A Novel Data Extraction Framework Using Natural Language Processing (DEFNLP) Techniques

Hussain, Tayyaba

DSpace Home
→
E-Theses
→
CEME
→
Computer Software Engineering
→
MS
→
View Item

dc.contributor.author	Hussain, Tayyaba
dc.date.accessioned	2023-08-04T06:27:36Z
dc.date.available	2023-08-04T06:27:36Z
dc.date.issued	2021
dc.identifier.issn	319554
dc.identifier.uri	http://10.250.8.41:8080/xmlui/handle/123456789/35606
dc.description	Supervisor: Dr. Muhammad Usman Akram	en_US
dc.description.abstract	Evidence through data is critical if government is to address many threats facing society, including; pandemics, climate change, Alzheimer’s disease, child hunger, increasing food production, maintaining biodiversity, and addressing many other challenges. Yet much of the information about data necessary to inform evidence and science is locked inside publications. A new dataset is recently introduced, Coleridge Initiative - Show US the Data, to discover how the data is used for the public good. In this research, we demonstrate a general Data Extraction Framework Using Natural Language Processing Techniques (DEFNLP) which challenges data scientists to show how publicly funded data are used to serve science and society using Natural Language Processing (NLP) techniques and models. The proposed framework uses NLP libraries and techniques like SpaCy NER and different huggingface Question Answering (QA) models to predict the datasets used in publications after further processing, data and text mining. DEFNLP will enable government agencies and researchers to quickly find the information they need. Till now such issue having large dataset which belongs to numerous research areas has not been addressed. The proposed approach is domain independent and therefore can be applied to all kind of case studies and scenarios where data is extracted. Our methodology sets the state-of-the-art on this Coleridge dataset, reaching the impressive outcome of 0.654, which outperforms current state-of-the-art as compare to other frameworks. In terms of timing and performance, it has short timing and high performance as each epoch took around 5 minutes on average on a CPU with output size of 3.27kB.	en_US
dc.language.iso	en	en_US
dc.publisher	College of Electrical & Mechanical Engineering (CEME), NUST	en_US
dc.subject	Keywords— Big Data, Data Extraction, Data Mining, Named Entity Recognition (NER), Natural Language Processing (NLP), Question Answering (QA) Modelling, Text Mining	en_US
dc.title	A Novel Data Extraction Framework Using Natural Language Processing (DEFNLP) Techniques	en_US
dc.type	Thesis	en_US