NUST Institutional Repository

A Novel Data Extraction Framework Using Natural Language Processing (DEFNLP) Techniques

Show simple item record

dc.contributor.author Hussain, Tayyaba
dc.date.accessioned 2023-08-04T06:27:36Z
dc.date.available 2023-08-04T06:27:36Z
dc.date.issued 2021
dc.identifier.issn 319554
dc.identifier.uri http://10.250.8.41:8080/xmlui/handle/123456789/35606
dc.description Supervisor: Dr. Muhammad Usman Akram en_US
dc.description.abstract Evidence through data is critical if government is to address many threats facing society, including; pandemics, climate change, Alzheimer’s disease, child hunger, increasing food production, maintaining biodiversity, and addressing many other challenges. Yet much of the information about data necessary to inform evidence and science is locked inside publications. A new dataset is recently introduced, Coleridge Initiative - Show US the Data, to discover how the data is used for the public good. In this research, we demonstrate a general Data Extraction Framework Using Natural Language Processing Techniques (DEFNLP) which challenges data scientists to show how publicly funded data are used to serve science and society using Natural Language Processing (NLP) techniques and models. The proposed framework uses NLP libraries and techniques like SpaCy NER and different huggingface Question Answering (QA) models to predict the datasets used in publications after further processing, data and text mining. DEFNLP will enable government agencies and researchers to quickly find the information they need. Till now such issue having large dataset which belongs to numerous research areas has not been addressed. The proposed approach is domain independent and therefore can be applied to all kind of case studies and scenarios where data is extracted. Our methodology sets the state-of-the-art on this Coleridge dataset, reaching the impressive outcome of 0.654, which outperforms current state-of-the-art as compare to other frameworks. In terms of timing and performance, it has short timing and high performance as each epoch took around 5 minutes on average on a CPU with output size of 3.27kB. en_US
dc.language.iso en en_US
dc.publisher College of Electrical & Mechanical Engineering (CEME), NUST en_US
dc.subject Keywords— Big Data, Data Extraction, Data Mining, Named Entity Recognition (NER), Natural Language Processing (NLP), Question Answering (QA) Modelling, Text Mining en_US
dc.title A Novel Data Extraction Framework Using Natural Language Processing (DEFNLP) Techniques en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

  • MS [441]

Show simple item record

Search DSpace


Advanced Search

Browse

My Account