dc.description.abstract |
With an increasing growth rate, cancer marks itself as one of the major causes of death
worldwide. The exponentially increasing adverse effects of cancer on patients have
propelled the research community to dive into avenues relating to that could reduce
the time to diagnosis and treatment. One such pathway is the extraction of the physical
traits of cancer, phenotypes. Correlating genomics data with phenotypic information,
typically found in clinical notes is vital for the comprehensive understanding of cancer.
However, the quantity and the diversity of notes make manual extraction of phenotypes
a human resource-intensive task. Furthermore, the unstructured nature of clinical notes
makes them complex to work with generic data extraction tools. Rule-based techniques
have been employed previously to obtain this information, however, the usage of rules
limits the scope of the model in terms of cancers and phenotypes covered. We have
aimed to devise a model that could tackle these limitations by utilizing NLP concepts
such as NER to be independent of rules, and reduce the dependency of data extraction
on medical practitioners. We extend a cancer ontology to include 65 phenotypes for
eight cancer types and propose a Named Entity Recognition (NER) based multi-cancer
multi-phenotypes extraction method from unstructured clinical records. A qualitative
and quantitative comparative analysis has been carried out between SpaCy NER and
BERT-based NER models, with BERT outperforming by achieving precision and recall
scores of 0.84 and 0.85 respectively. In order to cope with the large dataset, active
learning was also introduced with an uncertainty sampling interpretation presented for
NER problems. The research highlights the benefit of employing active learning with
BERT to annotate a large dataset by manually labelling a small representative sample
of the data. |
en_US |