NUST Institutional Repository

Analysis of Multimodal Representation Learning Across Medical Images and Reports Generation Using Multiple Vision and Language Pre-Trained Models

Show simple item record

dc.contributor.author Hassan, Ahmad
dc.date.accessioned 2023-08-09T11:16:09Z
dc.date.available 2023-08-09T11:16:09Z
dc.date.issued 2022
dc.identifier.other 318671
dc.identifier.uri http://10.250.8.41:8080/xmlui/handle/123456789/36066
dc.description Supervisor: Dr. Muhammad Usman Akram en_US
dc.description.abstract In the medical field, medical images are the visual representation of the organs and their functions. The medical images are used for finding the medical problems present in human being and highly trained professionals interpret these images into reports with a lot of time required in a day, as each report take around 4-6 minutes. Other than that, the reports were used to highlight the diseases, and writing a summarized report nowadays is very important for other inexperienced persons in understanding so that it can help them in better treatment. The automated summarization of radiology reports has tremendous potential. This operationally improve the diagnosis process of diseases. An image-text joint embedding extraction from chest x-rays and radiology reports, in producing summarized reports along with findings/tags will significantly reduce the workload of doctors and help them in treating patients. Because of the sensitivity of the process, the existing methods/techniques are not adequately accurate and limitation of data effects in training the models. Therefore, the generation of a summarization radiology report is an exceedingly difficult task. A novel approach is proposed to address this issue. In this approach, use pre-trained vision-and-language models like VisualBERT, UNITER, and LXMERT to learn multimodal representation from chest x-rays or radiographs and reports. The pre-trained model classified the findings/tags in the chest xrays (CXR) using Gated Recurrent Units as a decoder to generate a summarized report based on them. The Chest X-rays images and reports data are publicly available Indiana University dataset. There are also different methods for automatic report summarization and findings/tags classification from CNN-RNN-based models but mostly based on text or image only with less accuracy. The image-text joint embedding using the pre-trained models helps in more accurate report generation and improve performance in thoracic findings and summarized report generation task. Experimental results obtained by utilizing Indiana University (IU) CXR dataset showed that the suggested model attains the current state-of-the-art efficiency as compared to other existing solutions to the baseline. As evaluation metrics, BLEU and ROUGE have been applied along with AUC for findings/tags. The experiments are performed in multiple ways and the accuracy achieved in diseases findings is about 98%, BLEU score of 0.35 and ROUGE score of 0.65 for the summarized radiology report. en_US
dc.language.iso en en_US
dc.publisher College of Electrical & Mechanical Engineering (CEME), NUST en_US
dc.subject Key Words: Bottom-Up Top-Down (BUTD), Chest X-rays Radiology (CXR), Convolution Neural Network, Deep Learning, Gated Recurrent Units (GRU), Recall-Oriented Understudy for Gisting Evaluation (ROUGE), Visual Question Answering (VQA). en_US
dc.title Analysis of Multimodal Representation Learning Across Medical Images and Reports Generation Using Multiple Vision and Language Pre-Trained Models en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

  • MS [329]

Show simple item record

Search DSpace


Advanced Search

Browse

My Account