NUST Institutional Repository

Multi-modal Emotion Recognition Using Deep Learning Architectures

Show simple item record

dc.contributor.author Hina, Iram
dc.date.accessioned 2023-08-09T09:43:29Z
dc.date.available 2023-08-09T09:43:29Z
dc.date.issued 2020
dc.identifier.other 00000171137
dc.identifier.uri http://10.250.8.41:8080/xmlui/handle/123456789/36007
dc.description Supervisor: Dr. Arslan Shaukat en_US
dc.description.abstract Emotions are essential part of immaculate communication. Emotions play a vital role in decision making, behavior learning and communication activities in daily routine. Speech communication and facial expressions are considered to be the root of information. The purpose of this research is to design an automated system which can recognize six basic emotions of human namely anger, disgust, fear, happiness, sadness and surprise for effective communication between humans and computers. In proposed method audio-visual features from videos with emotions have been extracted separately. A sequential deep convolution neural network (CNN) has been used along with Recurrent Neural Network (RNN) to classify these emotions. From audio, features like MFCC have been extracted and passed to CNN for audio classification. In comparison fine tuning has been performed on pre-trained AlexNet deep CNN having melspectrogram as input. Features extracted from fine-tuning of AlexNet give better recognition rates on audio data. On the other hand, visual features have been extracted from video frames using CNN and then fed to the RNN using LSTM layer to handle the temporal nature of experimental data. Multimodal emotion recognition has been performed by fusing audio and visual modalities together though decision level and score level fusion. SVM, random forest, KNN and logistic regression classifiers were used to classify emotions from fused audio-visual data. Experiments have been performed on two audio-visual databases namely RML and BAUM-1s. RML contains 720 video samples recorded by 8 actors and BAUM-1s contains 544 video samples recorded by 31 actors belonging to different ethnic and cultural background. Leave-One-Speaker-Out (LOSO) and Leave-One-Speaker-Group-Out (LOSGO) cross validation techniques are used for evaluation of our model on RML and BAUM-1s respectively. Competitive recognition rates are achieved on both datasets i.e. 61.68% on BAUM-1s and 79.51% on RML. The recognition rate on BAUM-1s dataset is 61.68 % which is an improvement over previous state of art results by 1.19%. en_US
dc.language.iso en en_US
dc.publisher College of Electrical & Mechanical Engineering (CEME), NUST en_US
dc.subject Keywords: Audio-Visual Emotion Recognition, Multi-modal, Deep Convolution Neural Network, Deep Learning, Recurrent Neural Network, Long Short Term Memory, CNN-LSTM en_US
dc.title Multi-modal Emotion Recognition Using Deep Learning Architectures en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

  • MS [441]

Show simple item record

Search DSpace


Advanced Search

Browse

My Account