Multi-modal Emotion Recognition Using Deep Learning Architectures

Hina, Iram

DSpace Home
→
E-Theses
→
CEME
→
Computer Software Engineering
→
MS
→
View Item

Multi-modal Emotion Recognition Using Deep Learning Architectures

Hina, Iram

URI: http://10.250.8.41:8080/xmlui/handle/123456789/36007

Date: 2020

Abstract:

Emotions are essential part of immaculate communication. Emotions play a vital role in decision making, behavior learning and communication activities in daily routine. Speech communication and facial expressions are considered to be the root of information. The purpose of this research is to design an automated system which can recognize six basic emotions of human namely anger, disgust, fear, happiness, sadness and surprise for effective communication between humans and computers. In proposed method audio-visual features from videos with emotions have been extracted separately. A sequential deep convolution neural network (CNN) has been used along with Recurrent Neural Network (RNN) to classify these emotions. From audio, features like MFCC have been extracted and passed to CNN for audio classification. In comparison fine tuning has been performed on pre-trained AlexNet deep CNN having melspectrogram as input. Features extracted from fine-tuning of AlexNet give better recognition rates on audio data. On the other hand, visual features have been extracted from video frames using CNN and then fed to the RNN using LSTM layer to handle the temporal nature of experimental data. Multimodal emotion recognition has been performed by fusing audio and visual modalities together though decision level and score level fusion. SVM, random forest, KNN and logistic regression classifiers were used to classify emotions from fused audio-visual data. Experiments have been performed on two audio-visual databases namely RML and BAUM-1s. RML contains 720 video samples recorded by 8 actors and BAUM-1s contains 544 video samples recorded by 31 actors belonging to different ethnic and cultural background. Leave-One-Speaker-Out (LOSO) and Leave-One-Speaker-Group-Out (LOSGO) cross validation techniques are used for evaluation of our model on RML and BAUM-1s respectively. Competitive recognition rates are achieved on both datasets i.e. 61.68% on BAUM-1s and 79.51% on RML. The recognition rate on BAUM-1s dataset is 61.68 % which is an improvement over previous state of art results by 1.19%.