Abstract:
Emotions are essential part of immaculate communication. Emotions play a vital role in decision
making, behavior learning and communication activities in daily routine. Speech communication
and facial expressions are considered to be the root of information. The purpose of this research
is to design an automated system which can recognize six basic emotions of human namely
anger, disgust, fear, happiness, sadness and surprise for effective communication between
humans and computers. In proposed method audio-visual features from videos with emotions
have been extracted separately. A sequential deep convolution neural network (CNN) has been
used along with Recurrent Neural Network (RNN) to classify these emotions. From audio,
features like MFCC have been extracted and passed to CNN for audio classification. In
comparison fine tuning has been performed on pre-trained AlexNet deep CNN having melspectrogram as input. Features extracted from fine-tuning of AlexNet give better recognition
rates on audio data. On the other hand, visual features have been extracted from video frames
using CNN and then fed to the RNN using LSTM layer to handle the temporal nature of
experimental data. Multimodal emotion recognition has been performed by fusing audio and
visual modalities together though decision level and score level fusion. SVM, random forest, KNN and logistic regression classifiers were used to classify emotions from fused audio-visual
data. Experiments have been performed on two audio-visual databases namely RML and
BAUM-1s. RML contains 720 video samples recorded by 8 actors and BAUM-1s contains 544
video samples recorded by 31 actors belonging to different ethnic and cultural background.
Leave-One-Speaker-Out (LOSO) and Leave-One-Speaker-Group-Out (LOSGO) cross validation
techniques are used for evaluation of our model on RML and BAUM-1s respectively.
Competitive recognition rates are achieved on both datasets i.e. 61.68% on BAUM-1s and
79.51% on RML. The recognition rate on BAUM-1s dataset is 61.68 % which is an improvement
over previous state of art results by 1.19%.