Forensic Speaker Verification Using Speech Signals

Muhammad Mubeen

DSpace Home
→
E-Theses
→
SEECS
→
Electrical Engineering
→
BS
→
View Item

Forensic Speaker Verification Using Speech Signals

Muhammad Mubeen

URI: http://10.250.8.41:8080/xmlui/handle/123456789/20269

Date: 2018

Abstract:

In the recent years due to the perceived world security situation, government as well as other business organizations require reliable methods to accurately identify individuals without overly infringing on rights to privacy or requiring significant compliance on the part of the individual being recognized. For the past few years, biometric verification systems have been used quite extensively for this purpose. The most common among them include facial recognition, finger-print verification and voice based (speaker) verification systems. In the recent years, due to availability of better computational resources and large datasets, deep learning has been doing wonders in the field of artificial intelligence. Due to this great success of deep learning in computer vision, there has been interest of applying deep learning for the speech information processing as well especially for the speaker verification task. The aim of this project is to use Deep Convolution Neural Networks coupled with MFECs (Mel Frequency Energy Coefficients), to make a Speaker Verification (SV) system which is invariant to mimicry, noise, channel degradation along with text independence and the capability to make decisions on legitimacy of speakers by using only short utterances. In this project, we propose a novel Deep Convolutional Neural Network (DCNN) to extract speaker-specific information (SI) from MFECs, a popular frequency domain representation of the speech signals. A two stage learning strategy is adopted, which is based on unsupervised training for network initialization followed by triplet based learning of the network. To train our network in the 2nd stage, triplet loss function was used to discriminate the speakers on the basis of their intrinsic statistical patterns, distributed in the representations yielded by our deep network. This is achieved in the triplet pair-wise comparison of these representations for similar or dissimilar speakers. In the end in order to test our network, various datasets were used which include TIMIT, NTIMIT along with the KING dataset. These datasets have been provided by NIST under commercial license. After testing, the resulting have been compiled and were very promising and have been discussed.