Abstract:
Person authentication is a primary element to consider wherever privacy is necessary.
Deep learning based authentication algorithms have a number of applications in the said
field. Adding multiple modalities makes the system more robust. In this research a joint
multi-modal audio-visual deep learning based method has been devised to authenticate
a person based on their voice as well as face. This two-step verification process works
by learning face-feature based embeddings as well as voice-feature based embeddings
to serve two purposes: 1) if the face presented matches with an identity in a reference
database and 2) if the voice matches any voice in the reference database. This strategy
can help prevent important systems from impostor attempts using modalities that are
commonly present and available in consumer devices.