Abstract:
Meta AI’s new unsupervised speech recognition framework (wav2vec variants) is the latest
development of several years of work in speech recognition models, data sets, and training
techniques. The wav2vec model has changed the way traditional ASR worked, now a few
hours of spoken data is required to obtain transcribed speech. Despite this, over 6000
languages couldn't exploit the opportunity because they lack the required speech data
corpus. The corpus should contain 4-5 hours of speech data on average, which is a
challenge, especially for a low-resource language. To deal with the challenge the current
approach is to manually record speech and then transcribe it. Such an approach is resource
intensive and costly. On the internet, there is a wealth of speech data. To capitalize on such
data, we will use an automated speech utilization process instead of manual recording. In
our thesis, we have proposed a model that automatically fetches audio data from free
video/audio sharing websites and segments them to produce desired length audio frames.
The proposed model is generic and can be implemented for any low-resource language.
Furthermore, using the proposed pipeline we generated speech data for the Pashto language.