Abstract:
With this thesis, our aim is to establish a foundation for research and development of
the Burushaski language. We will construct the first ever audio and textual dataset
that can be used for future research. Our final goals include the development of a
Latin-based script, a structured and clean audio dataset, a usable text corpus, and an
initial Automatic Speech Recognition (ASR) system using the Kaldi toolkit based on
the developed datasets for the Burushaski language.
The Burushaski language is a language isolate and is considered one of the most difficult
languages to learn and model. In this paper, we present the first ever open source
free database of audio and text datasets of the Burushaski language collected from
speakers. Additionally, we present a continuous Burushaski speech recognition model
using the Kaldi toolkit. From continuous speech samples of the Burushaski language
audio dataset, we extracted Mel frequency cepstral coefficients (MFCC) features for the
ASR system. We provide detailed reports on the performance of the ASR system for
both monophone and triphone models, including tri1, tri2, and tri3 models using N gram language model. The word error rate (WER) is the metric on which we measured
the performance of the system. We trained the system on a limited dataset and noticed
that the triphone model (tri3) gives significantly better performance compared to the
monophone model system. The tri3 model has also performed much better than the tri2
model, and the tri2 model has better performance than the tri1 model ASR.
We also present a detailed framework that can be used to design and develop systems
to create ASR systems for other zero-resource languages. This framework can be used
for dataset generation any any language.