Abstract:
In transfer learning, a model is pre-trained on a large unlabeled dataset and then fine-tuned on
downstream tasks. These pretraining and fine-tuning models are powerful and produced the
best results on Natural Language Processing (NLP) tasks. These models are unidirectional but
BERT introduced the first full deep bidirectional model which can read input from both sides
of the input. BERT was pre-trained on Wikipedia and Book corpus dataset and fine-tuned with
an extra layer. We present a replication study of BERT and provide a detailed analysis of the
effect of hyperparameters during pre-training on downstream tasks. Due to the public
unavailability of the Books Corpus dataset, we pre-trained the BERT from scratch on
Wikipedia (2100M) and compares it with our model which trained on Wikipedia (531M). Our
model Modified BERT “MBERT” achieves better results on GLUE (74.94) which consists of
8 tasks excepts STS-B, SQuADv1.1(57.40/69.50) and SQuADv2.0(56.19/59.38) dataset while
saving pretraining from 53 hours to only 17 hours, six times less computational power and was
also trained on four times smaller dataset. We also present a detailed study of why MBERT
achieves these results on the SQuAD dataset.