Abstract:
This study mainly focuses on Features Selection- the most significant aspect behind a
successful ML program than building the prediction model itself. Training dataset
consisting of positive and negative bacterial protein samples was annotated for various
biological and physicochemical features using publicly available bioinformatics tools.
Final set of the best features was selected after the original dataset was pre-processed
through various features selection techniques. These included removal of null,
duplicate, constant and quasi-constant features; application of Pearson’s correlation,
recursive feature elimination-cross validation, data transformation for skewed data;
followed by hyper-parameter optimization of the models. To test the performance of
the final feature set eight ML models were built, and cross- validated via Stratified 5-
fold cross-validation. With the help of an independent benchmarking dataset, our best
performing model ‘Random Forest Classifier’ was compared to other publicly available
ML-RV tools. This model performed even better in terms of accuracy than the best
existing program to date, Vaxign-ML.