Abstract:
The digital realm is increasingly threatened by botnet attacks, which have escalated
both in frequency and complexity. Addressing these threats, our research
focuses on the deployment of AI-enhanced systems for the detection and mitigation
of botnet activities. Traditional research in this area has concentrated on
developing datasets, enhancing detection methods with advanced machine learning
(ML) techniques, and conducting analyses based on behaviour. Our review
identifies critical limitations in existing datasets, including outdated information,
limited applicability to various attack types, and a notable absence of verified
data (ground truth). To bridge these gaps, we introduce the BotLab-DS1 dataset,
which comprises 5,279 authentic botnet samples across 12 families and 3,000 benign
samples, providing a robust foundation for detection models. Our study
unfolds in three primary areas: an exhaustive evaluation of current datasets and
their limitations, the development of a comprehensive dataset creation strategy
utilizing advanced feature engineering across static, behavioural, and network attributes,
and the application of diverse ML algorithms for superior botnet detection
efficacy. Empirical results demonstrate the BotLab-DS1 dataset’s effectiveness,
particularly when combined with the Random Forest algorithm, achieving
an accuracy of 98.6% and a precision of 99.0%. Gradient Boosting also shows
strong performance, with 96.34% accuracy and 96.0% precision. Furthermore,
we explore the innovative application of Natural Language Processing (NLP)
techniques, such as Bag of Words, BERT, GloVe, and Word2Vec, to analyze behavioral
reports, enhancing ML feature sets for botnet detection. This approach,
supported by the XGboost classifier, yields exceptional outcomes, achieving an
accuracy of 99.17% and a ROC/AUC score of 0.9995. Our findings underscore the
pivotal role of NLP in improving feature extraction and the overall effectiveness
of ML algorithms in addressing botnet threats. This research not only advances
the field of cybersecurity by offering a novel dataset and demonstrating the efficacy
of combining NLP with ML for botnet detection but also sets a foundation
for future innovations in dataset development and algorithmic analysis to combat
cyber threats more effectively.