Abstract:
Cybersecurity threats continue to rise in complexity and scale. This work proposed the
robust feature extraction and machine learning techniques for the detection and
identification of malware using a private dataset comprising MS-Office and Portable
Executable (PE) files, which was initially unlabelled. Robust feature extraction methods
were employed. The integration of robust feature extraction via the SCORE framework
was pivotal in ensuring the models' reliability and performance under adversarial
conditions. To address the challenge of data imbalance, SMOTE resampling was applied.
Multiple machine learning models, including K-Nearest Neighbours (KNN), Random
Forest (RF), Support Vector Machine (SVM), Multi-Layer Perceptron (MLP), and a
custom Convolutional Neural Network (CNN), were fine-tuned for both malware detection
(binary classification) and threat identification (multi-class classification). The models
were evaluated using different performance metrics. Additionally, K-fold and leave-oneout cross-validation were employed to improve robustness, also resource and time tracking
was recorded. The research achieved state-of-the-art results, with significant success in
identifying obfuscated and adversarial modified malware. To further evaluate the
robustness of our models, we used independent validation. This additional validation
provided strong evidence of the models’ generalization capabilities and resilience to unseen
malware samples.