Abstract:
Amid the mounting and progressively intricate cyber threats, malware has emerged as a substantial
challenge in today's digital world. Traditional defences, dependent on static analysis and signature based tactics, frequently fail to detect and classify variants of malware and zero-day attacks due to
their vulnerability to obfuscation and polymorphism. However, behaviour-based malware
detection, providing a deeper insight into the behaviour of malware execution, is more efficacious
in malware family classification. This paper introduces a distinctive framework capable of
correctly classifying familiar malware samples into their respective families. The research puts
forward a comprehensive strategy for classifying malware families, from data prepossessing to
feature selection, emphasising the pivotal role of machine learning in this process. The
methodology employed involves three crucial stages: extraction of labels and features,
representation of features, and finally, feature selection and classification. The study makes use of
the publicly accessible "Malware Analysis Datasets: Top-1000 PE Imports" by IEEE, centring on
the top 1000 imported functions culled from 'pe_imports' elements. These elements are detected
utilising Cuckoo Sandbox, a robust and distributed framework for malware examination. The
process of assigning labels to malware families is conducted through VirusTotal, which harnesses
data from all available antivirus vendors, effectively mitigating potential issues related to label
completeness, consistency, accuracy, and coverage. The features selected for malware
classification revolve around the API calls tied to file, registry, network, process, and system,
which are invoked during the execution of malware samples. Machine learning models,
particularly Random Forests and Decision Trees, play a key role in feature selection, identifying
'Classification' and 'Family' as essential features for malware detection. Their significance is
further validated through Information Entropy, which utilises the Information Gain Ratio to rank
features.
Amid the mounting and progressively intricate cyber threats, malware has emerged as a substantial
challenge in today's digital world. Traditional defences, dependent on static analysis and signature based tactics, frequently fail to detect and classify variants of malware and zero-day attacks due to
their vulnerability to obfuscation and polymorphism. However, behaviour-based malware
detection, providing a deeper insight into the behaviour of malware execution, is more efficacious
in malware family classification. This paper introduces a distinctive framework capable of
correctly classifying familiar malware samples into their respective families. The research puts
forward a comprehensive strategy for classifying malware families, from data prepossessing to
feature selection, emphasising the pivotal role of machine learning in this process. The
methodology employed involves three crucial stages: extraction of labels and features,
representation of features, and finally, feature selection and classification. The study makes use of
the publicly accessible "Malware Analysis Datasets: Top-1000 PE Imports" by IEEE, centring on
the top 1000 imported functions culled from 'pe_imports' elements. These elements are detected
utilising Cuckoo Sandbox, a robust and distributed framework for malware examination. The
process of assigning labels to malware families is conducted through VirusTotal, which harnesses
data from all available antivirus vendors, effectively mitigating potential issues related to label
completeness, consistency, accuracy, and coverage. The features selected for malware
classification revolve around the API calls tied to file, registry, network, process, and system,
which are invoked during the execution of malware samples. Machine learning models,
particularly Random Forests and Decision Trees, play a key role in feature selection, identifying
'Classification' and 'Family' as essential features for malware detection. Their significance is
further validated through Information Entropy, which utilises the Information Gain Ratio to rank
features.