Malware Detection through Activity logs and Apply  Machine Learning to detect new Malware

Kazmi, Hasnain Taqi

DSpace Home
→
E-Theses
→
SEECS
→
Data Science
→
MS
→
View Item

Malware Detection through Activity logs and Apply Machine Learning to detect new Malware

Kazmi, Hasnain Taqi

URI: http://10.250.8.41:8080/xmlui/handle/123456789/37858

Date: 2023

Abstract:

Amid the mounting and progressively intricate cyber threats, malware has emerged as a substantial challenge in today's digital world. Traditional defences, dependent on static analysis and signature based tactics, frequently fail to detect and classify variants of malware and zero-day attacks due to their vulnerability to obfuscation and polymorphism. However, behaviour-based malware detection, providing a deeper insight into the behaviour of malware execution, is more efficacious in malware family classification. This paper introduces a distinctive framework capable of correctly classifying familiar malware samples into their respective families. The research puts forward a comprehensive strategy for classifying malware families, from data prepossessing to feature selection, emphasising the pivotal role of machine learning in this process. The methodology employed involves three crucial stages: extraction of labels and features, representation of features, and finally, feature selection and classification. The study makes use of the publicly accessible "Malware Analysis Datasets: Top-1000 PE Imports" by IEEE, centring on the top 1000 imported functions culled from 'pe_imports' elements. These elements are detected utilising Cuckoo Sandbox, a robust and distributed framework for malware examination. The process of assigning labels to malware families is conducted through VirusTotal, which harnesses data from all available antivirus vendors, effectively mitigating potential issues related to label completeness, consistency, accuracy, and coverage. The features selected for malware classification revolve around the API calls tied to file, registry, network, process, and system, which are invoked during the execution of malware samples. Machine learning models, particularly Random Forests and Decision Trees, play a key role in feature selection, identifying 'Classification' and 'Family' as essential features for malware detection. Their significance is further validated through Information Entropy, which utilises the Information Gain Ratio to rank features. Amid the mounting and progressively intricate cyber threats, malware has emerged as a substantial challenge in today's digital world. Traditional defences, dependent on static analysis and signature based tactics, frequently fail to detect and classify variants of malware and zero-day attacks due to their vulnerability to obfuscation and polymorphism. However, behaviour-based malware detection, providing a deeper insight into the behaviour of malware execution, is more efficacious in malware family classification. This paper introduces a distinctive framework capable of correctly classifying familiar malware samples into their respective families. The research puts forward a comprehensive strategy for classifying malware families, from data prepossessing to feature selection, emphasising the pivotal role of machine learning in this process. The methodology employed involves three crucial stages: extraction of labels and features, representation of features, and finally, feature selection and classification. The study makes use of the publicly accessible "Malware Analysis Datasets: Top-1000 PE Imports" by IEEE, centring on the top 1000 imported functions culled from 'pe_imports' elements. These elements are detected utilising Cuckoo Sandbox, a robust and distributed framework for malware examination. The process of assigning labels to malware families is conducted through VirusTotal, which harnesses data from all available antivirus vendors, effectively mitigating potential issues related to label completeness, consistency, accuracy, and coverage. The features selected for malware classification revolve around the API calls tied to file, registry, network, process, and system, which are invoked during the execution of malware samples. Machine learning models, particularly Random Forests and Decision Trees, play a key role in feature selection, identifying 'Classification' and 'Family' as essential features for malware detection. Their significance is further validated through Information Entropy, which utilises the Information Gain Ratio to rank features.