Malware Detection through Activity logs and Apply  Machine Learning to detect new Malware

Kazmi, Hasnain Taqi

DSpace Home
→
E-Theses
→
SEECS
→
Data Science
→
MS
→
View Item

dc.contributor.author	Kazmi, Hasnain Taqi
dc.date.accessioned	2023-08-30T04:29:39Z
dc.date.available	2023-08-30T04:29:39Z
dc.date.issued	2023
dc.identifier.other	320246
dc.identifier.uri	http://10.250.8.41:8080/xmlui/handle/123456789/37858
dc.description	Supervisor: Dr. Sidra Sultana	en_US
dc.description.abstract	Amid the mounting and progressively intricate cyber threats, malware has emerged as a substantial challenge in today's digital world. Traditional defences, dependent on static analysis and signature based tactics, frequently fail to detect and classify variants of malware and zero-day attacks due to their vulnerability to obfuscation and polymorphism. However, behaviour-based malware detection, providing a deeper insight into the behaviour of malware execution, is more efficacious in malware family classification. This paper introduces a distinctive framework capable of correctly classifying familiar malware samples into their respective families. The research puts forward a comprehensive strategy for classifying malware families, from data prepossessing to feature selection, emphasising the pivotal role of machine learning in this process. The methodology employed involves three crucial stages: extraction of labels and features, representation of features, and finally, feature selection and classification. The study makes use of the publicly accessible "Malware Analysis Datasets: Top-1000 PE Imports" by IEEE, centring on the top 1000 imported functions culled from 'pe_imports' elements. These elements are detected utilising Cuckoo Sandbox, a robust and distributed framework for malware examination. The process of assigning labels to malware families is conducted through VirusTotal, which harnesses data from all available antivirus vendors, effectively mitigating potential issues related to label completeness, consistency, accuracy, and coverage. The features selected for malware classification revolve around the API calls tied to file, registry, network, process, and system, which are invoked during the execution of malware samples. Machine learning models, particularly Random Forests and Decision Trees, play a key role in feature selection, identifying 'Classification' and 'Family' as essential features for malware detection. Their significance is further validated through Information Entropy, which utilises the Information Gain Ratio to rank features. Amid the mounting and progressively intricate cyber threats, malware has emerged as a substantial challenge in today's digital world. Traditional defences, dependent on static analysis and signature based tactics, frequently fail to detect and classify variants of malware and zero-day attacks due to their vulnerability to obfuscation and polymorphism. However, behaviour-based malware detection, providing a deeper insight into the behaviour of malware execution, is more efficacious in malware family classification. This paper introduces a distinctive framework capable of correctly classifying familiar malware samples into their respective families. The research puts forward a comprehensive strategy for classifying malware families, from data prepossessing to feature selection, emphasising the pivotal role of machine learning in this process. The methodology employed involves three crucial stages: extraction of labels and features, representation of features, and finally, feature selection and classification. The study makes use of the publicly accessible "Malware Analysis Datasets: Top-1000 PE Imports" by IEEE, centring on the top 1000 imported functions culled from 'pe_imports' elements. These elements are detected utilising Cuckoo Sandbox, a robust and distributed framework for malware examination. The process of assigning labels to malware families is conducted through VirusTotal, which harnesses data from all available antivirus vendors, effectively mitigating potential issues related to label completeness, consistency, accuracy, and coverage. The features selected for malware classification revolve around the API calls tied to file, registry, network, process, and system, which are invoked during the execution of malware samples. Machine learning models, particularly Random Forests and Decision Trees, play a key role in feature selection, identifying 'Classification' and 'Family' as essential features for malware detection. Their significance is further validated through Information Entropy, which utilises the Information Gain Ratio to rank features.	en_US
dc.language.iso	en	en_US
dc.publisher	School of Electrical Engineering and Computer Sciences (SEECS), NUST	en_US
dc.title	Malware Detection through Activity logs and Apply Machine Learning to detect new Malware	en_US
dc.type	Thesis	en_US