SECURE AND SCALABLE MALICIOUS URL DETECTION USING MACHINE LEARNING AND SERVERLESS COMPUTING

Shaheen, Sikandar

DSpace Home
→
E-Theses
→
MCS
→
Information Security
→
MSIS
→
View Item

SECURE AND SCALABLE MALICIOUS URL DETECTION USING MACHINE LEARNING AND SERVERLESS COMPUTING

Shaheen, Sikandar

URI: http://10.250.8.41:8080/xmlui/handle/123456789/47295

Date: 2024-10-18

Abstract:

Malicious URLs have become major threat vectors over the Internet, with attackers using URLs to launch attacks such as phishing campaigns, malware distribution, or even data exfiltration. Naturally, since malicious URLs are modeled after real ones, they can be difficult for users to spot and identify as frauds. A single attack can target an organization with thousands of malicious URLs, costing in data loss, financial losses and reputation damages. Even worse, cyber criminals are fast at revising their tactics and this makes the identification a frustration, which requires effective countermeasures. This thesis presents a novel hybrid model of machine learning and serverless computing to tackle the challenge of detecting malicious URLs. In this research, I am using a well-balanced dataset of 48,000 consistent URLs, from reputable sources such as PhishTank and VirusTotal. Using such a varied dataset achieves the purpose of training on benign as well as malicious URLs, thereby assisting the model to learn better and generalize well across various cyber threats. Through features extracting process, I have selected 54 different features (25 from the URL strings and 29 from the content of respective web pages) for identification of malicious URLs. Multiple machine learning model were tested and evaluated including Decision tree, Random Forest, AdaBoost. In the end, XGBoost emerged as the standout performer, achieving an impressive accuracy of 98.14%. The testing showed the potential of hybrid model in a high detection rate regarding malicious URLs and improved amount in processing time that was saved significantly by the serverless architecture. Serverless computing also provided sandbox like environment for securely extraction of features from webpage content. This research exhibits the efficiency of my hybrid model in precisely identifying malicious URLs and showcases the potential of merging machine learning with serverless computing to bolster our defenses against evolving cyber threats.