Abstract:
Malicious URLs have become major threat vectors over the Internet, with attackers
using URLs to launch attacks such as phishing campaigns, malware distribution, or
even data exfiltration. Naturally, since malicious URLs are modeled after real ones,
they can be difficult for users to spot and identify as frauds. A single attack can target
an organization with thousands of malicious URLs, costing in data loss, financial losses
and reputation damages. Even worse, cyber criminals are fast at revising their tactics
and this makes the identification a frustration, which requires effective countermeasures.
This thesis presents a novel hybrid model of machine learning and serverless
computing to tackle the challenge of detecting malicious URLs. In this research, I
am using a well-balanced dataset of 48,000 consistent URLs, from reputable sources
such as PhishTank and VirusTotal. Using such a varied dataset achieves the purpose
of training on benign as well as malicious URLs, thereby assisting the model to learn
better and generalize well across various cyber threats. Through features extracting
process, I have selected 54 different features (25 from the URL strings and 29 from
the content of respective web pages) for identification of malicious URLs. Multiple
machine learning model were tested and evaluated including Decision tree, Random
Forest, AdaBoost. In the end, XGBoost emerged as the standout performer, achieving
an impressive accuracy of 98.14%. The testing showed the potential of hybrid model
in a high detection rate regarding malicious URLs and improved amount in processing
time that was saved significantly by the serverless architecture. Serverless computing
also provided sandbox like environment for securely extraction of features from webpage
content. This research exhibits the efficiency of my hybrid model in precisely
identifying malicious URLs and showcases the potential of merging machine learning
with serverless computing to bolster our defenses against evolving cyber threats.