Abstract:
In the current digital environment, the prevalence of phishing attacks, which use social
engineering to unlawfully obtain sensitive data like user credentials and personal information, is
on the rise. This increase highlights the need for more advanced detection methods. Traditional
phishing detection strategies are usually more effective with smaller datasets and often suffer from
high computational demands due to their reliance on numerous features, limiting scalability in
machine learning applications.
This research introduces a new method employing five well-known machine learning
algorithms: Logistic Regression, Random Forest, Gradient Boosting, XGBoost, and LightGBM.
The goal is to create a general framework for analyzing large-scale phishing data. An extensive
dataset of 274,131 phishing URL entries has been compiled from sources like Kaggle, PhishTank,
and OpenPhish. This dataset covers a wide range of URL categories, including Benign,
Defacement, Phishing, Malware, and Spam, offering a broad foundation for the detection model.
A thorough preprocessing of the data was conducted to correct common issues such as incorrect
formats, duplicates, broken links, and domain-only URLs, ensuring the dataset's quality for
machine learning. A key aspect of this approach is the use of a relatively small set of features, even
with larger datasets, addressing a major limitation of previous methods.
The processed data underwent extraction, optimization, and evaluation within the proposed
machine learning frameworks. The findings of this research are notable, showing that the new
methodologies outperform existing techniques in detection accuracy, handling of large data
volumes, and efficiency in feature use. Experimental results show especially high accuracy in
phishing URL detection, with algorithms like Random Forest, Gradient Boosting, XGBoost, and
LightGBM achieving up to 98% accuracy in identifying phishing URLs within the substantial
274,131 URL dataset.