ENHANCING PHISHING DETECTION THROUGH MACHINE LEARNING

Zain ul Abidin

DSpace Home
→
E-Theses
→
SEECS
→
Information Security
→
MS
→
View Item

ENHANCING PHISHING DETECTION THROUGH MACHINE LEARNING

Zain ul Abidin

URI: http://10.250.8.41:8080/xmlui/handle/123456789/42622

Date: 2024

Abstract:

In the current digital environment, the prevalence of phishing attacks, which use social engineering to unlawfully obtain sensitive data like user credentials and personal information, is on the rise. This increase highlights the need for more advanced detection methods. Traditional phishing detection strategies are usually more effective with smaller datasets and often suffer from high computational demands due to their reliance on numerous features, limiting scalability in machine learning applications. This research introduces a new method employing five well-known machine learning algorithms: Logistic Regression, Random Forest, Gradient Boosting, XGBoost, and LightGBM. The goal is to create a general framework for analyzing large-scale phishing data. An extensive dataset of 274,131 phishing URL entries has been compiled from sources like Kaggle, PhishTank, and OpenPhish. This dataset covers a wide range of URL categories, including Benign, Defacement, Phishing, Malware, and Spam, offering a broad foundation for the detection model. A thorough preprocessing of the data was conducted to correct common issues such as incorrect formats, duplicates, broken links, and domain-only URLs, ensuring the dataset's quality for machine learning. A key aspect of this approach is the use of a relatively small set of features, even with larger datasets, addressing a major limitation of previous methods. The processed data underwent extraction, optimization, and evaluation within the proposed machine learning frameworks. The findings of this research are notable, showing that the new methodologies outperform existing techniques in detection accuracy, handling of large data volumes, and efficiency in feature use. Experimental results show especially high accuracy in phishing URL detection, with algorithms like Random Forest, Gradient Boosting, XGBoost, and LightGBM achieving up to 98% accuracy in identifying phishing URLs within the substantial 274,131 URL dataset.