Abstract:
Cyber-attacks involving social engineering techniques, aimed at stealing sensitive information
from individuals or groups of people, such as user credentials, personal data, medical records,
account details, or business details, are known as phishing. Recently, the frequency of phishing
attacks has witnessed a substantial increase necessitating the development of more robust
phishing detection methods. Most of the existing phishing detection approaches are found to
be reliable only for small datasets. In addition, these depend on a large number of features that
require high computations and have limited scalability with machine learning models. This
study uses conventional machine learning algorithms to present a generalized approach to
large-scale phishing data. We have collected a larger phishing URL dataset of 100,000 records
from Alexa, PhishTank, and OpenPhish and employed supervised machine-learning models
for phishing detection. The collected data had 50,000 phishing URLs and 50,000 benign URLs
generating a balanced dataset for the detection model. We cleaned our experimental data
through multiple different stages of pre-processing for known errors of incorrect, corrupted,
incorrectly formatted, duplicate, broken, and domain-only URLs. It further required only an
efficient number of features for a larger dataset. The pre-processed data is then extracted,
optimized, and evaluated to proposed machine learning models. The proposed approach
outperformed the existing techniques in terms of detection accuracies, data size, and a limited
number of phishing URL attributes. From experiments, it is evident that Random Forest
detection accuracy is considered the best at 97.44% on our proposed dataset of 100,000 URLs.