Abstract:
A phishing attack is an instance of social engineering in which the perpetrator
deceives the user to gain access to sensitive information and/or personal data without
authorization. This attack vector has become a prevalent problem in recent years and
can result in substantial financial damage, as well as the potential risk of identity
theft, data loss, and long-term damage to an organization's reputation. In prior efforts
to counter this attack vector, researchers employed machine learning-based
approaches which are based on lexical analysis of URLs and make use of datasets
containing websites’ URLS. However, these approaches are effective only on smaller
no of dataset entries and are unable to detect new phishing URLs. This research has
optimized an existing anti-phishing methodology to function on a larger dataset of
phishing website URLs. To this end, a dataset of 150,000 URLs is collected for
experimentation, and a set of optimized lexical features is incorporated. To obtain the
optimal set of features, the feature significance scheme is then employed, using
Random Forest Python code to reduce the number of lexical features from 70 to 15.
For experiments, nine different machine learning classification algorithms, such as
Random Forest, Support Vector Machine, and Logistic Regression, were used to
assess the results. Precision, Recall, F1 Score, and Accuracy metrics were evaluated
in comparison to the benchmark study. In experiments, it is observed that the
proposed methodology obtained high detection accuracies as compared to the
benchmark approach on a larger phishing dataset (150k), where the kNN classifier
achieved the best detection accuracy of 99.98%.