NUST Institutional Repository

DETECTING WEB URL PHISHING: A LIGHTWEIGHT LEXICAL-BASED MACHINE LEARNING METHOD

Show simple item record

dc.contributor.author Javaid, Muhammad Nouman
dc.date.accessioned 2023-08-23T08:53:39Z
dc.date.available 2023-08-23T08:53:39Z
dc.date.issued 2023
dc.identifier.other 330691
dc.identifier.uri http://10.250.8.41:8080/xmlui/handle/123456789/37267
dc.description Supervisor: Dr. Mehdi Hussain en_US
dc.description.abstract Cyber-attacks involving social engineering techniques, aimed at stealing sensitive information from individuals or groups of people, such as user credentials, personal data, medical records, account details, or business details, are known as phishing. Recently, the frequency of phishing attacks has witnessed a substantial increase necessitating the development of more robust phishing detection methods. Most of the existing phishing detection approaches are found to be reliable only for small datasets. In addition, these depend on a large number of features that require high computations and have limited scalability with machine learning models. This study uses conventional machine learning algorithms to present a generalized approach to large-scale phishing data. We have collected a larger phishing URL dataset of 100,000 records from Alexa, PhishTank, and OpenPhish and employed supervised machine-learning models for phishing detection. The collected data had 50,000 phishing URLs and 50,000 benign URLs generating a balanced dataset for the detection model. We cleaned our experimental data through multiple different stages of pre-processing for known errors of incorrect, corrupted, incorrectly formatted, duplicate, broken, and domain-only URLs. It further required only an efficient number of features for a larger dataset. The pre-processed data is then extracted, optimized, and evaluated to proposed machine learning models. The proposed approach outperformed the existing techniques in terms of detection accuracies, data size, and a limited number of phishing URL attributes. From experiments, it is evident that Random Forest detection accuracy is considered the best at 97.44% on our proposed dataset of 100,000 URLs. en_US
dc.language.iso en en_US
dc.publisher School of Electrical Engineering and Computer Sciences (SEECS), NUST en_US
dc.subject URL, Phishing, Benign, Websites, Anti-Phishing, Machine Learning en_US
dc.title DETECTING WEB URL PHISHING: A LIGHTWEIGHT LEXICAL-BASED MACHINE LEARNING METHOD en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

  • MS [146]

Show simple item record

Search DSpace


Advanced Search

Browse

My Account