Abstract:
Uniform Resource Locators (URLs) have been a basis of the Web since its
origin. They are the reference point to any resource in the cyber space. Ac-
cording to Verizon DBIR 2016 report, attacks on web applications are the
single-biggest source of data loss and they account for over 40% of incidents
resulting in data breaches. The main challenges are that: Firstly, URLs
are sometimes hidden, shortened or encoded which humans cannot readily
identify as legitimate. While this is typical for URLs, the attackers utilize
it to their advantage. Secondly, the automation of attacks using domain
generation algorithms (DGA) and exploit toolkits have led to the need for
the automated and proactive protection system for malicious URL detec-
tion. Thirdly, the attackers can manipulate the users to redirect them to
the intended URL without the need to click by a variety of attacks such as
Phishing attack and drive-by-download attacks. Existing malicious web site
detection techniques have limitations in terms of accuracy and time that have
inverse relation and are di cult to achieve at a good rate. Also, few research
works are focused on speci c attack types such as domain typosquatting.
The aim of this work is the development of a system, URL-Analyzer, for
malicious URL detection with good accuracy and time trade-o . This novel
contribution is based on Natural Language Processing(NLP) technique and
a breadth of features: (i) URL lexical, (ii) host based, (iii) social reputation
and (iv) time-based features for the detection of phishing, malware, drive
by download and typosquatting. The input to the URL-Analyzer comprises
of known benign and malicious URL datasets. To detect signs of malicious-
ness in the URL, static analysis technique for feature extraction and machine
learning for classi cation is employed. The accuracy achieved classi cation
ranges from 97% to 98.5% with average time to extract features 1.98 ms, 8
sec and 34.25 sec per URL for the URL-Analyzer's three developed modes:
(i) o ine, (ii) online and (iii) partially online respectively.