Abstract:
In this thesis, a real-time URL based filtering system capable of URL filtering
and blocking from domain level to sub folder level is proposed. It can be used
as a standalone solution and can be integrated at any network level. The
modular approach provides scalability by stacking hardware boxes to meet
the current and future bandwidth requirements. The total delay caused by
our proposed system is less then 1ms which is highly desirable for real-time
filtering systems.
The core component of real-time URL based filtering system is URL
lookup and storage engine. In this work, hash tables are used to develop the
URL lookup and storage engine. The performance of hash table is dependent
upon hash function used for implementation of hash table. To gain the space
and time advantage using Hash Tables for real-time application over other
constant space-set data structures following approaches are used: 1) The use
of hash functions for compressing variable length URL strings to fixed size
integers and dynamic allocation of memory 2) To compensate the collision
problem associated with hash functions by performing statistical analysis
and implementation of various non-cryptographic hash functions to identify
their random nature and hash table resize operation depending upon the
load factor. The performance analysis is performed mainly using statistical
ii
iii
studies on the sequences generated using five widely used non-cryptographic
hash functions: 1) CRC, 2) Adler, 3) FNV, 4) DJBX33A, and 5) Murmur.
The comparative analysis of tested non-cryptographic hash functions shows
that the Adler hash function is not suitable for hash table implementation,
whereas, the rest of non-cryptographic hash functions exhibit similar and bet-
ter randomizing features which make them an attractive choice for hash table
implementation. The results of these statistical studies have been verified by
the implementation of hash table using these non-cryptographic hash func-
tions. The implementation results show that the average number of probes
for Adler based hash table varies between 1.25 and 2.75 for different load
factors and hash table sizes; whereas, for the rest of non-cryptographic hash
functions the average number of probes in a hash table is 1, which is highly
desirable for real-time network applications. Thus proving that 1) CRC, 2)
FNV, 3) DJBX33A, and 4) Murmur non-cryptographic hash functions are
good choices for hash table based implementation for real-time storage and
lookup of uniform resource locators.