Abstract:
Code cloning refers to the duplication of source code. It occurs as a result of copy paste activity
without or with minor modification into another section of code. It is the most common way of
reusing source code in software development. Several studies suggested that almost 20-50 percent
of large software systems consist of cloned code. If a bug is identified in one segment of code, all
the segments similar to this need to be checked for the same bug. Consequently, this cloning
process may lead to bug propagation that significantly affect maintenance cost. By considering
this problem, Code Clone Detection (CCD) appears as an active area of research. Several tools and
techniques are introduced so far, for the detection of code clones from various programming
languages. However, most of them are unable for the detection of most difficult type of clones
semantic or Type 4 clones. Few tools or techniques that can detect these clones utilize traditional
methods which can detect type 4 clones with low accuracy. From literature we find few (3 or 4)
studies that tried their best to detect all types of clones including type 4 clones with good results
(accuracy, execution) but their capabilities are limited to java code because the compilers or
parsers utilized by these approaches work for java code only. However, current approaches are
inadequate to detect semantic clones along with other (type 1, type 2 and type 3) three types of
clones with good results in programing languages (e.g. C/C++).
In this research work we attempt to improve the accuracy of semantic or type 4 clones while not
compromising the accuracies of other three types of clones in C programs. For this purpose, we
conduct an experiment by utilizing 2 datasets (Krawitz and Roy et al.). Different from manually
defining features for code clone detection, our framework can automatically extract features by
analyzing abstract syntax trees (ASTs) of source code. Afterwards, supervised learning based
classification model is used and conduct 2 sets of experiment for code clone detection. Each set
consists of pair instance feature using linear combination. The classification model is trained and
tested using different types of validations. Furthermore, to check the effectiveness of proposed
framework if a non-clone occurs in the dataset, we manually add some non-clones and iterate the
whole process.
The performance of our framework is compared with state of the art and popular code clone
detection approaches that are used in several recent studies. Results indicate that the proposed
framework is superior in the detection of Type 4 clones and comparable in finding Type1 clones.
9
However, our framework does not give acceptable results in finding Type2 and Type3 clones.
Therefore, we perform some extended experiments and get valuable results on all types of clones