Code Clone Detection in C Language Programs using Supervised Learning

Ain, Qurat Ul

DSpace Home
→
E-Theses
→
CEME
→
Computer Software Engineering
→
MS
→
View Item

Code Clone Detection in C Language Programs using Supervised Learning

Ain, Qurat Ul

URI: http://10.250.8.41:8080/xmlui/handle/123456789/35937

Date: 2019

Abstract:

Code cloning refers to the duplication of source code. It occurs as a result of copy paste activity without or with minor modification into another section of code. It is the most common way of reusing source code in software development. Several studies suggested that almost 20-50 percent of large software systems consist of cloned code. If a bug is identified in one segment of code, all the segments similar to this need to be checked for the same bug. Consequently, this cloning process may lead to bug propagation that significantly affect maintenance cost. By considering this problem, Code Clone Detection (CCD) appears as an active area of research. Several tools and techniques are introduced so far, for the detection of code clones from various programming languages. However, most of them are unable for the detection of most difficult type of clones semantic or Type 4 clones. Few tools or techniques that can detect these clones utilize traditional methods which can detect type 4 clones with low accuracy. From literature we find few (3 or 4) studies that tried their best to detect all types of clones including type 4 clones with good results (accuracy, execution) but their capabilities are limited to java code because the compilers or parsers utilized by these approaches work for java code only. However, current approaches are inadequate to detect semantic clones along with other (type 1, type 2 and type 3) three types of clones with good results in programing languages (e.g. C/C++). In this research work we attempt to improve the accuracy of semantic or type 4 clones while not compromising the accuracies of other three types of clones in C programs. For this purpose, we conduct an experiment by utilizing 2 datasets (Krawitz and Roy et al.). Different from manually defining features for code clone detection, our framework can automatically extract features by analyzing abstract syntax trees (ASTs) of source code. Afterwards, supervised learning based classification model is used and conduct 2 sets of experiment for code clone detection. Each set consists of pair instance feature using linear combination. The classification model is trained and tested using different types of validations. Furthermore, to check the effectiveness of proposed framework if a non-clone occurs in the dataset, we manually add some non-clones and iterate the whole process. The performance of our framework is compared with state of the art and popular code clone detection approaches that are used in several recent studies. Results indicate that the proposed framework is superior in the detection of Type 4 clones and comparable in finding Type1 clones. 9 However, our framework does not give acceptable results in finding Type2 and Type3 clones. Therefore, we perform some extended experiments and get valuable results on all types of clones