dc.description.abstract |
Extensive research has been carried out in the last decade regarding the improvement of software defect prediction methods, aiming at optimization goals, namely, cost-effectiveness, less effort employment and less time consumption, in order to achieve good quality control and ensure delivery of bug-free software packages to the end user. Several machine learning techniques were applied in efforts of gaining software defect prediction optimization. This research aims at demonstrating the positive aspects of data sampling, feature subset selection and ensemble learning model upon the outcome of defect prediction classification. Along with data sampling of defective datasets and feature subset selection and ensemble model algorithm is proposed to deliver robustness to both feature redundancy and data imbalance. We carefully combine variety of strong learning algorithms for ensemble learning models and using data sampling techniques with effective feature subset selection to report these issues and nullify their effects on the defect prediction classification performance. Forward and Backward features selection exposed that only few features promote to high area under the curve (AUC). On these tested datasets, Genetic forward selection method outpaced other feature selection techniques like correlation based feature selection and Info Gain Attribute selection. This recommends that taken features are extremely unbalanced. Yet, ensemble learners like the proposed algorithm and random forests, average probability ensemble are not as affected by meagre features as in the case of support vector machines (SVM). Also the proposed model combined with genetic forward selection achieved area under receiver operating curve (AUC) values of almost 1 for the NASA datasets. This research shows that software defect datasets must have well-balanced datasets for training. Also, features must be selected in a way that ensures optimized classification of defective components. Moreover, while dealing with above-mentioned data issues, along with the proposed model, resulted in exceptional performance leading to nearly perfect methods for quality control. |
en_US |