DESIGN AND IMPLEMENTATION OF DISEASE RECOGNITION SYSTEM USING MINIMUM SPANNING TREES

MUKHTAR, REEMA

DSpace Home
→
E-Theses
→
CEME
→
Computer Software Engineering
→
MS
→
View Item

DESIGN AND IMPLEMENTATION OF DISEASE RECOGNITION SYSTEM USING MINIMUM SPANNING TREES

MUKHTAR, REEMA

URI: http://10.250.8.41:8080/xmlui/handle/123456789/37471

Date: 2009

Abstract:

Accurate and reliable Pattern Recognition is critical in many fields such as medicine, biometrics, character recognition, speech recognition, bioinformatics etc. Pattern recognition attempts to instill in computers some of the cognitive abilities of humans. Based on some already known data samples, the system is trained and it is then able to recognize and classify objects by using the information it has learnt. The samples are represented by different features. In real world, these features may be hundreds and thousands in number. Some of these features may be redundant which may not provide any help in classifying the objects. When these data samples are gathered, noise may be added in those feature values which can actually play a negative role by classifying incorrectly. Also, the process of training and recognition will take a lot of time when all these features are used. Thus feature subset selection i.e. the process of selecting a smaller subset of features from the entire feature set, so that maximum accuracy can be achieved in classification in a reasonable amount of time, is an important area of research. This thesis describes feature subset selection and its implementation by using Minimum Spanning Trees. Graphs are built on the sample training data with the nodes of the graph equal to the number of data points. The edges of the graph are constructed by calculating the euclidean distance between samples using some features. Minimum spanning trees are then built on the graph for different feature subsets. These trees are then evaluated through a criterion function to determine the best spanning tree which will result in the best accuracy. The criterion function chooses such spanning trees which have dense clusters of samples of one class distinctively separated from clusters of other classes. This ensures good accuracy for recognition through the nearest neighbor method. For determining the recognition and classification performance of the system, three data sets from the field of medicine are used. Maximum classification accuracy of 96% is achieved using these data sets. The main phases of the implementation are training, feature subset selection, recognition and classification. For feature subset selection, minimum spanning trees are used to select the best feature subset that provides good accuracy. For recognition and classification, k-nearest neighbor approach is used in which the user can specify the desired value for ‘k’. The True Positive and False Positive rates are then calculated to assess the accuracy of the system.