Abstract:
Accurate and reliable Pattern Recognition is critical in many fields such as medicine, biometrics,
character recognition, speech recognition, bioinformatics etc. Pattern recognition attempts to
instill in computers some of the cognitive abilities of humans. Based on some already known data
samples, the system is trained and it is then able to recognize and classify objects by using the
information it has learnt. The samples are represented by different features. In real world, these
features may be hundreds and thousands in number. Some of these features may be redundant
which may not provide any help in classifying the objects. When these data samples are gathered,
noise may be added in those feature values which can actually play a negative role by classifying
incorrectly. Also, the process of training and recognition will take a lot of time when all these
features are used. Thus feature subset selection i.e. the process of selecting a smaller subset of
features from the entire feature set, so that maximum accuracy can be achieved in classification
in a reasonable amount of time, is an important area of research.
This thesis describes feature subset selection and its implementation by using Minimum
Spanning Trees. Graphs are built on the sample training data with the nodes of the graph equal
to the number of data points. The edges of the graph are constructed by calculating the euclidean
distance between samples using some features. Minimum spanning trees are then built on the
graph for different feature subsets. These trees are then evaluated through a criterion function to
determine the best spanning tree which will result in the best accuracy. The criterion function
chooses such spanning trees which have dense clusters of samples of one class distinctively
separated from clusters of other classes. This ensures good accuracy for recognition through the
nearest neighbor method. For determining the recognition and classification performance of the
system, three data sets from the field of medicine are used. Maximum classification accuracy of
96% is achieved using these data sets.
The main phases of the implementation are training, feature subset selection, recognition and
classification. For feature subset selection, minimum spanning trees are used to select the best
feature subset that provides good accuracy. For recognition and classification, k-nearest
neighbor approach is used in which the user can specify the desired value for ‘k’. The True
Positive and False Positive rates are then calculated to assess the accuracy of the system.