Abstract:
Genome data analysis and clustering has great importance in disease detection and prevention. Genome wide associate study involves extracting markers/probes from DNA or genome of a set of population to study the variations with respect to particular disease. These studies are helpful in identifying genetic variations that can be a cause of different diseases, e.g. Heart diseases, cancer, mental illness etc. Genome data usually have very small population size and large number of markers; this limits their study since most of the existing algorithms can work well with 𝑝 << 𝑛 datasets. Analyzing and processing genome data requires selection of prominent genes and robust algorithms which can deal with curse of dimensionality problem. Clustering and visualization of genomic data has equal importance in bioinformatics applications, since biologist are interested in identifying patterns in data which can give information about undiscovered diseases. Principal component analysis is often used to represent large dimensional dataset in fewer components which can be visualized graphically. Classical PCA fails in this case where we have greater number of attributes as compared to sample size. We propose usage of shrinkage estimate of covariance matrix in principal component analysis rather than sample covariance matrix. The estimated covariance matrix has certain properties which ensures it doesn’t give negative Eigen values for 𝑝 >> 𝑛 data, is not sensitive to outliers and can be computed for data with missing values. Since genome data has thousands of attributes, not all attributes are useful in classification problem. In this study we also discuss a full scale hybrid feature selection method which performs filtering and wrapping to select genes which contributes better towards classification of genome data. Overfitting in feature selection process is one of the challenging issue. Using full scale hybrid feature selection approach ensures we are not introducing any overfitting in our classification model. We achieve this by taking in consideration those features which didn’t give maximum classification performance but we assigned them lower weight in feature selection process. To evaluate our model, we used multiple experimental setups to tune the classification and feature selection process. To see performance of our proposed algorithm we performed evaluation in terms of classification and clustering on publicly available well studied Genome data and see that our algorithm out performs in most of the cases.