Abstract:
Genomic sequence analysis is one of the hottest topics of bioinformatics research. Gene
finding (exon-intron classification) in eukaryotes has received a considerable amount
of the analysts attention worldwide. Many tools and methods have been used and
proposed to date, based on a number of underlying techniques such as Markovian models,
statistical analysis tools, and digital signal processing methods. In order to be processed
by digital signal processing (DSP) techniques, DNA alphabets must first be converted to
a numerical sequence using appropriate techniques which should preferably preserve all
properties of the DNA sequence and / or the nucleotides. Towards this end, we present
a comprehensive investigation of DNA symbolic-to-numeric representations developed
thus far. These mapping schemes are partitioned based on the logic of assigning values
to the nucleotides. We compare and discuss their strengths and weaknesses in context of
a discrete Fourier transform-based gene and exon prediction technique known as spectral
content (SC) measure. We show our evaluation results obtained using standard measures
such as sensitivity, specificity, and receiver operating characteristic (ROC) curves when
tested on standard datasets (e.g., Burset and Guigo(96), HMR195, and Guigo2000). We
also report for the first time, the SC measure based analysis of the DNA walks.