dc.description.abstract |
Since late 1990s, the development in the field of sequencing techniques has fasten
the pace of gradually evolving field of genomics. In 1995, for the first time, a living
organism genome was sequenced. It realized the fact that by examining the overall
potential of a genome of an organism, the results can extend the possiblilities to the
identification of gene or a mixture of genes. Along with advacement in computing
technologies, this idea results in the developement of advanced next generation
computational sequencing techniques and made possible the genome/proteome
sequences of commercially significant organisms and pathogens in an optmized
time frame and minimal cost. In this thesis, we present the realization of the
implementation of Pan-genome and Comparative Genome analysis Pipeline tool
using high performance parallel computing techniques. The aim of developing this
high performance and scalable pipeline is to reduce the time cost of calculating
the pan-genomes from the given dataset of protein sequences of bacterial strains.
The pipeline is able to compute pan-genome analysis from unpublished datasets
as well. Pan-genome and Comparative Genome analysis Pipeline (PanCGP) use
divide and conquer approach to parallelize the whole analysis process. Data de composition technique is used to break down the problem into smaller chunks and
process them separately over the available processing resources. Since, each data
chunk is being processed separately, this technique drives out the communication
overhead during parallel processing of the pipeline and makes it embarrassingly
parallel. PanCGP is able to scale on shared memory architectures and distributed
memory architectures as well as hybrid architectures seamlessly. Scalability of
pipeline depends on the size of input dataset as well as the number of available
computing resources. MPJ Express has been used to exploit the pipeline on shared
as well as distributed memory architectures in a seamless manner. Modular nature
of PanCGP makes it highly customizable and ease of extending it to add more
functional modules to the pipeline. Dataset of 38 strains of helicobacter pylori has
iii
been given as input to PanCGP and CMG-biotools pan-genome analysis pipeline.
The resultant time cost of PanCGP is much less (∼ one sixth) compared to that of
CMG-biotools benchmarks. Incorporating the same tools and versions of tools in
PanCGP pipeline, best result accuracy has been achieved. The accuracy of results
may vary depending on the versions of tools being incorporated. The software
package can be downloaded from https://github.com/TechnologyCell/PanCGP
and https://sourceforge.net/projects/pancgp/. |
en_US |