PanCGP: Pangenome and Comparative Genome analysis Pipeline

Muhammad Ahsan

DSpace Home
→
E-Theses
→
SINES
→
Computation Science & Engineering
→
MS
→
View Item

dc.contributor.author	Muhammad Ahsan
dc.date.accessioned	2021-12-04T12:59:12Z
dc.date.available	2021-12-04T12:59:12Z
dc.date.issued	2016
dc.identifier.uri	http://10.250.8.41:8080/xmlui/handle/123456789/27867
dc.description.abstract	Since late 1990s, the development in the field of sequencing techniques has fasten the pace of gradually evolving field of genomics. In 1995, for the first time, a living organism genome was sequenced. It realized the fact that by examining the overall potential of a genome of an organism, the results can extend the possiblilities to the identification of gene or a mixture of genes. Along with advacement in computing technologies, this idea results in the developement of advanced next generation computational sequencing techniques and made possible the genome/proteome sequences of commercially significant organisms and pathogens in an optmized time frame and minimal cost. In this thesis, we present the realization of the implementation of Pan-genome and Comparative Genome analysis Pipeline tool using high performance parallel computing techniques. The aim of developing this high performance and scalable pipeline is to reduce the time cost of calculating the pan-genomes from the given dataset of protein sequences of bacterial strains. The pipeline is able to compute pan-genome analysis from unpublished datasets as well. Pan-genome and Comparative Genome analysis Pipeline (PanCGP) use divide and conquer approach to parallelize the whole analysis process. Data de composition technique is used to break down the problem into smaller chunks and process them separately over the available processing resources. Since, each data chunk is being processed separately, this technique drives out the communication overhead during parallel processing of the pipeline and makes it embarrassingly parallel. PanCGP is able to scale on shared memory architectures and distributed memory architectures as well as hybrid architectures seamlessly. Scalability of pipeline depends on the size of input dataset as well as the number of available computing resources. MPJ Express has been used to exploit the pipeline on shared as well as distributed memory architectures in a seamless manner. Modular nature of PanCGP makes it highly customizable and ease of extending it to add more functional modules to the pipeline. Dataset of 38 strains of helicobacter pylori has iii been given as input to PanCGP and CMG-biotools pan-genome analysis pipeline. The resultant time cost of PanCGP is much less (∼ one sixth) compared to that of CMG-biotools benchmarks. Incorporating the same tools and versions of tools in PanCGP pipeline, best result accuracy has been achieved. The accuracy of results may vary depending on the versions of tools being incorporated. The software package can be downloaded from https://github.com/TechnologyCell/PanCGP and https://sourceforge.net/projects/pancgp/.	en_US
dc.publisher	RCMS, National University of Sciences and Technology	en_US
dc.subject	PanCGP: Pangenome and Comparative Genome analysis Pipeline	en_US
dc.title	PanCGP: Pangenome and Comparative Genome analysis Pipeline	en_US
dc.type	Thesis	en_US