NUST Institutional Repository

PanCGP: Pangenome and Comparative Genome analysis Pipeline

Show simple item record

dc.contributor.author Muhammad Ahsan
dc.date.accessioned 2021-12-04T12:59:12Z
dc.date.available 2021-12-04T12:59:12Z
dc.date.issued 2016
dc.identifier.uri http://10.250.8.41:8080/xmlui/handle/123456789/27867
dc.description.abstract Since late 1990s, the development in the field of sequencing techniques has fasten the pace of gradually evolving field of genomics. In 1995, for the first time, a living organism genome was sequenced. It realized the fact that by examining the overall potential of a genome of an organism, the results can extend the possiblilities to the identification of gene or a mixture of genes. Along with advacement in computing technologies, this idea results in the developement of advanced next generation computational sequencing techniques and made possible the genome/proteome sequences of commercially significant organisms and pathogens in an optmized time frame and minimal cost. In this thesis, we present the realization of the implementation of Pan-genome and Comparative Genome analysis Pipeline tool using high performance parallel computing techniques. The aim of developing this high performance and scalable pipeline is to reduce the time cost of calculating the pan-genomes from the given dataset of protein sequences of bacterial strains. The pipeline is able to compute pan-genome analysis from unpublished datasets as well. Pan-genome and Comparative Genome analysis Pipeline (PanCGP) use divide and conquer approach to parallelize the whole analysis process. Data de composition technique is used to break down the problem into smaller chunks and process them separately over the available processing resources. Since, each data chunk is being processed separately, this technique drives out the communication overhead during parallel processing of the pipeline and makes it embarrassingly parallel. PanCGP is able to scale on shared memory architectures and distributed memory architectures as well as hybrid architectures seamlessly. Scalability of pipeline depends on the size of input dataset as well as the number of available computing resources. MPJ Express has been used to exploit the pipeline on shared as well as distributed memory architectures in a seamless manner. Modular nature of PanCGP makes it highly customizable and ease of extending it to add more functional modules to the pipeline. Dataset of 38 strains of helicobacter pylori has iii been given as input to PanCGP and CMG-biotools pan-genome analysis pipeline. The resultant time cost of PanCGP is much less (∼ one sixth) compared to that of CMG-biotools benchmarks. Incorporating the same tools and versions of tools in PanCGP pipeline, best result accuracy has been achieved. The accuracy of results may vary depending on the versions of tools being incorporated. The software package can be downloaded from https://github.com/TechnologyCell/PanCGP and https://sourceforge.net/projects/pancgp/. en_US
dc.publisher RCMS, National University of Sciences and Technology en_US
dc.subject PanCGP: Pangenome and Comparative Genome analysis Pipeline en_US
dc.title PanCGP: Pangenome and Comparative Genome analysis Pipeline en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

  • MS [234]

Show simple item record

Search DSpace


Advanced Search

Browse

My Account