Abstract:
Tuberculosis (TB) has surpassed HIV as the leading infectious disease killer globally since 2014. The pathogen Mycobacterium tuberculosis (Mtb) contains ~4,000 genes that can account for almost 90% of the genome. Many global comparative studies on Mtb whole genome sequenced files have been conducted to elucidate the core, accessory, and strain-specific genome. However, it is still a noticeable edge to perform detailed pangenome analysis on several Asian strains including Pakistan. The major function of these studies was to focus on the generality/individuality of strains and gene content along with their evolutionary trends. Here we utilized a pangenomic analysis of 40 Mtb genomes to address these questions. EDGAR platform has become one of the most established software tools in the field of comparative genomics. These Mtb genomes are specifically selected from the Asian strains and collected from the National Center for Biotechnology Information (NCBI) to perform the variation and evolution studies. We identified 49.2% of the core genome with 2809 genes, 38.5 % dispensable genome with 2196 genes, and the singleton genome with 12.8 % with 704 genes. The translated CDS are involved in membrane and repair proteins with conserved hypothetical proteins. We also observed strain-specific genes for 40 Mtb strains comparing it with Mtb H37Rv in EDGAR. We identified a pan vs core developmental plot to indicate the evolutionary trend and variation history among Mtb strains. The trend for pan and core genes was vice versa. A phylogenetic tree is constructed using a multiple sequence alignment tool (MUSCLE) and EDGAR built-in package PHYLIP to find the intra-species evolutionary relationships and variation. EDGAR offers web-based interface with an independent user interface. Furthermore, we have identified the common core virulent and unique genes for Pakistani strains. For common core virulent genes identification, genes from Mtb H37Rv, Virulence factor database (VFDB), and Database of essential genes (DEG) are retrieved. Genes are loaded to the RAST server to find out the sequence similarities of local strains with reference Mtb H37Rv. EDGAR helps to find out the strain-specific genes for selected genomes. We identified 72 strain-specific genes for the Mtb SWL PK and 100 genes for the Mtb MNPK. Further investigation of the 40 Mtb strains is performed for functional annotation through KOfamKOALA to get better insights about biological, cellular, and metabolic processes involvement in disease pathogenicity. This study reflects that the variation in gene content can drive potential biomarkers for many sequenced Mtb strains from different locations.