Computational Methods for Microbial and Microbiome Sequence Analysis

Objective

Project SummaryThis project will support our work on computational methods for microbial sequence analysis, including genefinding, whole-genome alignment, genome assembly, and metagenomic sequence analysis. Over the years wehave developed multiple systems to solve problems in these areas, some of which are very widely used. Thesetools need continued updates and improvements to keep pace with changes in sequencing technology, changesin experimental design, and the ever-growing number of sequenced genomes. One of these systems is Glimmer,a computational method for finding genes in bacteria, viruses, archaea, and simple eukaryotes. Glimmer ishighly accurate, finding over 99% of the genes in most prokaryotic genomes. It has been used by thousands ofscientists around the world and in the majority of published bacterial genome sequencing projects over the pastdecade. Collectively the three main publications describing Glimmer have been cited over 4,700 times,including >700 citations in 2016-17 alone. Usage of Glimmer has been increased in recent years due to theexplosion in next-generation sequencing projects, which are particularly cost-effective for bacterial genomes. Asecond system, MUMmer, is an efficient whole-genome aligner that is used to compare genomes to one anotherand to compare genome assemblies to detect changes, both large and small. MUMmer and its components,especially Nucmer, have been widely used and incorporated in other systems, including multi-genome alignersand several genome assembly packages. The three main publications describing MUMmer have been citedover 3,600 times including >750 citations in 2016-17. In recent years we have focused our efforts ondeveloping methods for the analysis of metagenomics data, producing several newer tools, including Krakenand Centrifuge. Both of these systems attempt to assign a species identifier to every read in a metagenomicsdata set. Because the Kraken algorithm is not only accurate but far faster than earlier methods, it was rapidlyadopted by many labs soon after its release, and its usage continues to grow. The even newer and more space-efficient Centrifuge system has also been highly successful and was recently incorporated into the analysispackage of one of the new third-generation sequencing companies. We continue to work on improving theperformance of both algorithms, and this project will allow us to extend them to handle the newest long-readdata that is increasingly being used for metagenomics experiments. Finally, a new direction of the lab is the useof metagenomic shotgun sequencing to diagnose infections, for which we are not only modifying ouralgorithms, but also building customized genome databases where we rigorously screen the genomes to identifyand remove contaminants and low-complexity sequences that create false positives. As we have done for manyyears, we will release all of the software and data generated by this project for free under an open sourcelicense, allowing other scientists to use, modify, and redistribute them without restrictions of any kind.

Investigators

Steven Salzberg

Institution

Johns Hopkins University

Start date

2019

End date

2024

Funding Source

Nat'l. Inst. of General Medical Sciences

Project number

1R35GM130151-01

Accession number

130151

View this project