Development and Application of Bioinformatic Approaches for Foodborne Pathogen Detection, Subtyping and Genomic Epidemiology Investigation

Objective

The proposed research is aimed to explore the application of high throughput sequencing in various aspects of public health microbiology of major foodborne pathogens, especially Salmonella enterica. Specific objectives are:Develop a culture-independent and metagenomics-inspired solution to detecting and characterizing Salmonella up to serotype level and beyond from food samples.Develop bioinformatics tools for high throughput sequencing based pathogen subtyping and characterization.Build robust genealogies and dissect population structures for major Salmonella serotypes.Probe the emergence, spread and establishment of selected lineages of major Salmonella serotypes and explore how short-term evolution may signal or affect such events.

More information

Objective 1.High-efficiency immunomagnetic separation (IMS). We will implement and validate a newly-developed centrifugal microfluidic system by Dr. Peter Hesketh at Georgia Institute of Technology that allows highly effective capture of bacterial cells via efficient mixing of magnetic beads through food sample.We will use 1) freshly harvested cantaloupes from a local farm without any processing to represent farm and harvest phases of application with high levels of naturally-occurring microflora; and 2) pre-packaged, pre-washed Romaine lettuce from a retailing source to represent retail and consumer phases with low levels of microflora. Both items have been implicated in recent Salmonella outbreaks and are subjects of our previous research. Salmonella cells will be inoculated on the surface of the fresh produce. We will apply the CD microfluidics to capture and separate of Salmonella from produce rinse liquids.The performance of our system will also be compared to that of the traditional IMS.Multiple displacement amplification (MDA). Following IMS, we will use MDA to amplify the often trace amounts of target pathogen DNA to facilitate subsequent genome sequencing. MDA has been shown to support whole genome sequencing (WGS) from a single E. coli cell (11) and the combined use of IMS and MDA has led to WGS of difficult-to-culture pathogen from clinical samples (12). Commercially available MDA kits will be evaluated along with the modified IMS system to investigate how efficient their combined use will generate Salmonella DNA for subsequent sequencing.High throughput sequencing and bioinformatics analysis. Following IMS and MDA, the DNA sample will be sequenced on an Illumina MiSeq instrument. The raw sequencing reads will be analyzed for determination of serotypes and other subtypes (see Objective 2).Validation. The validation of the entire workflow will be performed with 25 g aliquots of fresh produce (cantaloupe rind and lettuce leaves) with 3 spike levels of inoculum (i.e., fractional, 1 log higher and blank).Objective 2High throughput sequencing based determination of Salmonella serotype.We will develop a bioinformatics solution (termed "SeqSero") for high throughput sequencing based serotyping. The pipeline will be designed to allow the input of both genome assemblies and sequencing reads through a web interface. These data will be processed by a pipeline implemented on a cloud serverto determine the O and H antigens by comparisons against individual curated databases of alleles encoding genes responsible for serotype. Then the serotype will be called according to the Kauffmann-White scheme. Databases for serotype determinants will be periodically updated to include novel sequences.We will build individual databases for the three genetic determinants of Salmonella serotype, including two flagellin structural genes fliC and fljB (encoding H antigens) and the rfb gene cluster (encoding genes responsible for O antigen. We will try to include all the currently available sequences from our previous studies (13, 14), literature (15) and Genbank. All the databases will be periodically updated for new sequences.Target sequences will be extracted by locating conservative sequences bordering the rfb region or in silico PCR to amplify fliC and fljB using primers flanking variable regions within the genes. Target sequences will be compared to serotype determinants databases through BLAST (16).Input reads will be directly mapped to the sequences in each database using BWA (17) based on sequence similarity.Closely-related fliC and fljB indistinguishable after reads mapping will be subject to finer-scale differentiation targeting signature SNPs or indels.High throughput sequencing based identification and subtyping of Shiga toxin-producing E. coli (STEC). Using the similar approach, we will develop another pipeline that allows: 1) O and H antigen determination for major STEC serotypes; 2) Detection and subtype identification of major virulence factors including stx1, stx2, eae, espP and O island 122 (OI-122); and 3) seven-gene multi locus sequence typing (MLST) of E. coli with the full allele databases from http://www.mlst.net/. Objective 3Isolates. For WGS, we will select 100-150 isolates from major serotypes - Newport, Heidelberg, Infantis, Javiana, Saintpaul, Montevideo, Oranienburg, Thompson and Muenchen - to represent aforementioned diversities according to our current knowledge. Genome-wide detection of single nucleotide polymorphisms (SNPs). Streamlined SNP detection will be performed using our published method (18) by mapping sequencing reads to fully assembled reference genomes. Phylogenetic analyses. As we previously described (18), recombination events and highly homoplastic sites indicative of non-neutral evolution, horizontal gene transfer, or ambiguous SNP calls will be detected and excluded from phylogenetic reconstruction. Maximum-likelihood (ML) trees based on remaining core genome SNPs will be built and used to test for a temporal signal based on isolation year of each strain.Object 4Age dating of individual lineages. Bayesian phylogenetic analyses will be performed by using the latest version of BEAST (7) to establish a temporal framework for constructing phylogenetic relationship among the isolates and estimating parameters to describe the evolutionary dynamics of the populations as we previously described (18). Geographical distribution and transmission. Lineages that display clustering of isolates from a particular geographical location will be identified. Serotypes featuring geographically structured populations will be subject to phylogeographical analysis implemented in BEAST (7). In a genealogy (phylogenetic tree), every branch will be assigned a geographical source. Together with the estimated age of each internal node, this will resolve major geographical transmissions by inferring the time of their occurrence and direction of movement.Lineages associated with specific ecological niches or environments (e.g. animal hosts) will be identified. Clustered regularly interspaced short palindromic repeats (CRISPRs) will be extracted from these lineages to study their potential utility as environmental markers. CRISPRs that are commonly found in bacteria and archaea originate from phages and plasmids that may bear ecological signature of a particular habitat. Our recent study (manuscript under review) show than CRISPRs alone could resolve the major lineages of SE including the ones with different ecological backgrounds.Population dynamics. Temporal changes in effective population sizes and fluctuations in the numbers of lineages over the time will be modeled to study the general population dynamics of major serotypes in recent history.In-depth evolutionary and comparative genomics studies. Based on aforementioned analyses, we will select 3-5 serotypes whose sampled populations show interesting features such as emerging lineages or sublineages, dynamic geographical dispersion, potential niche adaptation, and rapid expansion or diversification. We will assemble and annotate their genomes; compare gene contents between lineages; and extend the evolutionary analyses to pan-genomes by including accessory genes (in contrast with core genes shared by every member of a population) that afford selective advantages (e.g. antimicrobial resistance and virulence factors) and most of the genetic diversity within recently emerged pathogens (19).

Investigators

Deng, Xi

Institution

University of Georgia

Start date

2015

End date

2020

Funding Source

Nat'l. Inst. of Food and Agriculture

Project number

GEO00744

Accession number

1006141