Removing Blooming Bacterial Sequences from American Gut Data

By Amnon Amir, PhD

When someone in the United States submits their sample to the American Gut Project (AGP), the sample swabs are shipped back to the Knight lab via USPS mail. We observed that this sometimes leads to growth (blooming) of certain bacteria within the swab in some samples, thereby altering the composition of the microbial community. Evidence for this came when we noticed in several fecal samples from self-reported healthy individuals upwards of 80–90% Gammaproteobacteria (a bacterial class commonly associated with disease states such as IBD and usually at very low abundance in typical fecal samples).

To overcome the bloom problem, we decided to remove all sequences that we ascertained as belonging to blooming bacteria. The process we used is divided into two parts:

(1) identification of candidate blooming bacteria

(2) removal of bloom sequences from AGP data.

Although this filtering was performed on AGP fecal sample reads (16S rRNA V4), it can be applied to any other samples that may contain blooming bacteria. We recommend that groups perform bloom filtering on all samples that will be compared against the AGP data, even if the samples are stored under conditions which do not allow bacterial bloom. This recommendation will avoid a filtering bias that could occur because bacteria removed from the AGP dataset by the bloom filter may be present (albeit at a lower levels) in other samples from other studies. A detailed description of our process is described below:


(1) Identification of candidate bloom sequences:

In order to identify bacteria that grow during shipment, we used three criteria:

  1. Comparison of samples which were shipped fresh (i.e., AGP samples), to samples that were immediately frozen after collection; (i.e., samples from the Personal Genome Project (PGP), Mayo fecal stability study, and the Microbiome Quality Control project (MBQC). We tested which bacterial sequences were differentially abundant in shipped fresh vs. immediately frozen fecal samples (with the underlying assumption that freezing preserves the actual state of the bacterial community, whereas in some shipment conditions certain bacteria will grow and therefore change their frequency). This comparison was done at the sequence level (100% rather than on 97% OTUs) using a novel de-novo OTU picking algorithm (called deblurring, which is still under development). This approach enabled the identification of the exact 16S rRNA gene sequence associated with each blooming bacteria.
  2. Examining bacterial growth using the Mayo fecal stability study. In this study, subsamples of the same fecal material were either immediately frozen or left at room temperature for up to 4 days. By comparing bacterial abundance between these sample types, we identified bacteria that showed pronounced growth when left at room temp. for 4 days.
  3. Examining sequences that were identified using the previous American Gut bloom removal filter. This previous filter was based on the observation that typical (immediately frozen) human fecal samples have a low abundance of Gammaproteobacteria (usually <3%), whereas some American Gut samples (which were shipped via mail) contained a small number of very high frequency Gammamproteobacteria. Based on this observation, the previous American Gut filtering removed dominant Gammaproteobacteria sequences which were above 3% in cumulative frequency in the AGP samples (for details see HERE). While this filter was based on some assumptions that were not directly validated (i.e. blooming occurs only in Gammaproteobacteria and the high frequency Gammaproteobacteria are blooming bacteria), it performed relatively well (the American Gut samples with the designated blooming bacteria previously did not resemble other AGP fecal samples on unweighted PCoA, but after undergoing the bloom filtering did resemble other AGP fecal samples).

By combining these three criteria, we came up with a list  of ten 150bp 16S rRNA V4 sequences belonging to bacteria that potentially grow during shipment. This list includes eight Gammaproteobacterial sequences and one sequence identified as Bacillus. Additionally, we identified one sequence classified as Pseudomonas, which is a local lab contaminant (it is present at high levels in blanks). Notably, Pseudomonas and several similar Gammaproteobacteria, including E. coli, were also identified as common lab contaminants (Tanner et al. – AEM, 1998)

The list of the V4 150bp region of these bacteria can be found HERE.

A few notes:

  1. There is a balance between identifying more candidate bloom sequences (leads to a cleaner dataset) and identifying fewer bloom sequences (removes fewer sequences from the study). We opted to focus on the sequences that showed a consistent and non-negligible change during shipment, using a combination of the above described three criteria.
  2. Because our data is based on the 150bp V4 region, we can only positively identify the blooming bacteria by this sequence. Applying this list to other variable regions or longer reads requires the use of a database (e.g. Greengenes) for prediction, but this may be inaccurate.
  3. Identification of blooming bacteria was done only on fecal samples. Other sample types may have other bacteria and/or different growth conditions, which may lead to different bloom sequences.


(2) Removal of bloom sequences:

In order to correct the effect of shipment, all sequences that are affected by the shipment need to be removed prior to further analysis. This includes removal of all sequences identical to the bloom sequences identified in step 1, as well as the removal of read errors that can originate from the bloom sequences during the PCR/sequencing steps (the number of read error sequences may increase in frequency due to the increase in frequency of the bloom sequences). At a coverage of ~10k reads/sample, we can get non-negligible read errors up to a hamming distance of 4; therefore all sequences with at least 97% similarity to the bloom sequences (on V4 150bp) need to be removed.

Therefore, prior to OTU picking, the demultiplexed sequence data are scanned and all reads with at least 97% similarity to one of the bloom sequences are thrown away. For the sequence similarity detection, we used SortMeRNA 2.0, an open-source (LGPL) pairwise sequence alignment tool with OTU assignment extensions developed in the lab (to be included in the coming QIIME release 1.9.0). The remaining reads, which are < 97% similar to the bloom sequences, are used for the downstream analysis.

In a typical American Gut run, this results in approximately 25% of reads being filtered out of fecal samples (compared to only ~10% in the PGP immediately frozen fecal sample set). As can be seen in the PCoA plot, using this filtering approach leads to lower study-based clustering among the American Gut and PGP samples (B) compared to unfiltered reads (A), as well as lower variability expressed in PC1 and PC2.



Amnon Amir, PhD is a post-doc in the Knight lab. His interests include the metabolic activity of oral bacteria through time, as well as improving methodologies for analyzing and visualizing microbiome data.