- Data & Archives
- Policy & Conservation
- Education & Opportunities
- News & Events
You are here
Harvard Forest Data Archive
Metaproteomic Analysis of Sarracenia Purpurea Pitcher Fluid at Harvard Forest 2012-2017Related Publications
- hf295-01: analysis data files
- Lead: Amanda Northrop, Rachel Brooks, Aaron Ellison, Bryan Ballif, Nicholas Gotelli
- Contact: Information Manager
- Start date: 2012
- End date: 2017
- Status: completed
- Location: Tom Swamp Tract (Harvard Forest)
- Latitude: +42.50
- Longitude: -72.20
- Elevation: 240 meter
- Taxa: Sarracenia purpurea (northern pitcher plant)
- Release date: 2017
- EML file: knb-lter-hfr.295.4
- DOI: digital object identifier
- EDI: data package
- DataONE: data package
- Related links:
- Study type: short-term measurement, modeling
- Research topic: ecological informatics and modelling; watershed ecology
- LTER core area: organic matter, disturbance
- Keywords: bacteria, carnivorous plants, community composition, detritus, ecosystems, genetics, microbes, organic matter, thresholds
Aquatic ecosystem enrichment can lead to distinct and irreversible changes to undesirable states. Understanding changes in active microbial community function and composition following organic-matter loading in enriched ecosystems can help identify biomarkers of such state changes. In a field experiment, we enriched replicate aquatic ecosystems in the pitchers of the northern pitcher plant, Sarracenia purpurea. Shotgun metaproteomics using a custom metagenomic database identified proteins, molecular pathways, and contributing microbial taxa that differentiated control ecosystems from those that were enriched. The number of microbial taxa contributing to protein expression was comparable between treatments; however, taxonomic evenness was higher in controls. Functionally active bacterial composition differed significantly among treatments and was more divergent in control pitchers than enriched pitchers. Aerobic and facultative anaerobic bacteria contributed most to identified proteins in control and enriched ecosystems, respectively. The molecular pathways and contributing taxa in enriched pitcher ecosystems were similar to those found in larger enriched aquatic ecosystems and are consistent with microbial processes occurring at the base of detrital food webs. Detectable differences between protein profiles of enriched and control ecosystems suggest that a time series of environmental proteomics data may identify protein biomarkers of impending state changes to enriched states.
The field experiment was conducted in Tom Swamp, a nutrient-poor fen located at the northern end of Harvard Pond (42.51 N, −72.21 W) at Harvard Forest, Worcester County, Massachusetts. Newly opened pitchers were identified and randomly assigned to an ambient control or detritus-enriched treatment (Appendix S1). Previous work by Peterson et al. (2008) using culture-independent methods revealed that newly opened pitchers are sterile and impermeable to bacteria, so we are reasonably sure that our experimental pitchers did not harbor diverse bacterial communities prior to the start of the experiment. Detritus-enriched pitchers received 1 mg ml-1 d-1 of oven-dried, finely ground wasps (Dolichovespula maculata) (Appendix S1), which have elemental ratios (C:N, 5.99:1, N:P:K, 10.7:1.75:1.01) similar to those of Sarracenia’s natural ant prey (C:N, 5.9:1; N:P:K, 12.1:1.52:0.93) (Farnsworth and Ellison 2008). Proteomic analysis of the ground wasp (not reported here) failed to identify microbial proteins, so we are confident that microbial contribution to enriched pitchers from the wasps was minimal. Enrichment treatments were applied for 14 consecutive days; all pitchers were otherwise unmanipulated. Pitcher fluid was sampled on the first and last days of the experiment, filtered to remove microbes greater than 30 μm, pelleted, and stored at −80 °C until processed.
Protein Extraction, SDS-Page, and Mass Spectrometry
Six of ten replicate microbial pellets from each treatment yielded enough protein for analysis via tandem mass spectrometry. All replicates were analyzed separately using SDS-PAGE and Coomassie staining (Fig. 1, Appendix 1: Fig. S1a, and Appendix 1: Fig. S1b). All six of the enriched pitchers and five of the six control pitchers had visible protein staining levels and were chosen for mass spectrometry. Proteins were subjected to a tryptic digest (Appendix S1) and to LC-MS/MS as previously described (Cheerathodi and Ballif 2011) using a linear ion trap mass spectrometer (Thermo Electron, Waltham, MA, USA). MS/MS spectra were matched to peptides in a custom protein database using SEQUEST software as described below.
Custom Metagenomic Databases
We generated a custom protein database from a six-frame forward and reverse translation of a metagenomic database constructed from microbial communities of three previously collected pitchers that had captured diverse amounts of prey. Pitchers were collected from Molly Bog, an ombrotrophic bog located in Morristown, VT (44.50 N, -72.64 W) on 18 August 2008 and transported in a cooler directly from the field to the University of Vermont. Microbial pellets were obtained immediately as described above. DNA was extracted, prepared, and sent for library construction, sequencing, and assembly to Genome Quebéc (Montréal, QC, Canada) with the 454 GS-FLX Titanium Sequencing System (Roche). Contigs were assembled de-novo with Roche’s Newbler assembler v2.3 (release 091027_1459) using default parameters (minimum Read Length = 20; overlap Seed Step = 12; overlap Seed Length =16; overlap Min Seed Count = 1; overlap Seed Hit Limit = 70; overlap Min Match Length = 40; overlap Min Match Identity = 90; overlap Match Ident Score = 2; overlap Match Diff Score = -3; overlap Match Unique Thresh = 12; map Min Contig Depth = 1; all Contig Thresh = 100), with the exception of minimum read length (20 bp) and overlap Hit Position Limit (1,000,000). The assembled contigs were imported into MG-RAST 4.0.2 to assess functional and taxonomic potential (Meyer et al. 2008). Taxonomic assignments were visualized using the Krona plugin and the following cutoffs were applied to both taxonomic and subsystem functional category assignments: minimum identity = 60%, e-value of 1 x 10-5 or less, and a minimum alignment length of 15 bp (Appendix 1: Fig. S3). We calculated Hurlbert’s probability of an interspecific encounter (PIE) to estimate the evenness of bacterial classes in the metagenome (Hurlbert 1971). KEGG pathways (level 2 and level 3) were assigned to contigs using the KEGG database via MG-RAST.
A metaproteomic database was created with a six-frame forward and reverse translation of the assembled metagenome using open-source Ruby software. Sequences with greater than 100 amino acids (n=184,128) in length were retained. A decoy database was constructed by reversing the retained sequences and concatenating them to the forward database to allow for an estimation of the false discovery rate as has been described (Elias and Gygi 2007).
Protein Orthologue Identification
Peptide and protein identifications were made via a SEQUEST search of the tandem mass spectral data against the custom pitcher-plant microbial community protein database described above. The number of protein hits varied substantially among replicates, so to have enough proteins for treatment comparisons, peptides and proteins from the five control samples and six enriched samples were pooled after LC-MS/MS and the SEQUEST search into a single control and a single enriched sample dataset. The doubly- and triply-charged peptide ions were further considered and each dataset was filtered by first adjusting the cutoffs for XCorr and ΔCn until the false discovery rate was less than 10%. The final filters were: Xcorr ≥ 3.0 for doubly-charged ions, Xcorr ≥ 3.3 for triply-charged ions and unique Δcorr ≥ 0.15. The resulting list of protein hits for each treatment was then ranked by unique number of peptides and the top 220 proteins from each treatment were selected so that the false discovery rate for control and enriched treatments were 6.6% and 0%, respectively.
In the list of control peptides, a protein hit from the decoy database was represented by 25 total peptides; therefore, we suspected that this hit was a true positive not represented in our target database. However, a BLAST search of the full amino acid sequence did not yield an identical match, so we cannot definitively claim it is a true positive; therefore, we removed this peptide from our top 220 list of control peptides. With this peptide removed, the false discovery rate for the control treatment was 4.3%.
All peptide hits were pooled within treatments and mapped back to their source sequences in the custom protein database. Those source sequences were imported in fasta file format into blast2go v.2.8.0 for identification and annotation using the following configuration settings: blastp program, Blast Expect Value of 1.0E-3, 10 Blast Hits, Annotation CutOff greater than 55, GO Weight greater than 5.
Analysis of the Top Proteins Shared Between Treatments
A randomization test was done using R Studio (v. 0.98.1059) to test the hypothesis that there was a single common protein pool for both the control and enriched treatments and that the number of observed shared proteins between treatments reflects chance effects resulting from random draws from this single protein pool. We conducted an additional simulation in R to determine the likelihood of a Type I error in our randomization test.
Comparison of the Top 20 Proteins from Each Treatment
We downloaded the sequence by annotation file from the blast2go search for each treatment to get the protein names associated with each protein hit (sequence description in blast2go). Each of the top 220 identified proteins in each treatment, ordered by the number of total peptides associated with the protein hit, was matched to a protein name using R software. If multiple protein hits within a treatment matched a single protein name, the protein names were merged in silico and the total peptides representing them were summed. Protein names were ranked in order of the abundance of total peptides for each treatment.
To determine the taxonomic composition of the microbes contributing to identified proteins in our treatments, we conducted a BLAST homology search of the metagenomic sequence data for protein hits. All peptides from the top 220 identified proteins in each treatment were mapped back to their contigs of origin to obtain nucleotide sequences. Because contigs were at least 500 base pairs in length, we felt confident that a BLAST search of the nucleotide sequences would yield correct taxonomic identifications at course taxonomic levels and acknowledge that ambiguity can remain in the taxonomic identification from a metacommunity at genus and species levels. The top BLAST hit was retained for each nucleotide sequence associated with an identified protein and linked to a bacterial class. For each bacterial class identified, a 2×2 contingency table was created with treatments as columns and the number of peptides associated and not associated with the taxon as rows. A chi-square test was then used to determine if the abundance of the bacterial class was significantly different between treatments. All P values were adjusted using the Benjamini-Hochberg method (Benjamini and Hochberg 1995). Species composition was visualized using Krona (Ondov et al. 2011). In addition to the BLAST homology search, we used Unipept (Mesuere 2016) to map tryptic peptides to the UniprotKB database and retrieve the least common taxonomic ancestor (= most derived shared taxonomic node) associated with each peptide for pooled replicates. We calculated Hurlbert’s Probability of an Interspecific Encounter (PIE) to estimate the evenness of bacterial classes contributing to expressed proteins in control and enriched pitchers (Hurlbert 1971).
Functional pathways at two levels, associated with each identified protein from each treatment, were retrieved using the KEGG (Kyoto Encyclopedia of Genes and Genomes) (Kanehisa et al. 2014) mapping function of blast2go v.2.8.0. Each pathway was weighted by the total number of peptides associated with protein hits, or the number of spectral counts, mapping to that pathway. For each pathway identified, a 2×2 contingency table was created with treatments as columns and the number of peptides associated and not associated with the pathway as rows. A chi-square test was done to determine if each pathway was significantly over- or under-represented in enriched pitchers relative to controls. All P values were adjusted using the Benjamini-Hochberg method (Benjamini and Hochberg 1995).
To determine whether bacteria contributing to expressed proteins in control and enriched ecosystems differed in their O2 requirements, we mapped each bacterial species identified in our BLAST search to its O2 requirement using data from the Integrated Microbial Genomes database (IMG) (Timinskas et al. 2014, Reddy et al. 2015). The IMG database contains 6 classes of O2 requirements: aerobe, anaerobe, facultative, microaerophillic, obligate aerobe, and obligate anaerobe. The latter three categories make up less than 7% of the database. We merged any species classified as obligate aerobes or obligate anaerobes into the aerobe and anaerobe classes, respectively.
Analysis of Unpooled Data
In addition to analyzing pooled data, we used ordination and permutation analyses to determine the effect of enrichment on microbial community protein expression, taxonomic contribution to expressed proteins at the class and family levels, and KEGG pathways. We tested the similarity within and among replicates of control and enriched microbial communities using ADONIS, a nonparametric permutation test in the ‘vegan’ package (v. 2.4.1) in R (Oksanen et al. 2016). We used a multivariate homogeneity of group dispersions test (betadisper function in the ‘vegan’ package) to determine if the composition of contributing microbial taxa was more divergent in control replicates than in enriched replicates. The permutation tests used 999 permutations and were done using total peptide counts associated with protein identifications, microbial classes, microbial families, and KEGG pathways. To visualize the similarities among replicate ecosystems, we used the ‘vegan’ package function metaMDSto perform non-metric multidimensional scaling (NMDS) ordination using Bray-Curtis distances. Data were square-root transformed and standardized using Wisconsin double standardization. To determine which taxa contributed the most to Bray-Curtis dissimilarity of taxa contributing to protein expression between the treatments, we did a similarity percentages test using the simper function in the ‘vegan’ package.
Please see ReadMe.txt (in zipped package) for details on individual data files.
This dataset is released to the public under Creative Commons CC0 1.0 (No Rights Reserved). Please keep the dataset creators informed of any plans to use the dataset. Consultation with the original investigators is strongly encouraged. Publications and data products that make use of the dataset should include proper acknowledgement.
Northrop A, Brooks R, Ellison A, Ballif B, Gotelli N. 2017. Metaproteomic Analysis of Sarracenia Purpurea Pitcher Fluid at Harvard Forest 2012-2017. Harvard Forest Data Archive: HF295 (v.4). Environmental Data Initiative: https://doi.org/10.6073/pasta/366cbb1f4f25328cca1b2d9ad38dbb28.
hf295-01: analysis data files
- Compression: zip
- Format: fasta, html, Microsoft Excel, R Markdown, R script, text
- Type: document, script, table