Data Archive

HF262

Soil Bacteria and Archaea in Macrosystems Biodiversity Project at Harvard Forest 2012

Related Publications

Data

hf262-01: hf16s rRNA bacteria archaea (preview)

Overview

Lead: Jizhong Zhou, Robert Waide, James Brown
Investigators: Ye Deng
Contact: Information Manager
Start date: 2012
End date: 2012
Status: complete
Location: Harvard Forest
Latitude: +42.53780 to +42.54054 degrees
Longitude: -72.17899 to -72.17329 degrees
Elevation: 352 to 363 meter
Datum: WGS84
Taxa:
Release date: 2023
Language: English
EML file: knb-lter-hfr.262.4
DOI: digital object identifier
EDI: data package
DataONE: data package
Related links:

Study type: short-term measurements
Research topic: biodiversity studies; international research projects; regional studies
LTER core area: population studies, organic matter movement
Keywords: abundance, bacteria, biodiversity, genetics, microbes, soil, soil organic matter
Abstract:
Patterns of biodiversity, such as the increase toward the tropics and the peaked curve during ecological succession, are fundamental phenomena for ecology. Such patterns have multiple, interacting causes, but temperature emerges as a dominant factor across organisms from microbes to trees and mammals, and across terrestrial, marine, and freshwater environments. However, there is little consensus on the underlying mechanisms, even as global temperatures increase and the need to predict their effects becomes more pressing.

The purpose of this project is to generate and test theory for how temperature impacts biodiversity through its effect on biochemical processes and metabolic rate. A combination of standardized surveys in the field and controlled experiments in the field and laboratory measure diversity of three taxa -- trees, invertebrates, and microbes -- and key biogeochemical processes of decomposition in seven forests distributed along a geographic gradient of increasing temperature from cold temperate to warm tropical.

This field experiment focused on soil microbes. DNA was extracted and purified from soil cores from an array of 21 1m2 subplots. The V4 region of the 16S rRNA genes for bacteria and archaea were amplified and sequenced using Illumina MiSeq by the University of Oklahoma Institute for Environmental Genomics as part of a macrosystems biodiversity and latitude project supported by the National Science Foundation under Cooperative Agreement DEB#1065836.
Methods:
Overview

A nested sampling design was implemented to survey the background pools of regional taxonomic diversity at six forest sites across America along a latitudinal gradient of increasing temperature: Niwot, Andrews, Harvard, Coweeta, Luquillo and Barro Colorado Island. At each site, we located a central subplot first and then laid out 1-m2 subplots in four directions with distances of 1m, 10m, 50m, 100m and 200m from the central subplot. In each m2 plot, 9 soil cores were collected and pooled to form a soil sample. By this sampling method, 21 samples were collected for each site with a total of 126 samples.

To determine the biodiversity of microbial communities, three genes targeting different taxonomic groups with different taxonomic resolutions were sequenced: The first of these are the V3-V4 regions of the 16S rRNA genes for determining the biodiversity of bacteria and archaea. All of the target genes were amplified and sequenced to a great depth using Illumina MiSeq. An average of 64K sequence reads were obtained for the 16S gene. The numbers of OTUs (Operational Taxonomic Units) obtained varied considerably (1.6-5.8 times) based on different sequence similarity thresholds. Such sequencing efforts appear to be reasonably sufficient to estimate the diversity of the microbial communities examined, at least for the dominant populations, as indicated by rarefaction analysis, which showed that the rarefaction curves approached to saturation at different cutoffs for the three target genes.

Amplicon Sequencing Analysis Protocols

1. Raw data. The raw data includes three files: R1.fastq, R2.fastq, and I1.fastq. The first files are read files contains real sequences in FASTQ format and the third file is the index file which contains the barcode sequence information for each corresponding reads in the first two files. Here, to make an example, we trim the files and only included 10000 entries in each file.

2. Upload. All the files need to be uploaded into our sequencing analysis server. Big files larger than 2GB need to be uploaded through an ftp server.

3. Remove PhiX sequences. The PhiX sequences are added to increase the diversity of the nucleotides in each position and enhance the sequencing accuracy. These sequences are removed here using BLAST against the PhiX genome sequences. (Output: Galaxy8-10)

4. Split sequence by barcodes. A barcode list needs to be provided to extract these barcode from the index file so each sequence will be assigned to the sample its barcode represents. After this step, the sample name (barcode name) will be attached to every sequence ID linked by “--”. (Output: Galaxy13, 14)

5. Remove Primer (optional). This step is only needed if the primers are included in the reads. It depends on the sequencing strategy. If proceeds, the primers will be removed from the beginning of the reads within a certain range. (Output: Galaxy16,17)

6. Join pair-end reads. A program called FLASH is used to join the forward and reversed reads together. (Output: Galaxy20)

7. Quality trim. The quality trimming process is carried out by Btrim . (Output: Galaxy24)

8. Extract FASTA from FASTQ. The FASTQ format contains information of both sequences and their quality. For further analysis based, sequences in FASTA files are the only information needed, though users can still get the quality files here if they want. (Output: Galaxy26)

9. Remove sequences contain N. N is undetermined base which could indicate unreliable sequences after it. User can choose to remove completely or trim at the position of N to keep the remaining part of the sequences. (Output: Galaxy27)

10. Filter sequences by length. If the sequences are too short (this usually doesn’t happen for Illumina sequences after joining pair-ends reads), they will greatly affect the clustering results when generating OTUs and induce errors. And if the sequences are too long usually means they are sequencing errors. (Output: Galaxy29)

11. Remove potential chimeras using Uchime . To remove chimeras, which are commonly created during DNA sample amplification by PCR, U-CHIME is used. There are two modes in the U-CHIME algorithm: de novo and reference database. The de novo mode uses abundance information to detect chimeras on the assumption that chimeras are less abundant than their parents because they must’ve undergone fewer rounds of amplification. We usually use the reference database mode to save computational time. The Greengene 16S dataset is used as reference for 16S sequencing analysis. To save more computational time, all the identical sequences are removed from the dataset and a redundancy map contains all the identical sequence ID is also generated for future use. (Output: Galaxy32,33)

12. Correct frame shifts for protein encoding sequences (optional). When the sequences are from protein encoding genes, correct open reading frames (ORFs) need to be assured to allow accurate translation and alignment in the future analysis. A program called FrameBot (developed by RDP, not published yet) is used to perform this process. A reference file of protein sequences needs to be provided, and frame shift-corrected protein and DNA sequences are generated.

13. Resample (optional). Resample the sequences so that each sample has the same sequence number.

14. Pick OTUs. This step will use clustering methods to form OTUs (operational taxonomy units) based on sequence similarity. Now we have three methods that users can choose from: CD-HIT, UCLUST, McClust. McClust is a complete lineage algorithm which provides more accurate clustering results, but it is very time and space consuming. For large sequence datasets, CD-HIT and UCLUST are recommended. For millions of sequences generated from Illumina platform, UCLUST should be chosen to do the clustering. The clusters generated from this step will be treated as OTUs for further analysis. (Output: Galaxy35)

15. Generate OTU table. In the OTU table, the rows represent OTUs and columns are samples, the values between are the sequence numbers belong to the corresponding OTUs as well as corresponding samples. The redundancy map from the UCHIME step is also needed to add back the identical sequences. OTU tables w/o singlets are generated and also their corresponding representative sequences. (Output: Galaxy36-39)

Taxa

Some taxa remain unclassified. For those OTUs classified to family, the taxa are as follows: Enterobacteriaceae, Gemmatimonadaceae, Spirochaetaceae, Planctomycetaceae, Hyphomicrobiaceae, Rhodospirillaceae, Acetobacteraceae, Rickettsiaceae, Sphingomonadaceae, Ktedonobacteraceae, Polyangiaceae, Nocardioidaceae, Myxococcaceae, Opitutaceae, Legionellaceae, Bradyrhizobiaceae, Cystobacteraceae, Phaselicystidaceae, Acidimicrobiaceae, Rhodocyclaceae, Coxiellaceae, Bdellovibrionaceae, Parachlamydiaceae, Conexibacteraceae, Xanthomonadaceae, Ruminococcaceae, Thermoleophilaceae, Acidimicrobineae_incertae_sedis, Nocardiaceae, Micromonosporaceae, Solirubrobacteraceae, Nannocystaceae, Xanthobacteraceae, Pseudonocardiaceae, Oxalobacteraceae, Neisseriaceae, Chitinophagaceae, Puniceicoccaceae, Burkholderiaceae, Methylococcaceae, Streptosporangineae_incertae_sedis, Cytophagaceae, Catenulisporaceae, Chthonomonadaceae, Aurantimonadaceae, Caulobacteraceae, Thermomonosporaceae, Sinobacteraceae, Pseudomonadaceae, Comamonadaceae, Kofleriaceae, Rhodobacteraceae, Anaerolineaceae, Actinospicaceae, Bacillaceae 2, Beijerinckiaceae, Holophagaceae, Streptosporangiaceae, Verrucomicrobiaceae, Nitrospiraceae, Pasteuriaceae, Geobacteraceae, Peptococcaceae 2, Microbacteriaceae, Armatimonadaceae, Geodermatophilaceae, Simkaniaceae, Cryomorphaceae, Caldilineaceae, Nitrosomonadaceae, Rhodobiaceae, Clostridiaceae 1, Chloroplast, Sphaerobacteraceae, Hydrogenophilaceae, Thiotrichales_incertae_sedis, Oceanospirillaceae, Family II, Family I, Leptotrichiaceae, Rikenellaceae, Haliangiaceae, Methylocystaceae, Sphingobacteriaceae, Intrasporangiaceae, Alcaligenaceae, Erythrobacteraceae, Phyllobacteriaceae, Family XIII, Alteromonadaceae, Gracilibacteraceae, Campylobacteraceae, Bacteriovoracaceae, Sporichthyaceae, Iamiaceae, Saprospiraceae, Demequinaceae, Ectothiorhodospiraceae, Nakamurellaceae, Syntrophaceae, Chromatiaceae, Desulfobacteraceae, Flammeovirgaceae, Micrococcaceae, Planococcaceae, Rhizobiaceae, Methylobacteriaceae, Burkholderiales_incertae_sedis, Acidothermaceae, Phycisphaeraceae, Cellulomonadaceae, Streptomycetaceae, Kineosporiaceae, Syntrophobacteraceae, Rhizobiales_incertae_sedis, Thermotogaceae, Veillonellaceae, Porphyromonadaceae, Mycobacteriaceae, Peptococcaceae 1, Moraxellaceae, Lachnospiraceae, Methylophilaceae, Shewanellaceae, Trueperaceae, Mycoplasmataceae, Flavobacteriaceae, Beutenbergiaceae, Bacillales_Incertae Sedis XII, Cryptosporangiaceae, Rhodothermaceae, Desulfohalobiaceae, Family IX, Celerinatantimonadaceae, Leptospiraceae, Prevotellaceae, Rubrobacteraceae, Alicyclobacillaceae, Herpetosiphonaceae, Bacillaceae 1, Thermaceae, Fervidicoccaceae, Hyphomonadaceae, Natranaerobiaceae, Tsukamurellaceae, Syntrophorhabdaceae, Marinilabiaceae, Enterococcaceae, Jiangellaceae, Chlamydiaceae, Halomonadaceae, Ignavibacteriaceae, Cyclobacteriaceae, Deinococcaceae, Brevibacteriaceae, Brucellaceae, Thermoanaerobacteraceae, Corynebacteriaceae, Desulfovibrionaceae, Methanomicrobiaceae, Propionibacteriaceae, Family V, Coriobacteriaceae, Sanguibacteraceae, Thermosporotrichaceae, Thiotrichaceae, Anaplasmataceae, Dermacoccaceae, Leuconostocaceae, Elusimicrobiaceae, Paenibacillaceae 1, Thermoplasmatales_incertae_sedis, Clostridiales_Incertae Sedis XVIII, Oceanospirillales_incertae_sedis, Actinomycetaceae, Desulfuromonadaceae, Victivallaceae, Methanobacteriaceae, Thermoactinomycetaceae 1, Listeriaceae, Alcanivoracaceae, Erysipelotrichaceae, Bacteroidaceae, Family IV, Desulfobulbaceae, Thermogemmatisporaceae, Synergistaceae, Family XI, Promicromonosporaceae, Pseudomonadales_incertae_sedis, Carnobacteriaceae, Euzebyaceae, Thermofilaceae, Patulibacteraceae, Lactobacillaceae, Methanocorpusculaceae, Methanocellaceae, Deferribacteraceae, Glycomycetaceae, Chloroflexaceae, Eubacteriaceae, Aeromonadaceae, Helicobacteraceae, Methanomicrobiales_incertae_sedis, Syntrophomonadaceae, Jonesiaceae, Halobacteriaceae, Piscirickettsiaceae, Staphylococcaceae, Cohaesibacteraceae, Desulfarculaceae, Halothiobacillaceae, Ruaniaceae, Streptococcaceae, Thermomicrobiaceae, Paenibacillaceae 2, Cardiobacteriaceae, Psychromonadaceae, Vibrionaceae, Colwelliaceae, Hahellaceae, Sutterellaceae, Actinopolysporaceae, Francisellaceae, Peptostreptococcaceae, Fusobacteriaceae, Clostridiales_Incertae Sedis XI, Clostridiales_Incertae Sedis XIII, Acidaminococcaceae, Bacteroidales_incertae_sedis, Clostridiaceae 4, Entomoplasmataceae, Spiroplasmataceae, Clostridiaceae 2, Acholeplasmataceae, Thermoactinomycetaceae 2, Family VIII, Bogoriellaceae, Pasteurellaceae, Methanosarcinaceae, Idiomarinaceae, Nocardiopsaceae, Bacillales_Incertae Sedis XI, Chlorobiaceae, Pseudoalteromonadaceae, Bacillales_incertae_sedis, Dietziaceae, Clostridiaceae 3, Waddliaceae, Moritellaceae, Desulfurellaceae, Methanosaetaceae, Dermabacteraceae, Saccharospirillaceae, Hydrogenothermaceae, Segniliparaceae, Succinivibrionaceae.
Organization: Harvard Forest. 324 North Main Street, Petersham, MA 01366, USA. Phone (978) 724-3302. Fax (978) 724-3595.
Project: The Harvard Forest Long-Term Ecological Research (LTER) program examines ecological dynamics in the New England region resulting from natural disturbances, environmental change, and human impacts. (ROR).
Funding: National Science Foundation LTER grants: DEB-8811764, DEB-9411975, DEB-0080592, DEB-0620443, DEB-1237491, DEB-1832210.
Use: This dataset is released to the public under Creative Commons CC0 1.0 (No Rights Reserved). Please keep the dataset creators informed of any plans to use the dataset. Consultation with the original investigators is strongly encouraged. Publications and data products that make use of the dataset should include proper acknowledgement.
License: Creative Commons Zero v1.0 Universal (CC0-1.0)
Citation: Zhou J, Waide R, Brown J. 2023. Soil Bacteria and Archaea in Macrosystems Biodiversity Project at Harvard Forest 2012. Harvard Forest Data Archive: HF262 (v.4). Environmental Data Initiative: https://doi.org/10.6073/pasta/9c0625930a4086f4df39d549e98a9443.

Detailed Metadata

hf262-01: hf16s rRNA bacteria archaea

otu: the ID of the Operational Taxonomic Unit (OTU)
domain: classified taxonomic domain of the OTU (bacteria or archaea)
phylum: classified taxonomic phylum of the OTU
class: classified taxonomic class of the OTU
order: classified taxonomic order of the OTU
family: classified taxonomic family of the OTU
genus: classified taxonomic genus of the OTU
h100e: number of OTUs of a specific classification found in the soil sampled at the H100E location (Harvard Forest, 100 meters east of the central subplot) (unit: number / missing value: NA)
h100n: number of OTUs of a specific classification found in the soil sampled at the H100N location (Harvard Forest, 100 meters north of the central subplot) (unit: number / missing value: NA)
h100s: number of OTUs of a specific classification found in the soil sampled at the H100S location (Harvard Forest, 100 meters south of the central subplot) (unit: number / missing value: NA)
h100w: number of OTUs of a specific classification found in the soil sampled at the H100W location (Harvard Forest, 100 meters west of the central subplot) (unit: number / missing value: NA)
h10e: number of OTUs of a specific classification found in the soil sampled at the H10E location (Harvard Forest, 10 meters east of the central subplot) (unit: number / missing value: NA)
h10n: number of OTUs of a specific classification found in the soil sampled at the H10N location (Harvard Forest, 10 meters north of the central subplot) (unit: number / missing value: NA)
h10s: number of OTUs of a specific classification found in the soil sampled at the H10S location (Harvard Forest, 10 meters south of the central subplot) (unit: number / missing value: NA)
h10w: number of OTUs of a specific classification found in the soil sampled at the H10W location (Harvard Forest, 10 meters west of the central subplot) (unit: number / missing value: NA)
h1e: number of OTUs of a specific classification found in the soil sampled at the H1E location (Harvard Forest, 1 meter east of the central subplot) (unit: number / missing value: NA)
h1n: number of OTUs of a specific classification found in the soil sampled at the H1N location (Harvard Forest, 1 meter north of the central subplot) (unit: number / missing value: NA)
h1s: number of OTUs of a specific classification found in the soil sampled at the H1S location (Harvard Forest, 1 meter south of the central subplot) (unit: number / missing value: NA)
h1w: number of OTUs of a specific classification found in the soil sampled at the H1W location (Harvard Forest, 1 meter west of the central subplot) (unit: number / missing value: NA)
h200e: number of OTUs of a specific classification found in the soil sampled at the H200E location (Harvard Forest, 200 meter east of the central subplot) (unit: number / missing value: NA)
h200n: number of OTUs of a specific classification found in the soil sampled at the H200N location (Harvard Forest, 200 meter north of the central subplot) (unit: number / missing value: NA)
h200s: number of OTUs of a specific classification found in the soil sampled at the H200S location (Harvard Forest, 200 meter south of the central subplot) (unit: number / missing value: NA)
h200w: number of OTUs of a specific classification found in the soil sampled at the H200W location (Harvard Forest, 200 meter west of the central subplot) (unit: number / missing value: NA)
h50e: number of OTUs of a specific classification found in the soil sampled at the H50E location (Harvard Forest, 50 meter east of the central subplot) (unit: number / missing value: NA)
h50n: number of OTUs of a specific classification found in the soil sampled at the H50N location (Harvard Forest, 50 meter north of the central subplot) (unit: number / missing value: NA)
h50s: number of OTUs of a specific classification found in the soil sampled at the H50S location (Harvard Forest, 50 meter south of the central subplot) (unit: number / missing value: NA)
h50w: number of OTUs of a specific classification found in the soil sampled at the H50W location (Harvard Forest, 50 meter west of the central subplot) (unit: number / missing value: NA)
hc: number of OTUs of a specific classification found in the soil sampled at the HC location (Harvard Forest, central subplot) (unit: number / missing value: NA)