Skip to article frontmatterSkip to article content

Gut-to-soil axis tutorial đŸ’©đŸŒ±

Background¶

In this tutorial you’ll learn an end-to-end microbiome marker gene data science workflow, building on data presented in Meilander et al. (2024): Upcycling Human Excrement: The Gut Microbiome to Soil Microbiome Axis. The data used here is a subset (a single sequencing run) of that generated for the paper, specifically selected so that this tutorial can be run quickly on a personal computer. The full data set for the paper can be found in the paper’s Artifact Archive. In the final step (not yet written, as of 11 March 2025), you’ll learn how to adapt the workflow for use in analyzing your own data using Provenance Replay.

The data used in this tutorial was generated using the Earth Microbiome Project protocol. Specifically, the hypervariable region 4 (V4) of the 16S rRNA gene was amplified using the F515-R806 primers - a broad-coverage primer pair for Bacteria that also amplifies some Archaea. Paired-end sequencing was performed on an Illumina MiSeq. Full details are presented in Meilander et al. (2024).

Sample metadata¶

Before starting the analysis, explore the sample metadata to familiarize yourself with the samples used in this study. The following command will download the sample metadata as tab-separated text and save it in the file sample-metadata.tsv. This sample-metadata.tsv file is used throughout the rest of the tutorial.

[Command Line]
[Python API]
[Galaxy]
[R API]
[View Source]
wget -O 'sample-metadata.tsv' \
  'https://gut-to-soil-tutorial.readthedocs.io/en/latest/data/gut-to-soil/sample-metadata.tsv'

QIIME 2’s metadata plugin provides a Visualizer called tabulate that generates a convenient view of a sample metadata file. Let’s run this, and then we’ll look at the result. Here’s the first QIIME 2 command that you should run in this tutorial:

[Command Line]
[Python API]
[Galaxy]
[R API]
[View Source]
qiime metadata tabulate \
  --m-input-file sample-metadata.tsv \
  --o-visualization sample-metadata.qzv

This will generate a QIIME 2 Visualization. Visualizations can be viewed by loading them with QIIME 2 View. Navigate to QIIME 2 View, and drag and drop the visualization that was created to view it.

Access already-imported QIIME 2 data¶

This tutorial begins with paired-end read sequencing data that has already been demultiplexed and imported into a QIIME 2 Artifact. Because sequence data can be delivered to you in many different forms, it’s not possible to cover the varieties here. Instead we refer you to How to import and export data to learn how to import your data. If you want to learn why importing is necessary, refer to Why importing is necessary.

[Command Line]
[Python API]
[Galaxy]
[R API]
[View Source]
wget -O 'demux.qza' \
  'https://gut-to-soil-tutorial.readthedocs.io/en/latest/data/gut-to-soil/demux.qza'

Summarize demultiplexed sequences¶

When you have demultiplexed sequence data, the next step is typically to generate a visual summary of it. This allows you to determine how many sequences were obtained per sample, and also to get a summary of the distribution of sequence qualities at each position in your sequence data.

[Command Line]
[Python API]
[Galaxy]
[R API]
[View Source]
qiime demux summarize \
  --i-data demux.qza \
  --o-visualization demux.qzv

Upstream data analysis¶

Generally, the term “upstream” is used to refer to data analysis pre-feature-asv_table, and “downstream” is used to refer to data analysis post-feature-asv_table. Let’s jump into our upstream analysis.

Sequence quality control and feature table construction¶

QIIME 2 plugins are available for several quality control methods, including DADA2, Deblur, and basic quality-score-based filtering. In this tutorial we present this step using DADA2. The result of this method will be a FeatureTable[Frequency] QIIME 2 artifact, which contains counts (frequencies) of each unique sequence in each sample in the dataset, and a FeatureData[Sequence] QIIME 2 artifact, which maps feature identifiers in the FeatureTable to the sequences they represent.

DADA2 is a pipeline for detecting and correcting (where possible) Illumina amplicon sequence data. As implemented in the dada2 plugin, this quality control process will additionally filter any phiX reads (commonly present in marker gene Illumina sequence data) that are identified in the sequencing data, filter chimeric sequences, and merge paired end reads.

The denoise-paired action, which we’ll use here, requires four parameters that are used in quality filtering:

  • trim-left-f a, which trims off the first a bases of each forward read
  • trunc-len-f b which truncates each forward read at position b
  • trim-left-r c, which trims off the first c bases of each forward read
  • trunc-len-r d which truncates each forward read at position d This allows the user to remove low quality regions of the sequences. To determine what values to pass for these parameters, you should review the Interactive Quality Plot tab in the demux.qzv file that was generated above.
Solution to Exercise 2

The quality of the initial bases seems to be high, so I choose not to trim any bases from the beginning of the sequences. The quality also seems good all the way out to the end, though maybe dropping off after 250 bases. I’ll therefore truncate at 250. I’ll keep these values the same for both the forward and reverse reads, though that is not a requirement.

Now run your DADA2 command. This step may take up to 10 minutes to complete - it’s the longest running step in this tutorial.

[Command Line]
[Python API]
[Galaxy]
[R API]
[View Source]
qiime dada2 denoise-paired \
  --i-demultiplexed-seqs demux.qza \
  --p-trim-left-f 0 \
  --p-trunc-len-f 250 \
  --p-trim-left-r 0 \
  --p-trunc-len-r 250 \
  --o-representative-sequences asv-seqs.qza \
  --o-table asv-table.qza \
  --o-denoising-stats stats.qza

One of the outputs created by DADA2 is a summary of the denoising run. That is generated as an Artifact, so can’t be viewed directly. However this is one of many QIIME 2 types that can be viewed as Metadata - a very powerful concept that we’ll use again later in this tutorial. Learning to view artifacts as Metadata creates nearly infinite possibilities for how you can explore your microbiome data with QIIME 2.

Here, we’ll again use the metadata plugins tabulate visualizer, but this time we’ll apply it to the DADA2 statistics.

[Command Line]
[Python API]
[Galaxy]
[R API]
[View Source]
qiime metadata tabulate \
  --m-input-file stats.qza \
  --o-visualization stats.qzv
Solution to Exercise 4
[Command Line]
[Python API]
[Galaxy]
[R API]
[View Source]
qiime metadata tabulate \
  --m-input-file sample-metadata.tsv stats.qza \
  --o-visualization sample-metadata-w-dada2-stats.qzv

Feature table and feature data summaries¶

After DADA2 completes, you’ll want to explore the resulting data. You can do this using the following two commands, which will create visual summaries of the data. The feature-table summarize action command will give you information on how many sequences are associated with each sample and with each feature, histograms of those distributions, and some related summary statistics.

[Command Line]
[Python API]
[Galaxy]
[R API]
[View Source]
qiime feature-table summarize-plus \
  --i-table asv-table.qza \
  --m-metadata-file sample-metadata.tsv \
  --o-summary asv-table.qzv \
  --o-sample-frequencies sample-frequencies.qza \
  --o-feature-frequencies asv-frequencies.qza

The feature-table tabulate-seqs action command will provide a mapping of feature IDs to sequences, and provide links to easily BLAST each sequence against the NCBI nt database. We can also include the feature frequency information in this visualization by passing it as metadata, similar to how we merged metadata in the exercise above. In this case, however, we’re looking a feature metadata, as opposed to sample metadata. As far as QIIME 2 is concerned, there is no difference between these two - in our case, it’ll only be the identifiers that differ.

This visualization will be very useful later in the tutorial, when you want to learn more about specific features that are important in the data set.

[Command Line]
[Python API]
[Galaxy]
[R API]
[View Source]
qiime feature-table tabulate-seqs \
  --i-data asv-seqs.qza \
  --m-metadata-file asv-frequencies.qza \
  --o-visualization asv-seqs.qzv

Filtering features from a feature table¶

If you review the tabulated feature sequences, or the feature detail table of the feature table summary, you’ll notice that there are many sequences that are observed in only a single sample. Let’s filter those out to reduce the number of sequences we’re working with - this will speed up several slower steps that are coming up.

This is a two-step process. First we filter our feature table, and then we use the new feature table to filter our sequences to only the ones that are contained in the new table.

[Command Line]
[Python API]
[Galaxy]
[R API]
[View Source]
qiime feature-table filter-features \
  --i-table asv-table.qza \
  --p-min-samples 2 \
  --o-filtered-table asv-table-ms2.qza
[Command Line]
[Python API]
[Galaxy]
[R API]
[View Source]
qiime feature-table filter-seqs \
  --i-data asv-seqs.qza \
  --i-table asv-table-ms2.qza \
  --o-filtered-data asv-seqs-ms2.qza
Solution to Exercise 7

Here’s the command you would use:

[Command Line]
[Python API]
[Galaxy]
[R API]
[View Source]
qiime feature-table summarize-plus \
  --i-table asv-table-ms2.qza \
  --m-metadata-file sample-metadata.tsv \
  --o-summary asv-table-ms2.qzv \
  --o-sample-frequencies sample-frequencies-ms2.qza \
  --o-feature-frequencies asv-frequencies-ms2.qza

Be sure to run this as we’re going to use one of the results below.

Taxonomic annotation¶

Before we complete our upstream analysis steps, we’ll generate taxonomic annotations for our sequences using the feature-classifier plugin.

First, we’ll download a pre-trained classifier artifact.

[Command Line]
[Python API]
[Galaxy]
[R API]
[View Source]
wget -O 'suboptimal-16S-rRNA-classifier.qza' \
  'https://gut-to-soil-tutorial.readthedocs.io/en/latest/data/gut-to-soil/suboptimal-16S-rRNA-classifier.qza'

Then, we’ll apply it to our sequences using classify-sklearn.

[Command Line]
[Python API]
[Galaxy]
[R API]
[View Source]
qiime feature-classifier classify-sklearn \
  --i-classifier suboptimal-16S-rRNA-classifier.qza \
  --i-reads asv-seqs-ms2.qza \
  --o-classification taxonomy.qza

Then, to get an initial look at our taxonomic classifications, let’s integrate taxonomy in the sequence summary, like the one we generated above.

[Command Line]
[Python API]
[Galaxy]
[R API]
[View Source]
## constructing result collection ##
rc_name=taxonomy-collection/
ext=.qza
keys=( Greengenes-13-8 )
names=( taxonomy.qza )
construct_result_collection
##
qiime feature-table tabulate-seqs \
  --i-data asv-seqs-ms2.qza \
  --i-taxonomy taxonomy-collection/ \
  --m-metadata-file asv-frequencies-ms2.qza \
  --o-visualization asv-seqs-ms2.qzv

Building a tree for phylogenetic diversity calculations¶

QIIME supports several phylogenetic diversity metrics, including Faith’s Phylogenetic Diversity and weighted and unweighted UniFrac. In addition to counts of features per sample (i.e., the data in the FeatureTable[Frequency] QIIME 2 artifact), these metrics require a rooted phylogenetic tree relating the features to one another. The amplicon distribution offers a few ways to build these trees, including a reference-based approach in the fragment-insertion plugin and de novo (i.e., reference-free) approaches in the phylogeny plugin.

The reference based approach, by default, is specific to 16S rRNA marker gene analysis. We could use that here, but the runtime is too long for our documentation.[1] If you’d like to see this demonstrated, you can refer to the Parkinson’s Mouse tutorial.

The de novo approach is known to generate low quality trees, but can be used with any marker gene (not just 16S). If you’d like to see this demonstrated, you can refer to the Moving Pictures tutorial đŸŽ„.

For those reasons, we’re going to skip building phylogenetic trees and instead use an analog of phylogenetic diversity metrics here.

Downstream data analysis¶

As mentioned above, we tend to think of “downstream” analysis as beginning with a feature table, taxonomic annotation of our features, and optionally a phylogenetic tree. Now that we have those (with the exception of the tree, which we won’t use here), let’s jump in. This is where it starts to get fun! ⛷

Kmerization of our features¶

We’ll start here by calculating alpha and beta diversity metrics. To do this, in lieu of a phylogenetic tree, we’re going to use a stand-alone QIIME 2 plugin, q2-kmerizer. This plugin splits ASV sequences into their constituent kmers, and creates a new feature table where those kmers (instead of ASVs) are the features. The paper showed that this enables non-phylogenetic diversity metrics to achieve results highly correlated with those achieved by phylogenetic diversity metrics.

Let’s generate our kmer feature table.

[Command Line]
[Python API]
[Galaxy]
[R API]
[View Source]
qiime kmerizer seqs-to-kmers \
  --i-table asv-table-ms2.qza \
  --i-sequences asv-seqs-ms2.qza \
  --o-kmer-table kmer-table.qza

Let’s also generate a summary of the feature table.

[Command Line]
[Python API]
[Galaxy]
[R API]
[View Source]
qiime feature-table summarize-plus \
  --i-table kmer-table.qza \
  --m-metadata-file sample-metadata.tsv \
  --o-summary kmer-table.qzv \
  --o-sample-frequencies kmer-sample-frequencies.qza \
  --o-feature-frequencies kmer-feature-frequencies.qza

Computing bootstrapped alpha and beta diversity metrics¶

QIIME 2’s diversity analyses are available through the diversity plugin, which supports computing alpha and beta diversity metrics, applying related statistical tests, and generating interactive visualizations. A relatively new stand-alone plugin, q2-boots, mirrors the interface of the diversity metric calculation actions in the diversity plugin, but generates more robust results because it integrates rarefaction and/or bootstrapping. Let’s use that here, instead of the diversity plugin.

We’ll apply the core-metrics action, which bootstraps a FeatureTable[Frequency] (i.e., samples with replacement) to a user-specified sampling depth n times, computes several alpha and beta diversity metrics on each of the n bootstrapped feature tables, and then creates averaged alpha and beta diversity data artifacts as output. Those resulting artifacts can be used anywhere that the corresponding artifacts from the diversity plugin could be used. The core-metrics action also generates principle coordinates analysis (PCoA) plots using the Emperor plugin for each of the beta diversity metrics.

The metrics computed by default are:

  • Alpha diversity
    • Shannon’s diversity index (a quantitative measure of community richness)
    • Observed Features (a qualitative measure of community richness)
    • Evenness (or Pielou’s Evenness; a measure of community evenness)
  • Beta diversity
    • Jaccard distance (a qualitative measure of community dissimilarity)
    • Bray-Curtis distance (a quantitative measure of community dissimilarity)

An important parameter that needs to be provided to this script is sampling-depth, which is the even sampling (i.e., bootstrapping or rarefaction) depth. Because most diversity metrics are sensitive to different sampling depths across different samples, the tables are randomly subsampled such that the total frequency for each sample is the user-specified sampling depth. For example, if you set sampling-depth=500, this step will subsample the counts in each sample so that each sample in the resulting table has a total frequency of 500. If the total frequency for any sample(s) are smaller than this value, those samples will be dropped from the diversity analysis. Choosing this value is tricky. We recommend making your choice by reviewing the information presented in the kmer-table.qzv file that was created above.

I’m going to choose values that is around the first quartile of the sample total frequencies.

[Command Line]
[Python API]
[Galaxy]
[R API]
[View Source]
qiime boots core-metrics \
  --i-table kmer-table.qza \
  --m-metadata-file sample-metadata.tsv \
  --p-sampling-depth 22000 \
  --p-n 10 \
  --p-replacement \
  --p-alpha-average-method median \
  --p-beta-average-method medoid \
  --output-dir boots-core-metrics
## accessing result collection member ##
ln -s boots-core-metrics/distance_matrices/jaccard_distance_matrix.qza unweighted-dm.qza
##
## accessing result collection member ##
ln -s boots-core-metrics/pcoas/jaccard.qza unweighted-pcoa.qza
##
## accessing result collection member ##
ln -s boots-core-metrics/distance_matrices/braycurtis_distance_matrix.qza weighted-dm.qza
##
## accessing result collection member ##
ln -s boots-core-metrics/pcoas/braycurtis.qza weighted-pcoa.qza
##
## accessing result collection member ##
ln -s boots-core-metrics/alpha_diversities/observed_features.qza richness-vector.qza
##
## accessing result collection member ##
ln -s boots-core-metrics/alpha_diversities/evenness.qza evenness-vector.qza
##
  • boots-core-metrics/resampled_tables | download
  • boots-core-metrics/alpha_diversities | download
  • boots-core-metrics/distance_matrices | download
  • boots-core-metrics/pcoas | download
  • boots-core-metrics/emperor_plots | download

After computing diversity metrics, we can begin to explore the microbial composition of the samples in the context of the sample metadata. You can review the sample metadata using one of the tabulated views of this file that we created above.

Integrating additional information into PCoA scatter plots¶

The PCoA results that were computed by core-metrics are viewable as metadata, which opens them up to use with the vizard plugin. Vizard is a general purpose plotting plugin, and works with any artifacts that can be viewed as metadata. This opens up a world of possibility in how you visualize your microbiome data with QIIME 2. For example, let’s integrate our Jaccard PCoA results with our Observed Features data and our sample metadata in a vizard scatterplot.

[Command Line]
[Python API]
[Galaxy]
[R API]
[View Source]
qiime vizard scatterplot-2d \
  --m-metadata-file sample-metadata.tsv unweighted-pcoa.qza richness-vector.qza \
  --o-visualization unweighted-diversity-scatterplot.qzv

Let’s generate another vizard plot, but this time using our Bray-Curtis PCoA results.

[Command Line]
[Python API]
[Galaxy]
[R API]
[View Source]
qiime vizard scatterplot-2d \
  --m-metadata-file sample-metadata.tsv weighted-pcoa.qza richness-vector.qza \
  --o-visualization weighted-diversity-scatterplot.qzv

Alpha rarefaction plotting¶

In this section we’ll explore alpha diversity as a function of sampling depth using thealpha-rarefaction action. This visualizer computes one or more alpha diversity metrics at multiple sampling depths, in steps between 1 (optionally controlled with min-depth) and the value provided as max-depth. At each sampling depth step, 10 rarefied tables will be generated, and the diversity metrics will be computed for all samples in the tables. The number of iterations (rarefied tables computed at each sampling depth) can be controlled with the iterations parameter. Average diversity values will be plotted for each sample at each even sampling depth, and samples can be grouped based on metadata in the resulting visualization if sample metadata is provided.

The value that you provide for max-depth should be determined by reviewing the “Frequency per sample” information presented in the kmer-table.qzv file. In general, choosing a value that is somewhere around the median frequency seems to work well, but you may want to increase that value if the lines in the resulting rarefaction plot don’t appear to be leveling out, or decrease that value if you seem to be losing many of your samples due to low total frequencies closer to the minimum sampling depth than the maximum sampling depth.

[Command Line]
[Python API]
[Galaxy]
[R API]
[View Source]
qiime diversity alpha-rarefaction \
  --i-table kmer-table.qza \
  --p-max-depth 62000 \
  --m-metadata-file sample-metadata.tsv \
  --o-visualization alpha-rarefaction.qzv

The visualization will have two plots. The top plot is an alpha rarefaction plot, and is primarily used to determine if the richness of the samples has been fully observed or sequenced. If the lines in the plot appear to “level out” (i.e., approach a slope of zero) at some sampling depth along the x-axis, that suggests that collecting additional sequences beyond that sampling depth would not be likely to result in the observation of additional features. If the lines in the plot don’t level out, this may be because the richness of the samples hasn’t been fully observed yet (because too few sequences were collected), or it could be an indicator that a lot of sequencing error remains in the data (which is being mistaken for novel diversity).

The bottom plot in this visualization is important when grouping samples by metadata. It illustrates the number of samples that remain in each group when the feature table is rarefied to each sampling depth. If a given sampling depth d is larger than the total frequency of a sample s (i.e., the number of sequences that were obtained for sample s), it is not possible to compute the diversity metric for sample s at sampling depth d. If many of the samples in a group have lower total frequencies than d, the average diversity presented for that group at d in the top plot will be unreliable because it will have been computed on relatively few samples. When grouping samples by metadata, it is therefore essential to look at the bottom plot to ensure that the data presented in the top plot is reliable.

Taxonomic analysis¶

Next, we can view the taxonomic composition of our samples with interactive bar plots. Generate those plots with the following command and then open the visualization.

[Command Line]
[Python API]
[Galaxy]
[R API]
[View Source]
qiime taxa barplot \
  --i-table asv-table-ms2.qza \
  --i-taxonomy taxonomy.qza \
  --m-metadata-file sample-metadata.tsv \
  --o-visualization taxa-bar-plots.qzv
Solution to Exercise 15

We use the ASV table here because the feature ids in that table are the same as the ones used in the FeatureData[Taxonomy] artifact. Additionally, kmerization of our data is a tool used for computing diversity metrics - not something we generally intend to use throughout our analyses.

Differential abundance testing with ANCOM-BC¶

ANCOM-BC is a compositionally-aware linear regression model that allows testing for differentially abundant features across sample groups while also implementing bias correction. This can be accessed using the ancombc action in the composition plugin.

We’ll perform this analysis in a few steps.

Filter samples from the feature table¶

First, we’ll filter our samples from our feature table such that we only have the three groups that we have the most samples for.

[Command Line]
[Python API]
[Galaxy]
[R API]
[View Source]
qiime feature-table filter-samples \
  --i-table asv-table-ms2.qza \
  --m-metadata-file sample-metadata.tsv \
  --p-where '[SampleType] IN ("Human Excrement Compost", "Human Excrement", "Food Compost")' \
  --o-filtered-table asv-table-ms2-dominant-sample-types.qza

Collapse the ASVs into genera¶

Then, we’ll collapse our ASVs into genera (i.e. level 6 of the Greengenes taxonomy), to get more useful annotation of the features (and to learn how to perform this grouping).

[Command Line]
[Python API]
[Galaxy]
[R API]
[View Source]
qiime taxa collapse \
  --i-table asv-table-ms2-dominant-sample-types.qza \
  --i-taxonomy taxonomy.qza \
  --p-level 6 \
  --o-collapsed-table genus-table-ms2-dominant-sample-types.qza

Apply differential abundance testing¶

Then, we’ll apply ANCOM-BC to see which genera are differentially abundant across those sample types. I specify a reference level here as this defines what each group is compared against. Since the focus of this study is HEC, I choose that as my reference level. That will let us see what genera are over or under represented in the Human Excrement Compost samples relative to the other two sample groups.

[Command Line]
[Python API]
[Galaxy]
[R API]
[View Source]
qiime composition ancombc \
  --i-table genus-table-ms2-dominant-sample-types.qza \
  --m-metadata-file sample-metadata.tsv \
  --p-formula SampleType \
  --p-reference-levels 'SampleType::Human Excrement Compost' \
  --o-differentials genus-ancombc.qza

Finally, we’ll visualize the results.

[Command Line]
[Python API]
[Galaxy]
[R API]
[View Source]
qiime composition da-barplot \
  --i-data genus-ancombc.qza \
  --p-significance-threshold 0.001 \
  --p-level-delimiter ';' \
  --o-visualization genus-ancombc.qzv

That’s it for now, but more is coming soon!¶

In the near future (as of 11 March 2025) we plan to integrate analyses using the sample-classifier plugin and longitudinal plugin. In the meantime, here are some suggestions to continue your learning:

  1. Build a machine learning classifier that classifies samples accordining to the three dominant sample types in the feature table that we used with ANCOM-BC. (Hint: see classify-samples.)
  2. Perform a longitudinal analysis that tracks samples from different buckets over time. Which taxa change most over time? (Hint: see feature-volatility.)
  3. Remember that the full data set (five sequencing runs) are available in the gut-to-soil Artifact Archive. Grab one of the larger sequencing runs (we worked with a small sequencing run that was generated as a preliminary test), and adapt the commands in this tutorial to work on a bigger data set.

We’re also in the process of refactoring our statistical methods for assessing alpha and beta diversity across groups, using the new stats plugin. We’re therefore holding off on integrating statistical analysis until we have that ready. In the meantime, you can refer to you can refer to the Moving Pictures tutorial đŸŽ„, as well as the sample-classifier and longitudinal tutorials.

Replay provenance (work in progress!)¶

You might next want to try to adapt the commands presented in this tutorial to your own data, adjusting parameter settings and metadata column headers as is relevant. QIIME 2’s provenance replay functionality can help with this. Assuming that you ran all of the steps above in a directory called gut-to-soil/, run the following command to generate a template script that you can adapt for your workflow:

qiime tools replay-provenance \
  --in-fp gut-to-soil/taxa-bar-plots.qzv \
  --out-fp g2s-replayed.bash

If you need help, head over to the QIIME 2 Forum.

Footnotes¶
  1. The resource requirements exceed those provided by the Read the Docs (RTD) build system, which is used to build the documentation that you’re reading. RTD provides systems with 7GB of RAM for 30 minutes maximum to build documentation. That’s a very reasonable (and generous) allocation for building documentation, so we choose to work within those contraints rather than creating our own documentation build system like we’ve had in the past (e.g., for https://docs.qiime2.org).