Background¶
In this tutorial youâll learn an end-to-end microbiome marker gene data science workflow, building on data presented in Meilander et al. (2024): Upcycling Human Excrement: The Gut Microbiome to Soil Microbiome Axis. The data used here is a subset (a single sequencing run) of that generated for the paper, specifically selected so that this tutorial can be run quickly on a personal computer. The full data set for the paper can be found in the paperâs Artifact Archive. In the final step (not yet written, as of 11 March 2025), youâll learn how to adapt the workflow for use in analyzing your own data using Provenance Replay.
The data used in this tutorial was generated using the Earth Microbiome Project protocol. Specifically, the hypervariable region 4 (V4) of the 16S rRNA gene was amplified using the F515-R806 primers - a broad-coverage primer pair for Bacteria that also amplifies some Archaea. Paired-end sequencing was performed on an Illumina MiSeq. Full details are presented in Meilander et al. (2024).
Sample metadata¶
Before starting the analysis, explore the sample metadata to familiarize yourself with the samples used in this study.
The following command will download the sample metadata as tab-separated text and save it in the file sample-metadata.tsv
.
This sample-metadata.tsv
file is used throughout the rest of the tutorial.
wget -O 'sample-metadata.tsv' \
'https://gut-to-soil-tutorial.readthedocs.io/en/latest/data/gut-to-soil/sample-metadata.tsv'
from qiime2 import Metadata
from urllib import request
url = 'https://gut-to-soil-tutorial.readthedocs.io/en/latest/data/gut-to-soil/sample-metadata.tsv'
fn = 'sample-metadata.tsv'
request.urlretrieve(url, fn)
sample_metadata_md = Metadata.load(fn)
- Using the
Upload Data
tool: - On the first tab (Regular), press the
Paste/Fetch
data button at the bottom.- Set "Name" (first text-field) to:
sample-metadata.tsv
- In the larger text-area, copy-and-paste: https://
gut -to -soil -tutorial .readthedocs .io /en /latest /data /gut -to -soil /sample -metadata .tsv - ("Type", "Genome", and "Settings" can be ignored)
- Set "Name" (first text-field) to:
- Press the
Start
button at the bottom.
- On the first tab (Regular), press the
library(reticulate)
Metadata <- import("qiime2")$Metadata
request <- import("urllib")$request
url <- 'https://gut-to-soil-tutorial.readthedocs.io/en/latest/data/gut-to-soil/sample-metadata.tsv'
fn <- 'sample-metadata.tsv'
request$urlretrieve(url, fn)
sample_metadata_md <- Metadata$load(fn)
sample_metadata = use.init_metadata_from_url(
'sample-metadata',
'https://www.dropbox.com/scl/fi/irosimbb1aud1aa7frzxf/sample-metadata.tsv?rlkey=f45jpxzajjz9xx9vpvfnf1zjx&st=nahafuvy&dl=1')
sample-metadata.tsv
| download
QIIME 2âs metadata plugin provides a Visualizer called tabulate
that generates a convenient view of a sample metadata file.
Letâs run this, and then weâll look at the result.
Hereâs the first QIIME 2 command that you should run in this tutorial:
qiime metadata tabulate \
--m-input-file sample-metadata.tsv \
--o-visualization sample-metadata.qzv
import qiime2.plugins.metadata.actions as metadata_actions
sample_metadata_viz, = metadata_actions.tabulate(
input=sample_metadata_md,
)
- Using the
qiime2 metadata tabulate
tool: - For "input":
- Perform the following steps.
- Leave as
Metadata from TSV
- Set "Metadata Source" to
sample-metadata.tsv
- Leave as
- Perform the following steps.
- Press the
Execute
button.
- For "input":
- Once completed, for the new entry in your history, use the
Edit
button to set the name as follows: - (Renaming is optional, but it will make any subsequent steps easier to complete.)
History Name "Name" to set (be sure to press [Save]) #: qiime2 metadata tabulate [...] : visualization.qzv
sample-metadata.qzv
metadata_actions <- import("qiime2.plugins.metadata.actions")
action_results <- metadata_actions$tabulate(
input=sample_metadata_md,
)
sample_metadata_viz <- action_results$visualization
use.action(
use.UsageAction(plugin_id='metadata',
action_id='tabulate'),
use.UsageInputs(input=sample_metadata),
use.UsageOutputNames(visualization='sample_metadata')
)
You can learn more about viewing Visualizations, including alternatives to QIIME 2 View if you canât use that for any reason, in Using QIIME 2.
This will generate a QIIME 2 Visualization. Visualizations can be viewed by loading them with QIIME 2 View. Navigate to QIIME 2 View, and drag and drop the visualization that was created to view it.
Access already-imported QIIME 2 data¶
This tutorial begins with paired-end read sequencing data that has already been demultiplexed and imported into a QIIME 2 Artifact. Because sequence data can be delivered to you in many different forms, itâs not possible to cover the varieties here. Instead we refer you to How to import and export data to learn how to import your data. If you want to learn why importing is necessary, refer to Why importing is necessary.
You should ask your sequencing center to provide data already demultiplexed. In some cases, they may be able to provide a âQIIME 2 demux artifactsâ. If not, ask them to provide a fastq manifest file for your sequencing data. It should be easy for them to generate.
wget -O 'demux.qza' \
'https://gut-to-soil-tutorial.readthedocs.io/en/latest/data/gut-to-soil/demux.qza'
from qiime2 import Artifact
url = 'https://gut-to-soil-tutorial.readthedocs.io/en/latest/data/gut-to-soil/demux.qza'
fn = 'demux.qza'
request.urlretrieve(url, fn)
demux = Artifact.load(fn)
- Using the
Upload Data
tool: - On the first tab (Regular), press the
Paste/Fetch
data button at the bottom.- Set "Name" (first text-field) to:
demux.qza
- In the larger text-area, copy-and-paste: https://
gut -to -soil -tutorial .readthedocs .io /en /latest /data /gut -to -soil /demux .qza - ("Type", "Genome", and "Settings" can be ignored)
- Set "Name" (first text-field) to:
- Press the
Start
button at the bottom.
- On the first tab (Regular), press the
Artifact <- import("qiime2")$Artifact
url <- 'https://gut-to-soil-tutorial.readthedocs.io/en/latest/data/gut-to-soil/demux.qza'
fn <- 'demux.qza'
request$urlretrieve(url, fn)
demux <- Artifact$load(fn)
demux = use.init_artifact_from_url(
'demux',
'https://www.dropbox.com/scl/fi/73f6rmcq7lelug5qbuhl6/demux-10p.qza?rlkey=s0hoh326fes3z2usvgs62tv3c&st=caz1avkn&dl=1')
Summarize demultiplexed sequences¶
When you have demultiplexed sequence data, the next step is typically to generate a visual summary of it. This allows you to determine how many sequences were obtained per sample, and also to get a summary of the distribution of sequence qualities at each position in your sequence data.
qiime demux summarize \
--i-data demux.qza \
--o-visualization demux.qzv
import qiime2.plugins.demux.actions as demux_actions
demux_viz, = demux_actions.summarize(
data=demux,
)
- Using the
qiime2 demux summarize
tool: - Set "data" to
#: demux.qza
- Press the
Execute
button.
- Set "data" to
- Once completed, for the new entry in your history, use the
Edit
button to set the name as follows: - (Renaming is optional, but it will make any subsequent steps easier to complete.)
History Name "Name" to set (be sure to press [Save]) #: qiime2 demux summarize [...] : visualization.qzv
demux.qzv
demux_actions <- import("qiime2.plugins.demux.actions")
action_results <- demux_actions$summarize(
data=demux,
)
demux_viz <- action_results$visualization
use.action(
use.UsageAction(plugin_id='demux',
action_id='summarize'),
use.UsageInputs(data=demux),
use.UsageOutputNames(visualization='demux'))
How many samples are represented in this sequencing data? What is the median number of sequence reads obtained per sample? What is the median quality score at position 200 of the forward reads?
Upstream data analysis¶
Generally, the term âupstreamâ is used to refer to data analysis pre-feature-asv_table, and âdownstreamâ is used to refer to data analysis post-feature-asv_table. Letâs jump into our upstream analysis.
Sequence quality control and feature table construction¶
QIIME 2 plugins are available for several quality control methods, including DADA2, Deblur, and basic quality-score-based filtering.
In this tutorial we present this step using DADA2.
The result of this method will be a FeatureTable[Frequency]
QIIME 2 artifact, which contains counts (frequencies) of each unique sequence in each sample in the dataset, and a FeatureData[Sequence]
QIIME 2 artifact, which maps feature identifiers in the FeatureTable
to the sequences they represent.
DADA2 is a pipeline for detecting and correcting (where possible) Illumina amplicon sequence data. As implemented in the dada2 plugin, this quality control process will additionally filter any phiX reads (commonly present in marker gene Illumina sequence data) that are identified in the sequencing data, filter chimeric sequences, and merge paired end reads.
The denoise-paired
action, which weâll use here, requires four parameters that are used in quality filtering:
trim-left-f a
, which trims off the firsta
bases of each forward readtrunc-len-f b
which truncates each forward read at positionb
trim-left-r c
, which trims off the firstc
bases of each forward readtrunc-len-r d
which truncates each forward read at positiond
This allows the user to remove low quality regions of the sequences. To determine what values to pass for these parameters, you should review the Interactive Quality Plot tab in thedemux.qzv
file that was generated above.
Based on the plots you see in the demux.qzv
file that was generated above, what values would you choose for trim-left-f
and trunc-len-f
in this case?
What about trim-left-r
and trunc-len-r
?
Solution to Exercise 2
The quality of the initial bases seems to be high, so I choose not to trim any bases from the beginning of the sequences. The quality also seems good all the way out to the end, though maybe dropping off after 250 bases. Iâll therefore truncate at 250. Iâll keep these values the same for both the forward and reverse reads, though that is not a requirement.
Now run your DADA2 command. This step may take up to 10 minutes to complete - itâs the longest running step in this tutorial.
qiime dada2 denoise-paired \
--i-demultiplexed-seqs demux.qza \
--p-trim-left-f 0 \
--p-trunc-len-f 250 \
--p-trim-left-r 0 \
--p-trunc-len-r 250 \
--o-representative-sequences asv-seqs.qza \
--o-table asv-table.qza \
--o-denoising-stats stats.qza
import qiime2.plugins.dada2.actions as dada2_actions
asv_table, asv_seqs, stats = dada2_actions.denoise_paired(
demultiplexed_seqs=demux,
trim_left_f=0,
trunc_len_f=250,
trim_left_r=0,
trunc_len_r=250,
)
- Using the
qiime2 dada2 denoise-paired
tool: - Set "demultiplexed_seqs" to
#: demux.qza
- Set "trunc_len_f" to
250
- Set "trunc_len_r" to
250
- Expand the
additional options
section- Leave "trim_left_f" as its default value of
0
- Leave "trim_left_r" as its default value of
0
- Leave "trim_left_f" as its default value of
- Press the
Execute
button.
- Set "demultiplexed_seqs" to
- Once completed, for each new entry in your history, use the
Edit
button to set the name as follows: - (Renaming is optional, but it will make any subsequent steps easier to complete.)
History Name "Name" to set (be sure to press [Save]) #: qiime2 dada2 denoise-paired [...] : table.qza
asv-table.qza
#: qiime2 dada2 denoise-paired [...] : representative_sequences.qza
asv-seqs.qza
#: qiime2 dada2 denoise-paired [...] : denoising_stats.qza
stats.qza
dada2_actions <- import("qiime2.plugins.dada2.actions")
action_results <- dada2_actions$denoise_paired(
demultiplexed_seqs=demux,
trim_left_f=0L,
trunc_len_f=250L,
trim_left_r=0L,
trunc_len_r=250L,
)
asv_seqs <- action_results$representative_sequences
asv_table <- action_results$table
stats <- action_results$denoising_stats
asv_seqs, asv_table, stats = use.action(
use.UsageAction(plugin_id='dada2',
action_id='denoise_paired'),
use.UsageInputs(demultiplexed_seqs=demux,
trim_left_f=0,
trunc_len_f=250,
trim_left_r=0,
trunc_len_r=250),
use.UsageOutputNames(representative_sequences='asv_seqs',
table='asv_table',
denoising_stats='stats'))
One of the outputs created by DADA2 is a summary of the denoising run. That is generated as an Artifact, so canât be viewed directly. However this is one of many QIIME 2 types that can be viewed as Metadata - a very powerful concept that weâll use again later in this tutorial. Learning to view artifacts as Metadata creates nearly infinite possibilities for how you can explore your microbiome data with QIIME 2.
Here, weâll again use the metadata plugins tabulate
visualizer, but this time weâll apply it to the DADA2 statistics.
qiime metadata tabulate \
--m-input-file stats.qza \
--o-visualization stats.qzv
stats_as_md_md = stats.view(Metadata)
stats_viz, = metadata_actions.tabulate(
input=stats_as_md_md,
)
- Using the
qiime2 metadata tabulate
tool: - For "input":
- Perform the following steps.
- Change to
Metadata from Artifact
- Set "Metadata Source" to
stats.qza
- Change to
- Perform the following steps.
- Press the
Execute
button.
- For "input":
- Once completed, for the new entry in your history, use the
Edit
button to set the name as follows: - (Renaming is optional, but it will make any subsequent steps easier to complete.)
History Name "Name" to set (be sure to press [Save]) #: qiime2 metadata tabulate [...] : visualization.qzv
stats.qzv
stats_as_md_md <- stats$view(Metadata)
action_results <- metadata_actions$tabulate(
input=stats_as_md_md,
)
stats_viz <- action_results$visualization
stats_as_md = use.view_as_metadata('stats_as_md', stats)
use.action(
use.UsageAction(plugin_id='metadata',
action_id='tabulate'),
use.UsageInputs(input=stats_as_md),
use.UsageOutputNames(visualization='stats'))
Which three samples had the smallest percentage of input reads passing the quality filter?
Refer back to your tabulated metadata: what do you know about those samples (e.g., what values do they have in the SampleType
column)?
(Hint: the sample-id
is the key that connects data across these two visualizations.)
If itâs annoying to switch back and forth between visualizations to answer the question in the last exercise, you can create a combined tabulation of the metadata. Try to do that by adapting the instructions in How to merge metadata.
This is also useful if you want to create a large tabular summary describing your samples following analysis, as you can include as many different metadata objects as youâd like in these summaries.
Solution to Exercise 4
qiime metadata tabulate \
--m-input-file sample-metadata.tsv stats.qza \
--o-visualization sample-metadata-w-dada2-stats.qzv
sample_metadata_and_dada2_stats_md_md = sample_metadata_md.merge(stats_as_md_md)
sample_metadata_w_dada2_stats_viz, = metadata_actions.tabulate(
input=sample_metadata_and_dada2_stats_md_md,
)
- Using the
qiime2 metadata tabulate
tool: - For "input":
- Perform the following steps.
- Leave as
Metadata from TSV
- Set "Metadata Source" to
sample-metadata.tsv
- Leave as
- Press the
+ Insert input
button to set up the next steps.- Change to
Metadata from Artifact
- Set "Metadata Source" to
stats.qza
- Change to
- Perform the following steps.
- Press the
Execute
button.
- For "input":
- Once completed, for the new entry in your history, use the
Edit
button to set the name as follows: - (Renaming is optional, but it will make any subsequent steps easier to complete.)
History Name "Name" to set (be sure to press [Save]) #: qiime2 metadata tabulate [...] : visualization.qzv
sample-metadata-w-dada2-stats.qzv
sample_metadata_and_dada2_stats_md_md <- sample_metadata_md$merge(stats_as_md_md)
action_results <- metadata_actions$tabulate(
input=sample_metadata_and_dada2_stats_md_md,
)
sample_metadata_w_dada2_stats_viz <- action_results$visualization
sample_metadata_and_dada2_stats_md = use.merge_metadata('sample_metadata_and_dada2_stats_md', sample_metadata, stats_as_md)
use.action(
use.UsageAction(plugin_id='metadata',
action_id='tabulate'),
use.UsageInputs(input=sample_metadata_and_dada2_stats_md),
use.UsageOutputNames(visualization='sample_metadata_w_dada2_stats'))
Feature table and feature data summaries¶
After DADA2 completes, youâll want to explore the resulting data.
You can do this using the following two commands, which will create visual summaries of the data.
The feature-table summarize
action command will give you information on how many sequences are associated with each sample and with each feature, histograms of those distributions, and some related summary statistics.
qiime feature-table summarize-plus \
--i-table asv-table.qza \
--m-metadata-file sample-metadata.tsv \
--o-summary asv-table.qzv \
--o-sample-frequencies sample-frequencies.qza \
--o-feature-frequencies asv-frequencies.qza
import qiime2.plugins.feature_table.actions as feature_table_actions
asv_frequencies, sample_frequencies, asv_table_viz = feature_table_actions.summarize_plus(
table=asv_table,
metadata=sample_metadata_md,
)
- Using the
qiime2 feature-table summarize-plus
tool: - Set "table" to
#: asv-table.qza
- Expand the
additional options
section- For "metadata":
- Press the
+ Insert metadata
button to set up the next steps.- Leave as
Metadata from TSV
- Set "Metadata Source" to
sample-metadata.tsv
- Leave as
- Press the
- For "metadata":
- Press the
Execute
button.
- Set "table" to
- Once completed, for each new entry in your history, use the
Edit
button to set the name as follows: - (Renaming is optional, but it will make any subsequent steps easier to complete.)
History Name "Name" to set (be sure to press [Save]) #: qiime2 feature-table summarize-plus [...] : feature_frequencies.qza
asv-frequencies.qza
#: qiime2 feature-table summarize-plus [...] : sample_frequencies.qza
sample-frequencies.qza
#: qiime2 feature-table summarize-plus [...] : summary.qzv
asv-table.qzv
feature_table_actions <- import("qiime2.plugins.feature_table.actions")
action_results <- feature_table_actions$summarize_plus(
table=asv_table,
metadata=sample_metadata_md,
)
asv_table_viz <- action_results$summary
sample_frequencies <- action_results$sample_frequencies
asv_frequencies <- action_results$feature_frequencies
_, _, asv_frequencies = use.action(
use.UsageAction(plugin_id='feature_table',
action_id='summarize_plus'),
use.UsageInputs(table=asv_table,
metadata=sample_metadata),
use.UsageOutputNames(summary='asv_table',
sample_frequencies='sample_frequencies',
feature_frequencies='asv_frequencies'))
What is the total number of sequences represented in the feature table? What is the identifier of the feature that is observed the most number of times (i.e., has the highest frequency)?
The feature-table tabulate-seqs
action command will provide a mapping of feature IDs to sequences, and provide links to easily BLAST each sequence against the NCBI nt database.
We can also include the feature frequency information in this visualization by passing it as metadata, similar to how we merged metadata in the exercise above.
In this case, however, weâre looking a feature metadata, as opposed to sample metadata.
As far as QIIME 2 is concerned, there is no difference between these two - in our case, itâll only be the identifiers that differ.
This visualization will be very useful later in the tutorial, when you want to learn more about specific features that are important in the data set.
qiime feature-table tabulate-seqs \
--i-data asv-seqs.qza \
--m-metadata-file asv-frequencies.qza \
--o-visualization asv-seqs.qzv
asv_frequencies_md_md = asv_frequencies.view(Metadata)
asv_seqs_viz, = feature_table_actions.tabulate_seqs(
data=asv_seqs,
metadata=asv_frequencies_md_md,
)
- Using the
qiime2 feature-table tabulate-seqs
tool: - Set "data" to
#: asv-seqs.qza
- Expand the
additional options
section- For "metadata":
- Press the
+ Insert metadata
button to set up the next steps.- Change to
Metadata from Artifact
- Set "Metadata Source" to
asv-frequencies.qza
- Change to
- Press the
- For "metadata":
- Press the
Execute
button.
- Set "data" to
- Once completed, for the new entry in your history, use the
Edit
button to set the name as follows: - (Renaming is optional, but it will make any subsequent steps easier to complete.)
History Name "Name" to set (be sure to press [Save]) #: qiime2 feature-table tabulate-seqs [...] : visualization.qzv
asv-seqs.qzv
asv_frequencies_md_md <- asv_frequencies$view(Metadata)
action_results <- feature_table_actions$tabulate_seqs(
data=asv_seqs,
metadata=asv_frequencies_md_md,
)
asv_seqs_viz <- action_results$visualization
asv_frequencies_as_md = use.view_as_metadata('asv_frequencies_md',
asv_frequencies)
use.action(
use.UsageAction(plugin_id='feature_table',
action_id='tabulate_seqs'),
use.UsageInputs(data=asv_seqs,
metadata=asv_frequencies_as_md),
use.UsageOutputNames(visualization='asv_seqs'),
)
What is the taxonomy associated with the most frequently observed feature, based on a BLAST search?
Filtering features from a feature table¶
If you review the tabulated feature sequences, or the feature detail table of the feature table summary, youâll notice that there are many sequences that are observed in only a single sample. Letâs filter those out to reduce the number of sequences weâre working with - this will speed up several slower steps that are coming up.
This is a two-step process. First we filter our feature table, and then we use the new feature table to filter our sequences to only the ones that are contained in the new table.
I include _ms2
in the new output names here to remind us that these are filtered to those features in a minimum of 2 samples.
Generally speaking, file names are a convenient place to store information like this, but theyâre unreliable.
File names can easily be changed, and therefore could be modified to contain inaccurate information about the data they contain.
Luckily, QIIME 2âs provenance tracking system records all of the information that we need about how results were generated. Weâre therefore free to include information like this in file names if itâs helpful for us, but we shouldnât ever rely on the file names. If in doubt -- and always be in doubt đ”ïžââïž -- refer to the data provenance.
QIIME 2âs data provenance is your source for the truth about how a result was created.
qiime feature-table filter-features \
--i-table asv-table.qza \
--p-min-samples 2 \
--o-filtered-table asv-table-ms2.qza
asv_table_ms2, = feature_table_actions.filter_features(
table=asv_table,
min_samples=2,
)
- Using the
qiime2 feature-table filter-features
tool: - Set "table" to
#: asv-table.qza
- Expand the
additional options
section- Set "min_samples" to
2
- Set "min_samples" to
- Press the
Execute
button.
- Set "table" to
- Once completed, for the new entry in your history, use the
Edit
button to set the name as follows: - (Renaming is optional, but it will make any subsequent steps easier to complete.)
History Name "Name" to set (be sure to press [Save]) #: qiime2 feature-table filter-features [...] : filtered_table.qza
asv-table-ms2.qza
action_results <- feature_table_actions$filter_features(
table=asv_table,
min_samples=2L,
)
asv_table_ms2 <- action_results$filtered_table
asv_table_ms2, = use.action(
use.UsageAction(plugin_id='feature_table',
action_id='filter_features'),
use.UsageInputs(table=asv_table,
min_samples=2),
use.UsageOutputNames(filtered_table='asv_table_ms2'),
)
qiime feature-table filter-seqs \
--i-data asv-seqs.qza \
--i-table asv-table-ms2.qza \
--o-filtered-data asv-seqs-ms2.qza
asv_seqs_ms2, = feature_table_actions.filter_seqs(
data=asv_seqs,
table=asv_table_ms2,
)
- Using the
qiime2 feature-table filter-seqs
tool: - Set "data" to
#: asv-seqs.qza
- Expand the
additional options
section- Set "table" to
#: asv-table-ms2.qza
- Set "table" to
- Press the
Execute
button.
- Set "data" to
- Once completed, for the new entry in your history, use the
Edit
button to set the name as follows: - (Renaming is optional, but it will make any subsequent steps easier to complete.)
History Name "Name" to set (be sure to press [Save]) #: qiime2 feature-table filter-seqs [...] : filtered_data.qza
asv-seqs-ms2.qza
action_results <- feature_table_actions$filter_seqs(
data=asv_seqs,
table=asv_table_ms2,
)
asv_seqs_ms2 <- action_results$filtered_data
asv_seqs_ms2, = use.action(
use.UsageAction(plugin_id='feature_table',
action_id='filter_seqs'),
use.UsageInputs(data=asv_seqs,
table=asv_table_ms2),
use.UsageOutputNames(filtered_data='asv_seqs_ms2'),
)
Now that you have a second (filtered) feature table, create your own command to summarize it, like we did for the original feature table. How many features did we lose as a result of this filter? How many total sequences did we lose?
Solution to Exercise 7
Hereâs the command you would use:
qiime feature-table summarize-plus \
--i-table asv-table-ms2.qza \
--m-metadata-file sample-metadata.tsv \
--o-summary asv-table-ms2.qzv \
--o-sample-frequencies sample-frequencies-ms2.qza \
--o-feature-frequencies asv-frequencies-ms2.qza
asv_frequencies_ms2, sample_frequencies_ms2, asv_table_ms2_viz = feature_table_actions.summarize_plus(
table=asv_table_ms2,
metadata=sample_metadata_md,
)
- Using the
qiime2 feature-table summarize-plus
tool: - Set "table" to
#: asv-table-ms2.qza
- Expand the
additional options
section- For "metadata":
- Press the
+ Insert metadata
button to set up the next steps.- Leave as
Metadata from TSV
- Set "Metadata Source" to
sample-metadata.tsv
- Leave as
- Press the
- For "metadata":
- Press the
Execute
button.
- Set "table" to
- Once completed, for each new entry in your history, use the
Edit
button to set the name as follows: - (Renaming is optional, but it will make any subsequent steps easier to complete.)
History Name "Name" to set (be sure to press [Save]) #: qiime2 feature-table summarize-plus [...] : feature_frequencies.qza
asv-frequencies-ms2.qza
#: qiime2 feature-table summarize-plus [...] : sample_frequencies.qza
sample-frequencies-ms2.qza
#: qiime2 feature-table summarize-plus [...] : summary.qzv
asv-table-ms2.qzv
action_results <- feature_table_actions$summarize_plus(
table=asv_table_ms2,
metadata=sample_metadata_md,
)
asv_table_ms2_viz <- action_results$summary
sample_frequencies_ms2 <- action_results$sample_frequencies
asv_frequencies_ms2 <- action_results$feature_frequencies
_, _, asv_frequencies_ms2 = use.action(
use.UsageAction(plugin_id='feature_table',
action_id='summarize_plus'),
use.UsageInputs(table=asv_table_ms2,
metadata=sample_metadata),
use.UsageOutputNames(summary='asv_table_ms2',
sample_frequencies='sample_frequencies_ms2',
feature_frequencies='asv_frequencies_ms2'))
Be sure to run this as weâre going to use one of the results below.
Taxonomic annotation¶
Before we complete our upstream analysis steps, weâll generate taxonomic annotations for our sequences using the feature-classifier plugin.
The taxonomic classifier used here is very specific to the sample preparation and sequencing protocols used for this study, and Greengenes 13_8, which it is trained on, is an outdated reference database. The reason we use it here is because the reference data is relatively small, so classification can be run on most modern computers with this classifier[1].
When youâre ready to work on your own data, one of the choices youâll need to make is what classifier to use for your data. You can find pre-trained classifiers the QIIME 2 Library Resources page, and lots of discussion of this topic on the Forum. We strongly recommend the use of environment-weighted classifiers, as described in Kaehler et al., (2019), and you can find builds of these on the Resources page. If you donât find a classifier that will work for you, you can learn how to train your own.
First, weâll download a pre-trained classifier artifact.
wget -O 'suboptimal-16S-rRNA-classifier.qza' \
'https://gut-to-soil-tutorial.readthedocs.io/en/latest/data/gut-to-soil/suboptimal-16S-rRNA-classifier.qza'
url = 'https://gut-to-soil-tutorial.readthedocs.io/en/latest/data/gut-to-soil/suboptimal-16S-rRNA-classifier.qza'
fn = 'suboptimal-16S-rRNA-classifier.qza'
request.urlretrieve(url, fn)
suboptimal_16S_rRNA_classifier = Artifact.load(fn)
- Using the
Upload Data
tool: - On the first tab (Regular), press the
Paste/Fetch
data button at the bottom.- Set "Name" (first text-field) to:
suboptimal-16S-rRNA-classifier.qza
- In the larger text-area, copy-and-paste: https://
gut -to -soil -tutorial .readthedocs .io /en /latest /data /gut -to -soil /suboptimal -16S -rRNA -classifier .qza - ("Type", "Genome", and "Settings" can be ignored)
- Set "Name" (first text-field) to:
- Press the
Start
button at the bottom.
- On the first tab (Regular), press the
url <- 'https://gut-to-soil-tutorial.readthedocs.io/en/latest/data/gut-to-soil/suboptimal-16S-rRNA-classifier.qza'
fn <- 'suboptimal-16S-rRNA-classifier.qza'
request$urlretrieve(url, fn)
suboptimal_16S_rRNA_classifier <- Artifact$load(fn)
def classifier_factory():
from urllib import request
from qiime2 import Artifact
fp, _ = request.urlretrieve(
'https://data.qiime2.org/classifiers/sklearn-1.4.2/greengenes/gg-13-8-99-515-806-nb-classifier.qza')
return Artifact.load(fp)
classifier = use.init_artifact('suboptimal-16S-rRNA-classifier', classifier_factory)
Then, weâll apply it to our sequences using classify-sklearn
.
qiime feature-classifier classify-sklearn \
--i-classifier suboptimal-16S-rRNA-classifier.qza \
--i-reads asv-seqs-ms2.qza \
--o-classification taxonomy.qza
import qiime2.plugins.feature_classifier.actions as feature_classifier_actions
taxonomy, = feature_classifier_actions.classify_sklearn(
classifier=suboptimal_16S_rRNA_classifier,
reads=asv_seqs_ms2,
)
- Using the
qiime2 feature-classifier classify-sklearn
tool: - Set "reads" to
#: asv-seqs-ms2.qza
- Set "classifier" to
#: suboptimal-16S-rRNA-classifier.qza
- Press the
Execute
button.
- Set "reads" to
- Once completed, for the new entry in your history, use the
Edit
button to set the name as follows: - (Renaming is optional, but it will make any subsequent steps easier to complete.)
History Name "Name" to set (be sure to press [Save]) #: qiime2 feature-classifier classify-sklearn [...] : classification.qza
taxonomy.qza
feature_classifier_actions <- import("qiime2.plugins.feature_classifier.actions")
action_results <- feature_classifier_actions$classify_sklearn(
classifier=suboptimal_16S_rRNA_classifier,
reads=asv_seqs_ms2,
)
taxonomy <- action_results$classification
taxonomy, = use.action(
use.UsageAction(plugin_id='feature_classifier',
action_id='classify_sklearn'),
use.UsageInputs(classifier=classifier,
reads=asv_seqs_ms2),
use.UsageOutputNames(classification='taxonomy'))
Then, to get an initial look at our taxonomic classifications, letâs integrate taxonomy in the sequence summary, like the one we generated above.
If you want to compare taxonomic annotations achieved with different classifiers, you can do that with the feature-table tabulate-seqs
action by passing in multiple FeatureData[Taxonomy]
artifacts.
See an example of what that result might look like here.
While you have that visualization loaded, take a look at the data provenance. The complexity of that data provenance should give you an idea of why itâs helpful to have the computer record all of this information, rather than trying to embed it all in file names or keep track of it in your written notes.
What was used as the DADA2 trim and trunc parameters for the data leading to this visualization? (Hint: use the provenance search feature).
## constructing result collection ##
rc_name=taxonomy-collection/
ext=.qza
keys=( Greengenes-13-8 )
names=( taxonomy.qza )
construct_result_collection
##
qiime feature-table tabulate-seqs \
--i-data asv-seqs-ms2.qza \
--i-taxonomy taxonomy-collection/ \
--m-metadata-file asv-frequencies-ms2.qza \
--o-visualization asv-seqs-ms2.qzv
from qiime2 import ResultCollection
asv_frequencies_md = asv_frequencies_ms2.view(Metadata)
taxonomy_collection_artifact_collection = ResultCollection({
'Greengenes-13-8': taxonomy,
})
asv_seqs_ms2_viz, = feature_table_actions.tabulate_seqs(
data=asv_seqs_ms2,
taxonomy=taxonomy_collection_artifact_collection,
metadata=asv_frequencies_md,
)
- Using the
qiime2 feature-table tabulate-seqs
tool: - Set "data" to
#: asv-seqs-ms2.qza
- Expand the
additional options
section- Set "taxonomy" to
#: taxonomy-collection/
- For "metadata":
- Press the
+ Insert metadata
button to set up the next steps.- Change to
Metadata from Artifact
- Set "Metadata Source" to
asv-frequencies-ms2.qza
- Change to
- Press the
- Set "taxonomy" to
- Press the
Execute
button.
- Set "data" to
- Once completed, for the new entry in your history, use the
Edit
button to set the name as follows: - (Renaming is optional, but it will make any subsequent steps easier to complete.)
History Name "Name" to set (be sure to press [Save]) #: qiime2 feature-table tabulate-seqs [...] : visualization.qzv
asv-seqs-ms2.qzv
ResultCollection <- import("qiime2")$ResultCollection
asv_frequencies_md <- asv_frequencies_ms2$view(Metadata)
taxonomy_collection_artifact_collection = ResultCollection(dict(
'Greengenes-13-8' = taxonomy
))
action_results <- feature_table_actions$tabulate_seqs(
data=asv_seqs_ms2,
taxonomy=taxonomy_collection_artifact_collection,
metadata=asv_frequencies_md,
)
asv_seqs_ms2_viz <- action_results$visualization
asv_frequencies_ms2_as_md = use.view_as_metadata('asv_frequencies',
asv_frequencies_ms2)
taxonomy_collection = use.construct_artifact_collection(
'taxonomy_collection', {'Greengenes-13-8': taxonomy}
)
use.action(
use.UsageAction(plugin_id='feature_table',
action_id='tabulate_seqs'),
use.UsageInputs(data=asv_seqs_ms2,
taxonomy=taxonomy_collection,
metadata=asv_frequencies_ms2_as_md),
use.UsageOutputNames(visualization='asv_seqs_ms2'),
)
What is the taxonomy associated with the most frequently observed feature, based on a BLAST search? How does that compare to the taxonomy assigned by our feature classifier?
In general, we tend to trust the results of our feature classifier over those generated by BLAST, though the BLAST results are good for a quick look or a sanity check. The reason for this is that the reference databases used with our feature classifiers tend to be specifically designed for this purpose and in some cases all of the sequences included are vetted. The BLAST databases can contain mis-annotations that may negatively impact the classifications.
Recall that our asv-seqs.qzv
visualization allows you to easily BLAST the sequence associated with each feature against the NCBI nt database.
Using that visualization and the taxonomy.qzv
visualization created here, compare the taxonomic assignments with the taxonomy of the best BLAST hit for a few features.
How similar are the assignments?
If theyâre dissimilar, at what taxonomic level do they begin to differ (e.g., species, genus, family, ...)?
Building a tree for phylogenetic diversity calculations¶
QIIME supports several phylogenetic diversity metrics, including Faithâs Phylogenetic Diversity and weighted and unweighted UniFrac.
In addition to counts of features per sample (i.e., the data in the FeatureTable[Frequency]
QIIME 2 artifact), these metrics require a rooted phylogenetic tree relating the features to one another.
The amplicon distribution offers a few ways to build these trees, including a reference-based approach in the fragment-insertion plugin and de novo (i.e., reference-free) approaches in the phylogeny plugin.
The reference based approach, by default, is specific to 16S rRNA marker gene analysis. We could use that here, but the runtime is too long for our documentation.[1] If youâd like to see this demonstrated, you can refer to the Parkinsonâs Mouse tutorial.
The de novo approach is known to generate low quality trees, but can be used with any marker gene (not just 16S). If youâd like to see this demonstrated, you can refer to the Moving Pictures tutorial đ„.
For those reasons, weâre going to skip building phylogenetic trees and instead use an analog of phylogenetic diversity metrics here.
Downstream data analysis¶
As mentioned above, we tend to think of âdownstreamâ analysis as beginning with a feature table, taxonomic annotation of our features, and optionally a phylogenetic tree. Now that we have those (with the exception of the tree, which we wonât use here), letâs jump in. This is where it starts to get fun! â·ïž
Kmerization of our features¶
Weâll start here by calculating alpha and beta diversity metrics. To do this, in lieu of a phylogenetic tree, weâre going to use a stand-alone QIIME 2 plugin, q2-kmerizer. This plugin splits ASV sequences into their constituent kmers, and creates a new feature table where those kmers (instead of ASVs) are the features. The paper showed that this enables non-phylogenetic diversity metrics to achieve results highly correlated with those achieved by phylogenetic diversity metrics.
Letâs generate our kmer feature table.
qiime kmerizer seqs-to-kmers \
--i-table asv-table-ms2.qza \
--i-sequences asv-seqs-ms2.qza \
--o-kmer-table kmer-table.qza
import qiime2.plugins.kmerizer.actions as kmerizer_actions
kmer_table, = kmerizer_actions.seqs_to_kmers(
table=asv_table_ms2,
sequences=asv_seqs_ms2,
)
- Using the
qiime2 kmerizer seqs-to-kmers
tool: - Set "sequences" to
#: asv-seqs-ms2.qza
- Set "table" to
#: asv-table-ms2.qza
- Press the
Execute
button.
- Set "sequences" to
kmerizer_actions <- import("qiime2.plugins.kmerizer.actions")
action_results <- kmerizer_actions$seqs_to_kmers(
table=asv_table_ms2,
sequences=asv_seqs_ms2,
)
kmer_table <- action_results$kmer_table
kmer_table, = use.action(
use.UsageAction(plugin_id='kmerizer',
action_id='seqs_to_kmers'),
use.UsageInputs(table=asv_table_ms2,
sequences=asv_seqs_ms2),
use.UsageOutputNames(kmer_table='kmer_table')
)
Letâs also generate a summary of the feature table.
qiime feature-table summarize-plus \
--i-table kmer-table.qza \
--m-metadata-file sample-metadata.tsv \
--o-summary kmer-table.qzv \
--o-sample-frequencies kmer-sample-frequencies.qza \
--o-feature-frequencies kmer-feature-frequencies.qza
kmer_feature_frequencies, kmer_sample_frequencies, kmer_table_viz = feature_table_actions.summarize_plus(
table=kmer_table,
metadata=sample_metadata_md,
)
- Using the
qiime2 feature-table summarize-plus
tool: - Set "table" to
#: kmer-table.qza
- Expand the
additional options
section- For "metadata":
- Press the
+ Insert metadata
button to set up the next steps.- Leave as
Metadata from TSV
- Set "Metadata Source" to
sample-metadata.tsv
- Leave as
- Press the
- For "metadata":
- Press the
Execute
button.
- Set "table" to
- Once completed, for each new entry in your history, use the
Edit
button to set the name as follows: - (Renaming is optional, but it will make any subsequent steps easier to complete.)
History Name "Name" to set (be sure to press [Save]) #: qiime2 feature-table summarize-plus [...] : feature_frequencies.qza
kmer-feature-frequencies.qza
#: qiime2 feature-table summarize-plus [...] : sample_frequencies.qza
kmer-sample-frequencies.qza
#: qiime2 feature-table summarize-plus [...] : summary.qzv
kmer-table.qzv
action_results <- feature_table_actions$summarize_plus(
table=kmer_table,
metadata=sample_metadata_md,
)
kmer_table_viz <- action_results$summary
kmer_sample_frequencies <- action_results$sample_frequencies
kmer_feature_frequencies <- action_results$feature_frequencies
use.action(
use.UsageAction(plugin_id='feature_table',
action_id='summarize_plus'),
use.UsageInputs(table=kmer_table,
metadata=sample_metadata),
use.UsageOutputNames(summary='kmer_table',
sample_frequencies='kmer_sample_frequencies',
feature_frequencies='kmer_feature_frequencies')
)
What is the median number of kmers associated with each sample? What is the most frequently occurring kmer in this table? How long are the kmers, and how could you change that if you wanted to?
Computing bootstrapped alpha and beta diversity metrics¶
QIIME 2âs diversity analyses are available through the diversity plugin, which supports computing alpha and beta diversity metrics, applying related statistical tests, and generating interactive visualizations. A relatively new stand-alone plugin, q2-boots, mirrors the interface of the diversity metric calculation actions in the diversity plugin, but generates more robust results because it integrates rarefaction and/or bootstrapping. Letâs use that here, instead of the diversity plugin.
Weâll apply the core-metrics
action, which bootstraps a FeatureTable[Frequency]
(i.e., samples with replacement) to a user-specified sampling depth n
times, computes several alpha and beta diversity metrics on each of the n
bootstrapped feature tables, and then creates averaged alpha and beta diversity data artifacts as output.
Those resulting artifacts can be used anywhere that the corresponding artifacts from the diversity plugin could be used.
The core-metrics
action also generates principle coordinates analysis (PCoA) plots using the Emperor plugin for each of the beta diversity metrics.
The metrics computed by default are:
- Alpha diversity
- Shannonâs diversity index (a quantitative measure of community richness)
- Observed Features (a qualitative measure of community richness)
- Evenness (or Pielouâs Evenness; a measure of community evenness)
- Beta diversity
- Jaccard distance (a qualitative measure of community dissimilarity)
- Bray-Curtis distance (a quantitative measure of community dissimilarity)
You can find additional information on these and other metrics availability in QIIME 2 in this excellent forum post by a QIIME 2 community member. When youâre ready, weâd love to have your contributions on the Forum as well!
An important parameter that needs to be provided to this script is sampling-depth
, which is the even sampling (i.e., bootstrapping or rarefaction) depth.
Because most diversity metrics are sensitive to different sampling depths across different samples, the tables are randomly subsampled such that the total frequency for each sample is the user-specified sampling depth.
For example, if you set sampling-depth=500
, this step will subsample the counts in each sample so that each sample in the resulting table has a total frequency of 500.
If the total frequency for any sample(s) are smaller than this value, those samples will be dropped from the diversity analysis.
Choosing this value is tricky.
We recommend making your choice by reviewing the information presented in the kmer-table.qzv
file that was created above.
View the kmer-table.qzv
QIIME 2 artifact, and in particular the Interactive Sample Detail tab in that visualization.
What value would you choose to pass for sampling-depth
?
How many samples will be excluded from your analysis based on this choice?
How many total sequences will you be analyzing in the core-metrics
command?
Iâm going to choose values that is around the first quartile of the sample total frequencies.
qiime boots core-metrics \
--i-table kmer-table.qza \
--m-metadata-file sample-metadata.tsv \
--p-sampling-depth 22000 \
--p-n 10 \
--p-replacement \
--p-alpha-average-method median \
--p-beta-average-method medoid \
--output-dir boots-core-metrics
## accessing result collection member ##
ln -s boots-core-metrics/distance_matrices/jaccard_distance_matrix.qza unweighted-dm.qza
##
## accessing result collection member ##
ln -s boots-core-metrics/pcoas/jaccard.qza unweighted-pcoa.qza
##
## accessing result collection member ##
ln -s boots-core-metrics/distance_matrices/braycurtis_distance_matrix.qza weighted-dm.qza
##
## accessing result collection member ##
ln -s boots-core-metrics/pcoas/braycurtis.qza weighted-pcoa.qza
##
## accessing result collection member ##
ln -s boots-core-metrics/alpha_diversities/observed_features.qza richness-vector.qza
##
## accessing result collection member ##
ln -s boots-core-metrics/alpha_diversities/evenness.qza evenness-vector.qza
##
import qiime2.plugins.boots.actions as boots_actions
action_results = boots_actions.core_metrics(
table=kmer_table,
metadata=sample_metadata_md,
sampling_depth=22000,
n=10,
replacement=True,
alpha_average_method='median',
beta_average_method='medoid',
)
bootstrap_tables_artifact_collection = action_results.resampled_tables
bootstrap_alpha_diversities_artifact_collection = action_results.alpha_diversities
bootstrap_distance_matrices_artifact_collection = action_results.distance_matrices
bootstrap_pcoas_artifact_collection = action_results.pcoas
bootstrap_emperor_plots_viz_collection = action_results.emperor_plots
unweighted_dm = bootstrap_distance_matrices_artifact_collection['jaccard_distance_matrix']
unweighted_pcoa = bootstrap_pcoas_artifact_collection['jaccard']
weighted_dm = bootstrap_distance_matrices_artifact_collection['braycurtis_distance_matrix']
weighted_pcoa = bootstrap_pcoas_artifact_collection['braycurtis']
richness_vector = bootstrap_alpha_diversities_artifact_collection['observed_features']
evenness_vector = bootstrap_alpha_diversities_artifact_collection['evenness']
- Using the
qiime2 boots core-metrics
tool: - Set "table" to
#: kmer-table.qza
- Set "sampling_depth" to
22000
- For "metadata":
- Perform the following steps.
- Leave as
Metadata from TSV
- Set "Metadata Source" to
sample-metadata.tsv
- Leave as
- Perform the following steps.
- Set "n" to
10
- Set "replacement" to
Yes
- Expand the
additional options
section- Leave "alpha_average_method" as its default value of
median
- Set "beta_average_method" to
medoid
- Leave "alpha_average_method" as its default value of
- Press the
Execute
button.
- Set "table" to
- Once completed, for each new entry in your history, use the
Edit
button to set the name as follows: - (Renaming is optional, but it will make any subsequent steps easier to complete.)
History Name "Name" to set (be sure to press [Save]) #: qiime2 boots core-metrics [...] : resampled_tables.qza
bootstrap-tables/
#: qiime2 boots core-metrics [...] : alpha_diversities.qza
bootstrap-alpha-diversities/
#: qiime2 boots core-metrics [...] : distance_matrices.qza
bootstrap-distance-matrices/
#: qiime2 boots core-metrics [...] : pcoas.qza
bootstrap-pcoas/
#: qiime2 boots core-metrics [...] : emperor_plots.qza
bootstrap-emperor-plots/
boots_actions <- import("qiime2.plugins.boots.actions")
action_results <- boots_actions$core_metrics(
table=kmer_table,
metadata=sample_metadata_md,
sampling_depth=22000L,
n=10L,
replacement=TRUE,
alpha_average_method='median',
beta_average_method='medoid',
)
bootstrap_tables_artifact_collection <- action_results$resampled_tables
bootstrap_alpha_diversities_artifact_collection <- action_results$alpha_diversities
bootstrap_distance_matrices_artifact_collection <- action_results$distance_matrices
bootstrap_pcoas_artifact_collection <- action_results$pcoas
bootstrap_emperor_plots_viz_collection <- action_results$emperor_plots
unweighted_dm <- bootstrap_distance_matrices_artifact_collection['jaccard_distance_matrix']
unweighted_pcoa <- bootstrap_pcoas_artifact_collection['jaccard']
weighted_dm <- bootstrap_distance_matrices_artifact_collection['braycurtis_distance_matrix']
weighted_pcoa <- bootstrap_pcoas_artifact_collection['braycurtis']
richness_vector <- bootstrap_alpha_diversities_artifact_collection['observed_features']
evenness_vector <- bootstrap_alpha_diversities_artifact_collection['evenness']
core_metrics = use.action(
use.UsageAction(plugin_id='boots',
action_id='core_metrics'),
use.UsageInputs(table=kmer_table,
metadata=sample_metadata,
sampling_depth=22000,
n=10,
replacement=True,
alpha_average_method='median',
beta_average_method='medoid'),
use.UsageOutputNames(
resampled_tables='bootstrap_tables',
alpha_diversities='bootstrap_alpha_diversities',
distance_matrices='bootstrap_distance_matrices',
pcoas='bootstrap_pcoas',
emperor_plots='bootstrap_emperor_plots')
)
unweighted_dm = use.get_artifact_collection_member(
'unweighted_dm', core_metrics.distance_matrices, 'jaccard_distance_matrix')
unweighted_pcoa = use.get_artifact_collection_member(
'unweighted_pcoa', core_metrics.pcoas, 'jaccard')
weighted_dm = use.get_artifact_collection_member(
'weighted_dm', core_metrics.distance_matrices, 'braycurtis_distance_matrix')
weighted_pcoa = use.get_artifact_collection_member(
'weighted_pcoa', core_metrics.pcoas, 'braycurtis')
richness_vector = use.get_artifact_collection_member(
'richness_vector', core_metrics.alpha_diversities, 'observed_features')
evenness_vector = use.get_artifact_collection_member(
'evenness_vector', core_metrics.alpha_diversities, 'evenness')
In many Illumina runs youâll observe a few samples that have very low sequence counts. You will typically want to exclude those from the analysis by choosing a larger value for the sampling depth at this stage.
After computing diversity metrics, we can begin to explore the microbial composition of the samples in the context of the sample metadata. You can review the sample metadata using one of the tabulated views of this file that we created above.
Open one of the Emperor plots that was generated by the previous command, and experiment with the options for coloring by metadata. Which of the metadata categories results in samples grouping most by color?
Integrating additional information into PCoA scatter plots¶
The PCoA results that were computed by core-metrics
are viewable as metadata, which opens them up to use with the vizard plugin.
Vizard is a general purpose plotting plugin, and works with any artifacts that can be viewed as metadata.
This opens up a world of possibility in how you visualize your microbiome data with QIIME 2.
For example, letâs integrate our Jaccard PCoA results with our Observed Features data and our sample metadata in a vizard scatterplot.
When looking at the PCoA as metadata, the columns labels âAxis 1â, âAxis 2â, ... are the first, second, ... PCoA axes.
qiime vizard scatterplot-2d \
--m-metadata-file sample-metadata.tsv unweighted-pcoa.qza richness-vector.qza \
--o-visualization unweighted-diversity-scatterplot.qzv
import qiime2.plugins.vizard.actions as vizard_actions
unweighted_pcoa_as_md_md = unweighted_pcoa.view(Metadata)
richness_as_md_md = richness_vector.view(Metadata)
unweighted_vizard_md_md = sample_metadata_md.merge(unweighted_pcoa_as_md_md, richness_as_md_md)
unweighted_diversity_scatterplot_viz, = vizard_actions.scatterplot_2d(
metadata=unweighted_vizard_md_md,
)
- Using the
qiime2 vizard scatterplot-2d
tool: - For "metadata":
- Perform the following steps.
- Leave as
Metadata from TSV
- Set "Metadata Source" to
sample-metadata.tsv
- Leave as
- Press the
+ Insert metadata
button to set up the next steps.- Change to
Metadata from Artifact
- Set "Metadata Source" to
unweighted-pcoa.qza
- Change to
- Press the
+ Insert metadata
button to set up the next steps.- Change to
Metadata from Artifact
- Set "Metadata Source" to
richness-vector.qza
- Change to
- Perform the following steps.
- Press the
Execute
button.
- For "metadata":
- Once completed, for the new entry in your history, use the
Edit
button to set the name as follows: - (Renaming is optional, but it will make any subsequent steps easier to complete.)
History Name "Name" to set (be sure to press [Save]) #: qiime2 vizard scatterplot-2d [...] : visualization.qzv
unweighted-diversity-scatterplot.qzv
vizard_actions <- import("qiime2.plugins.vizard.actions")
unweighted_pcoa_as_md_md <- unweighted_pcoa$view(Metadata)
richness_as_md_md <- richness_vector$view(Metadata)
unweighted_vizard_md_md <- sample_metadata_md$merge(unweighted_pcoa_as_md_md, richness_as_md_md)
action_results <- vizard_actions$scatterplot_2d(
metadata=unweighted_vizard_md_md,
)
unweighted_diversity_scatterplot_viz <- action_results$visualization
unweighted_pcoa_as_md = use.view_as_metadata('unweighted_pcoa_as_md', unweighted_pcoa)
richness_as_md = use.view_as_metadata('richness_as_md', richness_vector)
unweighted_vizard_md = use.merge_metadata('unweighted_vizard_md', sample_metadata, unweighted_pcoa_as_md, richness_as_md)
use.action(
use.UsageAction(plugin_id='vizard',
action_id='scatterplot_2d'),
use.UsageInputs(metadata=unweighted_vizard_md),
use.UsageOutputNames(visualization='unweighted_diversity_scatterplot'))
Which sample types have the lowest richness? Does richness appear to be correlated with any of the PCoA axes?
Letâs generate another vizard plot, but this time using our Bray-Curtis PCoA results.
qiime vizard scatterplot-2d \
--m-metadata-file sample-metadata.tsv weighted-pcoa.qza richness-vector.qza \
--o-visualization weighted-diversity-scatterplot.qzv
weighted_pcoa_as_md_md = weighted_pcoa.view(Metadata)
weighted_vizard_md_md = sample_metadata_md.merge(weighted_pcoa_as_md_md, richness_as_md_md)
weighted_diversity_scatterplot_viz, = vizard_actions.scatterplot_2d(
metadata=weighted_vizard_md_md,
)
- Using the
qiime2 vizard scatterplot-2d
tool: - For "metadata":
- Perform the following steps.
- Leave as
Metadata from TSV
- Set "Metadata Source" to
sample-metadata.tsv
- Leave as
- Press the
+ Insert metadata
button to set up the next steps.- Change to
Metadata from Artifact
- Set "Metadata Source" to
weighted-pcoa.qza
- Change to
- Press the
+ Insert metadata
button to set up the next steps.- Change to
Metadata from Artifact
- Set "Metadata Source" to
richness-vector.qza
- Change to
- Perform the following steps.
- Press the
Execute
button.
- For "metadata":
- Once completed, for the new entry in your history, use the
Edit
button to set the name as follows: - (Renaming is optional, but it will make any subsequent steps easier to complete.)
History Name "Name" to set (be sure to press [Save]) #: qiime2 vizard scatterplot-2d [...] : visualization.qzv
weighted-diversity-scatterplot.qzv
weighted_pcoa_as_md_md <- weighted_pcoa$view(Metadata)
weighted_vizard_md_md <- sample_metadata_md$merge(weighted_pcoa_as_md_md, richness_as_md_md)
action_results <- vizard_actions$scatterplot_2d(
metadata=weighted_vizard_md_md,
)
weighted_diversity_scatterplot_viz <- action_results$visualization
weighted_pcoa_as_md = use.view_as_metadata('weighted_pcoa_as_md', weighted_pcoa)
weighted_vizard_md = use.merge_metadata('weighted_vizard_md', sample_metadata, weighted_pcoa_as_md, richness_as_md)
use.action(
use.UsageAction(plugin_id='vizard',
action_id='scatterplot_2d'),
use.UsageInputs(metadata=weighted_vizard_md),
use.UsageOutputNames(visualization='weighted_diversity_scatterplot'))
When plotting PCoA axes 1 and 2 and coloring by SampleType, is the HEC more similar to the food compost or HE samples?
What sample type is the Microbe Mix most similar to? The inside of the toilet pre-use? The bulking material?
What other interesting relationships do you see when changing the x- and y-axes and sample coloring?
Which sample has the lowest microbiome richness?
Alpha rarefaction plotting¶
In this section weâll explore alpha diversity as a function of sampling depth using thealpha-rarefaction
action.
This visualizer computes one or more alpha diversity metrics at multiple sampling depths, in steps between 1 (optionally controlled with min-depth
) and the value provided as max-depth
.
At each sampling depth step, 10 rarefied tables will be generated, and the diversity metrics will be computed for all samples in the tables.
The number of iterations (rarefied tables computed at each sampling depth) can be controlled with the iterations
parameter.
Average diversity values will be plotted for each sample at each even sampling depth, and samples can be grouped based on metadata in the resulting visualization if sample metadata is provided.
The value that you provide for max-depth
should be determined by reviewing the âFrequency per sampleâ information presented in the kmer-table.qzv
file.
In general, choosing a value that is somewhere around the median frequency seems to work well, but you may want to increase that value if the lines in the resulting rarefaction plot donât appear to be leveling out, or decrease that value if you seem to be losing many of your samples due to low total frequencies closer to the minimum sampling depth than the maximum sampling depth.
qiime diversity alpha-rarefaction \
--i-table kmer-table.qza \
--p-max-depth 62000 \
--m-metadata-file sample-metadata.tsv \
--o-visualization alpha-rarefaction.qzv
import qiime2.plugins.diversity.actions as diversity_actions
alpha_rarefaction_viz, = diversity_actions.alpha_rarefaction(
table=kmer_table,
max_depth=62000,
metadata=sample_metadata_md,
)
- Using the
qiime2 diversity alpha-rarefaction
tool: - Set "table" to
#: kmer-table.qza
- Set "max_depth" to
62000
- Expand the
additional options
section- For "metadata":
- Press the
+ Insert metadata
button to set up the next steps.- Leave as
Metadata from TSV
- Set "Metadata Source" to
sample-metadata.tsv
- Leave as
- Press the
- For "metadata":
- Press the
Execute
button.
- Set "table" to
- Once completed, for the new entry in your history, use the
Edit
button to set the name as follows: - (Renaming is optional, but it will make any subsequent steps easier to complete.)
History Name "Name" to set (be sure to press [Save]) #: qiime2 diversity alpha-rarefaction [...] : visualization.qzv
alpha-rarefaction.qzv
diversity_actions <- import("qiime2.plugins.diversity.actions")
action_results <- diversity_actions$alpha_rarefaction(
table=kmer_table,
max_depth=62000L,
metadata=sample_metadata_md,
)
alpha_rarefaction_viz <- action_results$visualization
use.action(
use.UsageAction(plugin_id='diversity',
action_id='alpha_rarefaction'),
use.UsageInputs(table=kmer_table,
max_depth=62000,
metadata=sample_metadata),
use.UsageOutputNames(visualization='alpha_rarefaction'))
The visualization will have two plots. The top plot is an alpha rarefaction plot, and is primarily used to determine if the richness of the samples has been fully observed or sequenced. If the lines in the plot appear to âlevel outâ (i.e., approach a slope of zero) at some sampling depth along the x-axis, that suggests that collecting additional sequences beyond that sampling depth would not be likely to result in the observation of additional features. If the lines in the plot donât level out, this may be because the richness of the samples hasnât been fully observed yet (because too few sequences were collected), or it could be an indicator that a lot of sequencing error remains in the data (which is being mistaken for novel diversity).
The bottom plot in this visualization is important when grouping samples by metadata.
It illustrates the number of samples that remain in each group when the feature table is rarefied to each sampling depth.
If a given sampling depth d
is larger than the total frequency of a sample s
(i.e., the number of sequences that were obtained for sample s
), it is not possible to compute the diversity metric for sample s
at sampling depth d
.
If many of the samples in a group have lower total frequencies than d
, the average diversity presented for that group at d
in the top plot will be unreliable because it will have been computed on relatively few samples.
When grouping samples by metadata, it is therefore essential to look at the bottom plot to ensure that the data presented in the top plot is reliable.
When grouping samples by âSampleTypeâ and viewing the alpha rarefaction plot for the âobserved_featuresâ metric, which sample types (if any) appear to exhibit sufficient diversity coverage (i.e., their rarefaction curves level out)?
Taxonomic analysis¶
Next, we can view the taxonomic composition of our samples with interactive bar plots. Generate those plots with the following command and then open the visualization.
qiime taxa barplot \
--i-table asv-table-ms2.qza \
--i-taxonomy taxonomy.qza \
--m-metadata-file sample-metadata.tsv \
--o-visualization taxa-bar-plots.qzv
import qiime2.plugins.taxa.actions as taxa_actions
taxa_bar_plots_viz, = taxa_actions.barplot(
table=asv_table_ms2,
taxonomy=taxonomy,
metadata=sample_metadata_md,
)
- Using the
qiime2 taxa barplot
tool: - Set "table" to
#: asv-table-ms2.qza
- Expand the
additional options
section- Set "taxonomy" to
#: taxonomy.qza
- For "metadata":
- Press the
+ Insert metadata
button to set up the next steps.- Leave as
Metadata from TSV
- Set "Metadata Source" to
sample-metadata.tsv
- Leave as
- Press the
- Set "taxonomy" to
- Press the
Execute
button.
- Set "table" to
- Once completed, for the new entry in your history, use the
Edit
button to set the name as follows: - (Renaming is optional, but it will make any subsequent steps easier to complete.)
History Name "Name" to set (be sure to press [Save]) #: qiime2 taxa barplot [...] : visualization.qzv
taxa-bar-plots.qzv
taxa_actions <- import("qiime2.plugins.taxa.actions")
action_results <- taxa_actions$barplot(
table=asv_table_ms2,
taxonomy=taxonomy,
metadata=sample_metadata_md,
)
taxa_bar_plots_viz <- action_results$visualization
use.action(
use.UsageAction(plugin_id='taxa',
action_id='barplot'),
use.UsageInputs(table=asv_table_ms2,
taxonomy=taxonomy,
metadata=sample_metadata),
use.UsageOutputNames(visualization='taxa_bar_plots'))
Why are we using the ASV table when generating our taxonomic barplot, rather than the kmer table that weâve been using in the last few steps?
Solution to Exercise 15
We use the ASV table here because the feature ids in that table are the same as the ones used in the FeatureData[Taxonomy]
artifact.
Additionally, kmerization of our data is a tool used for computing diversity metrics - not something we generally intend to use throughout our analyses.
Visualize the samples at Level 2 (which corresponds to the phylum level in this analysis), and then sort the samples by SampleType
.
What are the dominant phyla in each in SampleType
?
Differential abundance testing with ANCOM-BC¶
ANCOM-BC is a compositionally-aware linear regression model that allows testing for differentially abundant features across sample groups while also implementing bias correction.
This can be accessed using the ancombc
action in the composition plugin.
Accurately identifying individual features that are differentially abundant across sample types in microbiome data is a challenging problem and an open area of research, particularly if you donât have an a priori hypothesis about which feature(s) are differentially abundant. A q-value that suggests that youâve identified a feature that is differentially abundant across sample groups should be considered a hypothesis, not a conclusion, and you need new samples to test that new hypothesis.
In addition to the methods contained in the composition plugin, new approaches for differential abundance testing are regularly introduced. Itâs worth assessing the current state of the field when performing differential abundance testing to see if there are new methods that might be useful for your data. If in doubt, consult a statistician.
Weâll perform this analysis in a few steps.
Filter samples from the feature table¶
First, weâll filter our samples from our feature table such that we only have the three groups that we have the most samples for.
qiime feature-table filter-samples \
--i-table asv-table-ms2.qza \
--m-metadata-file sample-metadata.tsv \
--p-where '[SampleType] IN ("Human Excrement Compost", "Human Excrement", "Food Compost")' \
--o-filtered-table asv-table-ms2-dominant-sample-types.qza
asv_table_ms2_dominant_sample_types, = feature_table_actions.filter_samples(
table=asv_table_ms2,
metadata=sample_metadata_md,
where='[SampleType] IN ("Human Excrement Compost", "Human Excrement", "Food Compost")',
)
- Using the
qiime2 feature-table filter-samples
tool: - Set "table" to
#: asv-table-ms2.qza
- Expand the
additional options
section- For "metadata":
- Press the
+ Insert metadata
button to set up the next steps.- Leave as
Metadata from TSV
- Set "Metadata Source" to
sample-metadata.tsv
- Leave as
- Press the
- Set "where" to
[SampleType] IN ("Human Excrement Compost", "Human Excrement", "Food Compost")
- For "metadata":
- Press the
Execute
button.
- Set "table" to
- Once completed, for the new entry in your history, use the
Edit
button to set the name as follows: - (Renaming is optional, but it will make any subsequent steps easier to complete.)
History Name "Name" to set (be sure to press [Save]) #: qiime2 feature-table filter-samples [...] : filtered_table.qza
asv-table-ms2-dominant-sample-types.qza
action_results <- feature_table_actions$filter_samples(
table=asv_table_ms2,
metadata=sample_metadata_md,
where='[SampleType] IN ("Human Excrement Compost", "Human Excrement", "Food Compost")',
)
asv_table_ms2_dominant_sample_types <- action_results$filtered_table
asv_table_ms2_dominant_sample_types, = use.action(
use.UsageAction(plugin_id='feature_table',
action_id='filter_samples'),
use.UsageInputs(table=asv_table_ms2,
metadata=sample_metadata,
where='[SampleType] IN ("Human Excrement Compost", "Human Excrement", "Food Compost")'),
use.UsageOutputNames(filtered_table='asv_table_ms2_dominant_sample_types'))
Collapse the ASVs into genera¶
Then, weâll collapse our ASVs into genera (i.e. level 6 of the Greengenes taxonomy), to get more useful annotation of the features (and to learn how to perform this grouping).
qiime taxa collapse \
--i-table asv-table-ms2-dominant-sample-types.qza \
--i-taxonomy taxonomy.qza \
--p-level 6 \
--o-collapsed-table genus-table-ms2-dominant-sample-types.qza
genus_table_ms2_dominant_sample_types, = taxa_actions.collapse(
table=asv_table_ms2_dominant_sample_types,
taxonomy=taxonomy,
level=6,
)
- Using the
qiime2 taxa collapse
tool: - Set "table" to
#: asv-table-ms2-dominant-sample-types.qza
- Set "taxonomy" to
#: taxonomy.qza
- Set "level" to
6
- Press the
Execute
button.
- Set "table" to
- Once completed, for the new entry in your history, use the
Edit
button to set the name as follows: - (Renaming is optional, but it will make any subsequent steps easier to complete.)
History Name "Name" to set (be sure to press [Save]) #: qiime2 taxa collapse [...] : collapsed_table.qza
genus-table-ms2-dominant-sample-types.qza
action_results <- taxa_actions$collapse(
table=asv_table_ms2_dominant_sample_types,
taxonomy=taxonomy,
level=6L,
)
genus_table_ms2_dominant_sample_types <- action_results$collapsed_table
genus_table_ms2_dominant_sample_types, = use.action(
use.UsageAction(plugin_id='taxa',
action_id='collapse'),
use.UsageInputs(table=asv_table_ms2_dominant_sample_types,
taxonomy=taxonomy,
level=6),
use.UsageOutputNames(collapsed_table='genus_table_ms2_dominant_sample_types'))
Apply differential abundance testing¶
Then, weâll apply ANCOM-BC to see which genera are differentially abundant across those sample types. I specify a reference level here as this defines what each group is compared against. Since the focus of this study is HEC, I choose that as my reference level. That will let us see what genera are over or under represented in the Human Excrement Compost samples relative to the other two sample groups.
qiime composition ancombc \
--i-table genus-table-ms2-dominant-sample-types.qza \
--m-metadata-file sample-metadata.tsv \
--p-formula SampleType \
--p-reference-levels 'SampleType::Human Excrement Compost' \
--o-differentials genus-ancombc.qza
import qiime2.plugins.composition.actions as composition_actions
genus_ancombc, = composition_actions.ancombc(
table=genus_table_ms2_dominant_sample_types,
metadata=sample_metadata_md,
formula='SampleType',
reference_levels=['SampleType::Human Excrement Compost'],
)
- Using the
qiime2 composition ancombc
tool: - Set "table" to
#: genus-table-ms2-dominant-sample-types.qza
- For "metadata":
- Perform the following steps.
- Leave as
Metadata from TSV
- Set "Metadata Source" to
sample-metadata.tsv
- Leave as
- Perform the following steps.
- Set "formula" to
SampleType
- Expand the
additional options
section- For "reference_levels":
- Set "element" to
SampleType::Human Excrement Compost
- (Do not insert additional values.)
- Set "element" to
- For "reference_levels":
- Press the
Execute
button.
- Set "table" to
- Once completed, for the new entry in your history, use the
Edit
button to set the name as follows: - (Renaming is optional, but it will make any subsequent steps easier to complete.)
History Name "Name" to set (be sure to press [Save]) #: qiime2 composition ancombc [...] : differentials.qza
genus-ancombc.qza
composition_actions <- import("qiime2.plugins.composition.actions")
action_results <- composition_actions$ancombc(
table=genus_table_ms2_dominant_sample_types,
metadata=sample_metadata_md,
formula='SampleType',
reference_levels=list('SampleType::Human Excrement Compost'),
)
genus_ancombc <- action_results$differentials
genus_ancombc, = use.action(
use.UsageAction(plugin_id='composition',
action_id='ancombc'),
use.UsageInputs(table=genus_table_ms2_dominant_sample_types,
metadata=sample_metadata,
formula='SampleType',
reference_levels=['SampleType::Human Excrement Compost']),
use.UsageOutputNames(differentials='genus_ancombc'))
Finally, weâll visualize the results.
qiime composition da-barplot \
--i-data genus-ancombc.qza \
--p-significance-threshold 0.001 \
--p-level-delimiter ';' \
--o-visualization genus-ancombc.qzv
genus_ancombc_viz, = composition_actions.da_barplot(
data=genus_ancombc,
significance_threshold=0.001,
level_delimiter=';',
)
- Using the
qiime2 composition da-barplot
tool: - Set "data" to
#: genus-ancombc.qza
- Expand the
additional options
section- Set "significance_threshold" to
0.001
- Set "level_delimiter" to
;
- Set "significance_threshold" to
- Press the
Execute
button.
- Set "data" to
- Once completed, for the new entry in your history, use the
Edit
button to set the name as follows: - (Renaming is optional, but it will make any subsequent steps easier to complete.)
History Name "Name" to set (be sure to press [Save]) #: qiime2 composition da-barplot [...] : visualization.qzv
genus-ancombc.qzv
action_results <- composition_actions$da_barplot(
data=genus_ancombc,
significance_threshold=0.001,
level_delimiter=';',
)
genus_ancombc_viz <- action_results$visualization
use.action(
use.UsageAction(plugin_id='composition',
action_id='da_barplot'),
use.UsageInputs(data=genus_ancombc,
significance_threshold=0.001,
level_delimiter=';'),
use.UsageOutputNames(visualization='genus_ancombc'))
Which genus is most enriched in HEC relative to Food Compost? Which genus is most enriched in HEC relative to Human Excrement?
Which genus is most depleted in HEC relative to Food Compost? Which genus is most depleted in HEC relative to Human Excrement?
Thatâs it for now, but more is coming soon!¶
In the near future (as of 11 March 2025) we plan to integrate analyses using the sample-classifier plugin and longitudinal plugin. In the meantime, here are some suggestions to continue your learning:
- Build a machine learning classifier that classifies samples accordining to the three dominant sample types in the feature table that we used with ANCOM-BC.
(Hint: see
classify-samples
.) - Perform a longitudinal analysis that tracks samples from different buckets over time. Which taxa change most over time? (Hint: see
feature-volatility
.) - Remember that the full data set (five sequencing runs) are available in the gut-to-soil Artifact Archive. Grab one of the larger sequencing runs (we worked with a small sequencing run that was generated as a preliminary test), and adapt the commands in this tutorial to work on a bigger data set.
Weâre also in the process of refactoring our statistical methods for assessing alpha and beta diversity across groups, using the new stats plugin. Weâre therefore holding off on integrating statistical analysis until we have that ready. In the meantime, you can refer to you can refer to the Moving Pictures tutorial đ„, as well as the sample-classifier and longitudinal tutorials.
Replay provenance (work in progress!)¶
This section is a work in progress as of 11 March 2025. Right now this section only demonstrates replaying provenance from a single visualization, but this will be expanded to reflect the full analysis. Also, note that all commands for training the feature classifier are included in the resulting replay script (in case there are some commands in there that you donât remember running). More on this coming soon!
You might next want to try to adapt the commands presented in this tutorial to your own data, adjusting parameter settings and metadata column headers as is relevant.
QIIME 2âs provenance replay functionality can help with this.
Assuming that you ran all of the steps above in a directory called gut-to-soil/
, run the following command to generate a template script that you can adapt for your workflow:
qiime tools replay-provenance \
--in-fp gut-to-soil/taxa-bar-plots.qzv \
--out-fp g2s-replayed.bash
If you need help, head over to the QIIME 2 Forum.
The resource requirements exceed those provided by the Read the Docs (RTD) build system, which is used to build the documentation that youâre reading. RTD provides systems with 7GB of RAM for 30 minutes maximum to build documentation. Thatâs a very reasonable (and generous) allocation for building documentation, so we choose to work within those contraints rather than creating our own documentation build system like weâve had in the past (e.g., for
https://docs.qiime2.org
).