Chapter 3 Pre-processing Sequence Reads
3.1 Import data
First up gotta import reads into a qiime artifact (called here input-data-casava
)
3.2 Denoise with dada2
This step takes a while. so may want to setup within a tmux session if running on nectar cloud. Note I use a trim-left parameter of 15 on both the forward and reverse reads to deal with the amplicon primers (as seems to work better than using cut-adapt).
$ qiime dada2 denoise-paired \
--p-n-threads 8 \
--i-demultiplexed-seqs input-data-casava.qza \
--p-trim-left-f 15 \
--p-trim-left-r 15 \
--p-trunc-len-f 260 \
--p-trunc-len-r 215 \
--output-dir DADA2_denoising_output \
--verbose \
&> DADA2_denoising.log
So now we have a folder, DADA2_denoisng_output
with following items in there:
- denoising_stats.qza
- representative_sequences.qza
- table.qza
3.3 Check denoising stats
3.4 Get FeatureTable [frequency] and FeatureData [sequences]
$ qiime feature-table summarize \
--i-table table.qza \
--o-visualization table.qzv \
--m-sample-metadata-file ../metadata.txt
$ qiime feature-table tabulate-seqs \
--i-data representative_sequences.qza \
--o-visualization representative_sequences.qzv
Note: min freq: 20,718.0 reads
3.5 Train Classifier (for doing taxonomy)
First up, we gotta import a the greengenes database stuff into qiime as qiime artifacts.
$ mkdir classifier
$ cd classifier
# download green genes 13_8
$ wget ftp://greengenes.microbio.me/greengenes_release/gg_13_5/gg_13_8_otus.tar.gz
$ gunzip gg_13_8_otus.tar.gz
$ tar -xvf gg_13_8_otus.tar
$ rm gg_13_8_otus.tar
# the sequences from greengenes
$ qiime tools import \
--type 'FeatureData[Sequence]' \
--input-path gg_13_8_otus/rep_set/99_otus.fasta \
--output-path 99_otus.qza
# The taxonomy strings from greengenes
$ qiime tools import \
--type 'FeatureData[Taxonomy]' \
--input-format HeaderlessTSVTaxonomyFormat \
--input-path gg_13_8_otus/taxonomy/99_otu_taxonomy.txt \
--output-path 99_otus_16S_taxonomy.qza
Extract reads from the relevant region (V3-V5) out of the ref database
$ qiime feature-classifier extract-reads \
--i-sequences 99_otus.qza \
--p-f-primer CCTACGGGNGGCWGCAG \
--p-r-primer GACTACHVGGGTATCTAATCC \
--p-min-length 300 \
--p-max-length 600 \
--o-reads ref_seqs.qza \
--verbose \
&> 16S_training.log
Now, train the classifier on this region. I.e use this ref_seqs.qza
qiime artifact as the input for feature-classifier.
$ qiime feature-classifier fit-classifier-naive-bayes \
--i-reference-reads ref_seqs.qza \
--i-reference-taxonomy 99_otus_16S_taxonomy.qza \
--o-classifier classifier_16S.qza \
--verbose \
&> 16S_classifier.log
The output of this is our classifier, which I’ve called classifier_16S.qza
, is what will be used in the next step.
3.6 Classify rep seqs
So now we’ve got the classifier, we can use it to classify our representative seqs (rep-seqs.qza), which I then tabulated and visualised:
$ cd ..
$ mkdir taxonomy
$ qiime feature-classifier classify-sklearn \
--i-classifier classifier/classifier_16S.qza \
--i-reads DADA2_denoising_output/representative_sequences.qza \
--o-classification taxonomy/classified_rep_seqs.qza
# Tabulate the features, their taxonomy and the confidence of taxonomy assignment
$ cd taxonomy
$ qiime metadata tabulate \
--m-input-file classified_rep_seqs.qza \
--o-visualization classified_rep_seqs.qzv
3.7 Create a phylogenetic tree
Note that this only needs the rep seqs (not the classified rep seqs)
$ cd ..
$ qiime phylogeny align-to-tree-mafft-fasttree \
--i-sequences DADA2_denoising_output/representative_sequences.qza \
--output-dir phylogenetic_tree \
--p-n-threads 8 \
--verbose \
&> phylogenetic_tree_generation.log
Note: it’s the rooted_tree.qza file which will be used in downstream analyses.
Now on to the fun stuff - taxonomy and diversity analyses!