Chapter 3 Pre-processing Sequence Reads

3.1 Import data

First up gotta import reads into a qiime artifact (called here input-data-casava)

$ qiime tools import \
--type 'SampleData[PairedEndSequencesWithQuality]' \
--input-path data-raw/ \
--input-format CasavaOneEightSingleLanePerSampleDirFmt \
--output-path input-data-casava.qza

3.2 Denoise with dada2

This step takes a while. so may want to setup within a tmux session if running on nectar cloud. Note I use a trim-left parameter of 15 on both the forward and reverse reads to deal with the amplicon primers (as seems to work better than using cut-adapt).

$ qiime dada2 denoise-paired \
--p-n-threads 8 \
--i-demultiplexed-seqs input-data-casava.qza \
--p-trim-left-f 15 \
--p-trim-left-r 15 \
--p-trunc-len-f 260 \
--p-trunc-len-r 215 \
--output-dir DADA2_denoising_output \
--verbose \
&> DADA2_denoising.log

So now we have a folder, DADA2_denoisng_output with following items in there:
- denoising_stats.qza
- representative_sequences.qza
- table.qza

3.3 Check denoising stats

$ cd DADA2_denoising_output

$ qiime metadata tabulate \
--m-input-file denoising_stats.qza \
--o-visualization denoising_stats.qzv \

3.4 Get FeatureTable [frequency] and FeatureData [sequences]

$ qiime feature-table summarize \
  --i-table table.qza \
  --o-visualization table.qzv \
  --m-sample-metadata-file ../metadata.txt 

$ qiime feature-table tabulate-seqs \
  --i-data representative_sequences.qza \
  --o-visualization representative_sequences.qzv

Note: min freq: 20,718.0 reads

3.5 Train Classifier (for doing taxonomy)

First up, we gotta import a the greengenes database stuff into qiime as qiime artifacts.

$ mkdir classifier
$ cd classifier

# download green genes 13_8
$ wget ftp://greengenes.microbio.me/greengenes_release/gg_13_5/gg_13_8_otus.tar.gz
$ gunzip gg_13_8_otus.tar.gz
$ tar -xvf gg_13_8_otus.tar
$ rm gg_13_8_otus.tar

 # the sequences from greengenes
$ qiime tools import \
--type 'FeatureData[Sequence]' \
--input-path gg_13_8_otus/rep_set/99_otus.fasta \
--output-path 99_otus.qza

# The taxonomy strings from greengenes
$ qiime tools import \
--type 'FeatureData[Taxonomy]' \
--input-format HeaderlessTSVTaxonomyFormat \
--input-path gg_13_8_otus/taxonomy/99_otu_taxonomy.txt \
--output-path 99_otus_16S_taxonomy.qza

Extract reads from the relevant region (V3-V5) out of the ref database

$ qiime feature-classifier extract-reads \
--i-sequences 99_otus.qza \
--p-f-primer CCTACGGGNGGCWGCAG  \
--p-r-primer GACTACHVGGGTATCTAATCC   \
--p-min-length 300 \
--p-max-length 600 \
--o-reads ref_seqs.qza \
--verbose \
&> 16S_training.log

Now, train the classifier on this region. I.e use this ref_seqs.qza qiime artifact as the input for feature-classifier.

$ qiime feature-classifier fit-classifier-naive-bayes \
--i-reference-reads ref_seqs.qza \
--i-reference-taxonomy 99_otus_16S_taxonomy.qza \
--o-classifier classifier_16S.qza \
--verbose \
&> 16S_classifier.log

The output of this is our classifier, which I’ve called classifier_16S.qza, is what will be used in the next step.

3.6 Classify rep seqs

So now we’ve got the classifier, we can use it to classify our representative seqs (rep-seqs.qza), which I then tabulated and visualised:

$ cd ..
$ mkdir taxonomy

$ qiime feature-classifier classify-sklearn \
--i-classifier classifier/classifier_16S.qza \
--i-reads DADA2_denoising_output/representative_sequences.qza \
--o-classification taxonomy/classified_rep_seqs.qza

# Tabulate the features, their taxonomy and the confidence of taxonomy assignment
$ cd taxonomy

$ qiime metadata tabulate \
--m-input-file classified_rep_seqs.qza \
--o-visualization classified_rep_seqs.qzv

3.7 Create a phylogenetic tree

Note that this only needs the rep seqs (not the classified rep seqs)

$ cd ..

$ qiime phylogeny align-to-tree-mafft-fasttree \
--i-sequences DADA2_denoising_output/representative_sequences.qza \
--output-dir phylogenetic_tree \
--p-n-threads 8 \
--verbose \
&> phylogenetic_tree_generation.log

Note: it’s the rooted_tree.qza file which will be used in downstream analyses.

Now on to the fun stuff - taxonomy and diversity analyses!