Table of contents
Introduction
metaNanoPype is a modular python based pipeline for reproducible analysis of nanopore metabarcoding data with the following modules:
-
(I) demultiplexing (not implemented yet);
-
(II) quality-assessment (implemented);
-
(III) quality-filtering and trimming (implemented);
-
(IV) polishing/read correction (not implemented yet);
-
(V) taxonomic classification (implemented);
-
(VI) diversity analyses (alpha- and beta-diversity) (implemented).
Depending on the step, the module can include more than one option, i.e., tool/algorithm, to allow flexibility to the user. During each step is generated a log file with relevant information to build a report (an additional option) in html format describing the results, versions of software used as well scripts and references of the software.
Installation
metaNanoPype is a modular pipeline relying on python scripts that are wrappers to other algorithm/tools that need to be installed in your OS.
In general, it requires:
-
python >= v.3.6.9 (installation)
-
metaNanoPype python scripts:
git clone https://github.com/antonioggsousa/metaNanoPype.git
Then, for each module step it requires several third-party, stand-alone tools installed in your system:
-
(I) demultiplexing (not implemented yet);
-
(II) quality-assessment (implemented);
-
fastqc v.0.11.5 (installation)
-
multiqc v.1.9 (installation)
-
NanoPlot v.1.29.1 (installation)
-
-
(III) quality-filtering and trimming (implemented);
-
porechop v.0.2.4 (installation)
-
NanoFilt v.2.7.1 (installation)
-
-
(IV) polishing/read correction (not implemented yet);
-
(V) taxonomic classification (under development);
- kraken2 v.2.1.1 (installation)
-
(VI) diversity analyses (alpha- and beta-diversity) (not implemented yet).
-
R v.4.0.3 (installation)
-
phyloseq v.4.0.3 (installation)
-
An easier way to install all the softwares and dependencies without trouble is using the environment.yaml
file provided in metaNanoPype repository to create a conda environment wiht all the softwares/dependencies. To do so, do the following (you need to have git and miniconda3 installed):
git clone https://github.com/antonioggsousa/metaNanoPype.git # download metaNanoPype repo
export PATH="$PWD/metaNanoPype/bin:$PATH" # make it accessible in your PATH
conda activate # activate conda
conda env create -f ./metaNanoPype/environment.yaml # create conda env with all soft/dependencies
conda activate metaNanoPype # activate new conda env
Tutorial
For this quick start tutorial it will be used the publicly available data of nanopore full-length 16S rRNA amplicon sequences published in Low et al. 2021: Evaluation of full-length nanopore 16S sequencing for detection of pathogens in microbial keratitis.
This tutorial was performed in a server with the OS Ubuntu 18.04.5 LTS and 20 cores and ~120 Gb RAM. Generally many of the commands used can be reproduced in any Linux distribution or even UNIX OS. You will start this tutorial by downloading the metaNanoPype GitHub repository and installing the conda recipe to create the conda environment with all the dependencies using miniconda3.
0) Create the directory structure to reproduce the tutorial and scripts:
Change into the directory where you want to reproduce the tutorial and create the directory structure:
mkdir bin scripts data results report
Download metaNanoPype scripts:
cd bin
git clone https://github.com/antonioggsousa/metaNanoPype.git
Next, add the metaNanoPype bin folder to your PATH:
export PATH="$PWD/metaNanoPype/bin:$PATH"
Ultimately, install all the softwares and dependencies necessary to work with metaNanoPype and run this tutorial using miniconda3 (please install miniconda3 first if you don’t have it already installed - see instructions):
conda activate # activate conda
conda env create -f ./metaNanoPype/environment.yaml # create conda env with all soft/dependencies
conda activate metaNanoPype # activate new conda env
1) Download the full-length 16S rRNA amplicon fastq files:
The fastq files were deposited in ENA (European Nucleotide Archive) under the project accession number: PRJEB37709.
The files can be downloaded from several ways. A convenient way is by using the ENA toolkit enaBrowserTools, such as: enaGroupGet
command. Download them from github and download the data (follow the steps below).
git clone https://github.com/enasequence/enaBrowserTools.git
Next, add the enaBrowserTools python3 folder to your PATH:
export PATH="$PWD/enaBrowserTools/python3:$PATH"
Download the PRJEB37709 project fastq sequencing data (comment - this will take a while):
cd ../data #change into data folder first
enaGroupGet -g read -f fastq PRJEB37709
Create a new folder with all individual fastq files inside and delete the previous folder (with each fastq file inside of a sample specific folder) to work more conveniently with the files:
mkdir fastq
mv ./PRJEB37709/*/*.fastq.gz ./fastq
rm -rf ./PRJEB37709
Remove the fastq file ERR4836977.fastq.gz because fastqc raises an exception error message when processing it. We’ll take a deeper look into it later. For now just remove it:
rm ./fastq/ERR4836977.fastq.gz
2) Assess the quality of the nanopore full-length 16S rRNA amplicon sequences with fastqc, NanoPlot and multiqc in one command-line with the fastqc-py
metaNanoPype script:
Change directory to scripts to save the log
under the script folder:
cd ../scripts
fastqc-py --help # display options
fastqc-py -f ../data/fastq -n -t 10 -o ../results/QC
The command above will give as input all the fastq files in the folder (-f
option) ../data/fastq
, it will run NanoPlot
(-n
), with 10 threads (-t 10
) and the output result (-o
option) will be saved in the folder ../results/QC
.
You can inspect the quality of the nanopore 16S sequences by looking into the individual html files produced by fastqc
or the aggregated html report produced by multiqc
as well the report produced by Nanoplot
at: ../results/QC
. Through this way, you can have a good picture about the quality of your data.
3) The next step consists in filtering and trimming bad quality nanopore sequences using porechop and NanoFilt in one command-line with the filter_fastq-py
metaNanoPype script:
filter_fastq-py --help # display options
filter_fastq-py -f ../data/fastq -t 10 -min_len 1000 -max_len 1800 -qs 10 -o ../data/trim
The command above is filtering/trimming the fastq files given in the folder (-f
option) ../data/fastq
, using 10 threads (-t 10
- passed to Porechop
) and discarding reads shorter than 1000 bp (-min_len 1000
) or longer than 1800 bp (-max_len 1800
) as well as reads with a quality-score lower than 10 (-qs 10
). The output good-quality full-length nanopore reads are saved at ../data/trim
(with the option -o ../data/trim
). Find more options with the filter_fastq-py --help
command.
The previous command works by running first Porechop
and creating fastq files with the name <sample_name>_adapter_clipped.fastq
that are then processed by NanoFilt
producing the files <sample_name>_trimmed.fastq
. Since we are only interested in the latter, let’s discard the adapter clipped.
rm ../data/trim/*_adapter_clipped.fastq
4) Re-assess the quality of the reads after filtering/trimming them with the fastqc-py
metaNanoPype script:
fastqc-py -f ../data/trim -n -t 10 -o ../results/QC_trim
5) Perform taxonomic assignment with kraken2, inclusive the download of 16S rRNA reference databases indexed, in one command-line with the tax_assign-py
metaNanoPype script:
tax_assign-py --help # display options
# create variables to simplify
FASTQ=$(ls ../data/trim/*_trimmed.fastq | xargs echo | sed 's/ /,/g')
SAMPLES=$(echo $FASTQ | sed 's/..\/data\/trim\/\|_trimmed.fastq//g')
DB=../data/SILVA_DB
tax_assign-py -i $FASTQ -t 10 -db $DB -db_down silva -s $SAMPLES -r -o ../results/tax
The command above will download the 16S rRNA database indexed silva
(-db_down silva
option) into the directory ../data/SILVA_DB
(-db $DB
option) which will be the reference to map nanopore 16S reads provided at ../data/trim/*.fastq.gz
(-i $FASTQ
with the option - comma-separated list of fastq file directories) with 10 threads (-t 10
). The output folder with the results will be ../results/tax
(-o ../results/tax
option) with files named based on sample names provided at -s $SAMPLES
(a comma-separated list of samples/files name in the same order as provided in the input fastq files). The ouput file name will be: <sample_name>.out
. Since the option -r
was provided, a report kraken2 file will be created also (with the extension <sample_name>.report
).
Inspect the kraken2 taxonomic assignments at: ../results/tax
6) Finally run the taxonomic and diversity analyses with phyloseq, in one command-line with the tax_div-py
metaNanoPype script:
tax_div-py --help # display options
tax_div-py -d ../results/tax/ -o ../results/div -r ../report/tax_div_report.html
The command above will search for all the files with the extension *.report
under the directory folder ../results/tax/
(-d ../results/tax/
option) and save the taxonomic and diversity analyses results into the output folder ../results/div
(-o ../results/div
). In addition, it is built a html report named ../report/tax_div_report.html
(-r ../report/tax_div_report.html
- can be open in a browser) with the whole analysis and code run in the R programming language
, with the phyloseq
R package. Please open the tax_div_report.html
report to check the taxonomic and diversity results.
7) As last step, the metaNanoPype logs can be grabbed into one report md/html file describing the whole pipeline, main softwares, versions and references used during the pipeline:
report-py --help # display options
report-py -d ./ -f ../report/metaNanoPype_report -n "metaNanoPype reproducible report" -a "António Sousa et al."
The command above will grab all the log files, with the extension *.log
in the current directory (-d ./
) and build a report named metaNanoPype_report
into the folder ../report/
(-f ../report/metaNanoPype_report
). In addition, the title of the report will be "metaNanoPype reproducible report"
(-n "metaNanoPype reproducible report"
) and the author name "António Sousa et al."
( -a "António Sousa et al."
).
Support or Contact
Please open an issue for support or contact.
Acknowledgement
This project is being developed under the scope of the Open Life Science 3 program.
Project lead: António Sousa
Mentor: Hans-Rudolf Hotz
Advice & support: Ricardo Ramiro