Table of contents




Introduction

metaNanoPype is a modular python based pipeline for reproducible analysis of nanopore metabarcoding data with the following modules:

Depending on the step, the module can include more than one option, i.e., tool/algorithm, to allow flexibility to the user. During each step is generated a log file with relevant information to build a report (an additional option) in html format describing the results, versions of software used as well scripts and references of the software.




Installation

metaNanoPype is a modular pipeline relying on python scripts that are wrappers to other algorithm/tools that need to be installed in your OS.

In general, it requires:


Then, for each module step it requires several third-party, stand-alone tools installed in your system:


An easier way to install all the softwares and dependencies without trouble is using the environment.yaml file provided in metaNanoPype repository to create a conda environment wiht all the softwares/dependencies. To do so, do the following (you need to have git and miniconda3 installed):

    git clone https://github.com/antonioggsousa/metaNanoPype.git # download metaNanoPype repo
    
    export PATH="$PWD/metaNanoPype/bin:$PATH" # make it accessible in your PATH

    conda activate # activate conda

    conda env create -f ./metaNanoPype/environment.yaml # create conda env with all soft/dependencies

    conda activate metaNanoPype # activate new conda env




Tutorial

For this quick start tutorial it will be used the publicly available data of nanopore full-length 16S rRNA amplicon sequences published in Low et al. 2021: Evaluation of full-length nanopore 16S sequencing for detection of pathogens in microbial keratitis.


This tutorial was performed in a server with the OS Ubuntu 18.04.5 LTS and 20 cores and ~120 Gb RAM. Generally many of the commands used can be reproduced in any Linux distribution or even UNIX OS. You will start this tutorial by downloading the metaNanoPype GitHub repository and installing the conda recipe to create the conda environment with all the dependencies using miniconda3.


0) Create the directory structure to reproduce the tutorial and scripts:

Change into the directory where you want to reproduce the tutorial and create the directory structure:

    mkdir bin scripts data results report

Download metaNanoPype scripts:

    cd bin 

    git clone https://github.com/antonioggsousa/metaNanoPype.git 

Next, add the metaNanoPype bin folder to your PATH:

    export PATH="$PWD/metaNanoPype/bin:$PATH"

Ultimately, install all the softwares and dependencies necessary to work with metaNanoPype and run this tutorial using miniconda3 (please install miniconda3 first if you don’t have it already installed - see instructions):

    conda activate # activate conda

    conda env create -f ./metaNanoPype/environment.yaml # create conda env with all soft/dependencies

    conda activate metaNanoPype # activate new conda env



1) Download the full-length 16S rRNA amplicon fastq files:

The fastq files were deposited in ENA (European Nucleotide Archive) under the project accession number: PRJEB37709.

The files can be downloaded from several ways. A convenient way is by using the ENA toolkit enaBrowserTools, such as: enaGroupGet command. Download them from github and download the data (follow the steps below).

    git clone https://github.com/enasequence/enaBrowserTools.git

Next, add the enaBrowserTools python3 folder to your PATH:

    export PATH="$PWD/enaBrowserTools/python3:$PATH"

Download the PRJEB37709 project fastq sequencing data (comment - this will take a while):

    cd ../data #change into data folder first

    enaGroupGet -g read -f fastq PRJEB37709 

Create a new folder with all individual fastq files inside and delete the previous folder (with each fastq file inside of a sample specific folder) to work more conveniently with the files:

    mkdir fastq

    mv ./PRJEB37709/*/*.fastq.gz ./fastq

    rm -rf ./PRJEB37709

Remove the fastq file ERR4836977.fastq.gz because fastqc raises an exception error message when processing it. We’ll take a deeper look into it later. For now just remove it:

    rm ./fastq/ERR4836977.fastq.gz



2) Assess the quality of the nanopore full-length 16S rRNA amplicon sequences with fastqc, NanoPlot and multiqc in one command-line with the fastqc-py metaNanoPype script:

Change directory to scripts to save the log under the script folder:

    cd ../scripts

    fastqc-py --help # display options

    fastqc-py -f ../data/fastq -n -t 10 -o ../results/QC

The command above will give as input all the fastq files in the folder (-f option) ../data/fastq, it will run NanoPlot (-n), with 10 threads (-t 10) and the output result (-o option) will be saved in the folder ../results/QC.

You can inspect the quality of the nanopore 16S sequences by looking into the individual html files produced by fastqc or the aggregated html report produced by multiqc as well the report produced by Nanoplot at: ../results/QC. Through this way, you can have a good picture about the quality of your data.



3) The next step consists in filtering and trimming bad quality nanopore sequences using porechop and NanoFilt in one command-line with the filter_fastq-py metaNanoPype script:

    filter_fastq-py --help # display options
    
    filter_fastq-py -f ../data/fastq -t 10 -min_len 1000 -max_len 1800 -qs 10 -o ../data/trim    

The command above is filtering/trimming the fastq files given in the folder (-f option) ../data/fastq, using 10 threads (-t 10 - passed to Porechop) and discarding reads shorter than 1000 bp (-min_len 1000) or longer than 1800 bp (-max_len 1800) as well as reads with a quality-score lower than 10 (-qs 10). The output good-quality full-length nanopore reads are saved at ../data/trim (with the option -o ../data/trim). Find more options with the filter_fastq-py --help command.

The previous command works by running first Porechop and creating fastq files with the name <sample_name>_adapter_clipped.fastq that are then processed by NanoFilt producing the files <sample_name>_trimmed.fastq. Since we are only interested in the latter, let’s discard the adapter clipped.

    rm ../data/trim/*_adapter_clipped.fastq



4) Re-assess the quality of the reads after filtering/trimming them with the fastqc-py metaNanoPype script:

    fastqc-py -f ../data/trim -n -t 10 -o ../results/QC_trim



5) Perform taxonomic assignment with kraken2, inclusive the download of 16S rRNA reference databases indexed, in one command-line with the tax_assign-py metaNanoPype script:

    tax_assign-py --help # display options

    # create variables to simplify
    FASTQ=$(ls ../data/trim/*_trimmed.fastq | xargs echo | sed 's/ /,/g')
    SAMPLES=$(echo $FASTQ | sed 's/..\/data\/trim\/\|_trimmed.fastq//g')
    DB=../data/SILVA_DB

    tax_assign-py -i $FASTQ -t 10 -db $DB -db_down silva -s $SAMPLES -r -o ../results/tax

The command above will download the 16S rRNA database indexed silva (-db_down silva option) into the directory ../data/SILVA_DB (-db $DB option) which will be the reference to map nanopore 16S reads provided at ../data/trim/*.fastq.gz (-i $FASTQ with the option - comma-separated list of fastq file directories) with 10 threads (-t 10). The output folder with the results will be ../results/tax (-o ../results/tax option) with files named based on sample names provided at -s $SAMPLES (a comma-separated list of samples/files name in the same order as provided in the input fastq files). The ouput file name will be: <sample_name>.out. Since the option -r was provided, a report kraken2 file will be created also (with the extension <sample_name>.report).

Inspect the kraken2 taxonomic assignments at: ../results/tax



6) Finally run the taxonomic and diversity analyses with phyloseq, in one command-line with the tax_div-py metaNanoPype script:

    tax_div-py --help # display options

    tax_div-py -d ../results/tax/ -o ../results/div -r ../report/tax_div_report.html

The command above will search for all the files with the extension *.report under the directory folder ../results/tax/ (-d ../results/tax/ option) and save the taxonomic and diversity analyses results into the output folder ../results/div (-o ../results/div). In addition, it is built a html report named ../report/tax_div_report.html (-r ../report/tax_div_report.html - can be open in a browser) with the whole analysis and code run in the R programming language, with the phyloseq R package. Please open the tax_div_report.html report to check the taxonomic and diversity results.



7) As last step, the metaNanoPype logs can be grabbed into one report md/html file describing the whole pipeline, main softwares, versions and references used during the pipeline:

    report-py --help # display options

    report-py -d ./ -f ../report/metaNanoPype_report -n "metaNanoPype reproducible report" -a "António Sousa et al."

The command above will grab all the log files, with the extension *.log in the current directory (-d ./) and build a report named metaNanoPype_report into the folder ../report/ (-f ../report/metaNanoPype_report). In addition, the title of the report will be "metaNanoPype reproducible report" (-n "metaNanoPype reproducible report") and the author name "António Sousa et al." ( -a "António Sousa et al.").




Support or Contact

Please open an issue for support or contact.




Acknowledgement

This project is being developed under the scope of the Open Life Science 3 program.


Project lead: António Sousa

Mentor: Hans-Rudolf Hotz

Advice & support: Ricardo Ramiro