NGSpeciesID

Name	NGSpeciesID JSON
Version	0.3.0 JSON
	download
home_page	https://github.com/ksahlin/NGSpeciesID
Summary	Reconstructs viral consensus sequences from a set of ONT reads.
upload_time	2023-06-26 09:10:09
maintainer
docs_url	None
author	Kristoffer Sahlin
requires_python	!=3.0., !=3.1., !=3.2., !=3.3., <4
license
keywords	viral sequeces ont oxford nanopore technologies long reads
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI
coveralls test coverage	No coveralls.

            NGSpeciesID
===========

NGSpeciesID is a tool for clustering and consensus forming of long-read amplicon sequencing data (has been used with both PacBio and Oxford Nanopore data). The repository is a modified version of [isONclust](https://github.com/ksahlin/isONclust), where consensus, primer-removal, and polishing feautures have been added.

NGSpeciesID is distributed as a python package supported on Linux / OSX with python v3.6. [![Build Status](https://travis-ci.org/ksahlin/NGSpeciesID.svg?branch=master)](https://travis-ci.org/ksahlin/NGSpeciesID).

Table of Contents
=================

  * [INSTALLATION](#INSTALLATION)
    * [Using conda](#Using-conda)
    * [Testing installation](#testing-installation)
  * [USAGE](#USAGE)
    * [Filtering and subsampling](#filtering-and-subsampling)
    * [Removing primers](#removing-primers)
    * [Output](#Output)
  * [EXAMPLE WORKFLOW](#EXAMPLE-WORKFLOW)
  * [CREDITS](#CREDITS)
  * [LICENCE](#LICENCE)



INSTALLATION
----------------

**NOTE**: If you are experiencing issues (e.g. [this one](https://github.com/rvaser/spoa/issues/26)) with the third party tools  [spoa](https://github.com/rvaser/spoa) or [medaka](https://github.com/nanoporetech/medaka) in the all-in-one installation instructions below, please install the tools manually with their respective installation instructions [here](https://github.com/rvaser/spoa#installation) and [here](https://github.com/nanoporetech/medaka#installation).  

### Using conda
Conda is the preferred way to install NGSpeciesID.

1. Create and activate a new environment called NGSpeciesID

```
conda create -n NGSpeciesID python=3.6 pip 
conda activate NGSpeciesID
```

2. Install NGSpeciesID 

```
conda install --yes -c conda-forge -c bioconda medaka==0.11.5 openblas==0.3.3 spoa racon minimap2
pip install NGSpeciesID
```
3. You should now have 'NGSpeciesID' installed; try it:
```
NGSpeciesID --help
```

Upon start/login to your server/computer you need to activate the conda environment "NGSpeciesID" to run NGSpeciesID as:
```
conda activate NGSpeciesID
```



### Testing installation

0. Activate conda environment
```
conda activate NGSpeciesID
```

1. Make a new directory and navigate to it
```
mkdir test_ngspeciesID
cd test_ngspeciesID
```

2. Download the test fastq file called "sample_h1.fastq" (filesize 390kb)

```
curl -LO https://raw.githubusercontent.com/ksahlin/NGSpeciesID/master/test/sample_h1.fastq
```

3. Run the NGSpecies command on test file. Outputs will be saved in "/test_ngspeciesID/sample_h1/", where the final polished consensus file ("consensus.fasta") is located in the "/test_ngspeciesID/sample_h1/medaka_cl_id_<cluster number>" directory.

```
NGSpeciesID --ont --fastq sample_h1.fastq --outfolder ./sample_h1 --consensus --medaka
```


USAGE
-------

NGSpeciesID needs a fastq file generated by an Oxford Nanopore basecaller.

```
NGSpeciesID --ont --consensus --medaka --fastq [reads.fastq] --outfolder [/path/to/output] 
```
The argument `--ont` simply means `--k 13 --w 20`. These arguments can be set manually without the `--ont` flag. Specify number of cores with `--t`. 


NGSpeciesID can also run with racon as polisher. For example

```
NGSpeciesID --ont --consensus --racon --racon_iter 3 --fastq [reads.fastq] --outfolder [/path/to/output] 
```
will polish the consensus sequences with racon three times.

### Filtering and subsampling

NGSpeciesID employs quality filtering of the reads based on read Phred scores. However, we recommend also removing reads much shorter or longer than the intended target, which often represent chimeras or contaminations. This can be done by specifying the `--m (intended target length)` and `--s (maximum deviation from target length)`. NGSpeciesID also has the feature of subsampling reads using parameter `--sample_size`. Altogether, if we want to filter out reads outside the length interval [700,800] and using a subset of 300 reads (if the dataset consists of more reads) we could run

```
NGSpeciesID --ont --sample_size 300 --m 750 --s 50 --consensus --medaka --fastq [reads.fastq] --outfolder [/path/to/output]
```

By default, length filtering and subsampling are not invoked if parameters are not specified.

### Removing primers

If customized primers are to be expected in the reads thay can be detected and removed. The primer file is expected to be in fasta format. Here is an example of a primer file:

```
>MCB869_ONT_R
CGATCAATCCCCTAACAAACTAGG
>MCB398_ONT_F
TACCATGAGGACAAATATCATTCTG
```
NGSpeciesID searches for primes in a window of Xbp (parameter, default 150bp) at the beginning and end of each consensus.


Trimming of primers is performed after consensus forming and can be invoked as
```
NGSpeciesID --ont --consensus --medaka --fastq [reads.fastq] --outfolder [/path/to/output] --primer_file [primers.fa]
```

`NGSpeciesID` can also remove universal tails. Trimming of tails is performed after consensus forming and can be invoked as

```
NGSpeciesID --ont --consensus --medaka --fastq [reads.fastq] --outfolder [/path/to/output] --remove_universal_tails
```

The two options are mutually exclusive, i.e., only one of them can be run.

### Output

The output consists of the polished consensus sequences along with some information about clustering.

* Polished consensus sequence(s). A folder named “medaka_cl_id_X”[/"racon_cl_id_X"] is created for each predicted consensus. Each such folder contains a sequence “consensus.fasta” which is the final output of NGSpeciesID. 
* Draft spoa consensus sequences of each of the clusters are given as consensus_reference_X.fasta (where X is a number).
* The final cluster information is given in a tsv file `final_clusters.tsv` present in the specified output folder.


In the cluster TSV-file, the first column is the cluster ID and the second column is the read accession. For example:

```
0 read_X_acc
0 read_Y_acc
...
n read_Z_acc
```
if there are n reads there will be n rows. Some reads might be singletons. The rows are ordered with respect to the size of the cluster (largest first).


EXAMPLE WORKFLOW
-----------------

The bioinformatics workflow below was developed as part of a step-by-step protocol for field-deployable DNA amplicon sequencing with the Oxford Nanopore Technologies MinION. The full protocol manuscript is in submission; a link will be posted here when available. The steps below correspond to step numbers in the protocol.

#### P2 | Generate custom indexes for uniquely identifying samples using [`barcode_generator`](https://github.com/lcomai/barcode_generator). This software uses Python3.

```
python3 barcode_generator_3.4.py none 24 40 8
```

Here, the parameters are set as:
- table_excluded_barcodes = 'none'
- index length = 24 base pairs
- number of barcodes to generate = 40
- hamming distance = 8

After lab steps are complete:

#### B1 | Basecalling and quality check (optional) with [Guppy](https://community.nanoporetech.com/downloads)

These commands use the fast basecalling model from Guppy.

Basecalling for R9.4 flow cell:

```
guppy_basecaller --input_path minKNOW_input/ --save_path basecalled_fastqs/ -c dna_r9.4.1_450bps_fast.cfg --recursive --disable_pings
```

Basecalling and filter reads by quality score (here, set to 7):

```
guppy_basecaller --input_path minKNOW_input/ --save_path basecalled_fastqs/ -c dna_r9.4.1_450bps_fast.cfg --recursive --disable_pings --min_qscore 7
```

Basecalling for R10.3 flow cell:

```
guppy_basecaller --input_path minKNOW_input/ --save_path basecalled_fastqs/ -c dna_r10.3_450bps_fast.cfg --recursive --disable_pings
```
 
#### B2 | Go to folder with the fastq files generated by Guppy

#### B3 | Concatenate all the read files into one large file

```
cat *.fastq > sequencing_reads.fastq
```

#### B4 | Check raw read quality/stats with [NanoPlot](https://github.com/wdecoster/NanoPlot)

```
NanoPlot --fastq_rich sequencing_reads.fastq -o sequencing_run -p sequencing_run
```

#### B5 | Demultiplexing of the sequencing data with [minibar](https://github.com/calacademy-research/minibar) or Guppy

Example files can be found in:
- [Supplementary Data 1](./test/Supplementary_File1_reads.fastq): 3,000 reads in fastq format from three fish species - Atlantic cod (*Gadus morhua*), Haddock (*Melanogrammus aeglefinus*), and Whiting (*Merlangius merlangus*) - sequenced on a Flongle flow cell.
- [Supplementary Data 2](./test/Supplementary_File2_minibar.txt): index file used for demultiplexing with minibar

The example files Supplementary Data 1 can be used for `sequencing_reads.fastq` and Supplementary Data 2 can be used for `indexes.txt`.

#### B5a | minibar (using example files):

```
python minibar.py indexes.txt sequencing_reads.fastq -T -F -e 3 -E 11
```

Here, the edit distance allowed between indexes (`-e`) is set to 3 base pairs and the edit distance allowed between primer sequences (`-E`) is set to 11 base pairs.

#### B5b | Guppy:

```
guppy_barcoder -i sequencing_reads.fastq -s demultiplex_folder --trim_barcodes --disable_pings
```

#### B6 | Read filtering, clustering, consensus generation and polishing with NGSpeciesID

For a single sample (using example primer file):

```
NGSpeciesID --ont --consensus --sample_size 500 --m 800 --s 100 --medaka --primer_file primers.txt --fastq barcode0.fastq --outfolder barcode0_consensus
```

Here, the parameters are set as:
- the data is from ONT MinION (`--ont`)
- we want to generate consensus sequences (`--consensus`)
- subsample of reads (`--sample_size`) = 500 reads subsampled per sample to analyze 
- intended target length (`--m`) = 800 base pairs
- maximum deviation from target length (`--s`) = 100 base pairs
- use [Medaka](https://github.com/nanoporetech/medaka) to polish the final consensus sequences (`--medaka`)
- if a `--primer_file` is given, NGSpeciesID will check to remove any remaining primer sequence. The example primer file is available in [Supplementary Data 3](./test/Supplementary_File3_primer.txt). The primers were developed in Mikkelsen, P.M., Bieler, R., Kappner, I., & Rawlings, T.A. (2006). Phylogeny of Veneroidea (Mollusca: Bivalvia) based on morphology and molecules. *Zoological Journal of the Linnean Society*, 148(3), 439-521.
- the input file of demultiplexed reads is specified by `--fastq` (output from step B5)
- the output consensus files will be saved to `--outfolder`

To run this step on **more than one sample**, use a bash script with a for loop:

```
for file in *.fastq; do
bn=`basename $file .fastq`
NGSpeciesID --ont --consensus --sample_size 500 --m 800 --s 100 --medaka --primer_file primers.txt --fastq $file --outfolder ${bn}
done
```

This loop uses the wildcard `*` to indicate you want to analyze all files with the `.fastq` extension and assumes the command is run from the directory that contains the read files (if not, be sure to change the file path: `path/to/*.fastq`). 

This loop code can be entered at a UNIX/Mac terminal (be sure the spacing/indentation is correct) or saved as a script (see [`consensus.sh`](./test/consensus.sh). The script should be run from the terminal and in the directory that contains the read files as:

```
./consensus.sh
```

#### B7 | Compare consensus sequences to reference database with [BLAST](https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastDocs&DOC_TYPE=Download)

- Create/format database for BLAST search:

```
makeblastdb -in database.fasta -dbtype nucl -out database
```

- Conduct BLAST search:

```
blastn -db database -query barcode0_consensus.fasta -outfmt 6 -out barcode0_consensus_blast.out
```

Check the results and refine the search or database as needed to better identify the sequence identity of your samples!



CREDITS
----------------

Please cite [1] when using NGSpeciesID.

1. Sahlin, K, Lim, MCW, Prost, S. NGSpeciesID: DNA barcode and amplicon consensus generation from long‐read sequencing data. Ecol Evol. 2021; 00: 1– 7. https://doi.org/10.1002/ece3.7146



LICENCE
----------------

GPL v3.0, see [LICENSE.txt](https://github.com/ksahlin/NGSpeciesID/blob/master/LICENCE.txt).

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/ksahlin/NGSpeciesID",
    "name": "NGSpeciesID",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "!=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, <4",
    "maintainer_email": "",
    "keywords": "viral sequeces ONT Oxford Nanopore Technologies long reads",
    "author": "Kristoffer Sahlin",
    "author_email": "ksahlin@math.su.se",
    "download_url": "https://files.pythonhosted.org/packages/1b/96/2d8f488d490521cbd5f8ecd7ae566d2a0aacb9b059beb5e2f2c3387211ff/NGSpeciesID-0.3.0.tar.gz",
    "platform": null,
    "description": "NGSpeciesID\n===========\n\nNGSpeciesID is a tool for clustering and consensus forming of long-read amplicon sequencing data (has been used with both PacBio and Oxford Nanopore data). The repository is a modified version of [isONclust](https://github.com/ksahlin/isONclust), where consensus, primer-removal, and polishing feautures have been added.\n\nNGSpeciesID is distributed as a python package supported on Linux / OSX with python v3.6. [![Build Status](https://travis-ci.org/ksahlin/NGSpeciesID.svg?branch=master)](https://travis-ci.org/ksahlin/NGSpeciesID).\n\nTable of Contents\n=================\n\n  * [INSTALLATION](#INSTALLATION)\n    * [Using conda](#Using-conda)\n    * [Testing installation](#testing-installation)\n  * [USAGE](#USAGE)\n    * [Filtering and subsampling](#filtering-and-subsampling)\n    * [Removing primers](#removing-primers)\n    * [Output](#Output)\n  * [EXAMPLE WORKFLOW](#EXAMPLE-WORKFLOW)\n  * [CREDITS](#CREDITS)\n  * [LICENCE](#LICENCE)\n\n\n\nINSTALLATION\n----------------\n\n**NOTE**: If you are experiencing issues (e.g. [this one](https://github.com/rvaser/spoa/issues/26)) with the third party tools  [spoa](https://github.com/rvaser/spoa) or [medaka](https://github.com/nanoporetech/medaka) in the all-in-one installation instructions below, please install the tools manually with their respective installation instructions [here](https://github.com/rvaser/spoa#installation) and [here](https://github.com/nanoporetech/medaka#installation).  \n\n### Using conda\nConda is the preferred way to install NGSpeciesID.\n\n1. Create and activate a new environment called NGSpeciesID\n\n```\nconda create -n NGSpeciesID python=3.6 pip \nconda activate NGSpeciesID\n```\n\n2. Install NGSpeciesID \n\n```\nconda install --yes -c conda-forge -c bioconda medaka==0.11.5 openblas==0.3.3 spoa racon minimap2\npip install NGSpeciesID\n```\n3. You should now have 'NGSpeciesID' installed; try it:\n```\nNGSpeciesID --help\n```\n\nUpon start/login to your server/computer you need to activate the conda environment \"NGSpeciesID\" to run NGSpeciesID as:\n```\nconda activate NGSpeciesID\n```\n\n\n\n### Testing installation\n\n0. Activate conda environment\n```\nconda activate NGSpeciesID\n```\n\n1. Make a new directory and navigate to it\n```\nmkdir test_ngspeciesID\ncd test_ngspeciesID\n```\n\n2. Download the test fastq file called \"sample_h1.fastq\" (filesize 390kb)\n\n```\ncurl -LO https://raw.githubusercontent.com/ksahlin/NGSpeciesID/master/test/sample_h1.fastq\n```\n\n3. Run the NGSpecies command on test file. Outputs will be saved in \"/test_ngspeciesID/sample_h1/\", where the final polished consensus file (\"consensus.fasta\") is located in the \"/test_ngspeciesID/sample_h1/medaka_cl_id_<cluster number>\" directory.\n\n```\nNGSpeciesID --ont --fastq sample_h1.fastq --outfolder ./sample_h1 --consensus --medaka\n```\n\n\nUSAGE\n-------\n\nNGSpeciesID needs a fastq file generated by an Oxford Nanopore basecaller.\n\n```\nNGSpeciesID --ont --consensus --medaka --fastq [reads.fastq] --outfolder [/path/to/output] \n```\nThe argument `--ont` simply means `--k 13 --w 20`. These arguments can be set manually without the `--ont` flag. Specify number of cores with `--t`. \n\n\nNGSpeciesID can also run with racon as polisher. For example\n\n```\nNGSpeciesID --ont --consensus --racon --racon_iter 3 --fastq [reads.fastq] --outfolder [/path/to/output] \n```\nwill polish the consensus sequences with racon three times.\n\n### Filtering and subsampling\n\nNGSpeciesID employs quality filtering of the reads based on read Phred scores. However, we recommend also removing reads much shorter or longer than the intended target, which often represent chimeras or contaminations. This can be done by specifying the `--m (intended target length)` and `--s (maximum deviation from target length)`. NGSpeciesID also has the feature of subsampling reads using parameter `--sample_size`. Altogether, if we want to filter out reads outside the length interval [700,800] and using a subset of 300 reads (if the dataset consists of more reads) we could run\n\n```\nNGSpeciesID --ont --sample_size 300 --m 750 --s 50 --consensus --medaka --fastq [reads.fastq] --outfolder [/path/to/output]\n```\n\nBy default, length filtering and subsampling are not invoked if parameters are not specified.\n\n### Removing primers\n\nIf customized primers are to be expected in the reads thay can be detected and removed. The primer file is expected to be in fasta format. Here is an example of a primer file:\n\n```\n>MCB869_ONT_R\nCGATCAATCCCCTAACAAACTAGG\n>MCB398_ONT_F\nTACCATGAGGACAAATATCATTCTG\n```\nNGSpeciesID searches for primes in a window of Xbp (parameter, default 150bp) at the beginning and end of each consensus.\n\n\nTrimming of primers is performed after consensus forming and can be invoked as\n```\nNGSpeciesID --ont --consensus --medaka --fastq [reads.fastq] --outfolder [/path/to/output] --primer_file [primers.fa]\n```\n\n`NGSpeciesID` can also remove universal tails. Trimming of tails is performed after consensus forming and can be invoked as\n\n```\nNGSpeciesID --ont --consensus --medaka --fastq [reads.fastq] --outfolder [/path/to/output] --remove_universal_tails\n```\n\nThe two options are mutually exclusive, i.e., only one of them can be run.\n\n### Output\n\nThe output consists of the polished consensus sequences along with some information about clustering.\n\n* Polished consensus sequence(s). A folder named \u201cmedaka_cl_id_X\u201d[/\"racon_cl_id_X\"] is created for each predicted consensus. Each such folder contains a sequence \u201cconsensus.fasta\u201d which is the final output of NGSpeciesID. \n* Draft spoa consensus sequences of each of the clusters are given as consensus_reference_X.fasta (where X is a number).\n* The final cluster information is given in a tsv file `final_clusters.tsv` present in the specified output folder.\n\n\nIn the cluster TSV-file, the first column is the cluster ID and the second column is the read accession. For example:\n\n```\n0 read_X_acc\n0 read_Y_acc\n...\nn read_Z_acc\n```\nif there are n reads there will be n rows. Some reads might be singletons. The rows are ordered with respect to the size of the cluster (largest first).\n\n\nEXAMPLE WORKFLOW\n-----------------\n\nThe bioinformatics workflow below was developed as part of a step-by-step protocol for field-deployable DNA amplicon sequencing with the Oxford Nanopore Technologies MinION. The full protocol manuscript is in submission; a link will be posted here when available. The steps below correspond to step numbers in the protocol.\n\n#### P2 | Generate custom indexes for uniquely identifying samples using [`barcode_generator`](https://github.com/lcomai/barcode_generator). This software uses Python3.\n\n```\npython3 barcode_generator_3.4.py none 24 40 8\n```\n\nHere, the parameters are set as:\n- table_excluded_barcodes = 'none'\n- index length = 24 base pairs\n- number of barcodes to generate = 40\n- hamming distance = 8\n\nAfter lab steps are complete:\n\n#### B1 | Basecalling and quality check (optional) with [Guppy](https://community.nanoporetech.com/downloads)\n\nThese commands use the fast basecalling model from Guppy.\n\nBasecalling for R9.4 flow cell:\n\n```\nguppy_basecaller --input_path minKNOW_input/ --save_path basecalled_fastqs/ -c dna_r9.4.1_450bps_fast.cfg --recursive --disable_pings\n```\n\nBasecalling and filter reads by quality score (here, set to 7):\n\n```\nguppy_basecaller --input_path minKNOW_input/ --save_path basecalled_fastqs/ -c dna_r9.4.1_450bps_fast.cfg --recursive --disable_pings --min_qscore 7\n```\n\nBasecalling for R10.3 flow cell:\n\n```\nguppy_basecaller --input_path minKNOW_input/ --save_path basecalled_fastqs/ -c dna_r10.3_450bps_fast.cfg --recursive --disable_pings\n```\n \n#### B2 | Go to folder with the fastq files generated by Guppy\n\n#### B3 | Concatenate all the read files into one large file\n\n```\ncat *.fastq > sequencing_reads.fastq\n```\n\n#### B4 | Check raw read quality/stats with [NanoPlot](https://github.com/wdecoster/NanoPlot)\n\n```\nNanoPlot --fastq_rich sequencing_reads.fastq -o sequencing_run -p sequencing_run\n```\n\n#### B5 | Demultiplexing of the sequencing data with [minibar](https://github.com/calacademy-research/minibar) or Guppy\n\nExample files can be found in:\n- [Supplementary Data 1](./test/Supplementary_File1_reads.fastq): 3,000 reads in fastq format from three fish species - Atlantic cod (*Gadus morhua*), Haddock (*Melanogrammus aeglefinus*), and Whiting (*Merlangius merlangus*) - sequenced on a Flongle flow cell.\n- [Supplementary Data 2](./test/Supplementary_File2_minibar.txt): index file used for demultiplexing with minibar\n\nThe example files Supplementary Data 1 can be used for `sequencing_reads.fastq` and Supplementary Data 2 can be used for `indexes.txt`.\n\n#### B5a | minibar (using example files):\n\n```\npython minibar.py indexes.txt sequencing_reads.fastq -T -F -e 3 -E 11\n```\n\nHere, the edit distance allowed between indexes (`-e`) is set to 3 base pairs and the edit distance allowed between primer sequences (`-E`) is set to 11 base pairs.\n\n#### B5b | Guppy:\n\n```\nguppy_barcoder -i sequencing_reads.fastq -s demultiplex_folder --trim_barcodes --disable_pings\n```\n\n#### B6 | Read filtering, clustering, consensus generation and polishing with NGSpeciesID\n\nFor a single sample (using example primer file):\n\n```\nNGSpeciesID --ont --consensus --sample_size 500 --m 800 --s 100 --medaka --primer_file primers.txt --fastq barcode0.fastq --outfolder barcode0_consensus\n```\n\nHere, the parameters are set as:\n- the data is from ONT MinION (`--ont`)\n- we want to generate consensus sequences (`--consensus`)\n- subsample of reads (`--sample_size`) = 500 reads subsampled per sample to analyze \n- intended target length (`--m`) = 800 base pairs\n- maximum deviation from target length (`--s`) = 100 base pairs\n- use [Medaka](https://github.com/nanoporetech/medaka) to polish the final consensus sequences (`--medaka`)\n- if a `--primer_file` is given, NGSpeciesID will check to remove any remaining primer sequence. The example primer file is available in [Supplementary Data 3](./test/Supplementary_File3_primer.txt). The primers were developed in Mikkelsen, P.M., Bieler, R., Kappner, I., & Rawlings, T.A. (2006). Phylogeny of Veneroidea (Mollusca: Bivalvia) based on morphology and molecules. *Zoological Journal of the Linnean Society*, 148(3), 439-521.\n- the input file of demultiplexed reads is specified by `--fastq` (output from step B5)\n- the output consensus files will be saved to `--outfolder`\n\nTo run this step on **more than one sample**, use a bash script with a for loop:\n\n```\nfor file in *.fastq; do\nbn=`basename $file .fastq`\nNGSpeciesID --ont --consensus --sample_size 500 --m 800 --s 100 --medaka --primer_file primers.txt --fastq $file --outfolder ${bn}\ndone\n```\n\nThis loop uses the wildcard `*` to indicate you want to analyze all files with the `.fastq` extension and assumes the command is run from the directory that contains the read files (if not, be sure to change the file path: `path/to/*.fastq`). \n\nThis loop code can be entered at a UNIX/Mac terminal (be sure the spacing/indentation is correct) or saved as a script (see [`consensus.sh`](./test/consensus.sh). The script should be run from the terminal and in the directory that contains the read files as:\n\n```\n./consensus.sh\n```\n\n#### B7 | Compare consensus sequences to reference database with [BLAST](https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastDocs&DOC_TYPE=Download)\n\n- Create/format database for BLAST search:\n\n```\nmakeblastdb -in database.fasta -dbtype nucl -out database\n```\n\n- Conduct BLAST search:\n\n```\nblastn -db database -query barcode0_consensus.fasta -outfmt 6 -out barcode0_consensus_blast.out\n```\n\nCheck the results and refine the search or database as needed to better identify the sequence identity of your samples!\n\n\n\nCREDITS\n----------------\n\nPlease cite [1] when using NGSpeciesID.\n\n1. Sahlin, K, Lim, MCW, Prost, S. NGSpeciesID: DNA barcode and amplicon consensus generation from long\u2010read sequencing data. Ecol Evol. 2021; 00: 1\u2013 7. https://doi.org/10.1002/ece3.7146\n\n\n\nLICENCE\n----------------\n\nGPL v3.0, see [LICENSE.txt](https://github.com/ksahlin/NGSpeciesID/blob/master/LICENCE.txt).\n\n\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Reconstructs viral consensus sequences from a set of ONT reads.",
    "version": "0.3.0",
    "project_urls": {
        "Homepage": "https://github.com/ksahlin/NGSpeciesID"
    },
    "split_keywords": [
        "viral",
        "sequeces",
        "ont",
        "oxford",
        "nanopore",
        "technologies",
        "long",
        "reads"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1b962d8f488d490521cbd5f8ecd7ae566d2a0aacb9b059beb5e2f2c3387211ff",
                "md5": "c01af9a246c9bcca01378b194c2c21a0",
                "sha256": "71663ce280220d4e692cc6c3aea44f91e40b3143ed09a9462bdf2feb1d94aa9f"
            },
            "downloads": -1,
            "filename": "NGSpeciesID-0.3.0.tar.gz",
            "has_sig": false,
            "md5_digest": "c01af9a246c9bcca01378b194c2c21a0",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "!=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, <4",
            "size": 561299,
            "upload_time": "2023-06-26T09:10:09",
            "upload_time_iso_8601": "2023-06-26T09:10:09.994774Z",
            "url": "https://files.pythonhosted.org/packages/1b/96/2d8f488d490521cbd5f8ecd7ae566d2a0aacb9b059beb5e2f2c3387211ff/NGSpeciesID-0.3.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-06-26 09:10:09",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ksahlin",
    "github_project": "NGSpeciesID",
    "travis_ci": true,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "ngspeciesid"
}

Kristoffer Sahlin