isONcorrect


NameisONcorrect JSON
Version 0.1.3.5 PyPI version JSON
download
home_pagehttps://github.com/ksahlin/isONcorrect
SummaryDe novo error-correction of long-read transcriptome reads
upload_time2023-09-06 23:07:54
maintainer
docs_urlNone
authorKristoffer Sahlin
requires_python!=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, <4
license
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI
coveralls test coverage No coveralls.
            isONcorrect
===========

isONcorrect is a tool for error-correcting Oxford Nanopore cDNA reads. It is designed to handle highly variable coverage and exon variation within reads and achieves about a 0.5-1% median error rate after correction. It leverages regions shared between reads from different isoforms achieve low error rates even for low abundant transcripts. See [paper](https://www.nature.com/articles/s41467-020-20340-8) for details. 

**Update:** Since v0.0.8, isONcorrect uses different default parameters compared to what was used in the [paper](https://www.nature.com/articles/s41467-020-20340-8). The new parameters make isONcorrect 2-3 times faster and use 3-8 times less memory with only a small cost of increased median post-correction error rate. With the new parameter setting the correction accuracy is 98.5-99.3% instead of 98.9–99.6% on the data used in the paper. Current default uses `--k 9 --w 20 --max_seqs 2000`. To invoke settings used in paper, set parameters `--k 9 --w 10 --max_seqs 1000`.

Processing and error correction of full-length ONT cDNA reads is achieved by the pipeline of running [pychopper](https://github.com/nanoporetech/pychopper) --> [isONclust](https://github.com/ksahlin/isONclust) --> [isONcorrect](https://github.com/ksahlin/isONcorrect). All these steps can be run in one go with [this script](https://github.com/ksahlin/isONcorrect/blob/master/scripts/correction_pipeline.sh). See below for installation and usage. 


isONcorrect is distributed as a python package supported on Linux / OSX with python v>=3.4. [![Build Status](https://travis-ci.org/ksahlin/isONcorrect.svg?branch=master)](https://travis-ci.com/ksahlin/isONcorrect).

Table of Contents
=================

  * [INSTALLATION](#INSTALLATION)
    * [Using conda](#Using-conda)
    * [Using pip](#Using-pip)
    * [Downloading source from GitHub](#Downloading-source-from-github)
    * [Dependencies](#Dependencies)
    * [Testing installation](#testing-installation)
  * [USAGE](#USAGE)
    * [Running](#Running)
    * [Output](#Output)
    * [Parallelization across nodes](#Parallelization-across-nodes)
  * [CREDITS](#CREDITS)
  * [LICENCE](#LICENCE)



INSTALLATION
=================

Typical install time on a desktop computer is about 5 minutes with conda for this software.

## Using conda
Conda is the preferred way to install isONcorrect.

1. Create and activate a new environment called isoncorrect

```
conda create -n isoncorrect python=3.9 pip 
conda activate isoncorrect
```

2. Install isONcorrect and its dependency `spoa`.

```
pip install isONcorrect
conda install -c bioconda spoa
```
3. You should now have 'isONcorrect' installed; try it:
```
isONcorrect --help
```

Upon start/login to your server/computer you need to activate the conda environment "isonclust" to run isONcorrect as:
```
conda activate isoncorrect
```

4. You probably want to install `pychopper` and `isONclust` in the isoncorrect environmment as well to run the complete correction pipeline if you haven't already. This can be done with:

```
pip install isONclust
conda install -c bioconda "hmmer>=3.0"
conda install -c bioconda "pychopper>=2.0"
```

You are now set to run the [correction_pipeline](https://github.com/ksahlin/isONcorrect/blob/master/scripts/correction_pipeline.sh). See [USAGE](https://github.com/ksahlin/isONcorrect#usage).


### Dependencies

isONcorrect has the following dependencies (the three first are automatically installed with `pip`)
* [edlib](https://github.com/Martinsos/edlib/tree/master/bindings/python)
* [NumPy](https://numpy.org/) 
* [parasail](https://github.com/jeffdaily/parasail-python)
*  [spoa](https://github.com/rvaser/spoa) 


## Testing installation

You can verify successul installation by running isONcorrect on [these](https://github.com/ksahlin/isONcorrect/tree/master/test_data/isoncorrect/) two small datasets of 100 reads. Download the two datasets and put in a folder `test_data` run, e.g,

```
isONcorrect --fastq test_data/0.fastq \
            --outfolder [output path]
```
Expected runtime for this test data is about 15 seconds. The output will be found in `[output path]/corrected_reads.fastq` where the 100 reads have the same headers as in the original file, but with corrected sequence. Testing the paralleized version (by separate clusters) of isONcorrect can be done by running

```
./run_isoncorrect --t 3 --fastq_folder test_data/ \
                  --outfolder [output path]
```
 
This will perform correction on `0.fastq` and `1.fastq` in parallel. Expected runtime for this test data is about 15 seconds. The output will be found in `[output path]/0/corrected_reads.fastq` and `[output path]/1/corrected_reads.fastq` where the 100 reads in each separate cluster have the same headers as in the respective original files, but with corrected sequence. 


USAGE
=================
 
## Running

### Using correction_pipeline.sh

You can simply run `./correction_pipeline.sh <raw_reads.fq>  <outfolder>  <num_cores> ` which will perform the steps 1-5 below for you. The `correction_pipeline.sh` script is available in this repository [here](https://github.com/ksahlin/isONcorrect/blob/master/scripts/correction_pipeline.sh). Simply download the reposotory or the individual [correction_pipeline.sh file](https://github.com/ksahlin/isONcorrect/blob/master/scripts/correction_pipeline.sh). 

For a fastq file with raw ONT cDNA reads, the following pipeline is recommended:
1.  Produce full-length reads (with [pychopper](https://github.com/nanoporetech/pychopper) (a.k.a. `cdna_classifier`))
2.  Cluster the full length reads into genes/gene-families ([isONclust](https://github.com/ksahlin/isONclust))
3.  Make fastq files of each cluster (`isONclust write_fastq` command)
4.  Correct individual clusters ([isONcorrect](https://github.com/ksahlin/isONcorrect))
5.  Join reads back to a single fastq file (This is of course optional)


### Manually

The contents of the `correction_pipeline.sh` is (roughly) provided below. If you want more individual control over the steps than what the `correction_pipeline.sh` can do for you (such as different parameters in each step), you can modify/remove arguments as needed in `correction_pipeline.sh` or in the below script.  

```
#!/bin/bash

# Pipeline to get high-quality full-length reads from ONT cDNA sequencing

# Set path to output and number of cores
root_out="outfolder"
cores=20

mkdir -p $root_out

cdna_classifier.py  raw_reads.fq $root_out/full_length.fq -t $cores 

isONclust  --t $cores  --ont --fastq $root_out/full_length.fq \
             --outfolder $root_out/clustering

isONclust write_fastq --N 1 --clusters $root_out/clustering/final_clusters.tsv \
                      --fastq $root_out/full_length.fq --outfolder  $root_out/clustering/fastq_files 

run_isoncorrect --t $cores  --fastq_folder $root_out/clustering/fastq_files  --outfolder $root_out/correction/ 

# OPTIONAL BELOW TO MERGE ALL CORRECTED READS INTO ONE FILE
touch $root_out/all_corrected_reads.fq
OUTFILES=$root_out"/correction/"*"/corrected_reads.fastq"
for f in $OUTFILES
do 
  cat $f >> $outfolder/all_corrected_reads.fq
done
```

isONcorrect does not need ONT reads to be full-length (i.e., produced by `pychopper`), but unless you have specific other goals, it is advised to run pychopper for any kind of downstream analysis to guarantee full-length reads. 

## Output

The output of `run_isoncorrect` is one file per cluster with identical headers to the original reads.

### Few large clusters

For some datasets, e.g. targeted data, `isONclust` can produce highly uneven clusters, i.e., a few very large clusters and some/many small ones. In such cases, runtime can be reduced if the argument `--split_wrt_batches` is specified to `run_isoncorrect`.


## Parallelization across nodes

isONcorrect currently supports parallelization across cores on a node (parameter `--t`), but not across several nodes. There is a way to overcome this limitation if you have access to multiple nodes as follows. The `run_isoncorrect` step can be parallilized across n nodes by (in bash or other environment, e.g., snakemake) parallelizing the following commands

```
run_isoncorrect --fastq_folder outfolder/clustering/fastq_files  --outfolder /outfolder/correction/ --split_mod n --residual 0
run_isoncorrect --fastq_folder outfolder/clustering/fastq_files  --outfolder /outfolder/correction/ --split_mod n --residual 1
run_isoncorrect --fastq_folder outfolder/clustering/fastq_files  --outfolder /outfolder/correction/ --split_mod n --residual 2
...
run_isoncorrect --fastq_folder outfolder/clustering/fastq_files  --outfolder /outfolder/correction/ --split_mod n --residual n-1
```
Which tells isONcorrect to only work with distinct cluster IDs.

CREDITS
----------------

Please cite [1] when using isONcorrect.

1. Sahlin, K., Medvedev, P. Error correction enables use of Oxford Nanopore technology for reference-free transcriptome analysis. Nat Commun 12, 2 (2021). https://doi.org/10.1038/s41467-020-20340-8  [Link](https://www.nature.com/articles/s41467-020-20340-8).

LICENCE
----------------

GPL v3.0, see [LICENSE.txt](https://github.com/ksahlin/isONcorect/blob/master/LICENCE.txt).


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/ksahlin/isONcorrect",
    "name": "isONcorrect",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "!=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, <4",
    "maintainer_email": "",
    "keywords": "",
    "author": "Kristoffer Sahlin",
    "author_email": "Kristoffer Sahlin <krsahlin.work@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/5d/84/3bd24dbc7426434f52126ca4ccacf27076bf8814f837fb52d8e9037e6460/isONcorrect-0.1.3.5.tar.gz",
    "platform": null,
    "description": "isONcorrect\n===========\n\nisONcorrect is a tool for error-correcting Oxford Nanopore cDNA reads. It is designed to handle highly variable coverage and exon variation within reads and achieves about a 0.5-1% median error rate after correction. It leverages regions shared between reads from different isoforms achieve low error rates even for low abundant transcripts. See [paper](https://www.nature.com/articles/s41467-020-20340-8) for details. \n\n**Update:** Since v0.0.8, isONcorrect uses different default parameters compared to what was used in the [paper](https://www.nature.com/articles/s41467-020-20340-8). The new parameters make isONcorrect 2-3 times faster and use 3-8 times less memory with only a small cost of increased median post-correction error rate. With the new parameter setting the correction accuracy is 98.5-99.3% instead of 98.9\u201399.6% on the data used in the paper. Current default uses `--k 9 --w 20 --max_seqs 2000`. To invoke settings used in paper, set parameters `--k 9 --w 10 --max_seqs 1000`.\n\nProcessing and error correction of full-length ONT cDNA reads is achieved by the pipeline of running [pychopper](https://github.com/nanoporetech/pychopper) --> [isONclust](https://github.com/ksahlin/isONclust) --> [isONcorrect](https://github.com/ksahlin/isONcorrect). All these steps can be run in one go with [this script](https://github.com/ksahlin/isONcorrect/blob/master/scripts/correction_pipeline.sh). See below for installation and usage. \n\n\nisONcorrect is distributed as a python package supported on Linux / OSX with python v>=3.4. [![Build Status](https://travis-ci.org/ksahlin/isONcorrect.svg?branch=master)](https://travis-ci.com/ksahlin/isONcorrect).\n\nTable of Contents\n=================\n\n  * [INSTALLATION](#INSTALLATION)\n    * [Using conda](#Using-conda)\n    * [Using pip](#Using-pip)\n    * [Downloading source from GitHub](#Downloading-source-from-github)\n    * [Dependencies](#Dependencies)\n    * [Testing installation](#testing-installation)\n  * [USAGE](#USAGE)\n    * [Running](#Running)\n    * [Output](#Output)\n    * [Parallelization across nodes](#Parallelization-across-nodes)\n  * [CREDITS](#CREDITS)\n  * [LICENCE](#LICENCE)\n\n\n\nINSTALLATION\n=================\n\nTypical install time on a desktop computer is about 5 minutes with conda for this software.\n\n## Using conda\nConda is the preferred way to install isONcorrect.\n\n1. Create and activate a new environment called isoncorrect\n\n```\nconda create -n isoncorrect python=3.9 pip \nconda activate isoncorrect\n```\n\n2. Install isONcorrect and its dependency `spoa`.\n\n```\npip install isONcorrect\nconda install -c bioconda spoa\n```\n3. You should now have 'isONcorrect' installed; try it:\n```\nisONcorrect --help\n```\n\nUpon start/login to your server/computer you need to activate the conda environment \"isonclust\" to run isONcorrect as:\n```\nconda activate isoncorrect\n```\n\n4. You probably want to install `pychopper` and `isONclust` in the isoncorrect environmment as well to run the complete correction pipeline if you haven't already. This can be done with:\n\n```\npip install isONclust\nconda install -c bioconda \"hmmer>=3.0\"\nconda install -c bioconda \"pychopper>=2.0\"\n```\n\nYou are now set to run the [correction_pipeline](https://github.com/ksahlin/isONcorrect/blob/master/scripts/correction_pipeline.sh). See [USAGE](https://github.com/ksahlin/isONcorrect#usage).\n\n\n### Dependencies\n\nisONcorrect has the following dependencies (the three first are automatically installed with `pip`)\n* [edlib](https://github.com/Martinsos/edlib/tree/master/bindings/python)\n* [NumPy](https://numpy.org/) \n* [parasail](https://github.com/jeffdaily/parasail-python)\n*  [spoa](https://github.com/rvaser/spoa) \n\n\n## Testing installation\n\nYou can verify successul installation by running isONcorrect on [these](https://github.com/ksahlin/isONcorrect/tree/master/test_data/isoncorrect/) two small datasets of 100 reads. Download the two datasets and put in a folder `test_data` run, e.g,\n\n```\nisONcorrect --fastq test_data/0.fastq \\\n            --outfolder [output path]\n```\nExpected runtime for this test data is about 15 seconds. The output will be found in `[output path]/corrected_reads.fastq` where the 100 reads have the same headers as in the original file, but with corrected sequence. Testing the paralleized version (by separate clusters) of isONcorrect can be done by running\n\n```\n./run_isoncorrect --t 3 --fastq_folder test_data/ \\\n                  --outfolder [output path]\n```\n \nThis will perform correction on `0.fastq` and `1.fastq` in parallel. Expected runtime for this test data is about 15 seconds. The output will be found in `[output path]/0/corrected_reads.fastq` and `[output path]/1/corrected_reads.fastq` where the 100 reads in each separate cluster have the same headers as in the respective original files, but with corrected sequence. \n\n\nUSAGE\n=================\n \n## Running\n\n### Using correction_pipeline.sh\n\nYou can simply run `./correction_pipeline.sh <raw_reads.fq>  <outfolder>  <num_cores> ` which will perform the steps 1-5 below for you. The `correction_pipeline.sh` script is available in this repository [here](https://github.com/ksahlin/isONcorrect/blob/master/scripts/correction_pipeline.sh). Simply download the reposotory or the individual [correction_pipeline.sh file](https://github.com/ksahlin/isONcorrect/blob/master/scripts/correction_pipeline.sh). \n\nFor a fastq file with raw ONT cDNA reads, the following pipeline is recommended:\n1.  Produce full-length reads (with [pychopper](https://github.com/nanoporetech/pychopper) (a.k.a. `cdna_classifier`))\n2.  Cluster the full length reads into genes/gene-families ([isONclust](https://github.com/ksahlin/isONclust))\n3.  Make fastq files of each cluster (`isONclust write_fastq` command)\n4.  Correct individual clusters ([isONcorrect](https://github.com/ksahlin/isONcorrect))\n5.  Join reads back to a single fastq file (This is of course optional)\n\n\n### Manually\n\nThe contents of the `correction_pipeline.sh` is (roughly) provided below. If you want more individual control over the steps than what the `correction_pipeline.sh` can do for you (such as different parameters in each step), you can modify/remove arguments as needed in `correction_pipeline.sh` or in the below script.  \n\n```\n#!/bin/bash\n\n# Pipeline to get high-quality full-length reads from ONT cDNA sequencing\n\n# Set path to output and number of cores\nroot_out=\"outfolder\"\ncores=20\n\nmkdir -p $root_out\n\ncdna_classifier.py  raw_reads.fq $root_out/full_length.fq -t $cores \n\nisONclust  --t $cores  --ont --fastq $root_out/full_length.fq \\\n             --outfolder $root_out/clustering\n\nisONclust write_fastq --N 1 --clusters $root_out/clustering/final_clusters.tsv \\\n                      --fastq $root_out/full_length.fq --outfolder  $root_out/clustering/fastq_files \n\nrun_isoncorrect --t $cores  --fastq_folder $root_out/clustering/fastq_files  --outfolder $root_out/correction/ \n\n# OPTIONAL BELOW TO MERGE ALL CORRECTED READS INTO ONE FILE\ntouch $root_out/all_corrected_reads.fq\nOUTFILES=$root_out\"/correction/\"*\"/corrected_reads.fastq\"\nfor f in $OUTFILES\ndo \n  cat $f >> $outfolder/all_corrected_reads.fq\ndone\n```\n\nisONcorrect does not need ONT reads to be full-length (i.e., produced by `pychopper`), but unless you have specific other goals, it is advised to run pychopper for any kind of downstream analysis to guarantee full-length reads. \n\n## Output\n\nThe output of `run_isoncorrect` is one file per cluster with identical headers to the original reads.\n\n### Few large clusters\n\nFor some datasets, e.g. targeted data, `isONclust` can produce highly uneven clusters, i.e., a few very large clusters and some/many small ones. In such cases, runtime can be reduced if the argument `--split_wrt_batches` is specified to `run_isoncorrect`.\n\n\n## Parallelization across nodes\n\nisONcorrect currently supports parallelization across cores on a node (parameter `--t`), but not across several nodes. There is a way to overcome this limitation if you have access to multiple nodes as follows. The `run_isoncorrect` step can be parallilized across n nodes by (in bash or other environment, e.g., snakemake) parallelizing the following commands\n\n```\nrun_isoncorrect --fastq_folder outfolder/clustering/fastq_files  --outfolder /outfolder/correction/ --split_mod n --residual 0\nrun_isoncorrect --fastq_folder outfolder/clustering/fastq_files  --outfolder /outfolder/correction/ --split_mod n --residual 1\nrun_isoncorrect --fastq_folder outfolder/clustering/fastq_files  --outfolder /outfolder/correction/ --split_mod n --residual 2\n...\nrun_isoncorrect --fastq_folder outfolder/clustering/fastq_files  --outfolder /outfolder/correction/ --split_mod n --residual n-1\n```\nWhich tells isONcorrect to only work with distinct cluster IDs.\n\nCREDITS\n----------------\n\nPlease cite [1] when using isONcorrect.\n\n1. Sahlin, K., Medvedev, P. Error correction enables use of Oxford Nanopore technology for reference-free transcriptome analysis. Nat Commun 12, 2 (2021). https://doi.org/10.1038/s41467-020-20340-8  [Link](https://www.nature.com/articles/s41467-020-20340-8).\n\nLICENCE\n----------------\n\nGPL v3.0, see [LICENSE.txt](https://github.com/ksahlin/isONcorect/blob/master/LICENCE.txt).\n\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "De novo error-correction of long-read transcriptome reads",
    "version": "0.1.3.5",
    "project_urls": {
        "Bug Tracker": "https://github.com/ksahlin/isONcorrect/issues",
        "Homepage": "https://github.com/ksahlin/isONcorrect"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8b7abb12ee564e1e9cfe0c810acc9f29903b88cdd20543ba7463c9a1d150ae9e",
                "md5": "9069bc8d15a00a2df59f0ba521e74215",
                "sha256": "0b477291c29ffa035bfebf082d33c71ddfc4a30b02961c7ed97870ea6b1cc740"
            },
            "downloads": -1,
            "filename": "isONcorrect-0.1.3.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "9069bc8d15a00a2df59f0ba521e74215",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "!=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, <4",
            "size": 51561,
            "upload_time": "2023-09-06T23:07:52",
            "upload_time_iso_8601": "2023-09-06T23:07:52.336723Z",
            "url": "https://files.pythonhosted.org/packages/8b/7a/bb12ee564e1e9cfe0c810acc9f29903b88cdd20543ba7463c9a1d150ae9e/isONcorrect-0.1.3.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5d843bd24dbc7426434f52126ca4ccacf27076bf8814f837fb52d8e9037e6460",
                "md5": "f94b2b7cf78de101ca03109e923b25c2",
                "sha256": "a05fe9e65392fdefd1bd65ae79317438dee77c36005b2719f9b2c538b4e0e85c"
            },
            "downloads": -1,
            "filename": "isONcorrect-0.1.3.5.tar.gz",
            "has_sig": false,
            "md5_digest": "f94b2b7cf78de101ca03109e923b25c2",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "!=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, <4",
            "size": 50186,
            "upload_time": "2023-09-06T23:07:54",
            "upload_time_iso_8601": "2023-09-06T23:07:54.243950Z",
            "url": "https://files.pythonhosted.org/packages/5d/84/3bd24dbc7426434f52126ca4ccacf27076bf8814f837fb52d8e9037e6460/isONcorrect-0.1.3.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-09-06 23:07:54",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ksahlin",
    "github_project": "isONcorrect",
    "travis_ci": true,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "isoncorrect"
}
        
Elapsed time: 1.45511s