# NanoPreP: a fully-equipped, fast, and memory-efficient pre-processor for ONT transcriptomic data
## Requirements
* Python (>= 3.7)
* edlib (>=1.3.8)
## Getting started
**Option 1:** use git clone and run NanoPreP as a python module without installation
```
git clone https://github.com/Woodformation1136/NanoPreP.git
cd NanoPreP
python -m NanoPreP --help
```
**Option 2:** use pip install
```
pip install nanoprep-ffm
nanoprep --help
```
## General usage
NanoPreP optimizes adapter/primer identification parameters for each input file. During the optimization process, NanoPreP search for the best combination of (1) the adapter/primer substring used for alignment, (2) sequence similarity cutoff, and (3) aligned location cutoff that achieves the highest $F_{\beta}$ score in distinguishing real adapter/primer alignments and random alignments.
The $\beta$ value in the formula of $F_{\beta}$ score greatly affect NanoPreP's behavior. The recommended range of $\beta$ is from 0.1 to 0.3, where smaller beta value lowers the chance of random alignments.
The general usage of NanoPreP to get ***high-quality, non-chimeric, full-length, strand-reoriented, adapter/primer-removed, polyA-removed*** reads:
```
nanoprep \
--input_fq input.fq
--beta 0.1 \
--p5_sense 5_PRIMER_SEQUENCE \
--p3_sense A{100}3_PRIMER_SEQUENCE \
--trim_adapter \
--trim_poly \
--output_full_length output.fq \
--report report.json
```
- `--input_fq input.fq` ← file contains raw sequences
- `--beta 0.1` ← optimize adapter/primer identification parameters using $F_{0.1}$ score
- `--p5_sense 5_PRIMER_SEQUENCE` ← 5' primer sequence in sense strand direction
- `--p3_sense A{100}3_PRIMER_SEQUENCE` ← expected length of polyA + 3' primer sequence in sense strand direction (see section [How to specify adapter/primer and polyA/T sequences](#HOWTO))
- `--output_full_length output.fq` ← write full-length reads to `output.fq`
- `--report report.json` ← write details of the run to `report.json`
<!-- TODO: why annotate reads? re-usable, time-saving, transparency, flexibility -->
After running this command, two output files `output.fq` and `report.json` will be written to your working directory.
The `report.json` records start/stop times, the parameters used, and the detail information of the input FASTQ file.
The `output.fq` contains full-length reads processed by NanoPreP. For each processed read, NanoPreP appends the information of the read to the ID line (the line started with @):
```
@read_1 strand=0.91 full_length=1 fusion=0 ploc5=0 ploc3=0 poly5=0 poly3=-20
AGAGGCTGGCGGGAACGGGC......TTTCAAAGCCAGGCGGATTC
+
+,),+'$)'%671*%('&$%......((&'(*($%$&%&$-((84*
```
As shown in the example above, several flags are used for the describe a read:
|flag|regex|default|explanation|
|:-|-|-|:-|
|`strand`|-?\d+\.\d*|0|0: unknown; > 0: sense; < 0: antisense|
|`full_length`|[0\|1]|0|0: non-full-length; 1: full-length|
|`fusion`|[0\|1]|0|0: non-chimeric/-fusion; 1: chimeric/fusion|
|`ploc5`|-?\d+|-1|-1: unknown; 0: removed; > 0: 5' adapter/primer location|
|`ploc3`|-?\d+|-1|-1: unknown; 0: removed; > 0: 3' adapter/primer location|
|`poly5`|-?\d+|-1|0: unknown; > 0: 5' polymer length; < 0: trimmed 5' polymer length|
|`poly3`|-?\d+|-1|0: unknown; > 0: 3' polymer length; < 0: trimmed 3' polymer length|
According to the flags, the example "read1" is a sense strand (`strand=0.91`), full-length (`full_length=1`), non-chimeric (`fusion=0`), adapter/primer removed (`ploc5=0 ploc3=0`), and polyA removed (`poly3=-20`) read.
## How to specify adapter/primer and polyA/T sequences <a id="HOWTO"></a>
Users need to provide the **adapter/primer** (and **polyA/T**) sequences to be searched for using options `--p5_sense` and `--p3_sense`.
For example, the following command means that the 5' and 3' adatper/primer sequences on the sense strand are 'CATTC' and 'GACTA', respectively.
```
--p5_sense CATTC --p3_sense GACTA
```
If users wish to detect polyA/T tails, a pattern `N{M}` can be used to specify the location and length of polyA/T tails. The command below tells NanoPreP that there are poly`"A"` tails of a maximum length of `"50"` bases next to the 3' adapters/primers.
```
--p5_sense CATTC --p3_sense A{50}GACTA
```
## Full usage
```
usage: nanoprep [-h] [--version] --input_fq str [--config str] [--report str] [--processes int] [--batch_size int] [--seed int] [-n int] [--beta float] [--disable_annot] [--skip_lowq float] [--skip_short int] [--p5_sense str] [--p3_sense str]
[--isl5 int int] [--isl3 int int] [--pid5 float] [--pid3 float] [--pid_body float] [--poly_w int] [--poly_k int] [--keep_adapter] [--keep_poly] [--filter_lowq float] [--filter_short int] [--orientation int] [--output_fusion str]
[--output_truncated str] [--output_full_length str] [--suffix_filtered str]
options:
-h, --help show this help message and exit
--version show program's version number and exit
--input_fq str input FASTQ
--config str use the parameters in this config file (JSON)
--report str output report file (JSON)
--processes int number of processes to use (default: 16)
--batch_size int number of records in each batch (default: 1000000)
--seed int seed for random number generator (default: 42)
-n int max number of reads to sample during optimization (default: 100000)
--beta float the beta parameter for the optimization (default: .1)
--skip_lowq float skip low-quality reads (default: 7)
--skip_short int skip too-short reads (default: 0)
--p5_sense str 5' sense adatper/primer + polyA sequences
--p3_sense str 3' sense adatper/primer + polyA sequences
--isl5 int int ideal searching location for 5' adapter/primer sequences (default: optimized)
--isl3 int int ideal searching location for 3' adapter/primer sequences (default: optimized)
--pid5 float 5' adapter/primer percent identity cutoff (default: optimized)
--pid3 float 3' adapter/primer percent identity cutoff (default: optimized)
--pid_body float adapter/primer percent identity cutoff (default: optimized)
--poly_w int window size for polyA/T identification
--poly_k int number of A/T to be expected in the window
--trim_adapter use this flag to trim adatper/primer sequences
--trim_poly use this flag to trim polyA/T sequences
--filter_lowq float filter low-quality reads after all trimming steps (default: 7)
--filter_short int filter too short reads after all trimming steps (default: 0)
--orientation int re-orient reads (0: generic , 1: sense (default), -1: antisense)
--output_fusion str output fusion/chimeric reads to this file (use '-' for stdout)
--output_truncated str
output truncated/non-full-length reads to this file (use '-' for stdout)
--output_full_length str
output full-length reads to this file (use '-' for stdout)
--suffix_filtered str
output filtered reads with the suffix
```
Raw data
{
"_id": null,
"home_page": "https://github.com/Woodformation1136/NanoPreP",
"name": "nanoprep-ffm",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": null,
"author": "Chia-Chen Chu",
"author_email": "jerry955071@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/eb/c1/49792205a94554c4f7cdb17aa57220ec35ee5c17596f1f8353a6eae9e891/nanoprep_ffm-0.0.19.tar.gz",
"platform": null,
"description": "# NanoPreP: a fully-equipped, fast, and memory-efficient pre-processor for ONT transcriptomic data\n\n## Requirements\n* Python (>= 3.7) \n* edlib (>=1.3.8)\n\n\n## Getting started\n**Option 1:** use git clone and run NanoPreP as a python module without installation\n```\ngit clone https://github.com/Woodformation1136/NanoPreP.git\ncd NanoPreP\npython -m NanoPreP --help\n```\n**Option 2:** use pip install\n```\npip install nanoprep-ffm\nnanoprep --help\n```\n\n## General usage\n\nNanoPreP optimizes adapter/primer identification parameters for each input file. During the optimization process, NanoPreP search for the best combination of (1) the adapter/primer substring used for alignment, (2) sequence similarity cutoff, and (3) aligned location cutoff that achieves the highest $F_{\\beta}$ score in distinguishing real adapter/primer alignments and random alignments. \n\nThe $\\beta$ value in the formula of $F_{\\beta}$ score greatly affect NanoPreP's behavior. The recommended range of $\\beta$ is from 0.1 to 0.3, where smaller beta value lowers the chance of random alignments.\n\n\nThe general usage of NanoPreP to get ***high-quality, non-chimeric, full-length, strand-reoriented, adapter/primer-removed, polyA-removed*** reads:\n```\nnanoprep \\\n --input_fq input.fq\n --beta 0.1 \\\n --p5_sense 5_PRIMER_SEQUENCE \\\n --p3_sense A{100}3_PRIMER_SEQUENCE \\\n --trim_adapter \\\n --trim_poly \\\n --output_full_length output.fq \\\n --report report.json\n```\n- `--input_fq input.fq` \u2190 file contains raw sequences\n- `--beta 0.1` \u2190 optimize adapter/primer identification parameters using $F_{0.1}$ score\n- `--p5_sense 5_PRIMER_SEQUENCE` \u2190 5' primer sequence in sense strand direction\n- `--p3_sense A{100}3_PRIMER_SEQUENCE` \u2190 expected length of polyA + 3' primer sequence in sense strand direction (see section [How to specify adapter/primer and polyA/T sequences](#HOWTO))\n- `--output_full_length output.fq` \u2190 write full-length reads to `output.fq` \n- `--report report.json` \u2190 write details of the run to `report.json` \n\n\n\n<!-- TODO: why annotate reads? re-usable, time-saving, transparency, flexibility -->\nAfter running this command, two output files `output.fq` and `report.json` will be written to your working directory.\n\nThe `report.json` records start/stop times, the parameters used, and the detail information of the input FASTQ file. \n\nThe `output.fq` contains full-length reads processed by NanoPreP. For each processed read, NanoPreP appends the information of the read to the ID line (the line started with @): \n```\n@read_1 strand=0.91 full_length=1 fusion=0 ploc5=0 ploc3=0 poly5=0 poly3=-20\nAGAGGCTGGCGGGAACGGGC......TTTCAAAGCCAGGCGGATTC\n+\n+,),+'$)'%671*%('&$%......((&'(*($%$&%&$-((84*\n```\nAs shown in the example above, several flags are used for the describe a read: \n|flag|regex|default|explanation|\n|:-|-|-|:-|\n|`strand`|-?\\d+\\.\\d*|0|0: unknown; > 0: sense; < 0: antisense|\n|`full_length`|[0\\|1]|0|0: non-full-length; 1: full-length|\n|`fusion`|[0\\|1]|0|0: non-chimeric/-fusion; 1: chimeric/fusion|\n|`ploc5`|-?\\d+|-1|-1: unknown; 0: removed; > 0: 5' adapter/primer location|\n|`ploc3`|-?\\d+|-1|-1: unknown; 0: removed; > 0: 3' adapter/primer location|\n|`poly5`|-?\\d+|-1|0: unknown; > 0: 5' polymer length; < 0: trimmed 5' polymer length|\n|`poly3`|-?\\d+|-1|0: unknown; > 0: 3' polymer length; < 0: trimmed 3' polymer length|\n\nAccording to the flags, the example \"read1\" is a sense strand (`strand=0.91`), full-length (`full_length=1`), non-chimeric (`fusion=0`), adapter/primer removed (`ploc5=0 ploc3=0`), and polyA removed (`poly3=-20`) read.\n\n\n## How to specify adapter/primer and polyA/T sequences <a id=\"HOWTO\"></a>\nUsers need to provide the **adapter/primer** (and **polyA/T**) sequences to be searched for using options `--p5_sense` and `--p3_sense`. \n\nFor example, the following command means that the 5' and 3' adatper/primer sequences on the sense strand are 'CATTC' and 'GACTA', respectively.\n```\n--p5_sense CATTC --p3_sense GACTA\n```\nIf users wish to detect polyA/T tails, a pattern `N{M}` can be used to specify the location and length of polyA/T tails. The command below tells NanoPreP that there are poly`\"A\"` tails of a maximum length of `\"50\"` bases next to the 3' adapters/primers.\n```\n--p5_sense CATTC --p3_sense A{50}GACTA\n```\n\n\n\n## Full usage\n```\nusage: nanoprep [-h] [--version] --input_fq str [--config str] [--report str] [--processes int] [--batch_size int] [--seed int] [-n int] [--beta float] [--disable_annot] [--skip_lowq float] [--skip_short int] [--p5_sense str] [--p3_sense str]\n [--isl5 int int] [--isl3 int int] [--pid5 float] [--pid3 float] [--pid_body float] [--poly_w int] [--poly_k int] [--keep_adapter] [--keep_poly] [--filter_lowq float] [--filter_short int] [--orientation int] [--output_fusion str]\n [--output_truncated str] [--output_full_length str] [--suffix_filtered str]\n\noptions:\n -h, --help show this help message and exit\n --version show program's version number and exit\n --input_fq str input FASTQ\n --config str use the parameters in this config file (JSON)\n --report str output report file (JSON)\n --processes int number of processes to use (default: 16)\n --batch_size int number of records in each batch (default: 1000000)\n --seed int seed for random number generator (default: 42)\n -n int max number of reads to sample during optimization (default: 100000)\n --beta float the beta parameter for the optimization (default: .1)\n --skip_lowq float skip low-quality reads (default: 7)\n --skip_short int skip too-short reads (default: 0)\n --p5_sense str 5' sense adatper/primer + polyA sequences\n --p3_sense str 3' sense adatper/primer + polyA sequences\n --isl5 int int ideal searching location for 5' adapter/primer sequences (default: optimized)\n --isl3 int int ideal searching location for 3' adapter/primer sequences (default: optimized)\n --pid5 float 5' adapter/primer percent identity cutoff (default: optimized)\n --pid3 float 3' adapter/primer percent identity cutoff (default: optimized)\n --pid_body float adapter/primer percent identity cutoff (default: optimized)\n --poly_w int window size for polyA/T identification\n --poly_k int number of A/T to be expected in the window\n --trim_adapter use this flag to trim adatper/primer sequences\n --trim_poly use this flag to trim polyA/T sequences\n --filter_lowq float filter low-quality reads after all trimming steps (default: 7)\n --filter_short int filter too short reads after all trimming steps (default: 0)\n --orientation int re-orient reads (0: generic , 1: sense (default), -1: antisense)\n --output_fusion str output fusion/chimeric reads to this file (use '-' for stdout)\n --output_truncated str\n output truncated/non-full-length reads to this file (use '-' for stdout)\n --output_full_length str\n output full-length reads to this file (use '-' for stdout)\n --suffix_filtered str\n output filtered reads with the suffix\n```\n",
"bugtrack_url": null,
"license": null,
"summary": "A fully-equipped, fast, and memory-efficient pre-processor for ONT transcriptomic data",
"version": "0.0.19",
"project_urls": {
"Homepage": "https://github.com/Woodformation1136/NanoPreP"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "2771ef67565fca0955754b659a1edbc3d4ff432c70e024782f41702d1b5d1689",
"md5": "884dd14dfbceea9a311002072643f355",
"sha256": "1d832a52acaf3e97a31b5a91ff35d266a2b4702beb455773e4b461c7e6d854d7"
},
"downloads": -1,
"filename": "nanoprep_ffm-0.0.19-py3-none-any.whl",
"has_sig": false,
"md5_digest": "884dd14dfbceea9a311002072643f355",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 23810,
"upload_time": "2024-08-27T07:39:56",
"upload_time_iso_8601": "2024-08-27T07:39:56.951663Z",
"url": "https://files.pythonhosted.org/packages/27/71/ef67565fca0955754b659a1edbc3d4ff432c70e024782f41702d1b5d1689/nanoprep_ffm-0.0.19-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "ebc149792205a94554c4f7cdb17aa57220ec35ee5c17596f1f8353a6eae9e891",
"md5": "2fe797fe2217ad16b3eaea7187d19f86",
"sha256": "1737efb1db254c1812a745fcb58c276dc1e7781b835c1b4d9ef78b9ab76d8e83"
},
"downloads": -1,
"filename": "nanoprep_ffm-0.0.19.tar.gz",
"has_sig": false,
"md5_digest": "2fe797fe2217ad16b3eaea7187d19f86",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 18888,
"upload_time": "2024-08-27T07:39:58",
"upload_time_iso_8601": "2024-08-27T07:39:58.195330Z",
"url": "https://files.pythonhosted.org/packages/eb/c1/49792205a94554c4f7cdb17aa57220ec35ee5c17596f1f8353a6eae9e891/nanoprep_ffm-0.0.19.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-08-27 07:39:58",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Woodformation1136",
"github_project": "NanoPreP",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "nanoprep-ffm"
}