nanoprep-ffm


Namenanoprep-ffm JSON
Version 0.0.19 PyPI version JSON
download
home_pagehttps://github.com/Woodformation1136/NanoPreP
SummaryA fully-equipped, fast, and memory-efficient pre-processor for ONT transcriptomic data
upload_time2024-08-27 07:39:58
maintainerNone
docs_urlNone
authorChia-Chen Chu
requires_pythonNone
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # NanoPreP: a fully-equipped, fast, and memory-efficient pre-processor for ONT transcriptomic data

## Requirements
* Python (>= 3.7)  
* edlib (>=1.3.8)


## Getting started
**Option 1:** use git clone and run NanoPreP as a python module without installation
```
git clone https://github.com/Woodformation1136/NanoPreP.git
cd NanoPreP
python -m NanoPreP --help
```
**Option 2:** use pip install
```
pip install nanoprep-ffm
nanoprep --help
```

## General usage

NanoPreP optimizes adapter/primer identification parameters for each input file. During the optimization process, NanoPreP search for the best combination of (1) the adapter/primer substring used for alignment, (2) sequence similarity cutoff, and (3) aligned location cutoff that achieves the highest $F_{\beta}$ score in distinguishing real adapter/primer alignments and random alignments. 

The $\beta$ value in the formula of $F_{\beta}$ score greatly affect NanoPreP's behavior. The recommended range of $\beta$ is from 0.1 to 0.3, where smaller beta value lowers the chance of random alignments.


The general usage of NanoPreP to get ***high-quality, non-chimeric, full-length, strand-reoriented, adapter/primer-removed, polyA-removed*** reads:
```
nanoprep \
  --input_fq input.fq
  --beta 0.1 \
  --p5_sense 5_PRIMER_SEQUENCE \
  --p3_sense A{100}3_PRIMER_SEQUENCE \
  --trim_adapter \
  --trim_poly \
  --output_full_length output.fq \
  --report report.json
```
- `--input_fq input.fq` ← file contains raw sequences
- `--beta 0.1` ← optimize adapter/primer identification parameters using $F_{0.1}$ score
- `--p5_sense 5_PRIMER_SEQUENCE` ← 5' primer sequence in sense strand direction
- `--p3_sense A{100}3_PRIMER_SEQUENCE` ← expected length of polyA + 3' primer sequence in sense strand direction (see section [How to specify adapter/primer and polyA/T sequences](#HOWTO))
- `--output_full_length output.fq` ← write full-length reads to `output.fq`  
- `--report report.json` ← write details of the run to `report.json`  



<!-- TODO: why annotate reads? re-usable, time-saving, transparency, flexibility -->
After running this command, two output files `output.fq` and `report.json` will be written to your working directory.

The `report.json` records start/stop times, the parameters used, and the detail information of the input FASTQ file.  

The `output.fq` contains full-length reads processed by NanoPreP. For each processed read, NanoPreP appends the information of the read to the ID line (the line started with @): 
```
@read_1 strand=0.91 full_length=1 fusion=0 ploc5=0 ploc3=0 poly5=0 poly3=-20
AGAGGCTGGCGGGAACGGGC......TTTCAAAGCCAGGCGGATTC
+
+,),+'$)'%671*%('&$%......((&'(*($%$&%&$-((84*
```
As shown in the example above, several flags are used for the describe a read: 
|flag|regex|default|explanation|
|:-|-|-|:-|
|`strand`|-?\d+\.\d*|0|0: unknown; > 0: sense; < 0: antisense|
|`full_length`|[0\|1]|0|0: non-full-length; 1: full-length|
|`fusion`|[0\|1]|0|0: non-chimeric/-fusion; 1: chimeric/fusion|
|`ploc5`|-?\d+|-1|-1: unknown; 0: removed; > 0: 5' adapter/primer location|
|`ploc3`|-?\d+|-1|-1: unknown; 0: removed; > 0: 3' adapter/primer location|
|`poly5`|-?\d+|-1|0: unknown; > 0: 5' polymer length; < 0: trimmed 5' polymer length|
|`poly3`|-?\d+|-1|0: unknown; > 0: 3' polymer length; < 0: trimmed 3' polymer length|

According to the flags, the example "read1" is a sense strand (`strand=0.91`), full-length (`full_length=1`), non-chimeric (`fusion=0`),  adapter/primer removed (`ploc5=0 ploc3=0`), and polyA removed (`poly3=-20`) read.


## How to specify adapter/primer and polyA/T sequences <a id="HOWTO"></a>
Users need to provide the **adapter/primer** (and **polyA/T**) sequences to be searched for using options `--p5_sense` and `--p3_sense`. 

For example, the following command means that the 5' and 3' adatper/primer sequences on the sense strand are 'CATTC' and 'GACTA', respectively.
```
--p5_sense CATTC --p3_sense GACTA
```
If users wish to detect polyA/T tails, a pattern `N{M}` can be used to specify the location and length of polyA/T tails. The command below tells NanoPreP that there are poly`"A"` tails of a maximum length of `"50"` bases next to the 3' adapters/primers.
```
--p5_sense CATTC --p3_sense A{50}GACTA
```



## Full usage
```
usage: nanoprep [-h] [--version] --input_fq str [--config str] [--report str] [--processes int] [--batch_size int] [--seed int] [-n int] [--beta float] [--disable_annot] [--skip_lowq float] [--skip_short int] [--p5_sense str] [--p3_sense str]
                [--isl5 int int] [--isl3 int int] [--pid5 float] [--pid3 float] [--pid_body float] [--poly_w int] [--poly_k int] [--keep_adapter] [--keep_poly] [--filter_lowq float] [--filter_short int] [--orientation int] [--output_fusion str]
                [--output_truncated str] [--output_full_length str] [--suffix_filtered str]

options:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --input_fq str        input FASTQ
  --config str          use the parameters in this config file (JSON)
  --report str          output report file (JSON)
  --processes int       number of processes to use (default: 16)
  --batch_size int      number of records in each batch (default: 1000000)
  --seed int            seed for random number generator (default: 42)
  -n int                max number of reads to sample during optimization (default: 100000)
  --beta float          the beta parameter for the optimization (default: .1)
  --skip_lowq float     skip low-quality reads (default: 7)
  --skip_short int      skip too-short reads (default: 0)
  --p5_sense str        5' sense adatper/primer + polyA sequences
  --p3_sense str        3' sense adatper/primer + polyA sequences
  --isl5 int int        ideal searching location for 5' adapter/primer sequences (default: optimized)
  --isl3 int int        ideal searching location for 3' adapter/primer sequences (default: optimized)
  --pid5 float          5' adapter/primer percent identity cutoff (default: optimized)
  --pid3 float          3' adapter/primer percent identity cutoff (default: optimized)
  --pid_body float      adapter/primer percent identity cutoff (default: optimized)
  --poly_w int          window size for polyA/T identification
  --poly_k int          number of A/T to be expected in the window
  --trim_adapter        use this flag to trim adatper/primer sequences
  --trim_poly           use this flag to trim polyA/T sequences
  --filter_lowq float   filter low-quality reads after all trimming steps (default: 7)
  --filter_short int    filter too short reads after all trimming steps (default: 0)
  --orientation int     re-orient reads (0: generic , 1: sense (default), -1: antisense)
  --output_fusion str   output fusion/chimeric reads to this file (use '-' for stdout)
  --output_truncated str
                        output truncated/non-full-length reads to this file (use '-' for stdout)
  --output_full_length str
                        output full-length reads to this file (use '-' for stdout)
  --suffix_filtered str
                        output filtered reads with the suffix
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Woodformation1136/NanoPreP",
    "name": "nanoprep-ffm",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": null,
    "author": "Chia-Chen Chu",
    "author_email": "jerry955071@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/eb/c1/49792205a94554c4f7cdb17aa57220ec35ee5c17596f1f8353a6eae9e891/nanoprep_ffm-0.0.19.tar.gz",
    "platform": null,
    "description": "# NanoPreP: a fully-equipped, fast, and memory-efficient pre-processor for ONT transcriptomic data\n\n## Requirements\n* Python (>= 3.7)  \n* edlib (>=1.3.8)\n\n\n## Getting started\n**Option 1:** use git clone and run NanoPreP as a python module without installation\n```\ngit clone https://github.com/Woodformation1136/NanoPreP.git\ncd NanoPreP\npython -m NanoPreP --help\n```\n**Option 2:** use pip install\n```\npip install nanoprep-ffm\nnanoprep --help\n```\n\n## General usage\n\nNanoPreP optimizes adapter/primer identification parameters for each input file. During the optimization process, NanoPreP search for the best combination of (1) the adapter/primer substring used for alignment, (2) sequence similarity cutoff, and (3) aligned location cutoff that achieves the highest $F_{\\beta}$ score in distinguishing real adapter/primer alignments and random alignments. \n\nThe $\\beta$ value in the formula of $F_{\\beta}$ score greatly affect NanoPreP's behavior. The recommended range of $\\beta$ is from 0.1 to 0.3, where smaller beta value lowers the chance of random alignments.\n\n\nThe general usage of NanoPreP to get ***high-quality, non-chimeric, full-length, strand-reoriented, adapter/primer-removed, polyA-removed*** reads:\n```\nnanoprep \\\n  --input_fq input.fq\n  --beta 0.1 \\\n  --p5_sense 5_PRIMER_SEQUENCE \\\n  --p3_sense A{100}3_PRIMER_SEQUENCE \\\n  --trim_adapter \\\n  --trim_poly \\\n  --output_full_length output.fq \\\n  --report report.json\n```\n- `--input_fq input.fq` \u2190 file contains raw sequences\n- `--beta 0.1` \u2190 optimize adapter/primer identification parameters using $F_{0.1}$ score\n- `--p5_sense 5_PRIMER_SEQUENCE` \u2190 5' primer sequence in sense strand direction\n- `--p3_sense A{100}3_PRIMER_SEQUENCE` \u2190 expected length of polyA + 3' primer sequence in sense strand direction (see section [How to specify adapter/primer and polyA/T sequences](#HOWTO))\n- `--output_full_length output.fq` \u2190 write full-length reads to `output.fq`  \n- `--report report.json` \u2190 write details of the run to `report.json`  \n\n\n\n<!-- TODO: why annotate reads? re-usable, time-saving, transparency, flexibility -->\nAfter running this command, two output files `output.fq` and `report.json` will be written to your working directory.\n\nThe `report.json` records start/stop times, the parameters used, and the detail information of the input FASTQ file.  \n\nThe `output.fq` contains full-length reads processed by NanoPreP. For each processed read, NanoPreP appends the information of the read to the ID line (the line started with @): \n```\n@read_1 strand=0.91 full_length=1 fusion=0 ploc5=0 ploc3=0 poly5=0 poly3=-20\nAGAGGCTGGCGGGAACGGGC......TTTCAAAGCCAGGCGGATTC\n+\n+,),+'$)'%671*%('&$%......((&'(*($%$&%&$-((84*\n```\nAs shown in the example above, several flags are used for the describe a read: \n|flag|regex|default|explanation|\n|:-|-|-|:-|\n|`strand`|-?\\d+\\.\\d*|0|0: unknown; > 0: sense; < 0: antisense|\n|`full_length`|[0\\|1]|0|0: non-full-length; 1: full-length|\n|`fusion`|[0\\|1]|0|0: non-chimeric/-fusion; 1: chimeric/fusion|\n|`ploc5`|-?\\d+|-1|-1: unknown; 0: removed; > 0: 5' adapter/primer location|\n|`ploc3`|-?\\d+|-1|-1: unknown; 0: removed; > 0: 3' adapter/primer location|\n|`poly5`|-?\\d+|-1|0: unknown; > 0: 5' polymer length; < 0: trimmed 5' polymer length|\n|`poly3`|-?\\d+|-1|0: unknown; > 0: 3' polymer length; < 0: trimmed 3' polymer length|\n\nAccording to the flags, the example \"read1\" is a sense strand (`strand=0.91`), full-length (`full_length=1`), non-chimeric (`fusion=0`),  adapter/primer removed (`ploc5=0 ploc3=0`), and polyA removed (`poly3=-20`) read.\n\n\n## How to specify adapter/primer and polyA/T sequences <a id=\"HOWTO\"></a>\nUsers need to provide the **adapter/primer** (and **polyA/T**) sequences to be searched for using options `--p5_sense` and `--p3_sense`. \n\nFor example, the following command means that the 5' and 3' adatper/primer sequences on the sense strand are 'CATTC' and 'GACTA', respectively.\n```\n--p5_sense CATTC --p3_sense GACTA\n```\nIf users wish to detect polyA/T tails, a pattern `N{M}` can be used to specify the location and length of polyA/T tails. The command below tells NanoPreP that there are poly`\"A\"` tails of a maximum length of `\"50\"` bases next to the 3' adapters/primers.\n```\n--p5_sense CATTC --p3_sense A{50}GACTA\n```\n\n\n\n## Full usage\n```\nusage: nanoprep [-h] [--version] --input_fq str [--config str] [--report str] [--processes int] [--batch_size int] [--seed int] [-n int] [--beta float] [--disable_annot] [--skip_lowq float] [--skip_short int] [--p5_sense str] [--p3_sense str]\n                [--isl5 int int] [--isl3 int int] [--pid5 float] [--pid3 float] [--pid_body float] [--poly_w int] [--poly_k int] [--keep_adapter] [--keep_poly] [--filter_lowq float] [--filter_short int] [--orientation int] [--output_fusion str]\n                [--output_truncated str] [--output_full_length str] [--suffix_filtered str]\n\noptions:\n  -h, --help            show this help message and exit\n  --version             show program's version number and exit\n  --input_fq str        input FASTQ\n  --config str          use the parameters in this config file (JSON)\n  --report str          output report file (JSON)\n  --processes int       number of processes to use (default: 16)\n  --batch_size int      number of records in each batch (default: 1000000)\n  --seed int            seed for random number generator (default: 42)\n  -n int                max number of reads to sample during optimization (default: 100000)\n  --beta float          the beta parameter for the optimization (default: .1)\n  --skip_lowq float     skip low-quality reads (default: 7)\n  --skip_short int      skip too-short reads (default: 0)\n  --p5_sense str        5' sense adatper/primer + polyA sequences\n  --p3_sense str        3' sense adatper/primer + polyA sequences\n  --isl5 int int        ideal searching location for 5' adapter/primer sequences (default: optimized)\n  --isl3 int int        ideal searching location for 3' adapter/primer sequences (default: optimized)\n  --pid5 float          5' adapter/primer percent identity cutoff (default: optimized)\n  --pid3 float          3' adapter/primer percent identity cutoff (default: optimized)\n  --pid_body float      adapter/primer percent identity cutoff (default: optimized)\n  --poly_w int          window size for polyA/T identification\n  --poly_k int          number of A/T to be expected in the window\n  --trim_adapter        use this flag to trim adatper/primer sequences\n  --trim_poly           use this flag to trim polyA/T sequences\n  --filter_lowq float   filter low-quality reads after all trimming steps (default: 7)\n  --filter_short int    filter too short reads after all trimming steps (default: 0)\n  --orientation int     re-orient reads (0: generic , 1: sense (default), -1: antisense)\n  --output_fusion str   output fusion/chimeric reads to this file (use '-' for stdout)\n  --output_truncated str\n                        output truncated/non-full-length reads to this file (use '-' for stdout)\n  --output_full_length str\n                        output full-length reads to this file (use '-' for stdout)\n  --suffix_filtered str\n                        output filtered reads with the suffix\n```\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A fully-equipped, fast, and memory-efficient pre-processor for ONT transcriptomic data",
    "version": "0.0.19",
    "project_urls": {
        "Homepage": "https://github.com/Woodformation1136/NanoPreP"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2771ef67565fca0955754b659a1edbc3d4ff432c70e024782f41702d1b5d1689",
                "md5": "884dd14dfbceea9a311002072643f355",
                "sha256": "1d832a52acaf3e97a31b5a91ff35d266a2b4702beb455773e4b461c7e6d854d7"
            },
            "downloads": -1,
            "filename": "nanoprep_ffm-0.0.19-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "884dd14dfbceea9a311002072643f355",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 23810,
            "upload_time": "2024-08-27T07:39:56",
            "upload_time_iso_8601": "2024-08-27T07:39:56.951663Z",
            "url": "https://files.pythonhosted.org/packages/27/71/ef67565fca0955754b659a1edbc3d4ff432c70e024782f41702d1b5d1689/nanoprep_ffm-0.0.19-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ebc149792205a94554c4f7cdb17aa57220ec35ee5c17596f1f8353a6eae9e891",
                "md5": "2fe797fe2217ad16b3eaea7187d19f86",
                "sha256": "1737efb1db254c1812a745fcb58c276dc1e7781b835c1b4d9ef78b9ab76d8e83"
            },
            "downloads": -1,
            "filename": "nanoprep_ffm-0.0.19.tar.gz",
            "has_sig": false,
            "md5_digest": "2fe797fe2217ad16b3eaea7187d19f86",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 18888,
            "upload_time": "2024-08-27T07:39:58",
            "upload_time_iso_8601": "2024-08-27T07:39:58.195330Z",
            "url": "https://files.pythonhosted.org/packages/eb/c1/49792205a94554c4f7cdb17aa57220ec35ee5c17596f1f8353a6eae9e891/nanoprep_ffm-0.0.19.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-08-27 07:39:58",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Woodformation1136",
    "github_project": "NanoPreP",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "nanoprep-ffm"
}
        
Elapsed time: 0.34351s