cutseq


Namecutseq JSON
Version 0.0.34 PyPI version JSON
download
home_pagehttps://github.com/y9c/cutseq
SummaryAutomatically cut adapter / barcode / UMI from NGS data
upload_time2024-05-02 02:34:55
maintainerNone
docs_urlNone
authorYe Chang
requires_python<4.0,>=3.8
licenseMIT
keywords bioinformatics ngs adapter barcode umi
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # ✂️ CutSeq

[![Pypi Releases](https://img.shields.io/pypi/v/cutseq.svg)](https://pypi.python.org/pypi/cutseq)
[![Downloads](https://pepy.tech/badge/cutseq)](https://pepy.tech/project/cutseq)

CutSeq is a tool that provides an efficient wrapper for the cutadapt tool, which is powerful in handling various types of NGS libraries.
Due to the complexities involved in NGS library preparation methods, mutiple operations are necessary to process sequencing reads correctly.

Take _SMARTer® Stranded Total RNA-Seq Kit v3_ as an example, at least 9 operations are required.

![](https://raw.githubusercontent.com/y9c/cutseq/main/docs/takaraV3.png)

For **Read 1**:

1.  Remove the Illumina p7 adapter from the end of the sequence.
2.  Remove 14 nt (8+3+3) at the rightmost position of the sequence, representing UMI and linker sequence from the beginning of read 2. This is required when the library insert size is shorter than the sequencing length.
3.  Remove poly-T sequences at the beginning of the sequence (read 1 is oriented in reverse to the RNA, hence a polyA tail appears as a leading polyT sequence).
4.  Remove low-quality bases from right to left.

For **Read 2**:

5.  Remove the reverse complement Illumina p5 adapter from the end of the sequence.
6.  Extract the 8 nt UMI sequence from the beginning of the sequence and append it to the read name for downstream analysis.
7.  Mask a 6 nt linker sequence at the leftmost position immediately after clipping the UMI sequence.
8.  Remove poly-A sequences at the end of the read.
9.  Remove low-quality bases from right to left.

These operations must be performed in the **correct order**. The limitations of the cutadapt tool make it challenging to configure these operations in a single command, often leading to errors unnoticed in some publications.

---

To solve this by using cutadapt, we can run multiple cutadpat insitent sequentially or pipe multiple commands together. But this waste lots of IO and computational resource. I am thinking there a more eligent API to make things easy. Then comes this toy project.
-- **What you need is only one parameter which spcific what the library would looks like.**

CutSeq overcomes these limitations by enabling multiple operations in a automatical manner to ensure accuracy and efficiency.

## How to install?

```bash
pip install cutseq
```

## How to use?

Execute adapter trimming by providing a single parameter and your input files:

```bash
cutseq -A TAKARAV3 test_R1.fq.gz test_R2.fq.gz
```

Alternatively, you can specify a custom adapter sequence:

`cutseq -a "ACACGACGCTCTTCCGATCTXXX<XXXXXXNNNNNNNNAGATCGGAAGAGCACACGTC"`

![](https://raw.githubusercontent.com/y9c/cutseq/main/docs/explain_library.png)

The customized scheme can be explained by diagram above.

- The outmost parts on both ends are the Illumina adapters.
- The first inner parts are inline barcode sequence or customized PCR primers in the library construction step. These are also fixed DNA sequence, and will be represented by by sequence within `(` and `)`.
- The second inner parts are the UMI sequence, which is a random sequence and will be represented by `N`.
- The innermost parts are sequnce to be masked, which will be represented by `X`. This can be random tail in the library construction step, caused by template switching or other reasons.
- The center parts are the actual library sequence, which will be represented by `>` , `<` or `-`. `>` means that sequence is forward, `<` means that sequence is reverse, `-` means that sequence orientation is unknown.

More details can be found in the [document](https://cutseq.yech.science)

## TODO

[ ] support more library scheme


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/y9c/cutseq",
    "name": "cutseq",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.8",
    "maintainer_email": null,
    "keywords": "bioinformatics, NGS, adapter, barcode, UMI",
    "author": "Ye Chang",
    "author_email": "yech1990@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/74/28/cffca509aabce19d094c8e1d0faffd411214a43684616251aeddf90ff3dd/cutseq-0.0.34.tar.gz",
    "platform": null,
    "description": "# \u2702\ufe0f CutSeq\n\n[![Pypi Releases](https://img.shields.io/pypi/v/cutseq.svg)](https://pypi.python.org/pypi/cutseq)\n[![Downloads](https://pepy.tech/badge/cutseq)](https://pepy.tech/project/cutseq)\n\nCutSeq is a tool that provides an efficient wrapper for the cutadapt tool, which is powerful in handling various types of NGS libraries.\nDue to the complexities involved in NGS library preparation methods, mutiple operations are necessary to process sequencing reads correctly.\n\nTake _SMARTer\u00ae Stranded Total RNA-Seq Kit v3_ as an example, at least 9 operations are required.\n\n![](https://raw.githubusercontent.com/y9c/cutseq/main/docs/takaraV3.png)\n\nFor **Read 1**:\n\n1.  Remove the Illumina p7 adapter from the end of the sequence.\n2.  Remove 14 nt (8+3+3) at the rightmost position of the sequence, representing UMI and linker sequence from the beginning of read 2. This is required when the library insert size is shorter than the sequencing length.\n3.  Remove poly-T sequences at the beginning of the sequence (read 1 is oriented in reverse to the RNA, hence a polyA tail appears as a leading polyT sequence).\n4.  Remove low-quality bases from right to left.\n\nFor **Read 2**:\n\n5.  Remove the reverse complement Illumina p5 adapter from the end of the sequence.\n6.  Extract the 8 nt UMI sequence from the beginning of the sequence and append it to the read name for downstream analysis.\n7.  Mask a 6 nt linker sequence at the leftmost position immediately after clipping the UMI sequence.\n8.  Remove poly-A sequences at the end of the read.\n9.  Remove low-quality bases from right to left.\n\nThese operations must be performed in the **correct order**. The limitations of the cutadapt tool make it challenging to configure these operations in a single command, often leading to errors unnoticed in some publications.\n\n---\n\nTo solve this by using cutadapt, we can run multiple cutadpat insitent sequentially or pipe multiple commands together. But this waste lots of IO and computational resource. I am thinking there a more eligent API to make things easy. Then comes this toy project.\n-- **What you need is only one parameter which spcific what the library would looks like.**\n\nCutSeq overcomes these limitations by enabling multiple operations in a automatical manner to ensure accuracy and efficiency.\n\n## How to install?\n\n```bash\npip install cutseq\n```\n\n## How to use?\n\nExecute adapter trimming by providing a single parameter and your input files:\n\n```bash\ncutseq -A TAKARAV3 test_R1.fq.gz test_R2.fq.gz\n```\n\nAlternatively, you can specify a custom adapter sequence:\n\n`cutseq -a \"ACACGACGCTCTTCCGATCTXXX<XXXXXXNNNNNNNNAGATCGGAAGAGCACACGTC\"`\n\n![](https://raw.githubusercontent.com/y9c/cutseq/main/docs/explain_library.png)\n\nThe customized scheme can be explained by diagram above.\n\n- The outmost parts on both ends are the Illumina adapters.\n- The first inner parts are inline barcode sequence or customized PCR primers in the library construction step. These are also fixed DNA sequence, and will be represented by by sequence within `(` and `)`.\n- The second inner parts are the UMI sequence, which is a random sequence and will be represented by `N`.\n- The innermost parts are sequnce to be masked, which will be represented by `X`. This can be random tail in the library construction step, caused by template switching or other reasons.\n- The center parts are the actual library sequence, which will be represented by `>` , `<` or `-`. `>` means that sequence is forward, `<` means that sequence is reverse, `-` means that sequence orientation is unknown.\n\nMore details can be found in the [document](https://cutseq.yech.science)\n\n## TODO\n\n[ ] support more library scheme\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Automatically cut adapter / barcode / UMI from NGS data",
    "version": "0.0.34",
    "project_urls": {
        "Homepage": "https://github.com/y9c/cutseq",
        "Repository": "https://github.com/y9c/cutseq"
    },
    "split_keywords": [
        "bioinformatics",
        " ngs",
        " adapter",
        " barcode",
        " umi"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "7f4dd207760ee032baa8189d837cbfbdb5c5a57488aef613243b6a0e94a0b3ee",
                "md5": "c3fcbb0db925f9d2b53d5b52d7a3549b",
                "sha256": "66d865fc9d16917d9d21777c1e4620de7022e530e79dec2269e59450eb300e18"
            },
            "downloads": -1,
            "filename": "cutseq-0.0.34-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "c3fcbb0db925f9d2b53d5b52d7a3549b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.8",
            "size": 12917,
            "upload_time": "2024-05-02T02:34:53",
            "upload_time_iso_8601": "2024-05-02T02:34:53.645742Z",
            "url": "https://files.pythonhosted.org/packages/7f/4d/d207760ee032baa8189d837cbfbdb5c5a57488aef613243b6a0e94a0b3ee/cutseq-0.0.34-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "7428cffca509aabce19d094c8e1d0faffd411214a43684616251aeddf90ff3dd",
                "md5": "382403e6f95182f96570c44b59a5d358",
                "sha256": "8c932ce2774e9bfc31e521617565d9dd038bcd4ff5d8045991936ebcdda9bcc7"
            },
            "downloads": -1,
            "filename": "cutseq-0.0.34.tar.gz",
            "has_sig": false,
            "md5_digest": "382403e6f95182f96570c44b59a5d358",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.8",
            "size": 11666,
            "upload_time": "2024-05-02T02:34:55",
            "upload_time_iso_8601": "2024-05-02T02:34:55.446242Z",
            "url": "https://files.pythonhosted.org/packages/74/28/cffca509aabce19d094c8e1d0faffd411214a43684616251aeddf90ff3dd/cutseq-0.0.34.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-05-02 02:34:55",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "y9c",
    "github_project": "cutseq",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "cutseq"
}
        
Elapsed time: 0.25901s