parasplit


Nameparasplit JSON
Version 1.1.3 PyPI version JSON
download
home_pageNone
SummaryAn Hi-C tool for cutting sequences using specified enzymes
upload_time2025-10-27 10:46:52
maintainerNone
docs_urlNone
authorNone
requires_pythonNone
licenseAGPLv3
keywords hi-c hic bioinformatics cutsite
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <!--
SPDX-FileCopyrightText: 2024 Samir Bertache
SPDX-FileCopyrightText: 2025 2024 Samir Bertache

SPDX-License-Identifier: AGPL-3.0-or-later
SPDX-License-Identifier: CC0-1.0
-->

[![pipeline status](https://gitbio.ens-lyon.fr/LBMC/hub/parasplit/badges/master/pipeline.svg)]
[![coverage report](https://gitbio.ens-lyon.fr/LBMC/hub/parasplit/badges/master/coverage.svg?job=tests)]



# PARASPLIT : 

## Overview


Parasplit is a Python script designed to process paired-end FASTQ files by fragmenting DNA sequences at specified restriction enzyme sites. It efficiently handles large datasets by leveraging multi-threading for decompression and compression using pigz.

## Features


- **Find and Utilize Restriction Enzyme Sites:** Automatically identifies ligation sites from provided enzyme names and generates regex patterns to locate these sites in sequences.

- **Fragmentation:** Splits sequences at restriction enzyme sites, creating smaller fragments based on specified seed size.

- **Multi-threading:** Efficiently processes large datasets by utilizing multiple threads for decompression and compression.

- **Custom Modes:** Supports different pairing modes for sequence fragments.


## Installation


Ensure you have Python 3 installed along with the required dependencies:

```bash
sudo apt-get install pigz
pip install parasplit
```


## Usage


The script can be executed from the command line with various arguments to customize its behavior.


### Command-Line Arguments


- `--source_forward` (str): Input file path for forward reads. Default is `../data/R1.fq.gz`.

- `--source_reverse` (str): Input file path for reverse reads. Default is `../data/R2.fq.gz`.

- `--output_forward` (str): Output file path for processed forward reads. Default is `../data/output_forward.fq.gz`.

- `--output_reverse` (str): Output file path for processed reverse reads. Default is `../data/output_reverse.fq.gz`.

- `--list_enzyme` (str): Comma-separated list of restriction enzymes. Default is "No restriction enzyme found."

- `--mode` (str): Mode of pairing fragments. Options are `all` or `fr`. Default is `fr`.

- `--seed_size` (int): Minimum length of fragments to keep. Default is 20.

- `--num_threads` (int): Number of threads to use for processing. Default is 8.

- `--borderless`: Non conservation of ligations sites

### Example Command


```bash
parasplit --source_forward="../data/R1.fq.gz" --source_reverse="../data/R2.fq.gz" --output_forward="../data/output_forward.fq.gz" --output_reverse="../data/output_reverse.fq.gz" --list_enzyme=EcoRI,HinfI --mode=all --seed_size=20 --num_threads=8
```


## Main Script


- **Pretreatment:** Retrieval of restriction sites from the Biopython database and allocation of resources for the different processes.

- **Read:** Decompression and simultaneous reading of FastQ files. Send reads to a multiprocessing queue

- **Frag:** Retrieve sequences in a queue. Splits sequences into fragments based on restriction enzyme sites. Create Pairs, and send it in a multiprocessing queue

- **WriteAndControl:** Stream writing from data from the output queue and compression in parallel


## Project architecture

![Schéma de l'architecture](images/EnglishVersion.svg)

*Schéma de l'architecture - Licence : [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/)*

## Dependencies

- pigz


## The tree structure of my project : 


			├── myproject/
			│   ├── __init__.py
			│   ├── main.py
			│   ├── Frag.py
			│   ├── Read.py
			│   ├── Pretreatment.py
			│   └── WriteAndControl.py
			├── pyproject.toml
			├── requirements-dev.txt
			├── docs/
			│   ├── requirements.txt
			├── test/
			│   ├── __init__.py
			│   ├── test_main.py	
			│   ├── input_data/
			│   │   ├── R1.fq.gz
			│   │   └── R2.fq.gz
			│   └── output_data/
			│       ├── output_ref_R1.fq.gz
			│       ├── output_ref_R2.fq.gz
			│       ├── output_ref_all_R1.fq.gz
			│       └── output_ref_all_R2.fq.gz
			└── README.md
			
## Contact


For questions or issues, please contact [samir.bertache.djenadi@gmail.com](mailto:samir.bertache.djenadi@gmail.com).


---

This README provides an overview of the Cutsite Script's functionality, usage instructions, and implementation details. For more detailed information, refer to the script's source code and docstrings.

			

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "parasplit",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "Hi-C, HiC, bioinformatics, cutsite",
    "author": null,
    "author_email": "Bertache Djenadi <samir.bertache.djenadi@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/47/83/fe066d3ba7a07bdedaba4ea41a4dc908691b603f4bcbb567155b39985571/parasplit-1.1.3.tar.gz",
    "platform": null,
    "description": "<!--\nSPDX-FileCopyrightText: 2024 Samir Bertache\nSPDX-FileCopyrightText: 2025 2024 Samir Bertache\n\nSPDX-License-Identifier: AGPL-3.0-or-later\nSPDX-License-Identifier: CC0-1.0\n-->\n\n[![pipeline status](https://gitbio.ens-lyon.fr/LBMC/hub/parasplit/badges/master/pipeline.svg)]\n[![coverage report](https://gitbio.ens-lyon.fr/LBMC/hub/parasplit/badges/master/coverage.svg?job=tests)]\n\n\n\n# PARASPLIT : \n\n## Overview\n\n\nParasplit is a Python script designed to process paired-end FASTQ files by fragmenting DNA sequences at specified restriction enzyme sites. It efficiently handles large datasets by leveraging multi-threading for decompression and compression using pigz.\n\n## Features\n\n\n- **Find and Utilize Restriction Enzyme Sites:** Automatically identifies ligation sites from provided enzyme names and generates regex patterns to locate these sites in sequences.\n\n- **Fragmentation:** Splits sequences at restriction enzyme sites, creating smaller fragments based on specified seed size.\n\n- **Multi-threading:** Efficiently processes large datasets by utilizing multiple threads for decompression and compression.\n\n- **Custom Modes:** Supports different pairing modes for sequence fragments.\n\n\n## Installation\n\n\nEnsure you have Python 3 installed along with the required dependencies:\n\n```bash\nsudo apt-get install pigz\npip install parasplit\n```\n\n\n## Usage\n\n\nThe script can be executed from the command line with various arguments to customize its behavior.\n\n\n### Command-Line Arguments\n\n\n- `--source_forward` (str): Input file path for forward reads. Default is `../data/R1.fq.gz`.\n\n- `--source_reverse` (str): Input file path for reverse reads. Default is `../data/R2.fq.gz`.\n\n- `--output_forward` (str): Output file path for processed forward reads. Default is `../data/output_forward.fq.gz`.\n\n- `--output_reverse` (str): Output file path for processed reverse reads. Default is `../data/output_reverse.fq.gz`.\n\n- `--list_enzyme` (str): Comma-separated list of restriction enzymes. Default is \"No restriction enzyme found.\"\n\n- `--mode` (str): Mode of pairing fragments. Options are `all` or `fr`. Default is `fr`.\n\n- `--seed_size` (int): Minimum length of fragments to keep. Default is 20.\n\n- `--num_threads` (int): Number of threads to use for processing. Default is 8.\n\n- `--borderless`: Non conservation of ligations sites\n\n### Example Command\n\n\n```bash\nparasplit --source_forward=\"../data/R1.fq.gz\" --source_reverse=\"../data/R2.fq.gz\" --output_forward=\"../data/output_forward.fq.gz\" --output_reverse=\"../data/output_reverse.fq.gz\" --list_enzyme=EcoRI,HinfI --mode=all --seed_size=20 --num_threads=8\n```\n\n\n## Main Script\n\n\n- **Pretreatment:** Retrieval of restriction sites from the Biopython database and allocation of resources for the different processes.\n\n- **Read:** Decompression and simultaneous reading of FastQ files. Send reads to a multiprocessing queue\n\n- **Frag:** Retrieve sequences in a queue. Splits sequences into fragments based on restriction enzyme sites. Create Pairs, and send it in a multiprocessing queue\n\n- **WriteAndControl:** Stream writing from data from the output queue and compression in parallel\n\n\n## Project architecture\n\n![Sch\u00e9ma de l'architecture](images/EnglishVersion.svg)\n\n*Sch\u00e9ma de l'architecture - Licence : [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/)*\n\n## Dependencies\n\n- pigz\n\n\n## The tree structure of my project : \n\n\n\t\t\t\u251c\u2500\u2500 myproject/\n\t\t\t\u2502   \u251c\u2500\u2500 __init__.py\n\t\t\t\u2502   \u251c\u2500\u2500 main.py\n\t\t\t\u2502   \u251c\u2500\u2500 Frag.py\n\t\t\t\u2502   \u251c\u2500\u2500 Read.py\n\t\t\t\u2502   \u251c\u2500\u2500 Pretreatment.py\n\t\t\t\u2502   \u2514\u2500\u2500 WriteAndControl.py\n\t\t\t\u251c\u2500\u2500 pyproject.toml\n\t\t\t\u251c\u2500\u2500 requirements-dev.txt\n\t\t\t\u251c\u2500\u2500 docs/\n\t\t\t\u2502   \u251c\u2500\u2500 requirements.txt\n\t\t\t\u251c\u2500\u2500 test/\n\t\t\t\u2502   \u251c\u2500\u2500 __init__.py\n\t\t\t\u2502   \u251c\u2500\u2500 test_main.py\t\n\t\t\t\u2502   \u251c\u2500\u2500 input_data/\n\t\t\t\u2502   \u2502   \u251c\u2500\u2500 R1.fq.gz\n\t\t\t\u2502   \u2502   \u2514\u2500\u2500 R2.fq.gz\n\t\t\t\u2502   \u2514\u2500\u2500 output_data/\n\t\t\t\u2502       \u251c\u2500\u2500 output_ref_R1.fq.gz\n\t\t\t\u2502       \u251c\u2500\u2500 output_ref_R2.fq.gz\n\t\t\t\u2502       \u251c\u2500\u2500 output_ref_all_R1.fq.gz\n\t\t\t\u2502       \u2514\u2500\u2500 output_ref_all_R2.fq.gz\n\t\t\t\u2514\u2500\u2500 README.md\n\t\t\t\n## Contact\n\n\nFor questions or issues, please contact [samir.bertache.djenadi@gmail.com](mailto:samir.bertache.djenadi@gmail.com).\n\n\n---\n\nThis README provides an overview of the Cutsite Script's functionality, usage instructions, and implementation details. For more detailed information, refer to the script's source code and docstrings.\n\n\t\t\t\n",
    "bugtrack_url": null,
    "license": "AGPLv3",
    "summary": "An Hi-C tool for cutting sequences using specified enzymes",
    "version": "1.1.3",
    "project_urls": null,
    "split_keywords": [
        "hi-c",
        " hic",
        " bioinformatics",
        " cutsite"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "5f8e7f80b6676d89003dab572316244f1e417b78beb6b598ae2ab223b3e1a976",
                "md5": "1d63c884fb48052bc358a2c5be4d5279",
                "sha256": "cb18a51771ac06b7c73fd1ea27be62c19e10134c56e88491ca293e173cc20cfd"
            },
            "downloads": -1,
            "filename": "parasplit-1.1.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "1d63c884fb48052bc358a2c5be4d5279",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 16654,
            "upload_time": "2025-10-27T10:46:50",
            "upload_time_iso_8601": "2025-10-27T10:46:50.575352Z",
            "url": "https://files.pythonhosted.org/packages/5f/8e/7f80b6676d89003dab572316244f1e417b78beb6b598ae2ab223b3e1a976/parasplit-1.1.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "4783fe066d3ba7a07bdedaba4ea41a4dc908691b603f4bcbb567155b39985571",
                "md5": "10757c232779f5b1d35e6a101e46ec2c",
                "sha256": "542e517f7bd3041c7a6027d041632b166a6b7b76a09d6530981f3a4667840d7d"
            },
            "downloads": -1,
            "filename": "parasplit-1.1.3.tar.gz",
            "has_sig": false,
            "md5_digest": "10757c232779f5b1d35e6a101e46ec2c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 14509,
            "upload_time": "2025-10-27T10:46:52",
            "upload_time_iso_8601": "2025-10-27T10:46:52.410599Z",
            "url": "https://files.pythonhosted.org/packages/47/83/fe066d3ba7a07bdedaba4ea41a4dc908691b603f4bcbb567155b39985571/parasplit-1.1.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-27 10:46:52",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "parasplit"
}
        
Elapsed time: 0.56740s