Name | parasplit JSON |
Version |
1.1.2
JSON |
| download |
home_page | None |
Summary | An Hi-C tool for cutting sequences using specified enzymes |
upload_time | 2024-07-19 08:26:02 |
maintainer | None |
docs_url | None |
author | None |
requires_python | None |
license | AGPLv3 |
keywords |
hi-c
hic
bioinformatics
cutsite
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
<!--
SPDX-FileCopyrightText: 2024 Samir Bertache
SPDX-License-Identifier: CC0-1.0
-->
# CUTSITE SCRIPT README
## Overview
Parasplit is a Python script designed to process paired-end FASTQ files by fragmenting DNA sequences at specified restriction enzyme sites. It efficiently handles large datasets by leveraging multi-threading for decompression and compression using pigz.
## Features
- **Find and Utilize Restriction Enzyme Sites:** Automatically identifies ligation sites from provided enzyme names and generates regex patterns to locate these sites in sequences.
- **Fragmentation:** Splits sequences at restriction enzyme sites, creating smaller fragments based on specified seed size.
- **Multi-threading:** Efficiently processes large datasets by utilizing multiple threads for decompression and compression.
- **Custom Modes:** Supports different pairing modes for sequence fragments.
## Installation
Ensure you have Python 3 installed along with the required dependencies:
```bash
sudo apt-get install pigz
pip install parasplit
```
## Usage
The script can be executed from the command line with various arguments to customize its behavior.
### Command-Line Arguments
- `--source_forward` (str): Input file path for forward reads. Default is `../data/R1.fq.gz`.
- `--source_reverse` (str): Input file path for reverse reads. Default is `../data/R2.fq.gz`.
- `--output_forward` (str): Output file path for processed forward reads. Default is `../data/output_forward.fq.gz`.
- `--output_reverse` (str): Output file path for processed reverse reads. Default is `../data/output_reverse.fq.gz`.
- `--list_enzyme` (str): Comma-separated list of restriction enzymes. Default is "No restriction enzyme found."
- `--mode` (str): Mode of pairing fragments. Options are `all` or `fr`. Default is `fr`.
- `--seed_size` (int): Minimum length of fragments to keep. Default is 20.
- `--num_threads` (int): Number of threads to use for processing. Default is 8.
- `--borderless`: Non conservation of ligations sites
### Example Command
```bash
parasplit --source_forward="../data/R1.fq.gz" --source_reverse="../data/R2.fq.gz" --output_forward="../data/output_forward.fq.gz" --output_reverse="../data/output_reverse.fq.gz" --list_enzyme=EcoRI,HinfI --mode=all --seed_size=20 --num_threads=8
```
## Main Script
- **Pretreatment:** Retrieval of restriction sites from the Biopython database and allocation of resources for the different processes.
- **Read:** Decompression and simultaneous reading of FastQ files. Send reads to a multiprocessing queue
- **Frag:** Retrieve sequences in a queue. Splits sequences into fragments based on restriction enzyme sites. Create Pairs, and send it in a multiprocessing queue
- **WriteAndControl:** Stream writing from data from the output queue and compression in parallel
## Project architecture
![Schéma de l'architecture](images/EnglishVersion.svg)
*Schéma de l'architecture - Licence : [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/)*
## Implementation Details
- The script uses pigz for parallel decompression and compression to handle large datasets efficiently.
- Signal handlers are implemented to ensure graceful termination of processes.
- The main processing function reads input files, processes sequences to identify and fragment them at restriction sites, and writes the results to output files.
- Multi-threading is utilized for various stages of processing, including decompression, fragmentation, and compression.
## Dependencies
- Python 3
- pigz
## Testing
### Documentation of the `tests/` Directory
#### File `test_main.py`
- **Purpose:** This file contains unit tests to verify the correct functioning of the tool. The reference files were generated by hicstuff (version 3.2.3) cutsite for a zero seed size and the DpnII enzyme.
- **Examples of Tests:**
- `test_process_file`: Verifies that the `cut` function correctly processes an input file and generates the expected output file.
- Additional tests specific to the different functionalities (modes) of the program.
#### Directory `input_data/`
- **Purpose:** Contains specific input data used to test various configurations of your program.
- **Examples:**
- `R1.fq.gz`, `R2.fq.gz`: Compressed FASTQ files containing DNA sequences for testing fragmentation.
#### Directory `output_data/`
- **Purpose:** Contains the expected results of the tests.
- **Examples:**
- `output_ref_R1.fq.gz`, `output_ref_R2.fq.gz`: Compressed FASTQ files representing the expected result after processing by your program.
### Running Tests
To run the tests, use the following command:
```bash
pytest tests/
```
This command will execute all tests defined in the `tests/` directory and ensure that your program functions correctly.
## The tree structure of my project :
├── myproject/
│ ├── __init__.py
│ ├── main.py
│ ├── Frag.py
│ ├── Read.py
│ ├── Pretreatment.py
│ └── WriteAndControl.py
├── pyproject.toml
├── requirements-dev.txt
├── docs/
│ ├── requirements.txt
├── test/
│ ├── __init__.py
│ ├── test_main.py
│ ├── input_data/
│ │ ├── R1.fq.gz
│ │ └── R2.fq.gz
│ └── output_data/
│ ├── output_ref_R1.fq.gz
│ ├── output_ref_R2.fq.gz
│ ├── output_ref_all_R1.fq.gz
│ └── output_ref_all_R2.fq.gz
└── README.md
## Contact
For questions or issues, please contact [samir.bertache.djenadi@gmail.com](mailto:samir.bertache.djenadi@gmail.com).
---
This README provides an overview of the Cutsite Script's functionality, usage instructions, and implementation details. For more detailed information, refer to the script's source code and docstrings.
Raw data
{
"_id": null,
"home_page": null,
"name": "parasplit",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "Hi-C, HiC, bioinformatics, cutsite",
"author": null,
"author_email": "Bertache Djenadi <samir.bertache.djenadi@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/fc/9f/dbfc294df6e85fee362053ae714ca6c7289d9bc4b597220585878c5339b3/parasplit-1.1.2.tar.gz",
"platform": null,
"description": "<!--\nSPDX-FileCopyrightText: 2024 Samir Bertache\n\nSPDX-License-Identifier: CC0-1.0\n-->\n\n# CUTSITE SCRIPT README\n\n## Overview\n\n\nParasplit is a Python script designed to process paired-end FASTQ files by fragmenting DNA sequences at specified restriction enzyme sites. It efficiently handles large datasets by leveraging multi-threading for decompression and compression using pigz.\n\n## Features\n\n\n- **Find and Utilize Restriction Enzyme Sites:** Automatically identifies ligation sites from provided enzyme names and generates regex patterns to locate these sites in sequences.\n\n- **Fragmentation:** Splits sequences at restriction enzyme sites, creating smaller fragments based on specified seed size.\n\n- **Multi-threading:** Efficiently processes large datasets by utilizing multiple threads for decompression and compression.\n\n- **Custom Modes:** Supports different pairing modes for sequence fragments.\n\n\n## Installation\n\n\nEnsure you have Python 3 installed along with the required dependencies:\n\n```bash\nsudo apt-get install pigz\npip install parasplit\n```\n\n\n## Usage\n\n\nThe script can be executed from the command line with various arguments to customize its behavior.\n\n\n### Command-Line Arguments\n\n\n- `--source_forward` (str): Input file path for forward reads. Default is `../data/R1.fq.gz`.\n\n- `--source_reverse` (str): Input file path for reverse reads. Default is `../data/R2.fq.gz`.\n\n- `--output_forward` (str): Output file path for processed forward reads. Default is `../data/output_forward.fq.gz`.\n\n- `--output_reverse` (str): Output file path for processed reverse reads. Default is `../data/output_reverse.fq.gz`.\n\n- `--list_enzyme` (str): Comma-separated list of restriction enzymes. Default is \"No restriction enzyme found.\"\n\n- `--mode` (str): Mode of pairing fragments. Options are `all` or `fr`. Default is `fr`.\n\n- `--seed_size` (int): Minimum length of fragments to keep. Default is 20.\n\n- `--num_threads` (int): Number of threads to use for processing. Default is 8.\n\n- `--borderless`: Non conservation of ligations sites\n\n### Example Command\n\n\n```bash\nparasplit --source_forward=\"../data/R1.fq.gz\" --source_reverse=\"../data/R2.fq.gz\" --output_forward=\"../data/output_forward.fq.gz\" --output_reverse=\"../data/output_reverse.fq.gz\" --list_enzyme=EcoRI,HinfI --mode=all --seed_size=20 --num_threads=8\n```\n\n\n## Main Script\n\n\n- **Pretreatment:** Retrieval of restriction sites from the Biopython database and allocation of resources for the different processes.\n\n- **Read:** Decompression and simultaneous reading of FastQ files. Send reads to a multiprocessing queue\n\n- **Frag:** Retrieve sequences in a queue. Splits sequences into fragments based on restriction enzyme sites. Create Pairs, and send it in a multiprocessing queue\n\n- **WriteAndControl:** Stream writing from data from the output queue and compression in parallel\n\n\n## Project architecture\n\n![Sch\u00e9ma de l'architecture](images/EnglishVersion.svg)\n\n*Sch\u00e9ma de l'architecture - Licence : [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/)*\n\n## Implementation Details\n\n\n- The script uses pigz for parallel decompression and compression to handle large datasets efficiently.\n- Signal handlers are implemented to ensure graceful termination of processes.\n- The main processing function reads input files, processes sequences to identify and fragment them at restriction sites, and writes the results to output files.\n- Multi-threading is utilized for various stages of processing, including decompression, fragmentation, and compression.\n\n\n## Dependencies\n\n\n- Python 3\n- pigz\n\n## Testing\n\n### Documentation of the `tests/` Directory\n\n#### File `test_main.py`\n\n- **Purpose:** This file contains unit tests to verify the correct functioning of the tool. The reference files were generated by hicstuff (version 3.2.3) cutsite for a zero seed size and the DpnII enzyme.\n\n- **Examples of Tests:**\n - `test_process_file`: Verifies that the `cut` function correctly processes an input file and generates the expected output file.\n - Additional tests specific to the different functionalities (modes) of the program.\n\n#### Directory `input_data/`\n\n- **Purpose:** Contains specific input data used to test various configurations of your program.\n- **Examples:**\n - `R1.fq.gz`, `R2.fq.gz`: Compressed FASTQ files containing DNA sequences for testing fragmentation.\n\n#### Directory `output_data/`\n\n- **Purpose:** Contains the expected results of the tests.\n- **Examples:**\n - `output_ref_R1.fq.gz`, `output_ref_R2.fq.gz`: Compressed FASTQ files representing the expected result after processing by your program.\n\n### Running Tests\n\nTo run the tests, use the following command:\n\n```bash\npytest tests/\n```\n\nThis command will execute all tests defined in the `tests/` directory and ensure that your program functions correctly.\n\n\n## The tree structure of my project : \n\n\n\t\t\t\u251c\u2500\u2500 myproject/\n\t\t\t\u2502 \u251c\u2500\u2500 __init__.py\n\t\t\t\u2502 \u251c\u2500\u2500 main.py\n\t\t\t\u2502 \u251c\u2500\u2500 Frag.py\n\t\t\t\u2502 \u251c\u2500\u2500 Read.py\n\t\t\t\u2502 \u251c\u2500\u2500 Pretreatment.py\n\t\t\t\u2502 \u2514\u2500\u2500 WriteAndControl.py\n\t\t\t\u251c\u2500\u2500 pyproject.toml\n\t\t\t\u251c\u2500\u2500 requirements-dev.txt\n\t\t\t\u251c\u2500\u2500 docs/\n\t\t\t\u2502 \u251c\u2500\u2500 requirements.txt\n\t\t\t\u251c\u2500\u2500 test/\n\t\t\t\u2502 \u251c\u2500\u2500 __init__.py\n\t\t\t\u2502 \u251c\u2500\u2500 test_main.py\t\n\t\t\t\u2502 \u251c\u2500\u2500 input_data/\n\t\t\t\u2502 \u2502 \u251c\u2500\u2500 R1.fq.gz\n\t\t\t\u2502 \u2502 \u2514\u2500\u2500 R2.fq.gz\n\t\t\t\u2502 \u2514\u2500\u2500 output_data/\n\t\t\t\u2502 \u251c\u2500\u2500 output_ref_R1.fq.gz\n\t\t\t\u2502 \u251c\u2500\u2500 output_ref_R2.fq.gz\n\t\t\t\u2502 \u251c\u2500\u2500 output_ref_all_R1.fq.gz\n\t\t\t\u2502 \u2514\u2500\u2500 output_ref_all_R2.fq.gz\n\t\t\t\u2514\u2500\u2500 README.md\n\t\t\t\n## Contact\n\n\nFor questions or issues, please contact [samir.bertache.djenadi@gmail.com](mailto:samir.bertache.djenadi@gmail.com).\n\n\n---\n\nThis README provides an overview of the Cutsite Script's functionality, usage instructions, and implementation details. For more detailed information, refer to the script's source code and docstrings.\n\n\t\t\t\n",
"bugtrack_url": null,
"license": "AGPLv3",
"summary": "An Hi-C tool for cutting sequences using specified enzymes",
"version": "1.1.2",
"project_urls": null,
"split_keywords": [
"hi-c",
" hic",
" bioinformatics",
" cutsite"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "ca3e73891515e4288bb257aa01c435a647f20dc43115b18ffaf4798b02eba833",
"md5": "169d5c78be6406e1678d7e1a2c43b6eb",
"sha256": "8f5c762ba355ad2c3d574cda4e8b93d9a664d0e5e5c93633369f5f6f0378aefe"
},
"downloads": -1,
"filename": "parasplit-1.1.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "169d5c78be6406e1678d7e1a2c43b6eb",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 15939,
"upload_time": "2024-07-19T08:26:01",
"upload_time_iso_8601": "2024-07-19T08:26:01.495995Z",
"url": "https://files.pythonhosted.org/packages/ca/3e/73891515e4288bb257aa01c435a647f20dc43115b18ffaf4798b02eba833/parasplit-1.1.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "fc9fdbfc294df6e85fee362053ae714ca6c7289d9bc4b597220585878c5339b3",
"md5": "2eea274635d29557ce6f3567f588ad8e",
"sha256": "c1652e0fa9c5f204b54a994b45efebebd33fb6b280865820d32833e58dc481ad"
},
"downloads": -1,
"filename": "parasplit-1.1.2.tar.gz",
"has_sig": false,
"md5_digest": "2eea274635d29557ce6f3567f588ad8e",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 14320,
"upload_time": "2024-07-19T08:26:02",
"upload_time_iso_8601": "2024-07-19T08:26:02.982522Z",
"url": "https://files.pythonhosted.org/packages/fc/9f/dbfc294df6e85fee362053ae714ca6c7289d9bc4b597220585878c5339b3/parasplit-1.1.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-07-19 08:26:02",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "parasplit"
}