parasplit


Nameparasplit JSON
Version 1.1.2 PyPI version JSON
download
home_pageNone
SummaryAn Hi-C tool for cutting sequences using specified enzymes
upload_time2024-07-19 08:26:02
maintainerNone
docs_urlNone
authorNone
requires_pythonNone
licenseAGPLv3
keywords hi-c hic bioinformatics cutsite
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <!--
SPDX-FileCopyrightText: 2024 Samir Bertache

SPDX-License-Identifier: CC0-1.0
-->

# CUTSITE SCRIPT README

## Overview


Parasplit is a Python script designed to process paired-end FASTQ files by fragmenting DNA sequences at specified restriction enzyme sites. It efficiently handles large datasets by leveraging multi-threading for decompression and compression using pigz.

## Features


- **Find and Utilize Restriction Enzyme Sites:** Automatically identifies ligation sites from provided enzyme names and generates regex patterns to locate these sites in sequences.

- **Fragmentation:** Splits sequences at restriction enzyme sites, creating smaller fragments based on specified seed size.

- **Multi-threading:** Efficiently processes large datasets by utilizing multiple threads for decompression and compression.

- **Custom Modes:** Supports different pairing modes for sequence fragments.


## Installation


Ensure you have Python 3 installed along with the required dependencies:

```bash
sudo apt-get install pigz
pip install parasplit
```


## Usage


The script can be executed from the command line with various arguments to customize its behavior.


### Command-Line Arguments


- `--source_forward` (str): Input file path for forward reads. Default is `../data/R1.fq.gz`.

- `--source_reverse` (str): Input file path for reverse reads. Default is `../data/R2.fq.gz`.

- `--output_forward` (str): Output file path for processed forward reads. Default is `../data/output_forward.fq.gz`.

- `--output_reverse` (str): Output file path for processed reverse reads. Default is `../data/output_reverse.fq.gz`.

- `--list_enzyme` (str): Comma-separated list of restriction enzymes. Default is "No restriction enzyme found."

- `--mode` (str): Mode of pairing fragments. Options are `all` or `fr`. Default is `fr`.

- `--seed_size` (int): Minimum length of fragments to keep. Default is 20.

- `--num_threads` (int): Number of threads to use for processing. Default is 8.

- `--borderless`: Non conservation of ligations sites

### Example Command


```bash
parasplit --source_forward="../data/R1.fq.gz" --source_reverse="../data/R2.fq.gz" --output_forward="../data/output_forward.fq.gz" --output_reverse="../data/output_reverse.fq.gz" --list_enzyme=EcoRI,HinfI --mode=all --seed_size=20 --num_threads=8
```


## Main Script


- **Pretreatment:** Retrieval of restriction sites from the Biopython database and allocation of resources for the different processes.

- **Read:** Decompression and simultaneous reading of FastQ files. Send reads to a multiprocessing queue

- **Frag:** Retrieve sequences in a queue. Splits sequences into fragments based on restriction enzyme sites. Create Pairs, and send it in a multiprocessing queue

- **WriteAndControl:** Stream writing from data from the output queue and compression in parallel


## Project architecture

![Schéma de l'architecture](images/EnglishVersion.svg)

*Schéma de l'architecture - Licence : [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/)*

## Implementation Details


- The script uses pigz for parallel decompression and compression to handle large datasets efficiently.
- Signal handlers are implemented to ensure graceful termination of processes.
- The main processing function reads input files, processes sequences to identify and fragment them at restriction sites, and writes the results to output files.
- Multi-threading is utilized for various stages of processing, including decompression, fragmentation, and compression.


## Dependencies


- Python 3
- pigz

## Testing

### Documentation of the `tests/` Directory

#### File `test_main.py`

- **Purpose:** This file contains unit tests to verify the correct functioning of the tool. The reference files were generated by hicstuff (version 3.2.3) cutsite for a zero seed size and the DpnII enzyme.

- **Examples of Tests:**
    - `test_process_file`: Verifies that the `cut` function correctly processes an input file and generates the expected output file.
    - Additional tests specific to the different functionalities (modes) of the program.

#### Directory `input_data/`

- **Purpose:** Contains specific input data used to test various configurations of your program.
- **Examples:**
    - `R1.fq.gz`, `R2.fq.gz`: Compressed FASTQ files containing DNA sequences for testing fragmentation.

#### Directory `output_data/`

- **Purpose:** Contains the expected results of the tests.
- **Examples:**
    - `output_ref_R1.fq.gz`, `output_ref_R2.fq.gz`: Compressed FASTQ files representing the expected result after processing by your program.

### Running Tests

To run the tests, use the following command:

```bash
pytest tests/
```

This command will execute all tests defined in the `tests/` directory and ensure that your program functions correctly.


## The tree structure of my project : 


			├── myproject/
			│   ├── __init__.py
			│   ├── main.py
			│   ├── Frag.py
			│   ├── Read.py
			│   ├── Pretreatment.py
			│   └── WriteAndControl.py
			├── pyproject.toml
			├── requirements-dev.txt
			├── docs/
			│   ├── requirements.txt
			├── test/
			│   ├── __init__.py
			│   ├── test_main.py	
			│   ├── input_data/
			│   │   ├── R1.fq.gz
			│   │   └── R2.fq.gz
			│   └── output_data/
			│       ├── output_ref_R1.fq.gz
			│       ├── output_ref_R2.fq.gz
			│       ├── output_ref_all_R1.fq.gz
			│       └── output_ref_all_R2.fq.gz
			└── README.md
			
## Contact


For questions or issues, please contact [samir.bertache.djenadi@gmail.com](mailto:samir.bertache.djenadi@gmail.com).


---

This README provides an overview of the Cutsite Script's functionality, usage instructions, and implementation details. For more detailed information, refer to the script's source code and docstrings.

			

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "parasplit",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "Hi-C, HiC, bioinformatics, cutsite",
    "author": null,
    "author_email": "Bertache Djenadi <samir.bertache.djenadi@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/fc/9f/dbfc294df6e85fee362053ae714ca6c7289d9bc4b597220585878c5339b3/parasplit-1.1.2.tar.gz",
    "platform": null,
    "description": "<!--\nSPDX-FileCopyrightText: 2024 Samir Bertache\n\nSPDX-License-Identifier: CC0-1.0\n-->\n\n# CUTSITE SCRIPT README\n\n## Overview\n\n\nParasplit is a Python script designed to process paired-end FASTQ files by fragmenting DNA sequences at specified restriction enzyme sites. It efficiently handles large datasets by leveraging multi-threading for decompression and compression using pigz.\n\n## Features\n\n\n- **Find and Utilize Restriction Enzyme Sites:** Automatically identifies ligation sites from provided enzyme names and generates regex patterns to locate these sites in sequences.\n\n- **Fragmentation:** Splits sequences at restriction enzyme sites, creating smaller fragments based on specified seed size.\n\n- **Multi-threading:** Efficiently processes large datasets by utilizing multiple threads for decompression and compression.\n\n- **Custom Modes:** Supports different pairing modes for sequence fragments.\n\n\n## Installation\n\n\nEnsure you have Python 3 installed along with the required dependencies:\n\n```bash\nsudo apt-get install pigz\npip install parasplit\n```\n\n\n## Usage\n\n\nThe script can be executed from the command line with various arguments to customize its behavior.\n\n\n### Command-Line Arguments\n\n\n- `--source_forward` (str): Input file path for forward reads. Default is `../data/R1.fq.gz`.\n\n- `--source_reverse` (str): Input file path for reverse reads. Default is `../data/R2.fq.gz`.\n\n- `--output_forward` (str): Output file path for processed forward reads. Default is `../data/output_forward.fq.gz`.\n\n- `--output_reverse` (str): Output file path for processed reverse reads. Default is `../data/output_reverse.fq.gz`.\n\n- `--list_enzyme` (str): Comma-separated list of restriction enzymes. Default is \"No restriction enzyme found.\"\n\n- `--mode` (str): Mode of pairing fragments. Options are `all` or `fr`. Default is `fr`.\n\n- `--seed_size` (int): Minimum length of fragments to keep. Default is 20.\n\n- `--num_threads` (int): Number of threads to use for processing. Default is 8.\n\n- `--borderless`: Non conservation of ligations sites\n\n### Example Command\n\n\n```bash\nparasplit --source_forward=\"../data/R1.fq.gz\" --source_reverse=\"../data/R2.fq.gz\" --output_forward=\"../data/output_forward.fq.gz\" --output_reverse=\"../data/output_reverse.fq.gz\" --list_enzyme=EcoRI,HinfI --mode=all --seed_size=20 --num_threads=8\n```\n\n\n## Main Script\n\n\n- **Pretreatment:** Retrieval of restriction sites from the Biopython database and allocation of resources for the different processes.\n\n- **Read:** Decompression and simultaneous reading of FastQ files. Send reads to a multiprocessing queue\n\n- **Frag:** Retrieve sequences in a queue. Splits sequences into fragments based on restriction enzyme sites. Create Pairs, and send it in a multiprocessing queue\n\n- **WriteAndControl:** Stream writing from data from the output queue and compression in parallel\n\n\n## Project architecture\n\n![Sch\u00e9ma de l'architecture](images/EnglishVersion.svg)\n\n*Sch\u00e9ma de l'architecture - Licence : [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/)*\n\n## Implementation Details\n\n\n- The script uses pigz for parallel decompression and compression to handle large datasets efficiently.\n- Signal handlers are implemented to ensure graceful termination of processes.\n- The main processing function reads input files, processes sequences to identify and fragment them at restriction sites, and writes the results to output files.\n- Multi-threading is utilized for various stages of processing, including decompression, fragmentation, and compression.\n\n\n## Dependencies\n\n\n- Python 3\n- pigz\n\n## Testing\n\n### Documentation of the `tests/` Directory\n\n#### File `test_main.py`\n\n- **Purpose:** This file contains unit tests to verify the correct functioning of the tool. The reference files were generated by hicstuff (version 3.2.3) cutsite for a zero seed size and the DpnII enzyme.\n\n- **Examples of Tests:**\n    - `test_process_file`: Verifies that the `cut` function correctly processes an input file and generates the expected output file.\n    - Additional tests specific to the different functionalities (modes) of the program.\n\n#### Directory `input_data/`\n\n- **Purpose:** Contains specific input data used to test various configurations of your program.\n- **Examples:**\n    - `R1.fq.gz`, `R2.fq.gz`: Compressed FASTQ files containing DNA sequences for testing fragmentation.\n\n#### Directory `output_data/`\n\n- **Purpose:** Contains the expected results of the tests.\n- **Examples:**\n    - `output_ref_R1.fq.gz`, `output_ref_R2.fq.gz`: Compressed FASTQ files representing the expected result after processing by your program.\n\n### Running Tests\n\nTo run the tests, use the following command:\n\n```bash\npytest tests/\n```\n\nThis command will execute all tests defined in the `tests/` directory and ensure that your program functions correctly.\n\n\n## The tree structure of my project : \n\n\n\t\t\t\u251c\u2500\u2500 myproject/\n\t\t\t\u2502   \u251c\u2500\u2500 __init__.py\n\t\t\t\u2502   \u251c\u2500\u2500 main.py\n\t\t\t\u2502   \u251c\u2500\u2500 Frag.py\n\t\t\t\u2502   \u251c\u2500\u2500 Read.py\n\t\t\t\u2502   \u251c\u2500\u2500 Pretreatment.py\n\t\t\t\u2502   \u2514\u2500\u2500 WriteAndControl.py\n\t\t\t\u251c\u2500\u2500 pyproject.toml\n\t\t\t\u251c\u2500\u2500 requirements-dev.txt\n\t\t\t\u251c\u2500\u2500 docs/\n\t\t\t\u2502   \u251c\u2500\u2500 requirements.txt\n\t\t\t\u251c\u2500\u2500 test/\n\t\t\t\u2502   \u251c\u2500\u2500 __init__.py\n\t\t\t\u2502   \u251c\u2500\u2500 test_main.py\t\n\t\t\t\u2502   \u251c\u2500\u2500 input_data/\n\t\t\t\u2502   \u2502   \u251c\u2500\u2500 R1.fq.gz\n\t\t\t\u2502   \u2502   \u2514\u2500\u2500 R2.fq.gz\n\t\t\t\u2502   \u2514\u2500\u2500 output_data/\n\t\t\t\u2502       \u251c\u2500\u2500 output_ref_R1.fq.gz\n\t\t\t\u2502       \u251c\u2500\u2500 output_ref_R2.fq.gz\n\t\t\t\u2502       \u251c\u2500\u2500 output_ref_all_R1.fq.gz\n\t\t\t\u2502       \u2514\u2500\u2500 output_ref_all_R2.fq.gz\n\t\t\t\u2514\u2500\u2500 README.md\n\t\t\t\n## Contact\n\n\nFor questions or issues, please contact [samir.bertache.djenadi@gmail.com](mailto:samir.bertache.djenadi@gmail.com).\n\n\n---\n\nThis README provides an overview of the Cutsite Script's functionality, usage instructions, and implementation details. For more detailed information, refer to the script's source code and docstrings.\n\n\t\t\t\n",
    "bugtrack_url": null,
    "license": "AGPLv3",
    "summary": "An Hi-C tool for cutting sequences using specified enzymes",
    "version": "1.1.2",
    "project_urls": null,
    "split_keywords": [
        "hi-c",
        " hic",
        " bioinformatics",
        " cutsite"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ca3e73891515e4288bb257aa01c435a647f20dc43115b18ffaf4798b02eba833",
                "md5": "169d5c78be6406e1678d7e1a2c43b6eb",
                "sha256": "8f5c762ba355ad2c3d574cda4e8b93d9a664d0e5e5c93633369f5f6f0378aefe"
            },
            "downloads": -1,
            "filename": "parasplit-1.1.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "169d5c78be6406e1678d7e1a2c43b6eb",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 15939,
            "upload_time": "2024-07-19T08:26:01",
            "upload_time_iso_8601": "2024-07-19T08:26:01.495995Z",
            "url": "https://files.pythonhosted.org/packages/ca/3e/73891515e4288bb257aa01c435a647f20dc43115b18ffaf4798b02eba833/parasplit-1.1.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "fc9fdbfc294df6e85fee362053ae714ca6c7289d9bc4b597220585878c5339b3",
                "md5": "2eea274635d29557ce6f3567f588ad8e",
                "sha256": "c1652e0fa9c5f204b54a994b45efebebd33fb6b280865820d32833e58dc481ad"
            },
            "downloads": -1,
            "filename": "parasplit-1.1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "2eea274635d29557ce6f3567f588ad8e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 14320,
            "upload_time": "2024-07-19T08:26:02",
            "upload_time_iso_8601": "2024-07-19T08:26:02.982522Z",
            "url": "https://files.pythonhosted.org/packages/fc/9f/dbfc294df6e85fee362053ae714ca6c7289d9bc4b597220585878c5339b3/parasplit-1.1.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-07-19 08:26:02",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "parasplit"
}
        
Elapsed time: 0.28218s