fastqwiper

Name	fastqwiper JSON
Version	2024.1.93 JSON
	download
home_page	https://github.com/mazzalab/fastqwiper
Summary	An ensemble method to recover corrupted FASTQ files, drop or fix pesky lines, remove unpaired reads, and fix reads interleaving
upload_time	2024-05-26 07:23:00
maintainer	None
docs_url	None
author	Tommaso Mazza
requires_python	<3.13,>=3.7
license	MIT
keywords	genomics ngs fastq bioinformatics
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage

            # FastqWiper 
[![Build](https://github.com/mazzalab/fastqwiper/actions/workflows/buildall_and_publish.yml/badge.svg)](https://github.com/mazzalab/fastqwiper/actions/workflows/buildall_and_publish.yml) [![codecov](https://codecov.io/gh/mazzalab/fastqwiper/graph/badge.svg?token=5V4AQTK619)](https://codecov.io/gh/mazzalab/fastqwiper) [![GitHub issues](https://img.shields.io/github/issues-raw/mazzalab/fastqwiper)](https://github.com/mazzalab/fastqwiper/issues) 

[![Anaconda-Server Badge](https://anaconda.org/bfxcss/fastqwiper/badges/version.svg)](https://anaconda.org/bfxcss/fastqwiper) [![Anaconda-Server Badge](https://anaconda.org/bfxcss/fastqwiper/badges/latest_release_date.svg)](https://anaconda.org/bfxcss/fastqwiper) [![Anaconda-Server Badge](https://anaconda.org/bfxcss/fastqwiper/badges/platforms.svg)](https://anaconda.org/bfxcss/fastqwiper) [![Anaconda-Server Badge](https://anaconda.org/bfxcss/fastqwiper/badges/downloads.svg)](https://anaconda.org/bfxcss/fastqwiper)

[![PyPI version](https://badge.fury.io/py/fastqwiper.svg)](https://badge.fury.io/py/fastqwiper) [![PyPI pyversions](https://img.shields.io/pypi/pyversions/fastqwiper.svg)](https://pypi.python.org/pypi/fastqwiper/) ![PyPI - Downloads](https://img.shields.io/pypi/dm/fastqwiper)

[![Docker](https://badgen.net/badge/icon/docker?icon=docker&label)](https://hub.docker.com/r/mazzalab/fastqwiper) ![Docker Pulls](https://img.shields.io/docker/pulls/mazzalab/fastqwiper)

`FastqWiper` is a Snakemake-enabled application that wipes out bad reads from broken FASTQ files. Additionally, the available and pre-designed Snakemake [workflows](https://github.com/mazzalab/fastqwiper/tree/main/pipeline) allows **recovering** corrupted `fastq.gz`, **dropping** or **fixing** pesky lines, **removing** unpaired reads, and **settling** reads interleaving.

* Compatibility: Python ≥3.7, <3.13
* OS: Windows, Linux, Mac OS (Snakemake workflows only through Docker for Windows)
* Contributions: [bioinformatics@css-mendel.it](bioinformatics@css-mendel.it)
* Docker: https://hub.docker.com/r/mazzalab/fastqwiper
* Singularity: https://cloud.sylabs.io/library/mazzalab/fastqwiper/fastqwiper.sif
* Bug report: [https://github.com/mazzalab/fastqwiper/issues](https://github.com/mazzalab/fastqwiper/issues)


## USAGE
- <code style="color : greenyellow">**Case 1.**</code>You have one or a couple (R1&R2) of **computer readable** (meaning that the .gz files can be successfully decompressed or that the .fa/.fasta files can be viewed from the beginning to the EOF) FASTQ files which contain pesky, unformatted, uncompliant lines: Use *FastWiper* to clean them;
- <code style="color : darkorange">**Case 2.**</code>You have one or a couple (R1&R2) of **computer readable** FASTQ files that you want to drop unpaired reads from or fix reads interleaving: Use the FastqWiper's *Snakemake workflows*;
- <code style="color : orangered">**Case 3.**</code>You have one `fastq.gz` file or a couple (R1&R2) of `fastq.gz` files which are corrupted (**unreadable**, meaning that the .gz files cannot be successfully decompressed) and you want to recover healthy reads and reformat them: Use the FastqWiper's *Snakemake workflows*;


## Installation
### <code style="color : greenyellow">Case 1</code>
This requires you to install FastqWiper and therefore <u>not</u> to use *workflows*. You can do it for all OSs:

#### Use Conda

```
conda create -n fastqwiper python=3.11
conda activate fastqwiper
conda install -c bfxcss -c conda-forge fastqwiper

fastqwiper --help
```
*Hint: for an healthier experience, use* **mamba**


#### Use Pypi
```
pip install fastqwiper

fastqwiper --help
```
<br/>

`fastqwiper  <options>`
```
options:
  -i, --fastq_in TEXT          The input FASTQ file to be cleaned  [required]
  -o, --fastq_out TEXT         The wiped FASTQ file                [required]
  -l, --log_frequency INTEGER  The number of reads you want to print a status message. Default: 500000
  -f, --log_out TEXT           The file name of the final quality report summary. Print on the screen if not specified
  -a, --alphabet               Allowed character in the SEQ line. Default: ACGTN
  -h, --help                   Show this message and exit.
```
It accepts strictly **readable** `*.fastq` or `*.fastq.gz` files in input.


### <code style="color : darkorange">Case 2</code> & <code style="color : orangered">Case 3</code>
There are <b>QUICK</b> and a <b>SLOW</b> methods to configure `FastqWiper`'s workflows.


#### One quick way (Docker)
1. Pull the Docker image from DockerHub:

`docker pull mazzalab/fastqwiper`

2. Once downloaded the image, type:

CMD: `docker run --rm -ti --name fastqwiper -v "YOUR_LOCAL_PATH_TO_DATA_FOLDER:/fastqwiper/data" mazzalab/fastqwiper paired 8 sample 50000000 33 ACGTN 500000`

#### Another quick way (Singularity)
1. Pull the Singularity image from the Cloud Library:

`singularity pull library://mazzalab/fastqwiper/fastqwiper.sif`

2. Once downloaded the image (e.g., fastqwiper.sif_2024.1.89.sif), type:

CMD `singularity run --bind YOUR_LOCAL_PATH_TO_DATA_FOLDER:/fastqwiper/data --writable-tmpfs fastqwiper.sif_2024.1.89.sif paired 8 sample 50000000 33 ACGTN 500000`

If you want to bind the `.singularity` cache folder and the `logs` folder, you can omit `--writable-tmpfs`, create the folders `.singularity` and `logs` (`mkdir .singularity logs`) on the host system, and use this command instead:

CMD: `singularity run --bind YOUR_LOCAL_PATH_TO_DATA_FOLDER/:/fastqwiper/data --bind YOUR_LOCAL_PATH_TO_.SNAKEMAKE_FOLDER/:/fastqwiper/.snakemake --bind YOUR_LOCAL_PATH_TO_LOGS_FOLDER/:/fastqwiper/logs fastqwiper.sif_2024.1.89.sif paired 8 sample 50000000 33 ACGTN`

For both **Docker** and **Singularity**:

- `YOUR_LOCAL_PATH_TO_DATA_FOLDER` is the path of the folder where the fastq.gz files to be wiped are located;
- `paired` triggers the cleaning of R1 and R2. Alternatively, `single` will trigger the wipe of individual FASTQ files;
- `8` is the number of your choice of computing cores to be spawned;
- `sample` is part of the names of the FASTQ files to be wiped. <b>Be aware</b> that: for <b>paired-end</b> files (e.g., "sample_R1.fastq.gz" and "sample_R2.fastq.gz"), your files must finish with `_R1.fastq.gz` and `_R2.fastq.gz`. Therefore, the argument to pass is everything before these texts: `sample` in this case. For <b>single end</b>/individual files (e.g., "excerpt_R1_001.fastq.gz"), your file must end with the string `.fastq.gz`; the preceding text, i.e., "excerpt_R1_001" in this case, will be the text to be passed to the command as an argument. 
- `50000000` (optional) is the number of rows-per-chunk (used when cores>1. It must be a number multiple of 4). Increasing this number too much would reduce the parallelism advantage. Decreasing this number too much would increase the number of chunks more than the number of available cpus, making parallelism unefficient. Choose this number wisely depending on the total number of reads in your starting file.
- `33` (optional) is the ASCII offset (33=Sanger, 64=old Solexa)
- `ACGTN` (optional) is the allowed alphabet in the SEQ line of the FASTQ file
- `500000` (optional) is the log frequency (# reads)

### <code style="color : red">The slow way (Linux & Mac OS)</code>
To enable the use of preconfigured [pipelines](https://github.com/mazzalab/fastqwiper/tree/main/pipeline), you need to install **Snakemake**. The recommended way to install Snakemake is via Conda, because it enables **Snakemake** to [handle software dependencies of your workflow](https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#integrated-package-management).
However, the default conda solver is slow and often hangs. Therefore, we recommend installing [Mamba](https://github.com/mamba-org/mamba) as a drop-in replacement via

`$ conda install -c conda-forge mamba`

if you have anaconda/miniconda already installed, or directly installing `Mambaforge` as described [here](https://github.com/conda-forge/miniforge#mambaforge).


Then, create and activate a clean environment as above:

```
mamba create -n fastqwiper python=3.11
mamba activate fastqwiper
```
Finally, install the Snakemake dependency:

```
$ mamba install -c bioconda snakemake
```


#### Usage
Clone the FastqWiper repository in a folder of your choice and enter it:

```
git clone https://github.com/mazzalab/fastqwiper.git
cd fastqwiper
```

It contains, in particular, a folder `data` containing the fastq files to be processed, a folder `pipeline` containing the released pipelines and a folder `fastq_wiper` with the source files of `FastqWiper`. <br/>
Input files to be processed must be copied into the **data** folder.

Currently, to run the `FastqWiper` pipelines, the following packages need to be installed manually:

### required packages:
[gzrt](https://github.com/arenn/gzrt) (Linux build from source [instructions](https://github.com/arenn/gzrt/blob/master/README.build), Ubuntu install [instructions](https://howtoinstall.co/en/gzrt), Mac OS install [instructions](https://formulae.brew.sh/formula/gzrt))

[BBTools](https://jgi.doe.gov/data-and-tools/software-tools/bbtools/) (install [instructions](https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/installation-guide/))

If installed from source, `gzrt` scripts need to be put on PATH. `bbmap` must be installed in the root folder of FastqWiper, as the image below

![FastqWiper folder yierarchy](assets/hierarchy.png)

### Commands:
Copy the fastq files you want to fix in the `data` folder.

<code style="color : orange">**N.b.**: In all commands above, you will pass the name of the sample to be analyzed to the workflow through the config argument: `sample_name`. Remember that your fastq files' names must finish with `_R1.fastq.gz` and `_R2.fastq.gz`, for paired fastq files, and with `.fastq.gz`, for individual fastq files, and, therefore, the text to be assigned to the variable `sample_name` must be everything before them. E.g., if your files are `my_sample_R1.fastq.gz` and `my_sample_R2.fastq.gz`, then `--config sample_name=my_sample`.</code>


#### Paired-end files

- **Get a dry run** of a pipeline (e.g., `fix_wipe_pairs_reads_sequential.smk`):<br />
`snakemake --config sample_name=my_sample qin=33 alphabet=ACGTN log_freq=1000 -s pipeline/fix_wipe_pairs_reads_sequential.smk --use-conda --cores 4 -np`

- **Generate the planned DAG**:<br />
`snakemake --config sample_name=my_sample qin=33 alphabet=ACGTN log_freq=1000 -s pipeline/fix_wipe_pairs_reads_sequential.smk --dag | dot -Tpdf > dag.pdf`<br /> <br />
<img src="https://github.com/mazzalab/fastqwiper/blob/main/pipeline/fix_wipe_pairs_reads.png?raw=true" width="400">

- **Run the pipeline** (n.b., during the first execution, Snakemake will download and install some required remote packages and may take longer). The number of computing cores can be tuned accordingly:<br />
`snakemake --config sample_name=my_sample alphabet=ACGTN log_freq=1000 -s pipeline/fix_wipe_pairs_reads_sequential.smk --use-conda --cores 2`

Fixed files will be copied in the `data` folder and will be suffixed with the string `_fixed_wiped_paired_interleaving`.
We remind that the `fix_wipe_pairs_reads_sequential.smk` and `fix_wipe_pairs_reads_parallel.smk` pipelines perform the following actions:
- execute `gzrt` on corrupted fastq.gz files (i.e., that cannot be unzipped because of errors) and recover readable reads;
- execute `FastqWiper` on recovered reads to make them compliant with the FASTQ format (source: [Wipipedia](https://en.wikipedia.org/wiki/FASTQ_format))
- execute `Trimmomatic` on wiped reads to remove residual unpaired reads
- execute `BBmap (repair.sh)` on paired reads to fix the correct interleaving and sort fastq files.  

#### Single-end files
`fix_wipe_single_reads_parallel.smk` and `fix_wipe_single_reads_sequential.smk` will not execute `trimmomatic` and BBmap's `repair.sh`.

- **Get a dry run** of a pipeline (e.g., `fix_wipe_single_reads_sequential.smk`):<br />
`snakemake --config sample_name=my_sample alphabet=ACGTN log_freq=1000 -s pipeline/fix_wipe_single_reads_sequential.smk --use-conda --cores 2 -np`

- **Generate the planned DAG**:<br />
`snakemake --config sample_name=my_sample alphabet=ACGTN log_freq=1000 -s pipeline/fix_wipe_single_reads_sequential.smk --dag | dot -Tpdf > dag.pdf`<br /><br />
<img src="https://github.com/mazzalab/fastqwiper/blob/main/pipeline/fix_wipe_single_reads.png?raw=true" width="200">

- **Run the pipeline** (n.b., The number of computing cores can be tuned accordingly):<br />
`snakemake --config sample_name=my_sample alphabet=ACGTN log_freq=1000 -s pipeline/fix_wipe_single_reads_sequential.smk --use-conda --cores 2`

# Author
**Tommaso Mazza**  
[![X](https://img.shields.io/badge/X-%23000000.svg?style=for-the-badge&logo=X&logoColor=white)](https://twitter.com/irongraft) [![LinkedIn](https://img.shields.io/badge/linkedin-%230077B5.svg?style=for-the-badge&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/tommasomazza/)

Laboratory of Bioinformatics</br>
Fondazione IRCCS Casa Sollievo della Sofferenza</br>
Viale Regina Margherita 261 - 00198 Roma IT</br>
Tel: +39 06 44160526 - Fax: +39 06 44160548</br>
E-mail: t.mazza@operapadrepio.it</br>
Web page: http://www.css-mendel.it</br>
Web page: http://bioinformatics.css-mendel.it</br>

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/mazzalab/fastqwiper",
    "name": "fastqwiper",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<3.13,>=3.7",
    "maintainer_email": null,
    "keywords": "genomics, ngs, fastq, bioinformatics",
    "author": "Tommaso Mazza",
    "author_email": "bioinformatics@css-mendel.it",
    "download_url": null,
    "platform": null,
    "description": "# FastqWiper \n[![Build](https://github.com/mazzalab/fastqwiper/actions/workflows/buildall_and_publish.yml/badge.svg)](https://github.com/mazzalab/fastqwiper/actions/workflows/buildall_and_publish.yml) [![codecov](https://codecov.io/gh/mazzalab/fastqwiper/graph/badge.svg?token=5V4AQTK619)](https://codecov.io/gh/mazzalab/fastqwiper) [![GitHub issues](https://img.shields.io/github/issues-raw/mazzalab/fastqwiper)](https://github.com/mazzalab/fastqwiper/issues) \n\n[![Anaconda-Server Badge](https://anaconda.org/bfxcss/fastqwiper/badges/version.svg)](https://anaconda.org/bfxcss/fastqwiper) [![Anaconda-Server Badge](https://anaconda.org/bfxcss/fastqwiper/badges/latest_release_date.svg)](https://anaconda.org/bfxcss/fastqwiper) [![Anaconda-Server Badge](https://anaconda.org/bfxcss/fastqwiper/badges/platforms.svg)](https://anaconda.org/bfxcss/fastqwiper) [![Anaconda-Server Badge](https://anaconda.org/bfxcss/fastqwiper/badges/downloads.svg)](https://anaconda.org/bfxcss/fastqwiper)\n\n[![PyPI version](https://badge.fury.io/py/fastqwiper.svg)](https://badge.fury.io/py/fastqwiper) [![PyPI pyversions](https://img.shields.io/pypi/pyversions/fastqwiper.svg)](https://pypi.python.org/pypi/fastqwiper/) ![PyPI - Downloads](https://img.shields.io/pypi/dm/fastqwiper)\n\n[![Docker](https://badgen.net/badge/icon/docker?icon=docker&label)](https://hub.docker.com/r/mazzalab/fastqwiper) ![Docker Pulls](https://img.shields.io/docker/pulls/mazzalab/fastqwiper)\n\n`FastqWiper` is a Snakemake-enabled application that wipes out bad reads from broken FASTQ files. Additionally, the available and pre-designed Snakemake [workflows](https://github.com/mazzalab/fastqwiper/tree/main/pipeline) allows **recovering** corrupted `fastq.gz`, **dropping** or **fixing** pesky lines, **removing** unpaired reads, and **settling** reads interleaving.\n\n* Compatibility: Python \u22653.7, <3.13\n* OS: Windows, Linux, Mac OS (Snakemake workflows only through Docker for Windows)\n* Contributions: [bioinformatics@css-mendel.it](bioinformatics@css-mendel.it)\n* Docker: https://hub.docker.com/r/mazzalab/fastqwiper\n* Singularity: https://cloud.sylabs.io/library/mazzalab/fastqwiper/fastqwiper.sif\n* Bug report: [https://github.com/mazzalab/fastqwiper/issues](https://github.com/mazzalab/fastqwiper/issues)\n\n\n## USAGE\n- <code style=\"color : greenyellow\">**Case 1.**</code>You have one or a couple (R1&R2) of **computer readable** (meaning that the .gz files can be successfully decompressed or that the .fa/.fasta files can be viewed from the beginning to the EOF) FASTQ files which contain pesky, unformatted, uncompliant lines: Use *FastWiper* to clean them;\n- <code style=\"color : darkorange\">**Case 2.**</code>You have one or a couple (R1&R2) of **computer readable** FASTQ files that you want to drop unpaired reads from or fix reads interleaving: Use the FastqWiper's *Snakemake workflows*;\n- <code style=\"color : orangered\">**Case 3.**</code>You have one `fastq.gz` file or a couple (R1&R2) of `fastq.gz` files which are corrupted (**unreadable**, meaning that the .gz files cannot be successfully decompressed) and you want to recover healthy reads and reformat them: Use the FastqWiper's *Snakemake workflows*;\n\n\n## Installation\n### <code style=\"color : greenyellow\">Case 1</code>\nThis requires you to install FastqWiper and therefore <u>not</u> to use *workflows*. You can do it for all OSs:\n\n#### Use Conda\n\n```\nconda create -n fastqwiper python=3.11\nconda activate fastqwiper\nconda install -c bfxcss -c conda-forge fastqwiper\n\nfastqwiper --help\n```\n*Hint: for an healthier experience, use* **mamba**\n\n\n#### Use Pypi\n```\npip install fastqwiper\n\nfastqwiper --help\n```\n<br/>\n\n`fastqwiper  <options>`\n```\noptions:\n  -i, --fastq_in TEXT          The input FASTQ file to be cleaned  [required]\n  -o, --fastq_out TEXT         The wiped FASTQ file                [required]\n  -l, --log_frequency INTEGER  The number of reads you want to print a status message. Default: 500000\n  -f, --log_out TEXT           The file name of the final quality report summary. Print on the screen if not specified\n  -a, --alphabet               Allowed character in the SEQ line. Default: ACGTN\n  -h, --help                   Show this message and exit.\n```\nIt accepts strictly **readable** `*.fastq` or `*.fastq.gz` files in input.\n\n\n### <code style=\"color : darkorange\">Case 2</code> & <code style=\"color : orangered\">Case 3</code>\nThere are <b>QUICK</b> and a <b>SLOW</b> methods to configure `FastqWiper`'s workflows.\n\n\n#### One quick way (Docker)\n1. Pull the Docker image from DockerHub:\n\n`docker pull mazzalab/fastqwiper`\n\n2. Once downloaded the image, type:\n\nCMD: `docker run --rm -ti --name fastqwiper -v \"YOUR_LOCAL_PATH_TO_DATA_FOLDER:/fastqwiper/data\" mazzalab/fastqwiper paired 8 sample 50000000 33 ACGTN 500000`\n\n#### Another quick way (Singularity)\n1. Pull the Singularity image from the Cloud Library:\n\n`singularity pull library://mazzalab/fastqwiper/fastqwiper.sif`\n\n2. Once downloaded the image (e.g., fastqwiper.sif_2024.1.89.sif), type:\n\nCMD `singularity run --bind YOUR_LOCAL_PATH_TO_DATA_FOLDER:/fastqwiper/data --writable-tmpfs fastqwiper.sif_2024.1.89.sif paired 8 sample 50000000 33 ACGTN 500000`\n\nIf you want to bind the `.singularity` cache folder and the `logs` folder, you can omit `--writable-tmpfs`, create the folders `.singularity` and `logs` (`mkdir .singularity logs`) on the host system, and use this command instead:\n\nCMD: `singularity run --bind YOUR_LOCAL_PATH_TO_DATA_FOLDER/:/fastqwiper/data --bind YOUR_LOCAL_PATH_TO_.SNAKEMAKE_FOLDER/:/fastqwiper/.snakemake --bind YOUR_LOCAL_PATH_TO_LOGS_FOLDER/:/fastqwiper/logs fastqwiper.sif_2024.1.89.sif paired 8 sample 50000000 33 ACGTN`\n\nFor both **Docker** and **Singularity**:\n\n- `YOUR_LOCAL_PATH_TO_DATA_FOLDER` is the path of the folder where the fastq.gz files to be wiped are located;\n- `paired` triggers the cleaning of R1 and R2. Alternatively, `single` will trigger the wipe of individual FASTQ files;\n- `8` is the number of your choice of computing cores to be spawned;\n- `sample` is part of the names of the FASTQ files to be wiped. <b>Be aware</b> that: for <b>paired-end</b> files (e.g., \"sample_R1.fastq.gz\" and \"sample_R2.fastq.gz\"), your files must finish with `_R1.fastq.gz` and `_R2.fastq.gz`. Therefore, the argument to pass is everything before these texts: `sample` in this case. For <b>single end</b>/individual files (e.g., \"excerpt_R1_001.fastq.gz\"), your file must end with the string `.fastq.gz`; the preceding text, i.e., \"excerpt_R1_001\" in this case, will be the text to be passed to the command as an argument. \n- `50000000` (optional) is the number of rows-per-chunk (used when cores>1. It must be a number multiple of 4). Increasing this number too much would reduce the parallelism advantage. Decreasing this number too much would increase the number of chunks more than the number of available cpus, making parallelism unefficient. Choose this number wisely depending on the total number of reads in your starting file.\n- `33` (optional) is the ASCII offset (33=Sanger, 64=old Solexa)\n- `ACGTN` (optional) is the allowed alphabet in the SEQ line of the FASTQ file\n- `500000` (optional) is the log frequency (# reads)\n\n### <code style=\"color : red\">The slow way (Linux & Mac OS)</code>\nTo enable the use of preconfigured [pipelines](https://github.com/mazzalab/fastqwiper/tree/main/pipeline), you need to install **Snakemake**. The recommended way to install Snakemake is via Conda, because it enables **Snakemake** to [handle software dependencies of your workflow](https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#integrated-package-management).\nHowever, the default conda solver is slow and often hangs. Therefore, we recommend installing [Mamba](https://github.com/mamba-org/mamba) as a drop-in replacement via\n\n`$ conda install -c conda-forge mamba`\n\nif you have anaconda/miniconda already installed, or directly installing `Mambaforge` as described [here](https://github.com/conda-forge/miniforge#mambaforge).\n\n\nThen, create and activate a clean environment as above:\n\n```\nmamba create -n fastqwiper python=3.11\nmamba activate fastqwiper\n```\nFinally, install the Snakemake dependency:\n\n```\n$ mamba install -c bioconda snakemake\n```\n\n\n#### Usage\nClone the FastqWiper repository in a folder of your choice and enter it:\n\n```\ngit clone https://github.com/mazzalab/fastqwiper.git\ncd fastqwiper\n```\n\nIt contains, in particular, a folder `data` containing the fastq files to be processed, a folder `pipeline` containing the released pipelines and a folder `fastq_wiper` with the source files of `FastqWiper`. <br/>\nInput files to be processed must be copied into the **data** folder.\n\nCurrently, to run the `FastqWiper` pipelines, the following packages need to be installed manually:\n\n### required packages:\n[gzrt](https://github.com/arenn/gzrt) (Linux build from source [instructions](https://github.com/arenn/gzrt/blob/master/README.build), Ubuntu install [instructions](https://howtoinstall.co/en/gzrt), Mac OS install [instructions](https://formulae.brew.sh/formula/gzrt))\n\n[BBTools](https://jgi.doe.gov/data-and-tools/software-tools/bbtools/) (install [instructions](https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/installation-guide/))\n\nIf installed from source, `gzrt` scripts need to be put on PATH. `bbmap` must be installed in the root folder of FastqWiper, as the image below\n\n![FastqWiper folder yierarchy](assets/hierarchy.png)\n\n### Commands:\nCopy the fastq files you want to fix in the `data` folder.\n\n<code style=\"color : orange\">**N.b.**: In all commands above, you will pass the name of the sample to be analyzed to the workflow through the config argument: `sample_name`. Remember that your fastq files' names must finish with `_R1.fastq.gz` and `_R2.fastq.gz`, for paired fastq files, and with `.fastq.gz`, for individual fastq files, and, therefore, the text to be assigned to the variable `sample_name` must be everything before them. E.g., if your files are `my_sample_R1.fastq.gz` and `my_sample_R2.fastq.gz`, then `--config sample_name=my_sample`.</code>\n\n\n#### Paired-end files\n\n- **Get a dry run** of a pipeline (e.g., `fix_wipe_pairs_reads_sequential.smk`):<br />\n`snakemake --config sample_name=my_sample qin=33 alphabet=ACGTN log_freq=1000 -s pipeline/fix_wipe_pairs_reads_sequential.smk --use-conda --cores 4 -np`\n\n- **Generate the planned DAG**:<br />\n`snakemake --config sample_name=my_sample qin=33 alphabet=ACGTN log_freq=1000 -s pipeline/fix_wipe_pairs_reads_sequential.smk --dag | dot -Tpdf > dag.pdf`<br /> <br />\n<img src=\"https://github.com/mazzalab/fastqwiper/blob/main/pipeline/fix_wipe_pairs_reads.png?raw=true\" width=\"400\">\n\n- **Run the pipeline** (n.b., during the first execution, Snakemake will download and install some required remote packages and may take longer). The number of computing cores can be tuned accordingly:<br />\n`snakemake --config sample_name=my_sample alphabet=ACGTN log_freq=1000 -s pipeline/fix_wipe_pairs_reads_sequential.smk --use-conda --cores 2`\n\nFixed files will be copied in the `data` folder and will be suffixed with the string `_fixed_wiped_paired_interleaving`.\nWe remind that the `fix_wipe_pairs_reads_sequential.smk` and `fix_wipe_pairs_reads_parallel.smk` pipelines perform the following actions:\n- execute `gzrt` on corrupted fastq.gz files (i.e., that cannot be unzipped because of errors) and recover readable reads;\n- execute `FastqWiper` on recovered reads to make them compliant with the FASTQ format (source: [Wipipedia](https://en.wikipedia.org/wiki/FASTQ_format))\n- execute `Trimmomatic` on wiped reads to remove residual unpaired reads\n- execute `BBmap (repair.sh)` on paired reads to fix the correct interleaving and sort fastq files.  \n\n#### Single-end files\n`fix_wipe_single_reads_parallel.smk` and `fix_wipe_single_reads_sequential.smk` will not execute `trimmomatic` and BBmap's `repair.sh`.\n\n- **Get a dry run** of a pipeline (e.g., `fix_wipe_single_reads_sequential.smk`):<br />\n`snakemake --config sample_name=my_sample alphabet=ACGTN log_freq=1000 -s pipeline/fix_wipe_single_reads_sequential.smk --use-conda --cores 2 -np`\n\n- **Generate the planned DAG**:<br />\n`snakemake --config sample_name=my_sample alphabet=ACGTN log_freq=1000 -s pipeline/fix_wipe_single_reads_sequential.smk --dag | dot -Tpdf > dag.pdf`<br /><br />\n<img src=\"https://github.com/mazzalab/fastqwiper/blob/main/pipeline/fix_wipe_single_reads.png?raw=true\" width=\"200\">\n\n- **Run the pipeline** (n.b., The number of computing cores can be tuned accordingly):<br />\n`snakemake --config sample_name=my_sample alphabet=ACGTN log_freq=1000 -s pipeline/fix_wipe_single_reads_sequential.smk --use-conda --cores 2`\n\n# Author\n**Tommaso Mazza**  \n[![X](https://img.shields.io/badge/X-%23000000.svg?style=for-the-badge&logo=X&logoColor=white)](https://twitter.com/irongraft) [![LinkedIn](https://img.shields.io/badge/linkedin-%230077B5.svg?style=for-the-badge&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/tommasomazza/)\n\nLaboratory of Bioinformatics</br>\nFondazione IRCCS Casa Sollievo della Sofferenza</br>\nViale Regina Margherita 261 - 00198 Roma IT</br>\nTel: +39 06 44160526 - Fax: +39 06 44160548</br>\nE-mail: t.mazza@operapadrepio.it</br>\nWeb page: http://www.css-mendel.it</br>\nWeb page: http://bioinformatics.css-mendel.it</br>\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "An ensemble method to recover corrupted FASTQ files, drop or fix pesky lines, remove unpaired reads, and fix reads interleaving",
    "version": "2024.1.93",
    "project_urls": {
        "Developmental plan": "https://github.com/mazzalab/fastqwiper/projects",
        "Homepage": "https://github.com/mazzalab/fastqwiper",
        "Source": "https://github.com/mazzalab/fastqwiper",
        "Tracker": "https://github.com/mazzalab/fastqwiper/issues"
    },
    "split_keywords": [
        "genomics",
        " ngs",
        " fastq",
        " bioinformatics"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "180b3333aa2594a90abedd41d3c5602055d401802817e7151227a4b934f98c61",
                "md5": "2a15c3b5598720e2966b2e8cf9c9f29f",
                "sha256": "598ab21c2e0c6e304e834101d5d54ef08fe6167e0f491064af0d8965c73e125d"
            },
            "downloads": -1,
            "filename": "fastqwiper-2024.1.93-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "2a15c3b5598720e2966b2e8cf9c9f29f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.13,>=3.7",
            "size": 12935,
            "upload_time": "2024-05-26T07:23:00",
            "upload_time_iso_8601": "2024-05-26T07:23:00.747741Z",
            "url": "https://files.pythonhosted.org/packages/18/0b/3333aa2594a90abedd41d3c5602055d401802817e7151227a4b934f98c61/fastqwiper-2024.1.93-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-05-26 07:23:00",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "mazzalab",
    "github_project": "fastqwiper",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": true,
    "lcname": "fastqwiper"
}

Tommaso Mazza