MergeGI


NameMergeGI JSON
Version 0.1.0 PyPI version JSON
download
home_page
SummaryMerge MGI fastq files
upload_time2022-12-09 10:29:50
maintainer
docs_urlNone
authorSequana Team
requires_python>=3.7
licenseBSD-4-Clause
keywords fastq mgi merger
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # MergeGI

[![Tests](https://github.com/sequana/MergeGI/actions/workflows/main.yml/badge.svg)](https://github.com/sequana/MergeGI/actions/workflows/main.yml)


![PyPI - Python Version](https://img.shields.io/pypi/pyversions/mergegi)

**MergeGI** provides a single command line to merge and select barcoded raw data from [MGI](https://en.mgi-tech.com/products/) sequencing runs into a set of FastQ files ready for subsequent bioinformatics analysis. 


- [Installation](#installation)
- [Getting Started](#getting-started)
- [General Usage](#general-usage)


## Installation

We provide **MergeGI** as a Python library available on [Pypi](https://pypi.python.org). The standalone application is called **mergegi** and can be installed in an environment with Python>3.6 as follows:

    pip install mergegi

There is no dependencies except for **click** package so that installation should be straightforward. 


For developers:

    git clone git@github.com:sequana/MergeGI.git
    cd MergeGI
    pip install -e .[testing]


## Overview

The main goal of **MergeGI** is to select and merge the FastQ files generated by a MGI sequencer into a list of FastQ files directly usable for subsequent bioinformatics analysis. Why do we need to do this preprocessing ? 

First, MGI generates one FastQ file per barcode. You may not need all those barcodes yet the demultiplexing performs a systematic search of all barcodes. Consequently, you will end up with FastQ files corresponding to your barcode and a bunch of FastQ files  that should be ignored. Given the information from your wetlab colleagues you should have the list of samples and their relevant barcodes. 

Second, MGI technologies imposes that barcodes being processed in a specific manner meaning that a given sample may be split into several barcodse (files). Therefore we need a tool to merge such files. Again, the wetlab should provide the barcodes corresponding to a given sample. See image below for more explanation

Third, a MGI flowcell has several lanes. You may want to merge the lanes or not. 

Those 3 steps should be managed seemlessly by our tool given a sample sheet and the output directory of the MGI runs.

## General Usage and Examples


The data structure expected by **MergeGI** is the expected output directoy of MGI runs:

    OutputFq/Flowcell/L01
    OutputFq/Flowcell/L02

Where L01/L02 stands for lane 1 and 2.

The software needs a sample sheet that describe the sample name, the associated barcode identifier, the potentially second barcode (if none, the column must still be present with empty strings), the project name (it will be used to create the new output directory), and the lane where is the sample/barcode pair. Here is an example:

```csv
samplename,barcode,barcode2,project,lane
A,         1,,              projectA, 1
B,         20,,              projectA, 1
A,         1,,              projectA, 2
C,         20,,              projectB, 2
C,         30,,              projectB, 1
B,         30,,              projectA, 2
```

If you have pooled a sample on the four lanes, meaning it is the same barcode on each lane, you can use the * character to simplify the sample sheet:

```csv
samplename,barcode,barcode2,project,lane
A,         1,      ,        ,projectA, *
B,         20,     ,        ,projectA, *
```

> **_IMPORTANT NOTE1:_**  the current version uses the barcode 1 only (column barcode). 

> **_IMPORTANT NOTE2:_**  The header must be present. The header names are not important but columns must be sorted with the expected order: sample name, barcode 1, barcode 2, projetc name, lane. 


Given the sample sheet, and the input directory (top level of the MGI runs), this command should create a new clean directory with the relevant FastQ files (here in merge_data directory):

```bash
mergegi --samplesheet samplesheet.csv --input-directory mgi_raw_data --output-directory merge_data 
```

If the data is paired, add *--paired* argument

```bash
mergegi --samplesheet samplesheet.csv --input-directory mgi_raw_data --output-directory merge_data --paired
```


By default, lanes are merged. If this is not what you want you may disable this option:

```bash
mergegi --samplesheet samplesheet.csv --input-directory mgi_raw_data --output-directory merge_data --paired --no-merge
```

## Changelog


========= ==========================================================================
Version   Description
========= ==========================================================================
0.1.0     * simplify the CI action workflow and setup
0.0.1     * first release


## Barcode distribution example

<img src="doc/bccode.png" width="50%">






            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "MergeGI",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "",
    "keywords": "fastq,MGI,merger",
    "author": "Sequana Team",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/50/8c/1ba0f6d59e4220d354622c8ed9b17340a7a3b9c0c3019d611503260c0e71/MergeGI-0.1.0.tar.gz",
    "platform": null,
    "description": "# MergeGI\n\n[![Tests](https://github.com/sequana/MergeGI/actions/workflows/main.yml/badge.svg)](https://github.com/sequana/MergeGI/actions/workflows/main.yml)\n\n\n![PyPI - Python Version](https://img.shields.io/pypi/pyversions/mergegi)\n\n**MergeGI** provides a single command line to merge and select barcoded raw data from [MGI](https://en.mgi-tech.com/products/) sequencing runs into a set of FastQ files ready for subsequent bioinformatics analysis. \n\n\n- [Installation](#installation)\n- [Getting Started](#getting-started)\n- [General Usage](#general-usage)\n\n\n## Installation\n\nWe provide **MergeGI** as a Python library available on [Pypi](https://pypi.python.org). The standalone application is called **mergegi** and can be installed in an environment with Python>3.6 as follows:\n\n    pip install mergegi\n\nThere is no dependencies except for **click** package so that installation should be straightforward. \n\n\nFor developers:\n\n    git clone git@github.com:sequana/MergeGI.git\n    cd MergeGI\n    pip install -e .[testing]\n\n\n## Overview\n\nThe main goal of **MergeGI** is to select and merge the FastQ files generated by a MGI sequencer into a list of FastQ files directly usable for subsequent bioinformatics analysis. Why do we need to do this preprocessing ? \n\nFirst, MGI generates one FastQ file per barcode. You may not need all those barcodes yet the demultiplexing performs a systematic search of all barcodes. Consequently, you will end up with FastQ files corresponding to your barcode and a bunch of FastQ files  that should be ignored. Given the information from your wetlab colleagues you should have the list of samples and their relevant barcodes. \n\nSecond, MGI technologies imposes that barcodes being processed in a specific manner meaning that a given sample may be split into several barcodse (files). Therefore we need a tool to merge such files. Again, the wetlab should provide the barcodes corresponding to a given sample. See image below for more explanation\n\nThird, a MGI flowcell has several lanes. You may want to merge the lanes or not. \n\nThose 3 steps should be managed seemlessly by our tool given a sample sheet and the output directory of the MGI runs.\n\n## General Usage and Examples\n\n\nThe data structure expected by **MergeGI** is the expected output directoy of MGI runs:\n\n    OutputFq/Flowcell/L01\n    OutputFq/Flowcell/L02\n\nWhere L01/L02 stands for lane 1 and 2.\n\nThe software needs a sample sheet that describe the sample name, the associated barcode identifier, the potentially second barcode (if none, the column must still be present with empty strings), the project name (it will be used to create the new output directory), and the lane where is the sample/barcode pair. Here is an example:\n\n```csv\nsamplename,barcode,barcode2,project,lane\nA,         1,,              projectA, 1\nB,         20,,              projectA, 1\nA,         1,,              projectA, 2\nC,         20,,              projectB, 2\nC,         30,,              projectB, 1\nB,         30,,              projectA, 2\n```\n\nIf you have pooled a sample on the four lanes, meaning it is the same barcode on each lane, you can use the * character to simplify the sample sheet:\n\n```csv\nsamplename,barcode,barcode2,project,lane\nA,         1,      ,        ,projectA, *\nB,         20,     ,        ,projectA, *\n```\n\n> **_IMPORTANT NOTE1:_**  the current version uses the barcode 1 only (column barcode). \n\n> **_IMPORTANT NOTE2:_**  The header must be present. The header names are not important but columns must be sorted with the expected order: sample name, barcode 1, barcode 2, projetc name, lane. \n\n\nGiven the sample sheet, and the input directory (top level of the MGI runs), this command should create a new clean directory with the relevant FastQ files (here in merge_data directory):\n\n```bash\nmergegi --samplesheet samplesheet.csv --input-directory mgi_raw_data --output-directory merge_data \n```\n\nIf the data is paired, add *--paired* argument\n\n```bash\nmergegi --samplesheet samplesheet.csv --input-directory mgi_raw_data --output-directory merge_data --paired\n```\n\n\nBy default, lanes are merged. If this is not what you want you may disable this option:\n\n```bash\nmergegi --samplesheet samplesheet.csv --input-directory mgi_raw_data --output-directory merge_data --paired --no-merge\n```\n\n## Changelog\n\n\n========= ==========================================================================\nVersion   Description\n========= ==========================================================================\n0.1.0     * simplify the CI action workflow and setup\n0.0.1     * first release\n\n\n## Barcode distribution example\n\n<img src=\"doc/bccode.png\" width=\"50%\">\n\n\n\n\n\n",
    "bugtrack_url": null,
    "license": "BSD-4-Clause",
    "summary": "Merge MGI fastq files",
    "version": "0.1.0",
    "split_keywords": [
        "fastq",
        "mgi",
        "merger"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "0b1edb4f6964a661433a8a407ee989ad",
                "sha256": "94e293459a44cc8b7773646c094d72d3c5ee66c69b758c5049769b608e374cce"
            },
            "downloads": -1,
            "filename": "MergeGI-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "0b1edb4f6964a661433a8a407ee989ad",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 6291,
            "upload_time": "2022-12-09T10:29:50",
            "upload_time_iso_8601": "2022-12-09T10:29:50.622127Z",
            "url": "https://files.pythonhosted.org/packages/50/8c/1ba0f6d59e4220d354622c8ed9b17340a7a3b9c0c3019d611503260c0e71/MergeGI-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2022-12-09 10:29:50",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "lcname": "mergegi"
}
        
Elapsed time: 0.01809s