refgenDetector

Name	refgenDetector JSON
Version	1.0.1 JSON
	download
home_page
Summary	refgenDetector
upload_time	2023-07-25 12:34:07
maintainer
docs_url	None
author	Mireia Marin Ginestar
requires_python
license
keywords	python
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

# EGA - RefgenDetector

RefgenDetector is a python tool that infers the reference genome assembly used during the read alignment for BAM and
CRAM files. The proposed tool is designed to facilitate the analysis of genomic data with incomplete metadata
annotation. RefgenDetector can differentiate between major human reference genome releases, as well as commonly used
flavors, by utilizing the LN and SN mandatory fields in the BAM and CRAM headers. The tool includes dictionaries with
information on contig names and lengths, enabling it to accurately identify unique contigs and differentiate between
different flavors of the reference genome.

## Description

RefgenDetector is able to infer the following reference genomes:

- NCBI35/hg17
- NCBI36.1/hg18
- GRCh37
- hg19
- b37
- hs37d5
- GRCh38
- Verily's GRCh38
- hs38DH_extra
- T2T

## Requirements

- Python 3.10.6

Depending on how you want to install the package:

- pip 23.1.2
- Docker version 24.0.2

## Installation

### Cloning this repository

1. Clone this repository

2. ``` $ cd PATH_WHERE_YOU_CLONED_THE_REPOSITORY/src/refgenDetector ```

3. ``$ python3 refgenDetector.py -h ``

### From pypi

All the information can be found [here](https://pypi.org/project/refgenDetector/)

``$ pip install refgenDetector``

### Docker

All the instructions to run refgenDetector in a Docker container can be found [here](https://hub.docker.com/r/beacon2ri/refgendetector).

## Usage

You can get the help menu by running:

```
$ refgenDetector -h
```

```
usage: INFERRING THE REFERENCE GENOME USED TO ALIGN BAM OR CRAM FILE
[-h] -p PATH -t {BAM/CRAM,Headers} [-m] [-a]

options:
-h, --help show this help message and exit
-p PATH, --path PATH Path to main txt. It will consist of the paths to the
files to be analyzed (one path per line)
-t {BAM/CRAM,Headers}, --type {BAM/CRAM,Headers}
All the files in the txt provided in --path must be
BAM/CRAMs or headers in a txt. Choose -tdepending on
the type of files you are going to analyze
-m, --md5 [OPTIONAL] If you want to obtain the md5 of the
contigs present in the header, add --md5 to your
command. This will print the md5 values if the field
M5 was present in your header
-a, --assembly [OPTIONAL] If you want to obtain the assembly declared
in the header add --assembly to your command. This
will print the assembly if the field AS was present in
your header

```

In the main file (```-p argument```) you should add the paths to all the files you want to analyze. RefgenDetector
works with complete BAM and CRAMs and with txt files containing only the headers. The txt can be uncompressed, gzip
compressed, and with encodings utf-8 and iso-8859-1.

All the files included in this argument must be the same type, meaning, you should run RefgenDetector to analyze only
BAM/CRAMs or only headers.

## Test RefgenDetector

In the folder **examples** you can find headers, BAM and CRAMs to test the working of RefgenDetector.

*All this files belong to the [synthetics data cohort](https://ega-archive.org/synthetic-data) from the European
Genome-Phenome Archive ([EGA](https://ega-archive.org/)).*

### Test with headers in a TXT

In the folder TEST_HEADERS there are four headers obtained from synthetic BAM an CRAMs stored in the EGA. Each one of
them belongs to a different synthetic study:

- Test Study for EGA using data from 1000 Genomes Project - Phase
3 [EGAS00001005042](https://ega-archive.org/studies/EGAS00001005042).
- Synthetic data - Genome in a Bottle - [EGAS00001005591](https://ega-archive.org/studies/EGAS00001005591).
- Human genomic and phenotypic synthetic data for the study of rare
diseases - [EGAS00001005702](https://ega-archive.org/studies/EGAS00001005702).
- CINECA synthetic data.Please note: This study contains synthetic data (with cohort “participants” / ”subjects” marked
with FAKE) has no identifiable data and cannot be used to make any inference about cohort data or
results - [EGAS00001002472](https://ega-archive.org/studies/EGAS00001002472).

Further information about them can be found in the file *where_to_find_this_files.txt*, saved in the same folder.

To run RefgenDetector with the files:

1. Modify the txt *path_to_headers* so the paths match those in your computer.
2. Run:

``` $ refgenDetector -p /PATH_WHERE_YOU_CLONED_THE_REPOSITORY/refgenDetector/examples/path_to_headers -t Headers```

Check your installation has been successful by checking the test results are correct:

PATH_TO_YOUR_COMPUTER_SETUP/refgenDetector_pip-master/examples/TEST_HEADERS/EGAF00001753746, b37
PATH_TO_YOUR_COMPUTER_SETUP/refgenDetector_pip-master/examples/TEST_HEADERS/EGAF00005469864, hg19
PATH_TO_YOUR_COMPUTER_SETUP/refgenDetector_pip-master/examples/TEST_HEADERS/EGAF00005572695.gz, hs37d5
PATH_TO_YOUR_COMPUTER_SETUP/refgenDetector_pip-master/examples/TEST_HEADERS/EGAF00007462306, hs38DH_extra

### Test with BAM and CRAMs

In the folder TEST_BAM_CRAM there are a BAM and a CRAM obtained from synthetic BAM an CRAMs stored in the EGA. They
belong to the synthetic study - Test Study for EGA using data from 1000 Genomes Project - Phase
3 [EGAS00001005042](https://ega-archive.org/studies/EGAS00001005042).

Further information about them can be found in the file *where_to_find_this_files.txt*, saved in the same folder.

To run RefgenDetector with the files:

1. Modify the txt *path_to_bam_cram* so the paths match those in your computer.

2. Run:
`` $ refgenDetector -p /PATH_WHERE_YOU_CLONED_THE_REPOSITORY/refgenDetector_pip-master/examples/path_to_bam_cram-t BAM/CRAM ``

Check your installation has been successful by checking the test results are correct:

PATH_TO_YOUR_COMPUTER_SETUP/refgenDetector_pip-master/examples/TEST_BAM_CRAM/HG00096.GRCh38DH__1097r__10.10000-10100__21.5000000-5050000.bam, hs38DH_extra
PATH_TO_YOUR_COMPUTER_SETUP/refgenDetector_pip-master/examples/TEST_BAM_CRAM/HG00096.GRCh38DH__1097r__10.10000-10100__21.5000000-5050000.cram, hs38DH_extra

## Licence and funding

RefgenDetector is released under GNU General Public License v3.0.

It was funded by ELIXIR, the research infrastructure for life-science data (ELIXIR Beacon Implementation Studies
2019-2021 and 2022-2023).

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "refgenDetector",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "python",
    "author": "Mireia Marin Ginestar",
    "author_email": "<mireia.marin@crg.eu>",
    "download_url": "",
    "platform": null,
    "description": "# EGA - RefgenDetector\n\nRefgenDetector is a python tool that infers the reference genome assembly used during the read alignment for BAM and\nCRAM files. The proposed tool is designed to facilitate the analysis of genomic data with incomplete metadata\nannotation. RefgenDetector can differentiate between major human reference genome releases, as well as commonly used\nflavors, by utilizing the LN and SN mandatory fields in the BAM and CRAM headers. The tool includes dictionaries with\ninformation on contig names and lengths, enabling it to accurately identify unique contigs and differentiate between\ndifferent flavors of the reference genome.\n\n## Description\n\nRefgenDetector is able to infer the following reference genomes:\n\n- NCBI35/hg17\n- NCBI36.1/hg18\n- GRCh37\n- hg19\n- b37\n- hs37d5\n- GRCh38\n- Verily's GRCh38\n- hs38DH_extra\n- T2T\n\n## Requirements\n\n- Python 3.10.6\n\nDepending on how you want to install the package:\n\n- pip 23.1.2\n- Docker version 24.0.2\n\n## Installation\n\n### Cloning this repository\n\n1. Clone this repository\n\n2. ``` $ cd PATH_WHERE_YOU_CLONED_THE_REPOSITORY/src/refgenDetector ```\n\n3. ``$ python3 refgenDetector.py -h ``\n\n### From pypi\n\nAll the information can be found [here](https://pypi.org/project/refgenDetector/)\n\n``$ pip install refgenDetector``\n\n### Docker\n\nAll the instructions to run refgenDetector in a Docker container can be found [here](https://hub.docker.com/r/beacon2ri/refgendetector). \n\n## Usage\n\nYou can get the help menu by running:\n\n```\n$ refgenDetector -h\n```\n\n```\nusage: INFERRING THE REFERENCE GENOME USED TO ALIGN BAM OR CRAM FILE\n       [-h] -p PATH -t {BAM/CRAM,Headers} [-m] [-a]\n\noptions:\n  -h, --help            show this help message and exit\n  -p PATH, --path PATH  Path to main txt. It will consist of the paths to the\n                        files to be analyzed (one path per line)\n  -t {BAM/CRAM,Headers}, --type {BAM/CRAM,Headers}\n                        All the files in the txt provided in --path must be\n                        BAM/CRAMs or headers in a txt. Choose -tdepending on\n                        the type of files you are going to analyze\n  -m, --md5             [OPTIONAL] If you want to obtain the md5 of the\n                        contigs present in the header, add --md5 to your\n                        command. This will print the md5 values if the field\n                        M5 was present in your header\n  -a, --assembly        [OPTIONAL] If you want to obtain the assembly declared\n                        in the header add --assembly to your command. This\n                        will print the assembly if the field AS was present in\n                        your header\n\n```\n\nIn the main file (```-p argument```) you should add the paths to all the files you want to analyze. RefgenDetector\nworks with complete BAM and CRAMs and with txt files containing only the headers. The txt can be uncompressed, gzip\ncompressed, and with encodings utf-8 and iso-8859-1.\n\nAll the files included in this argument must be the same type, meaning, you should run RefgenDetector to analyze only\nBAM/CRAMs or only headers.\n\n## Test RefgenDetector\n\nIn the folder **examples** you can find headers, BAM and CRAMs to test the working of RefgenDetector.\n\n*All this files belong to the [synthetics data cohort](https://ega-archive.org/synthetic-data) from the European\nGenome-Phenome Archive ([EGA](https://ega-archive.org/)).*\n\n### Test with headers in a TXT\n\nIn the folder TEST_HEADERS there are four headers obtained from synthetic BAM an CRAMs stored in the EGA. Each one of\nthem belongs to a different synthetic study:\n\n- Test Study for EGA using data from 1000 Genomes Project - Phase\n  3 [EGAS00001005042](https://ega-archive.org/studies/EGAS00001005042).\n- Synthetic data - Genome in a Bottle - [EGAS00001005591](https://ega-archive.org/studies/EGAS00001005591).\n- Human genomic and phenotypic synthetic data for the study of rare\n  diseases - [EGAS00001005702](https://ega-archive.org/studies/EGAS00001005702).\n- CINECA synthetic data.Please note: This study contains synthetic data (with cohort \u201cparticipants\u201d / \u201dsubjects\u201d marked\n  with FAKE) has no identifiable data and cannot be used to make any inference about cohort data or\n  results - [EGAS00001002472](https://ega-archive.org/studies/EGAS00001002472).\n\nFurther information about them can be found in the file *where_to_find_this_files.txt*, saved in the same folder.\n\nTo run RefgenDetector with the files:\n\n1. Modify the txt *path_to_headers* so the paths match those in your computer.\n2. Run:\n\n   ``` $ refgenDetector -p /PATH_WHERE_YOU_CLONED_THE_REPOSITORY/refgenDetector/examples/path_to_headers -t Headers```\n\n   Check your installation has been successful by checking the test results are correct:\n\n   PATH_TO_YOUR_COMPUTER_SETUP/refgenDetector_pip-master/examples/TEST_HEADERS/EGAF00001753746, b37\n   PATH_TO_YOUR_COMPUTER_SETUP/refgenDetector_pip-master/examples/TEST_HEADERS/EGAF00005469864, hg19\n   PATH_TO_YOUR_COMPUTER_SETUP/refgenDetector_pip-master/examples/TEST_HEADERS/EGAF00005572695.gz, hs37d5\n   PATH_TO_YOUR_COMPUTER_SETUP/refgenDetector_pip-master/examples/TEST_HEADERS/EGAF00007462306, hs38DH_extra\n\n### Test with BAM and CRAMs\n\nIn the folder TEST_BAM_CRAM there are a BAM and a CRAM obtained from synthetic BAM an CRAMs stored in the EGA. They\nbelong to the synthetic study - Test Study for EGA using data from 1000 Genomes Project - Phase\n3 [EGAS00001005042](https://ega-archive.org/studies/EGAS00001005042).\n\nFurther information about them can be found in the file *where_to_find_this_files.txt*, saved in the same folder.\n\nTo run RefgenDetector with the files:\n\n1. Modify the txt *path_to_bam_cram* so the paths match those in your computer.\n\n2. Run:\n`` $ refgenDetector -p /PATH_WHERE_YOU_CLONED_THE_REPOSITORY/refgenDetector_pip-master/examples/path_to_bam_cram-t BAM/CRAM ``\n\nCheck your installation has been successful by checking the test results are correct:\n\nPATH_TO_YOUR_COMPUTER_SETUP/refgenDetector_pip-master/examples/TEST_BAM_CRAM/HG00096.GRCh38DH__1097r__10.10000-10100__21.5000000-5050000.bam, hs38DH_extra\nPATH_TO_YOUR_COMPUTER_SETUP/refgenDetector_pip-master/examples/TEST_BAM_CRAM/HG00096.GRCh38DH__1097r__10.10000-10100__21.5000000-5050000.cram, hs38DH_extra\n\n## Licence and funding\n\nRefgenDetector is released under GNU General Public License v3.0.\n\nIt was funded by ELIXIR, the research infrastructure for life-science data (ELIXIR Beacon Implementation Studies\n2019-2021 and 2022-2023).\n\n\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "refgenDetector",
    "version": "1.0.1",
    "project_urls": null,
    "split_keywords": [
        "python"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "30e9699230b3f9c93195db60b477d0175dfa930628c7949db665e12fb49741cc",
                "md5": "54dd1c8eab723402b99b00c922b268b5",
                "sha256": "7a20125d2bbee541d6b86e79793f877fd9a2304f52cf584ca5b6ed1922bd1863"
            },
            "downloads": -1,
            "filename": "refgenDetector-1.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "54dd1c8eab723402b99b00c922b268b5",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 21048,
            "upload_time": "2023-07-25T12:34:07",
            "upload_time_iso_8601": "2023-07-25T12:34:07.614087Z",
            "url": "https://files.pythonhosted.org/packages/30/e9/699230b3f9c93195db60b477d0175dfa930628c7949db665e12fb49741cc/refgenDetector-1.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-07-25 12:34:07",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "refgendetector"
}

Mireia Marin Ginestar