mutacc


Namemutacc JSON
Version 1.7.2 PyPI version JSON
download
home_pagehttps://github.com/Clinical-Genomics/mutacc
SummaryThe mutation accumulation database
upload_time2023-11-13 06:14:00
maintainer
docs_urlNone
authorAdam Rosenbaum
requires_python>=3.6.0
licenseMIT
keywords
VCS
bugtrack_url
requirements Click pysam coloredlogs biopython cyvcf2 importlib_resources mongo_adapter ped_parser PyYaml pymongo
Travis-CI No Travis.
coveralls test coverage No coveralls.
            
# mutacc
![Build Status - Docker](https://github.com/Clinical-Genomics/mutacc/actions/workflows/docker_build_n_publish.yml/badge.svg)
[![Coverage Status](https://coveralls.io/repos/github/Clinical-Genomics/mutacc/badge.svg?branch=master)](https://coveralls.io/github/Clinical-Genomics/mutacc?branch=master)
[![PyPI version](https://badge.fury.io/py/mutacc.svg)](https://badge.fury.io/py/mutacc)

## The mutation accumulation database

**mutacc** is a tool that makes it possible to create synthetic datasets to be used
for quality control or benchmarking of bioinformatic tools and pipelines intended
for variant calling of clinical variants. Using raw reads that supports a known
variant from a real NGS data, *mutacc* stores the relevant reads from each case into
a database. This database can then be queried to create synthetic datasets that can
be used as positive controls bioinformatics pipelines.


## Running the app using Docker (No installation of any software or database required)
An example containing a demo setup for the app is included in the docker-compose file. Note that this file is not intended for use in production and is only provided to illustrate how an image containing the application could be connected to a MongoDB instance and perform commands provided when running it as a container. A Docker image file for Mutacc can be pulled from [Docker Hub](https://hub.docker.com/repository/docker/clinicalgenomics/mutacc), or can be built from the Dockerfile provided in the GitHub repository folder. Start the docker-compose demo using this command:

```console
docker-compose up -d
```

What the docker-compose command does:

- Starts the database
- extracts the reads from a demo case (demo case resources are located under /mutacc/resources)
- Saves them to database
- Exports them from the database to a local file

When the above command is executed, it creates the following 4 directories: `reads`, `imports`, `queries` and `variants` in the working directory. The directory names `variants` contains the vcf with the variants of interest for this demo case.

After running the test, don't forget to run docker-compose to remove containers, networks, volumes and images created by docker-compose.


## Installation
### Conda
For installation of mutacc and the external prerequisites, this is made easy by
creating conda environment

```consol
conda create -n <env_name> python=3.8 pip numpy cython
```

activate environment

```consol
source activate <env_name>
```
### External Prerequisites
mutacc takes use of two external packages, [seqkit](https://github.com/shenwei356/seqkit)>=v0.9 ,
and [picard](https://github.com/broadinstitute/picard)>=v2.18. These can be
installed within a conda environment by

```console
conda install -c bioconda picard
conda install -c bioconda seqkit
```

### Install mutacc
Within the conda environment, do

```console
pip install mutacc
```

To install from PyPI, or clone this repo and install

```console
pip install git+https://github.com/Clinical-Genomics/mutacc
```

## Usage

### Configuration File

Some options are best passed to mutacc through a configuration file. below is an
example of a config file, using the YAML format.

```yaml
#EXAMPLE OF A CONFIGURATION FILE
host: <host>                  #Defaults to 'localhost'
port: <port>                  #Defaults to 27017
database: <database_name>     #Defaults to 'mutacc'
username: <username>          
password: <password>          
root_dir: <path_to_root>  
```

The 'root_dir' entry specifies an existing directory in the file system, where
all files generated by mutacc will be stored in corresponding subdirectories.
E.g. all generated fastq files will be stored in /.../root_dir/reads/


### Populate the mutacc database

To export data sets from the mutacc DB, the database must first be populated. To
extract the raw reads supporting a known variant, mutacc takes use some
relevant files generated from a NGS experiment up to the variant calling itself.
That is the bam file, and vcf file containing only the variants of interest.

This information is specified as a 'case', represented in yaml format

```yaml
#EXAMPLE OF A CASE

#THE CASE FIELD CONTAINS METADATA OF THE CASE ITSELF
case:
    case_id: 'case123' #REQUIRED CASE_ID

#LIST OF THE SAMPLES INVOLVED IN THE EXPERIMENT (MAY BE ONE, OR SEVERAL, E.G.
#A TRIO)
samples:
  - sample_id: 'sample1' #REQUIRED
    analysis_type: 'wgs' #REQUIRED
    sex: 'male'          #REQUIRED
    mother: 'sample2'    #REQUIRED (CAN BE 0 if no mother)
    father: 'sample3'    #REQUIRED (CAN BE 0 if no father)
    bam_file: /path/to/sorted_bam #REQUIRED
    phenotype: 'affected'

  - sample_id: 'sample2'
    analysis_type: 'wgs'
    sex: 'female'        
    mother: '0' #0 if no parent            
    father: '0'         
    bam_file: /path/to/sorted_bam
    phenotype: 'unaffected'

  - sample_id: 'sample2'
    analysis_type: 'wgs'
    sex: 'male'         
    mother: '0'             
    father: '0'            
    bam_file: /path/to/sorted_bam
    phenotype: 'affected'

#PATH TO VCF FILE CONTAINING VARIANTS OF INTEREST FROM CASE
variants: /path/to/vcf
```

This will find the reads from the bam files specified for each sample. If it
is desired that the reads are found from the fastq files instead, this can be
done by specifying the fastq-files as such

```yaml
  - sample_id: 'sample1'
    analysis_type: 'wgs'
    sex: 'male'          
    mother: 'sample2'    
    father: 'sample3'    
    bam_file: /path/to/sorted_bam
    fastq_files:
      - /path/to/fastq1
      - /path/to/fastq2
    phenotype: 'affected'
```
To extract the reads from the case

```console
mutacc --config-file <config_file> extract --padding 600 --case <case_file>
```
the --padding option takes the number of basepairs that the desired region is
padded with.

This will create a file <case_id>.json stored in the directory specified in the
/.../root_dir/imports directory.

To import the case into the database

```console
mutacc db import /.../root_dir/imports/<case_id>.json
```

The db command is called each time mutacc needs to do any operation against the
database.

This will try to establish a connection to an instance of mongodb, by default
running on 'localhost' on port 27017. If this is not wanted, it can be specified
with the --host and --port options.



```console
mutacc db -h <host> -p <port> import <case_id>.json
```

If authentication is required, this can be specified with the --username and
--password options.

or in a configuration file e.g.
```yaml
host: <host>
port: <port>
username: <username>
password: <password>
```

```console
mutacc --config-file <config.yaml> db import <case_id>.json
```


### Export datasets from the database
The datasets are exported one sample at the time. To export a synthetic
dataset, the export command is used together with options.
```
Usage: mutacc db export [OPTIONS]

  exports dataset from DB

Options:
  -c, --case-mongo TEXT           mongodb query language json-string to query
                                  for cases in database
  -v, --variant-mongo TEXT        mongodb query language json-string to query
                                  for variants in database
  -t, --variant-type TEXT         Type of variant
  -a, --analysis [wes|wgs]        Type of analysis
  --all-variants                  Export all variants in database
  -m, --member [father|mother|child|affected]
                                  Type of sample
  -s, --sex [male|female]         Sex of sample
  --vcf-dir PATH                  Directory where vcf is created. Defaults to
                                  mutacc-root/variants
  -p, --proband                   Variants from all affected samples,
                                  regardless of pedigree
  -n, --sample-name TEXT          Name of sample
  -j, --json-out                  Print results to stdout as json-string
  --help                          Show this message and exit.
```

example:

```console
mutacc --config-file <config.yaml> db export -m affected --all-variants
```
will find all the cases from the mutacc DB, and store this
information in a file /.../root_dir/queries/sample_name_query.mutacc.

to export an entire trio, this can be done by

```console
mutacc --config-file <config_file> db export -m child --all-variants -p -n child
mutacc --config-file <config_file> db export -m father --all-variants -n father
mutacc --config-file <config_file> db export -m mother --all-variants -n mother
```
This will create three files child_query_mutacc.json, father_query_mutacc.json, and
mother_query_mutacc.json.

the export subcommand will also generate a truth set vcf-file for each exported
samples, containing all queried variants.

To make a dataset (fastq-files) from a query file the synthesize command is used
with the following options

   -b/--background-bam \
    Path to the bam file for sample to be used as background

  -f/--background-fastq \
    Path to fastq file for sample to be used as background

  -f2/--background-fastq2 \
    Path to second fastq file (if paired end experiment)

  -q/--query \
    Path to the query json-files created with the export command

  --dataset-dir \
    Directory where fastq files will be stored. defaults to
    /.../root_dir/datasets


example, using the query files created above

```console
mutacc --config-file <config_file> synthesize -b <bam> -f <fastq1_child> -f2 <fastq2_child> -q child_query_mutacc.json
mutacc --config-file <config_file> synthesize -b <bam> -f <fastq1_father> -f2 <fastq2_father> -q father_query_mutacc.json
mutacc --config-file <config_file> synthesize -b <bam> -f <fastq1_mother> -f2 <fastq2_mother> -q mother_query_mutacc.json
```

The created fastq-files will be stored in the directory /.../root_dir/datasets/
or in directory specified by ---dataset-dir

### Remove case from database

To remove a case from the mutacc DB, and all the generated bam, and fastq files
generated from that case from disk, the remove command is used

```console
mutacc --config-file <config.yaml> db remove <case_id>
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Clinical-Genomics/mutacc",
    "name": "mutacc",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6.0",
    "maintainer_email": "",
    "keywords": "",
    "author": "Adam Rosenbaum",
    "author_email": "adam.rosenbaum@scilifelab.se",
    "download_url": "https://files.pythonhosted.org/packages/f3/05/39dbcdbf99fd6a339265d0fdcd2256b2fac6da487c978198911f888dbfee/mutacc-1.7.2.tar.gz",
    "platform": null,
    "description": "\n# mutacc\n![Build Status - Docker](https://github.com/Clinical-Genomics/mutacc/actions/workflows/docker_build_n_publish.yml/badge.svg)\n[![Coverage Status](https://coveralls.io/repos/github/Clinical-Genomics/mutacc/badge.svg?branch=master)](https://coveralls.io/github/Clinical-Genomics/mutacc?branch=master)\n[![PyPI version](https://badge.fury.io/py/mutacc.svg)](https://badge.fury.io/py/mutacc)\n\n## The mutation accumulation database\n\n**mutacc** is a tool that makes it possible to create synthetic datasets to be used\nfor quality control or benchmarking of bioinformatic tools and pipelines intended\nfor variant calling of clinical variants. Using raw reads that supports a known\nvariant from a real NGS data, *mutacc* stores the relevant reads from each case into\na database. This database can then be queried to create synthetic datasets that can\nbe used as positive controls bioinformatics pipelines.\n\n\n## Running the app using Docker (No installation of any software or database required)\nAn example containing a demo setup for the app is included in the docker-compose file. Note that this file is not intended for use in production and is only provided to illustrate how an image containing the application could be connected to a MongoDB instance and perform commands provided when running it as a container. A Docker image file for Mutacc can be pulled from [Docker Hub](https://hub.docker.com/repository/docker/clinicalgenomics/mutacc), or can be built from the Dockerfile provided in the GitHub repository folder. Start the docker-compose demo using this command:\n\n```console\ndocker-compose up -d\n```\n\nWhat the docker-compose command does:\n\n- Starts the database\n- extracts the reads from a demo case (demo case resources are located under /mutacc/resources)\n- Saves them to database\n- Exports them from the database to a local file\n\nWhen the above command is executed, it creates the following 4 directories: `reads`, `imports`, `queries` and `variants` in the working directory. The directory names `variants` contains the vcf with the variants of interest for this demo case.\n\nAfter running the test, don't forget to run docker-compose to remove containers, networks, volumes and images created by docker-compose.\n\n\n## Installation\n### Conda\nFor installation of mutacc and the external prerequisites, this is made easy by\ncreating conda environment\n\n```consol\nconda create -n <env_name> python=3.8 pip numpy cython\n```\n\nactivate environment\n\n```consol\nsource activate <env_name>\n```\n### External Prerequisites\nmutacc takes use of two external packages, [seqkit](https://github.com/shenwei356/seqkit)>=v0.9 ,\nand [picard](https://github.com/broadinstitute/picard)>=v2.18. These can be\ninstalled within a conda environment by\n\n```console\nconda install -c bioconda picard\nconda install -c bioconda seqkit\n```\n\n### Install mutacc\nWithin the conda environment, do\n\n```console\npip install mutacc\n```\n\nTo install from PyPI, or clone this repo and install\n\n```console\npip install git+https://github.com/Clinical-Genomics/mutacc\n```\n\n## Usage\n\n### Configuration File\n\nSome options are best passed to mutacc through a configuration file. below is an\nexample of a config file, using the YAML format.\n\n```yaml\n#EXAMPLE OF A CONFIGURATION FILE\nhost: <host>                  #Defaults to 'localhost'\nport: <port>                  #Defaults to 27017\ndatabase: <database_name>     #Defaults to 'mutacc'\nusername: <username>          \npassword: <password>          \nroot_dir: <path_to_root>  \n```\n\nThe 'root_dir' entry specifies an existing directory in the file system, where\nall files generated by mutacc will be stored in corresponding subdirectories.\nE.g. all generated fastq files will be stored in /.../root_dir/reads/\n\n\n### Populate the mutacc database\n\nTo export data sets from the mutacc DB, the database must first be populated. To\nextract the raw reads supporting a known variant, mutacc takes use some\nrelevant files generated from a NGS experiment up to the variant calling itself.\nThat is the bam file, and vcf file containing only the variants of interest.\n\nThis information is specified as a 'case', represented in yaml format\n\n```yaml\n#EXAMPLE OF A CASE\n\n#THE CASE FIELD CONTAINS METADATA OF THE CASE ITSELF\ncase:\n    case_id: 'case123' #REQUIRED CASE_ID\n\n#LIST OF THE SAMPLES INVOLVED IN THE EXPERIMENT (MAY BE ONE, OR SEVERAL, E.G.\n#A TRIO)\nsamples:\n  - sample_id: 'sample1' #REQUIRED\n    analysis_type: 'wgs' #REQUIRED\n    sex: 'male'          #REQUIRED\n    mother: 'sample2'    #REQUIRED (CAN BE 0 if no mother)\n    father: 'sample3'    #REQUIRED (CAN BE 0 if no father)\n    bam_file: /path/to/sorted_bam #REQUIRED\n    phenotype: 'affected'\n\n  - sample_id: 'sample2'\n    analysis_type: 'wgs'\n    sex: 'female'        \n    mother: '0' #0 if no parent            \n    father: '0'         \n    bam_file: /path/to/sorted_bam\n    phenotype: 'unaffected'\n\n  - sample_id: 'sample2'\n    analysis_type: 'wgs'\n    sex: 'male'         \n    mother: '0'             \n    father: '0'            \n    bam_file: /path/to/sorted_bam\n    phenotype: 'affected'\n\n#PATH TO VCF FILE CONTAINING VARIANTS OF INTEREST FROM CASE\nvariants: /path/to/vcf\n```\n\nThis will find the reads from the bam files specified for each sample. If it\nis desired that the reads are found from the fastq files instead, this can be\ndone by specifying the fastq-files as such\n\n```yaml\n  - sample_id: 'sample1'\n    analysis_type: 'wgs'\n    sex: 'male'          \n    mother: 'sample2'    \n    father: 'sample3'    \n    bam_file: /path/to/sorted_bam\n    fastq_files:\n      - /path/to/fastq1\n      - /path/to/fastq2\n    phenotype: 'affected'\n```\nTo extract the reads from the case\n\n```console\nmutacc --config-file <config_file> extract --padding 600 --case <case_file>\n```\nthe --padding option takes the number of basepairs that the desired region is\npadded with.\n\nThis will create a file <case_id>.json stored in the directory specified in the\n/.../root_dir/imports directory.\n\nTo import the case into the database\n\n```console\nmutacc db import /.../root_dir/imports/<case_id>.json\n```\n\nThe db command is called each time mutacc needs to do any operation against the\ndatabase.\n\nThis will try to establish a connection to an instance of mongodb, by default\nrunning on 'localhost' on port 27017. If this is not wanted, it can be specified\nwith the --host and --port options.\n\n\n\n```console\nmutacc db -h <host> -p <port> import <case_id>.json\n```\n\nIf authentication is required, this can be specified with the --username and\n--password options.\n\nor in a configuration file e.g.\n```yaml\nhost: <host>\nport: <port>\nusername: <username>\npassword: <password>\n```\n\n```console\nmutacc --config-file <config.yaml> db import <case_id>.json\n```\n\n\n### Export datasets from the database\nThe datasets are exported one sample at the time. To export a synthetic\ndataset, the export command is used together with options.\n```\nUsage: mutacc db export [OPTIONS]\n\n  exports dataset from DB\n\nOptions:\n  -c, --case-mongo TEXT           mongodb query language json-string to query\n                                  for cases in database\n  -v, --variant-mongo TEXT        mongodb query language json-string to query\n                                  for variants in database\n  -t, --variant-type TEXT         Type of variant\n  -a, --analysis [wes|wgs]        Type of analysis\n  --all-variants                  Export all variants in database\n  -m, --member [father|mother|child|affected]\n                                  Type of sample\n  -s, --sex [male|female]         Sex of sample\n  --vcf-dir PATH                  Directory where vcf is created. Defaults to\n                                  mutacc-root/variants\n  -p, --proband                   Variants from all affected samples,\n                                  regardless of pedigree\n  -n, --sample-name TEXT          Name of sample\n  -j, --json-out                  Print results to stdout as json-string\n  --help                          Show this message and exit.\n```\n\nexample:\n\n```console\nmutacc --config-file <config.yaml> db export -m affected --all-variants\n```\nwill find all the cases from the mutacc DB, and store this\ninformation in a file /.../root_dir/queries/sample_name_query.mutacc.\n\nto export an entire trio, this can be done by\n\n```console\nmutacc --config-file <config_file> db export -m child --all-variants -p -n child\nmutacc --config-file <config_file> db export -m father --all-variants -n father\nmutacc --config-file <config_file> db export -m mother --all-variants -n mother\n```\nThis will create three files child_query_mutacc.json, father_query_mutacc.json, and\nmother_query_mutacc.json.\n\nthe export subcommand will also generate a truth set vcf-file for each exported\nsamples, containing all queried variants.\n\nTo make a dataset (fastq-files) from a query file the synthesize command is used\nwith the following options\n\n   -b/--background-bam \\\n    Path to the bam file for sample to be used as background\n\n  -f/--background-fastq \\\n    Path to fastq file for sample to be used as background\n\n  -f2/--background-fastq2 \\\n    Path to second fastq file (if paired end experiment)\n\n  -q/--query \\\n    Path to the query json-files created with the export command\n\n  --dataset-dir \\\n    Directory where fastq files will be stored. defaults to\n    /.../root_dir/datasets\n\n\nexample, using the query files created above\n\n```console\nmutacc --config-file <config_file> synthesize -b <bam> -f <fastq1_child> -f2 <fastq2_child> -q child_query_mutacc.json\nmutacc --config-file <config_file> synthesize -b <bam> -f <fastq1_father> -f2 <fastq2_father> -q father_query_mutacc.json\nmutacc --config-file <config_file> synthesize -b <bam> -f <fastq1_mother> -f2 <fastq2_mother> -q mother_query_mutacc.json\n```\n\nThe created fastq-files will be stored in the directory /.../root_dir/datasets/\nor in directory specified by ---dataset-dir\n\n### Remove case from database\n\nTo remove a case from the mutacc DB, and all the generated bam, and fastq files\ngenerated from that case from disk, the remove command is used\n\n```console\nmutacc --config-file <config.yaml> db remove <case_id>\n```\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "The mutation accumulation database",
    "version": "1.7.2",
    "project_urls": {
        "Homepage": "https://github.com/Clinical-Genomics/mutacc"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "31358d63f523a726f6bd4d7218da8a19f75d7e90ee6e26dd3967aa9916b692d5",
                "md5": "65197b3707ab2c43d0cee99c212cf80c",
                "sha256": "093cc0072527233ed504c40e08719c3e3e6169f4913a5a12b5fb7b6387163cf8"
            },
            "downloads": -1,
            "filename": "mutacc-1.7.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "65197b3707ab2c43d0cee99c212cf80c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6.0",
            "size": 820168,
            "upload_time": "2023-11-13T06:13:55",
            "upload_time_iso_8601": "2023-11-13T06:13:55.833978Z",
            "url": "https://files.pythonhosted.org/packages/31/35/8d63f523a726f6bd4d7218da8a19f75d7e90ee6e26dd3967aa9916b692d5/mutacc-1.7.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f30539dbcdbf99fd6a339265d0fdcd2256b2fac6da487c978198911f888dbfee",
                "md5": "a756a18b9bd576df61ad182f2abc3f9e",
                "sha256": "9780d42d56271107559985c3df890ea8596ab548ed5f91c06f2ff1d72ee83776"
            },
            "downloads": -1,
            "filename": "mutacc-1.7.2.tar.gz",
            "has_sig": false,
            "md5_digest": "a756a18b9bd576df61ad182f2abc3f9e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6.0",
            "size": 814955,
            "upload_time": "2023-11-13T06:14:00",
            "upload_time_iso_8601": "2023-11-13T06:14:00.693287Z",
            "url": "https://files.pythonhosted.org/packages/f3/05/39dbcdbf99fd6a339265d0fdcd2256b2fac6da487c978198911f888dbfee/mutacc-1.7.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-11-13 06:14:00",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Clinical-Genomics",
    "github_project": "mutacc",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "Click",
            "specs": []
        },
        {
            "name": "pysam",
            "specs": []
        },
        {
            "name": "coloredlogs",
            "specs": []
        },
        {
            "name": "biopython",
            "specs": []
        },
        {
            "name": "cyvcf2",
            "specs": []
        },
        {
            "name": "importlib_resources",
            "specs": []
        },
        {
            "name": "mongo_adapter",
            "specs": [
                [
                    ">=",
                    "0.3.3"
                ]
            ]
        },
        {
            "name": "ped_parser",
            "specs": []
        },
        {
            "name": "PyYaml",
            "specs": [
                [
                    ">=",
                    "5.1"
                ]
            ]
        },
        {
            "name": "pymongo",
            "specs": []
        }
    ],
    "lcname": "mutacc"
}
        
Elapsed time: 0.17243s