ensembl-genomio


Nameensembl-genomio JSON
Version 1.6.0 PyPI version JSON
download
home_pageNone
SummaryEnsembl GenomIO -- pipelines to convert basic genomic data into Ensembl cores and back to flatfile
upload_time2024-12-19 15:09:17
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseApache License 2.0
keywords genome_io ensembl bioinformatics annotation setup
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Ensembl GenomIO

[![License](https://img.shields.io/badge/license-Apache_2.0-blue.svg)](https://github.com/Ensembl/ensembl-genomio/blob/main/LICENSE)
[![Coverage](https://vectorbase.gitdocs.ebi.ac.uk/ensembl-genomio/coverage-badge.svg)](https://vectorbase.gitdocs.ebi.ac.uk/ensembl-genomio/)
[![CI](https://img.shields.io/github/checks-status/Ensembl/ensembl-genomio/main?label=CI)](https://gitlab.ebi.ac.uk/vectorbase/ensembl-genomio/-/pipelines)
[![Release](https://img.shields.io/pypi/v/ensembl-genomio)](https://pypi.org/project/ensembl-genomio)

Pipelines to turn basic genomic data into Ensembl cores and back.

This is a multilanguage (Perl, Python) repo providing eHive pipelines and various scripts (see below) to prepare genomic data and load it as [Ensembl core database](http://www.ensembl.org/info/docs/api/core/index.html) or to dump such core databases as file bundles.

Bundles themselves consist of genomic data in various formats (e.g. fasta, gff3, json) and should follow the corresponding [specification](https://github.com/Ensembl/ensembl-genomio/blob/main/docs/BRC4_genome_loader.md#input-data).


## Installation and configuration

This repository is publicly available in [PyPI](https://pypi.org), so it can be easily installed with your preferred Python package manager, e.g.:

```bash
pip install ensembl-genomio
```

### Prerequisites

Pipelines are intended to be run inside the Ensembl production environment. Please, make sure you have all the proper credential, keys, etc. set up.

### Get repo and install

Clone:
```
git clone git@github.com:Ensembl/ensembl-genomio.git
```

Install the python part (of the pipelines) and test it:
```
pip install ./ensembl-genomio
# And test it has been installed correctly
python -c 'import ensembl.io.genomio'
```

Update your perl envs (if you need to)
```
export PERL5LIB=$(pwd)/ensembl-genomio/src/perl:$PERL5LIB
export PATH=$(pwd)/ensembl-genomio/scripts:$PATH
```

### Optional installation

If you need to install "editable" Python package use '-e' option
```
pip install -e ./ensembl-genomio
```

To install additional dependencies (e.g. `[docs]` or `[cicd]`) provide `[<tag>]` string, e.g.:
```
pip install -e ./ensembl-genomio[cicd]
```

For the list of tags see `[project.optional-dependencies]` in [pyproject.toml](https://github.com/Ensembl/ensembl-genomio/blob/main/pyproject.toml).

### Additional steps to use automated generation of the documentation

- Install python part with the `[docs]` tag
- Change into repo dir
- Run `mkdocs build` command

```
git clone git@github.com:Ensembl/ensembl-genomio.git
cd ./ensembl-genomio
pip install -e .[docs]
mkdocs build
```

###  Nextflow installation

Please, refer to the "Installation" section of the [Nextflow pipelines document](https://github.com/Ensembl/ensembl-genomio/blob/main/docs/nextflow.md#installation).

## Pipelines

### Initialising and running eHive-based pipelines

Pipelines are derived from [`Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf`](https://github.com/Ensembl/ensembl-hive/blob/version/2.6/modules/Bio/EnsEMBL/Hive/PipeConfig/HiveGeneric_conf.pm),
or from [`Bio::EnsEMBL::Hive::PipeConfig::EnsemblGeneric_conf`](https://github.com/Ensembl/ensembl-hive/blob/version/2.6/modules/Bio/EnsEMBL/Hive/PipeConfig/EnsemblGeneric_conf.pm),
of from [`Bio::EnsEMBL::EGPipeline::PipeConfig::EGGeneric_conf`](https://github.com/Ensembl/ensembl-production-imported/blob/main/src/perl/Bio/EnsEMBL/EGPipeline/PipeConfig/EGGeneric_conf.pm) (see [documentation](https://github.com/Ensembl/ensembl-production-imported/blob/main/docs/EGGeneric.md)).

And the same perl class prefix used for every pipeline:
  `Bio::EnsEMBL::EGPipeline::PipeConfig::` .

N.B. Don't forget to specify `-reg_file` option for the `beekeeper.pl -url $url -reg_file $REG_FILE -loop` command.

```
init_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig::BRC4_genome_loader_conf
    $($CMD details script) \
    -hive_force_init 1\
    -queue_name $SPECIFIC_QUEUE_NAME \
    -registry $REG_FILE \
    -pipeline_tag "_${PIPELINE_RUN_TAG}" \
    -ensembl_root_dir ${ENSEMBL_ROOT_DIR} \
    -dbsrv_url $($CMD details url) \
    -proddb_url "$($PROD_SERVER details url)""$PROD_DBNAME" \
    -taxonomy_url "$($PROD_SERVER details url)""$TAXONOMY_DBNAME" \
    -release ${RELEASE_VERSION} \
    -data_dir ${INPUT_DIR}/manifests_dir/ \
    -pipeline_dir $OUT_DIR/loader_run \
    ${OTHER_OPTIONS} \
    2> $OUT_DIR/init.stderr \
    1> $OUT_DIR/init.stdout

SYNC_CMD=$(cat $OUT_DIR/init.stdout | grep -- -sync'$' | perl -pe 's/^\s*//; s/"//g')
# should get something like
#   beekeeper.pl -url $url -sync

LOOP_CMD=$(cat $OUT_DIR/init.stdout | grep -- -loop | perl -pe 's/^\s*//; s/\s*#.*$//; s/"//g')
# should get something like
#   beekeeper.pl -url $url -reg_file $REG_FILE -loop

$SYNC_CMD 2> $OUT_DIR/sync.stderr 1> $OUT_DIR/sync.stdout
$LOOP_CMD 2> $OUT_DIR/loop.stderr 1> $OUT_DIR/loop.stdout
```

### List of the pipelines

| Pipeline name | Description | Document | Comment | Module |
| - | - | - | - | - |
| BRC4_genome_loader | creates an [Ensembl core database](http://www.ensembl.org/info/docs/api/core/index.html) from a set of flat files or adds ad-hoc (i.e. organellas) sequences to the existing core  | [BRC4_genome_loader](https://github.com/Ensembl/ensembl-genomio/blob/main/docs/BRC4_genome_loader.md) | | [Bio::EnsEMBL::Pipeline::PipeConfig::BRC4_genome_loader_conf](https://github.com/Ensembl/ensembl-genomio/blob/main/src/perl/Bio/EnsEMBL/Pipeline/PipeConfig/BRC4_genome_loader_conf.pm)
| BRC4_genome_dumper | | | | |
| | | | | |
| BRC4_genome_prepare | | | | |
| BRC4_addition_prepare | | | | |
| BRC4_genome_compare | | | | |
| | | | | |
| LoadGFF3 | | | | |
| LoadGFF3Batch | | | | |


### Scripts

* [trf_split_run.bash](https://github.com/Ensembl/ensembl-genomio/blob/main/scripts/trf_split_run.bash) -- a trf wrapper with chunking support to be used with [ensembl-production-imported DNAFeatures pipeline](https://github.com/Ensembl/ensembl-production-imported/tree/main/src/perl/Bio/EnsEMBL/EGPipeline/PipeConfig/DNAFeatures_conf.pm) (see [docs](https://github.com/Ensembl/ensembl-genomio/blob/main/docs/trf_split_run.md))

## CI/CD bits

As for now some [Gitlab CI](https://docs.gitlab.com/ee/ci/) pipelines introduced to keep things in shape. Though, this bit is in constant development. Some documentation can be found in [docs for GitLab CI/CD](https://github.com/Ensembl/ensembl-genomio/blob/main/docs/cicd_gitlab.md)

## Various docs

See [docs](https://github.com/Ensembl/ensembl-genomio/blob/main/docs)

## Unit testing

The Python part of the codebase has now unit tests available to test each module. Make sure you have installed this repository's `[cicd]` dependencies (via `pip install ensembl-genomio[cicd]`) before continuing.

Running all the tests in one go is as easy as running `pytest` **from the root of the repository**. If you also want to measure, collect and report the code coverage, you can do:
```bash
coverage run -m pytest
coverage report
```

You can also run specific tests by supplying the path to the specific test file/subfolder, e.g.:
```bash
pytest lib/python/tests/test_schema.py
```

## Acknowledgements

Some of this code and documentation is inherited from the [EnsemblGenomes](https://github.com/EnsemblGenomes) and other [Ensembl](https://github.com/Ensembl) projects. We appreciate the effort and time spent by developers of the [EnsemblGenomes](https://github.com/EnsemblGenomes) and [Ensembl](https://github.com/Ensembl) projects. 

Thank you!

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "ensembl-genomio",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "genome_io, ensembl, bioinformatics, annotation, setup",
    "author": null,
    "author_email": "Ensembl <dev@ensembl.org>",
    "download_url": "https://files.pythonhosted.org/packages/b4/fa/be32e2d03bc506869f3251c15bc4436193c3bfdb0e3366f1d7f3d8fd7847/ensembl_genomio-1.6.0.tar.gz",
    "platform": null,
    "description": "# Ensembl GenomIO\n\n[![License](https://img.shields.io/badge/license-Apache_2.0-blue.svg)](https://github.com/Ensembl/ensembl-genomio/blob/main/LICENSE)\n[![Coverage](https://vectorbase.gitdocs.ebi.ac.uk/ensembl-genomio/coverage-badge.svg)](https://vectorbase.gitdocs.ebi.ac.uk/ensembl-genomio/)\n[![CI](https://img.shields.io/github/checks-status/Ensembl/ensembl-genomio/main?label=CI)](https://gitlab.ebi.ac.uk/vectorbase/ensembl-genomio/-/pipelines)\n[![Release](https://img.shields.io/pypi/v/ensembl-genomio)](https://pypi.org/project/ensembl-genomio)\n\nPipelines to turn basic genomic data into Ensembl cores and back.\n\nThis is a multilanguage (Perl, Python) repo providing eHive pipelines and various scripts (see below) to prepare genomic data and load it as [Ensembl core database](http://www.ensembl.org/info/docs/api/core/index.html) or to dump such core databases as file bundles.\n\nBundles themselves consist of genomic data in various formats (e.g. fasta, gff3, json) and should follow the corresponding [specification](https://github.com/Ensembl/ensembl-genomio/blob/main/docs/BRC4_genome_loader.md#input-data).\n\n\n## Installation and configuration\n\nThis repository is publicly available in [PyPI](https://pypi.org), so it can be easily installed with your preferred Python package manager, e.g.:\n\n```bash\npip install ensembl-genomio\n```\n\n### Prerequisites\n\nPipelines are intended to be run inside the Ensembl production environment. Please, make sure you have all the proper credential, keys, etc. set up.\n\n### Get repo and install\n\nClone:\n```\ngit clone git@github.com:Ensembl/ensembl-genomio.git\n```\n\nInstall the python part (of the pipelines) and test it:\n```\npip install ./ensembl-genomio\n# And test it has been installed correctly\npython -c 'import ensembl.io.genomio'\n```\n\nUpdate your perl envs (if you need to)\n```\nexport PERL5LIB=$(pwd)/ensembl-genomio/src/perl:$PERL5LIB\nexport PATH=$(pwd)/ensembl-genomio/scripts:$PATH\n```\n\n### Optional installation\n\nIf you need to install \"editable\" Python package use '-e' option\n```\npip install -e ./ensembl-genomio\n```\n\nTo install additional dependencies (e.g. `[docs]` or `[cicd]`) provide `[<tag>]` string, e.g.:\n```\npip install -e ./ensembl-genomio[cicd]\n```\n\nFor the list of tags see `[project.optional-dependencies]` in [pyproject.toml](https://github.com/Ensembl/ensembl-genomio/blob/main/pyproject.toml).\n\n### Additional steps to use automated generation of the documentation\n\n- Install python part with the `[docs]` tag\n- Change into repo dir\n- Run `mkdocs build` command\n\n```\ngit clone git@github.com:Ensembl/ensembl-genomio.git\ncd ./ensembl-genomio\npip install -e .[docs]\nmkdocs build\n```\n\n###  Nextflow installation\n\nPlease, refer to the \"Installation\" section of the [Nextflow pipelines document](https://github.com/Ensembl/ensembl-genomio/blob/main/docs/nextflow.md#installation).\n\n## Pipelines\n\n### Initialising and running eHive-based pipelines\n\nPipelines are derived from [`Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf`](https://github.com/Ensembl/ensembl-hive/blob/version/2.6/modules/Bio/EnsEMBL/Hive/PipeConfig/HiveGeneric_conf.pm),\nor from [`Bio::EnsEMBL::Hive::PipeConfig::EnsemblGeneric_conf`](https://github.com/Ensembl/ensembl-hive/blob/version/2.6/modules/Bio/EnsEMBL/Hive/PipeConfig/EnsemblGeneric_conf.pm),\nof from [`Bio::EnsEMBL::EGPipeline::PipeConfig::EGGeneric_conf`](https://github.com/Ensembl/ensembl-production-imported/blob/main/src/perl/Bio/EnsEMBL/EGPipeline/PipeConfig/EGGeneric_conf.pm) (see [documentation](https://github.com/Ensembl/ensembl-production-imported/blob/main/docs/EGGeneric.md)).\n\nAnd the same perl class prefix used for every pipeline:\n  `Bio::EnsEMBL::EGPipeline::PipeConfig::` .\n\nN.B. Don't forget to specify `-reg_file` option for the `beekeeper.pl -url $url -reg_file $REG_FILE -loop` command.\n\n```\ninit_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig::BRC4_genome_loader_conf\n    $($CMD details script) \\\n    -hive_force_init 1\\\n    -queue_name $SPECIFIC_QUEUE_NAME \\\n    -registry $REG_FILE \\\n    -pipeline_tag \"_${PIPELINE_RUN_TAG}\" \\\n    -ensembl_root_dir ${ENSEMBL_ROOT_DIR} \\\n    -dbsrv_url $($CMD details url) \\\n    -proddb_url \"$($PROD_SERVER details url)\"\"$PROD_DBNAME\" \\\n    -taxonomy_url \"$($PROD_SERVER details url)\"\"$TAXONOMY_DBNAME\" \\\n    -release ${RELEASE_VERSION} \\\n    -data_dir ${INPUT_DIR}/manifests_dir/ \\\n    -pipeline_dir $OUT_DIR/loader_run \\\n    ${OTHER_OPTIONS} \\\n    2> $OUT_DIR/init.stderr \\\n    1> $OUT_DIR/init.stdout\n\nSYNC_CMD=$(cat $OUT_DIR/init.stdout | grep -- -sync'$' | perl -pe 's/^\\s*//; s/\"//g')\n# should get something like\n#   beekeeper.pl -url $url -sync\n\nLOOP_CMD=$(cat $OUT_DIR/init.stdout | grep -- -loop | perl -pe 's/^\\s*//; s/\\s*#.*$//; s/\"//g')\n# should get something like\n#   beekeeper.pl -url $url -reg_file $REG_FILE -loop\n\n$SYNC_CMD 2> $OUT_DIR/sync.stderr 1> $OUT_DIR/sync.stdout\n$LOOP_CMD 2> $OUT_DIR/loop.stderr 1> $OUT_DIR/loop.stdout\n```\n\n### List of the pipelines\n\n| Pipeline name | Description | Document | Comment | Module |\n| - | - | - | - | - |\n| BRC4_genome_loader | creates an [Ensembl core database](http://www.ensembl.org/info/docs/api/core/index.html) from a set of flat files or adds ad-hoc (i.e. organellas) sequences to the existing core  | [BRC4_genome_loader](https://github.com/Ensembl/ensembl-genomio/blob/main/docs/BRC4_genome_loader.md) | | [Bio::EnsEMBL::Pipeline::PipeConfig::BRC4_genome_loader_conf](https://github.com/Ensembl/ensembl-genomio/blob/main/src/perl/Bio/EnsEMBL/Pipeline/PipeConfig/BRC4_genome_loader_conf.pm)\n| BRC4_genome_dumper | | | | |\n| | | | | |\n| BRC4_genome_prepare | | | | |\n| BRC4_addition_prepare | | | | |\n| BRC4_genome_compare | | | | |\n| | | | | |\n| LoadGFF3 | | | | |\n| LoadGFF3Batch | | | | |\n\n\n### Scripts\n\n* [trf_split_run.bash](https://github.com/Ensembl/ensembl-genomio/blob/main/scripts/trf_split_run.bash) -- a trf wrapper with chunking support to be used with [ensembl-production-imported DNAFeatures pipeline](https://github.com/Ensembl/ensembl-production-imported/tree/main/src/perl/Bio/EnsEMBL/EGPipeline/PipeConfig/DNAFeatures_conf.pm) (see [docs](https://github.com/Ensembl/ensembl-genomio/blob/main/docs/trf_split_run.md))\n\n## CI/CD bits\n\nAs for now some [Gitlab CI](https://docs.gitlab.com/ee/ci/) pipelines introduced to keep things in shape. Though, this bit is in constant development. Some documentation can be found in [docs for GitLab CI/CD](https://github.com/Ensembl/ensembl-genomio/blob/main/docs/cicd_gitlab.md)\n\n## Various docs\n\nSee [docs](https://github.com/Ensembl/ensembl-genomio/blob/main/docs)\n\n## Unit testing\n\nThe Python part of the codebase has now unit tests available to test each module. Make sure you have installed this repository's `[cicd]` dependencies (via `pip install ensembl-genomio[cicd]`) before continuing.\n\nRunning all the tests in one go is as easy as running `pytest` **from the root of the repository**. If you also want to measure, collect and report the code coverage, you can do:\n```bash\ncoverage run -m pytest\ncoverage report\n```\n\nYou can also run specific tests by supplying the path to the specific test file/subfolder, e.g.:\n```bash\npytest lib/python/tests/test_schema.py\n```\n\n## Acknowledgements\n\nSome of this code and documentation is inherited from the [EnsemblGenomes](https://github.com/EnsemblGenomes) and other [Ensembl](https://github.com/Ensembl) projects. We appreciate the effort and time spent by developers of the [EnsemblGenomes](https://github.com/EnsemblGenomes) and [Ensembl](https://github.com/Ensembl) projects. \n\nThank you!\n",
    "bugtrack_url": null,
    "license": "Apache License 2.0",
    "summary": "Ensembl GenomIO -- pipelines to convert basic genomic data into Ensembl cores and back to flatfile",
    "version": "1.6.0",
    "project_urls": {
        "documentation": "https://ensembl.github.io/ensembl-genomio",
        "homepage": "https://www.ensembl.org",
        "repository": "https://github.com/Ensembl/ensembl-genomio"
    },
    "split_keywords": [
        "genome_io",
        " ensembl",
        " bioinformatics",
        " annotation",
        " setup"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6748fef6746f9f3fb8b773f6f216af2b2b7c4b5ad1bbdde295199b60072a3676",
                "md5": "34f457b71fd6d72e1da7b9359f0d6ef1",
                "sha256": "582af7e3c429f2382d09d2535df81561bfd10c691a1f2f7ddb353260f3727457"
            },
            "downloads": -1,
            "filename": "ensembl_genomio-1.6.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "34f457b71fd6d72e1da7b9359f0d6ef1",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 412150,
            "upload_time": "2024-12-19T15:09:15",
            "upload_time_iso_8601": "2024-12-19T15:09:15.910866Z",
            "url": "https://files.pythonhosted.org/packages/67/48/fef6746f9f3fb8b773f6f216af2b2b7c4b5ad1bbdde295199b60072a3676/ensembl_genomio-1.6.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "b4fabe32e2d03bc506869f3251c15bc4436193c3bfdb0e3366f1d7f3d8fd7847",
                "md5": "434929ec326e9b31ccca05e8b849a5a0",
                "sha256": "e3a9e8c547d5cce04ef793b746dc56b7a8c245385f449a63790e0f941025242e"
            },
            "downloads": -1,
            "filename": "ensembl_genomio-1.6.0.tar.gz",
            "has_sig": false,
            "md5_digest": "434929ec326e9b31ccca05e8b849a5a0",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 569698,
            "upload_time": "2024-12-19T15:09:17",
            "upload_time_iso_8601": "2024-12-19T15:09:17.647046Z",
            "url": "https://files.pythonhosted.org/packages/b4/fa/be32e2d03bc506869f3251c15bc4436193c3bfdb0e3366f1d7f3d8fd7847/ensembl_genomio-1.6.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-19 15:09:17",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Ensembl",
    "github_project": "ensembl-genomio",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "ensembl-genomio"
}
        
Elapsed time: 1.09895s