# Ensembl GenomIO
[![License](https://img.shields.io/badge/license-Apache_2.0-blue.svg)](https://github.com/Ensembl/ensembl-genomio/blob/main/LICENSE)
[![Coverage](https://vectorbase.gitdocs.ebi.ac.uk/ensembl-genomio/coverage-badge.svg)](https://vectorbase.gitdocs.ebi.ac.uk/ensembl-genomio/)
[![CI](https://img.shields.io/github/checks-status/Ensembl/ensembl-genomio/main?label=CI)](https://gitlab.ebi.ac.uk/vectorbase/ensembl-genomio/-/pipelines)
[![Release](https://img.shields.io/pypi/v/ensembl-genomio)](https://pypi.org/project/ensembl-genomio)
Pipelines to turn basic genomic data into Ensembl cores and back.
This is a multilanguage (Perl, Python) repo providing eHive pipelines and various scripts (see below) to prepare genomic data and load it as [Ensembl core database](http://www.ensembl.org/info/docs/api/core/index.html) or to dump such core databases as file bundles.
Bundles themselves consist of genomic data in various formats (e.g. fasta, gff3, json) and should follow the corresponding [specification](https://github.com/Ensembl/ensembl-genomio/blob/main/docs/BRC4_genome_loader.md#input-data).
## Installation and configuration
This repository is publicly available in [PyPI](https://pypi.org), so it can be easily installed with your preferred Python package manager, e.g.:
```bash
pip install ensembl-genomio
```
### Prerequisites
Pipelines are intended to be run inside the Ensembl production environment. Please, make sure you have all the proper credential, keys, etc. set up.
### Get repo and install
Clone:
```
git clone git@github.com:Ensembl/ensembl-genomio.git
```
Install the python part (of the pipelines) and test it:
```
pip install ./ensembl-genomio
# And test it has been installed correctly
python -c 'import ensembl.io.genomio'
```
Update your perl envs (if you need to)
```
export PERL5LIB=$(pwd)/ensembl-genomio/src/perl:$PERL5LIB
export PATH=$(pwd)/ensembl-genomio/scripts:$PATH
```
### Optional installation
If you need to install "editable" Python package use '-e' option
```
pip install -e ./ensembl-genomio
```
To install additional dependencies (e.g. `[docs]` or `[cicd]`) provide `[<tag>]` string, e.g.:
```
pip install -e ./ensembl-genomio[cicd]
```
For the list of tags see `[project.optional-dependencies]` in [pyproject.toml](https://github.com/Ensembl/ensembl-genomio/blob/main/pyproject.toml).
### Additional steps to use automated generation of the documentation
- Install python part with the `[docs]` tag
- Change into repo dir
- Run `mkdocs build` command
```
git clone git@github.com:Ensembl/ensembl-genomio.git
cd ./ensembl-genomio
pip install -e .[docs]
mkdocs build
```
### Nextflow installation
Please, refer to the "Installation" section of the [Nextflow pipelines document](https://github.com/Ensembl/ensembl-genomio/blob/main/docs/nextflow.md#installation).
## Pipelines
### Initialising and running eHive-based pipelines
Pipelines are derived from [`Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf`](https://github.com/Ensembl/ensembl-hive/blob/version/2.6/modules/Bio/EnsEMBL/Hive/PipeConfig/HiveGeneric_conf.pm),
or from [`Bio::EnsEMBL::Hive::PipeConfig::EnsemblGeneric_conf`](https://github.com/Ensembl/ensembl-hive/blob/version/2.6/modules/Bio/EnsEMBL/Hive/PipeConfig/EnsemblGeneric_conf.pm),
of from [`Bio::EnsEMBL::EGPipeline::PipeConfig::EGGeneric_conf`](https://github.com/Ensembl/ensembl-production-imported/blob/main/src/perl/Bio/EnsEMBL/EGPipeline/PipeConfig/EGGeneric_conf.pm) (see [documentation](https://github.com/Ensembl/ensembl-production-imported/blob/main/docs/EGGeneric.md)).
And the same perl class prefix used for every pipeline:
`Bio::EnsEMBL::EGPipeline::PipeConfig::` .
N.B. Don't forget to specify `-reg_file` option for the `beekeeper.pl -url $url -reg_file $REG_FILE -loop` command.
```
init_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig::BRC4_genome_loader_conf
$($CMD details script) \
-hive_force_init 1\
-queue_name $SPECIFIC_QUEUE_NAME \
-registry $REG_FILE \
-pipeline_tag "_${PIPELINE_RUN_TAG}" \
-ensembl_root_dir ${ENSEMBL_ROOT_DIR} \
-dbsrv_url $($CMD details url) \
-proddb_url "$($PROD_SERVER details url)""$PROD_DBNAME" \
-taxonomy_url "$($PROD_SERVER details url)""$TAXONOMY_DBNAME" \
-release ${RELEASE_VERSION} \
-data_dir ${INPUT_DIR}/manifests_dir/ \
-pipeline_dir $OUT_DIR/loader_run \
${OTHER_OPTIONS} \
2> $OUT_DIR/init.stderr \
1> $OUT_DIR/init.stdout
SYNC_CMD=$(cat $OUT_DIR/init.stdout | grep -- -sync'$' | perl -pe 's/^\s*//; s/"//g')
# should get something like
# beekeeper.pl -url $url -sync
LOOP_CMD=$(cat $OUT_DIR/init.stdout | grep -- -loop | perl -pe 's/^\s*//; s/\s*#.*$//; s/"//g')
# should get something like
# beekeeper.pl -url $url -reg_file $REG_FILE -loop
$SYNC_CMD 2> $OUT_DIR/sync.stderr 1> $OUT_DIR/sync.stdout
$LOOP_CMD 2> $OUT_DIR/loop.stderr 1> $OUT_DIR/loop.stdout
```
### List of the pipelines
| Pipeline name | Description | Document | Comment | Module |
| - | - | - | - | - |
| BRC4_genome_loader | creates an [Ensembl core database](http://www.ensembl.org/info/docs/api/core/index.html) from a set of flat files or adds ad-hoc (i.e. organellas) sequences to the existing core | [BRC4_genome_loader](https://github.com/Ensembl/ensembl-genomio/blob/main/docs/BRC4_genome_loader.md) | | [Bio::EnsEMBL::Pipeline::PipeConfig::BRC4_genome_loader_conf](https://github.com/Ensembl/ensembl-genomio/blob/main/src/perl/Bio/EnsEMBL/Pipeline/PipeConfig/BRC4_genome_loader_conf.pm)
| BRC4_genome_dumper | | | | |
| | | | | |
| BRC4_genome_prepare | | | | |
| BRC4_addition_prepare | | | | |
| BRC4_genome_compare | | | | |
| | | | | |
| LoadGFF3 | | | | |
| LoadGFF3Batch | | | | |
### Scripts
* [trf_split_run.bash](https://github.com/Ensembl/ensembl-genomio/blob/main/scripts/trf_split_run.bash) -- a trf wrapper with chunking support to be used with [ensembl-production-imported DNAFeatures pipeline](https://github.com/Ensembl/ensembl-production-imported/tree/main/src/perl/Bio/EnsEMBL/EGPipeline/PipeConfig/DNAFeatures_conf.pm) (see [docs](https://github.com/Ensembl/ensembl-genomio/blob/main/docs/trf_split_run.md))
## CI/CD bits
As for now some [Gitlab CI](https://docs.gitlab.com/ee/ci/) pipelines introduced to keep things in shape. Though, this bit is in constant development. Some documentation can be found in [docs for GitLab CI/CD](https://github.com/Ensembl/ensembl-genomio/blob/main/docs/cicd_gitlab.md)
## Various docs
See [docs](https://github.com/Ensembl/ensembl-genomio/blob/main/docs)
## Unit testing
The Python part of the codebase has now unit tests available to test each module. Make sure you have installed this repository's `[cicd]` dependencies (via `pip install ensembl-genomio[cicd]`) before continuing.
Running all the tests in one go is as easy as running `pytest` **from the root of the repository**. If you also want to measure, collect and report the code coverage, you can do:
```bash
coverage run -m pytest
coverage report
```
You can also run specific tests by supplying the path to the specific test file/subfolder, e.g.:
```bash
pytest lib/python/tests/test_schema.py
```
## Acknowledgements
Some of this code and documentation is inherited from the [EnsemblGenomes](https://github.com/EnsemblGenomes) and other [Ensembl](https://github.com/Ensembl) projects. We appreciate the effort and time spent by developers of the [EnsemblGenomes](https://github.com/EnsemblGenomes) and [Ensembl](https://github.com/Ensembl) projects.
Thank you!
Raw data
{
"_id": null,
"home_page": null,
"name": "ensembl-genomio",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "genome_io, ensembl, bioinformatics, annotation, setup",
"author": null,
"author_email": "Ensembl <dev@ensembl.org>",
"download_url": "https://files.pythonhosted.org/packages/b4/fa/be32e2d03bc506869f3251c15bc4436193c3bfdb0e3366f1d7f3d8fd7847/ensembl_genomio-1.6.0.tar.gz",
"platform": null,
"description": "# Ensembl GenomIO\n\n[![License](https://img.shields.io/badge/license-Apache_2.0-blue.svg)](https://github.com/Ensembl/ensembl-genomio/blob/main/LICENSE)\n[![Coverage](https://vectorbase.gitdocs.ebi.ac.uk/ensembl-genomio/coverage-badge.svg)](https://vectorbase.gitdocs.ebi.ac.uk/ensembl-genomio/)\n[![CI](https://img.shields.io/github/checks-status/Ensembl/ensembl-genomio/main?label=CI)](https://gitlab.ebi.ac.uk/vectorbase/ensembl-genomio/-/pipelines)\n[![Release](https://img.shields.io/pypi/v/ensembl-genomio)](https://pypi.org/project/ensembl-genomio)\n\nPipelines to turn basic genomic data into Ensembl cores and back.\n\nThis is a multilanguage (Perl, Python) repo providing eHive pipelines and various scripts (see below) to prepare genomic data and load it as [Ensembl core database](http://www.ensembl.org/info/docs/api/core/index.html) or to dump such core databases as file bundles.\n\nBundles themselves consist of genomic data in various formats (e.g. fasta, gff3, json) and should follow the corresponding [specification](https://github.com/Ensembl/ensembl-genomio/blob/main/docs/BRC4_genome_loader.md#input-data).\n\n\n## Installation and configuration\n\nThis repository is publicly available in [PyPI](https://pypi.org), so it can be easily installed with your preferred Python package manager, e.g.:\n\n```bash\npip install ensembl-genomio\n```\n\n### Prerequisites\n\nPipelines are intended to be run inside the Ensembl production environment. Please, make sure you have all the proper credential, keys, etc. set up.\n\n### Get repo and install\n\nClone:\n```\ngit clone git@github.com:Ensembl/ensembl-genomio.git\n```\n\nInstall the python part (of the pipelines) and test it:\n```\npip install ./ensembl-genomio\n# And test it has been installed correctly\npython -c 'import ensembl.io.genomio'\n```\n\nUpdate your perl envs (if you need to)\n```\nexport PERL5LIB=$(pwd)/ensembl-genomio/src/perl:$PERL5LIB\nexport PATH=$(pwd)/ensembl-genomio/scripts:$PATH\n```\n\n### Optional installation\n\nIf you need to install \"editable\" Python package use '-e' option\n```\npip install -e ./ensembl-genomio\n```\n\nTo install additional dependencies (e.g. `[docs]` or `[cicd]`) provide `[<tag>]` string, e.g.:\n```\npip install -e ./ensembl-genomio[cicd]\n```\n\nFor the list of tags see `[project.optional-dependencies]` in [pyproject.toml](https://github.com/Ensembl/ensembl-genomio/blob/main/pyproject.toml).\n\n### Additional steps to use automated generation of the documentation\n\n- Install python part with the `[docs]` tag\n- Change into repo dir\n- Run `mkdocs build` command\n\n```\ngit clone git@github.com:Ensembl/ensembl-genomio.git\ncd ./ensembl-genomio\npip install -e .[docs]\nmkdocs build\n```\n\n### Nextflow installation\n\nPlease, refer to the \"Installation\" section of the [Nextflow pipelines document](https://github.com/Ensembl/ensembl-genomio/blob/main/docs/nextflow.md#installation).\n\n## Pipelines\n\n### Initialising and running eHive-based pipelines\n\nPipelines are derived from [`Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf`](https://github.com/Ensembl/ensembl-hive/blob/version/2.6/modules/Bio/EnsEMBL/Hive/PipeConfig/HiveGeneric_conf.pm),\nor from [`Bio::EnsEMBL::Hive::PipeConfig::EnsemblGeneric_conf`](https://github.com/Ensembl/ensembl-hive/blob/version/2.6/modules/Bio/EnsEMBL/Hive/PipeConfig/EnsemblGeneric_conf.pm),\nof from [`Bio::EnsEMBL::EGPipeline::PipeConfig::EGGeneric_conf`](https://github.com/Ensembl/ensembl-production-imported/blob/main/src/perl/Bio/EnsEMBL/EGPipeline/PipeConfig/EGGeneric_conf.pm) (see [documentation](https://github.com/Ensembl/ensembl-production-imported/blob/main/docs/EGGeneric.md)).\n\nAnd the same perl class prefix used for every pipeline:\n `Bio::EnsEMBL::EGPipeline::PipeConfig::` .\n\nN.B. Don't forget to specify `-reg_file` option for the `beekeeper.pl -url $url -reg_file $REG_FILE -loop` command.\n\n```\ninit_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig::BRC4_genome_loader_conf\n $($CMD details script) \\\n -hive_force_init 1\\\n -queue_name $SPECIFIC_QUEUE_NAME \\\n -registry $REG_FILE \\\n -pipeline_tag \"_${PIPELINE_RUN_TAG}\" \\\n -ensembl_root_dir ${ENSEMBL_ROOT_DIR} \\\n -dbsrv_url $($CMD details url) \\\n -proddb_url \"$($PROD_SERVER details url)\"\"$PROD_DBNAME\" \\\n -taxonomy_url \"$($PROD_SERVER details url)\"\"$TAXONOMY_DBNAME\" \\\n -release ${RELEASE_VERSION} \\\n -data_dir ${INPUT_DIR}/manifests_dir/ \\\n -pipeline_dir $OUT_DIR/loader_run \\\n ${OTHER_OPTIONS} \\\n 2> $OUT_DIR/init.stderr \\\n 1> $OUT_DIR/init.stdout\n\nSYNC_CMD=$(cat $OUT_DIR/init.stdout | grep -- -sync'$' | perl -pe 's/^\\s*//; s/\"//g')\n# should get something like\n# beekeeper.pl -url $url -sync\n\nLOOP_CMD=$(cat $OUT_DIR/init.stdout | grep -- -loop | perl -pe 's/^\\s*//; s/\\s*#.*$//; s/\"//g')\n# should get something like\n# beekeeper.pl -url $url -reg_file $REG_FILE -loop\n\n$SYNC_CMD 2> $OUT_DIR/sync.stderr 1> $OUT_DIR/sync.stdout\n$LOOP_CMD 2> $OUT_DIR/loop.stderr 1> $OUT_DIR/loop.stdout\n```\n\n### List of the pipelines\n\n| Pipeline name | Description | Document | Comment | Module |\n| - | - | - | - | - |\n| BRC4_genome_loader | creates an [Ensembl core database](http://www.ensembl.org/info/docs/api/core/index.html) from a set of flat files or adds ad-hoc (i.e. organellas) sequences to the existing core | [BRC4_genome_loader](https://github.com/Ensembl/ensembl-genomio/blob/main/docs/BRC4_genome_loader.md) | | [Bio::EnsEMBL::Pipeline::PipeConfig::BRC4_genome_loader_conf](https://github.com/Ensembl/ensembl-genomio/blob/main/src/perl/Bio/EnsEMBL/Pipeline/PipeConfig/BRC4_genome_loader_conf.pm)\n| BRC4_genome_dumper | | | | |\n| | | | | |\n| BRC4_genome_prepare | | | | |\n| BRC4_addition_prepare | | | | |\n| BRC4_genome_compare | | | | |\n| | | | | |\n| LoadGFF3 | | | | |\n| LoadGFF3Batch | | | | |\n\n\n### Scripts\n\n* [trf_split_run.bash](https://github.com/Ensembl/ensembl-genomio/blob/main/scripts/trf_split_run.bash) -- a trf wrapper with chunking support to be used with [ensembl-production-imported DNAFeatures pipeline](https://github.com/Ensembl/ensembl-production-imported/tree/main/src/perl/Bio/EnsEMBL/EGPipeline/PipeConfig/DNAFeatures_conf.pm) (see [docs](https://github.com/Ensembl/ensembl-genomio/blob/main/docs/trf_split_run.md))\n\n## CI/CD bits\n\nAs for now some [Gitlab CI](https://docs.gitlab.com/ee/ci/) pipelines introduced to keep things in shape. Though, this bit is in constant development. Some documentation can be found in [docs for GitLab CI/CD](https://github.com/Ensembl/ensembl-genomio/blob/main/docs/cicd_gitlab.md)\n\n## Various docs\n\nSee [docs](https://github.com/Ensembl/ensembl-genomio/blob/main/docs)\n\n## Unit testing\n\nThe Python part of the codebase has now unit tests available to test each module. Make sure you have installed this repository's `[cicd]` dependencies (via `pip install ensembl-genomio[cicd]`) before continuing.\n\nRunning all the tests in one go is as easy as running `pytest` **from the root of the repository**. If you also want to measure, collect and report the code coverage, you can do:\n```bash\ncoverage run -m pytest\ncoverage report\n```\n\nYou can also run specific tests by supplying the path to the specific test file/subfolder, e.g.:\n```bash\npytest lib/python/tests/test_schema.py\n```\n\n## Acknowledgements\n\nSome of this code and documentation is inherited from the [EnsemblGenomes](https://github.com/EnsemblGenomes) and other [Ensembl](https://github.com/Ensembl) projects. We appreciate the effort and time spent by developers of the [EnsemblGenomes](https://github.com/EnsemblGenomes) and [Ensembl](https://github.com/Ensembl) projects. \n\nThank you!\n",
"bugtrack_url": null,
"license": "Apache License 2.0",
"summary": "Ensembl GenomIO -- pipelines to convert basic genomic data into Ensembl cores and back to flatfile",
"version": "1.6.0",
"project_urls": {
"documentation": "https://ensembl.github.io/ensembl-genomio",
"homepage": "https://www.ensembl.org",
"repository": "https://github.com/Ensembl/ensembl-genomio"
},
"split_keywords": [
"genome_io",
" ensembl",
" bioinformatics",
" annotation",
" setup"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "6748fef6746f9f3fb8b773f6f216af2b2b7c4b5ad1bbdde295199b60072a3676",
"md5": "34f457b71fd6d72e1da7b9359f0d6ef1",
"sha256": "582af7e3c429f2382d09d2535df81561bfd10c691a1f2f7ddb353260f3727457"
},
"downloads": -1,
"filename": "ensembl_genomio-1.6.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "34f457b71fd6d72e1da7b9359f0d6ef1",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 412150,
"upload_time": "2024-12-19T15:09:15",
"upload_time_iso_8601": "2024-12-19T15:09:15.910866Z",
"url": "https://files.pythonhosted.org/packages/67/48/fef6746f9f3fb8b773f6f216af2b2b7c4b5ad1bbdde295199b60072a3676/ensembl_genomio-1.6.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "b4fabe32e2d03bc506869f3251c15bc4436193c3bfdb0e3366f1d7f3d8fd7847",
"md5": "434929ec326e9b31ccca05e8b849a5a0",
"sha256": "e3a9e8c547d5cce04ef793b746dc56b7a8c245385f449a63790e0f941025242e"
},
"downloads": -1,
"filename": "ensembl_genomio-1.6.0.tar.gz",
"has_sig": false,
"md5_digest": "434929ec326e9b31ccca05e8b849a5a0",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 569698,
"upload_time": "2024-12-19T15:09:17",
"upload_time_iso_8601": "2024-12-19T15:09:17.647046Z",
"url": "https://files.pythonhosted.org/packages/b4/fa/be32e2d03bc506869f3251c15bc4436193c3bfdb0e3366f1d7f3d8fd7847/ensembl_genomio-1.6.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-19 15:09:17",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Ensembl",
"github_project": "ensembl-genomio",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "ensembl-genomio"
}