genome-uploader

Name	genome-uploader JSON
Version	2.4.0 JSON
	download
home_page	None
Summary	Python script to upload bins and MAGs in fasta format to ENA (European Nucleotide Archive). This script generates xmls and manifests necessary for submission with webin-cli.
upload_time	2025-07-31 10:52:12
maintainer	None
docs_url	None
author	None
requires_python	>=3.10
license	Apache Software License 2.0
keywords	bioinformatics tool metagenomics
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Public bins and MAGs uploader
Python script to upload bins and MAGs in fasta format to ENA (European Nucleotide Archive). This script generates xmls and manifests necessary for submission with webin-cli.

It takes as input one tsv (tab-separated values) table in the following format:

| genome_name | genome_path | accessions | assembly_software | binning_software | binning_parameters | stats_generation_software | completeness | contamination | genome_coverage | metagenome | co-assembly | broad_environment | local_environment | environmental_medium | rRNA_presence | taxonomy_lineage |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| ERR4647712_crispatus | path/to/ERR4647712.fa.gz | ERR4647712 | megahit_v1.2.9 | MGnify-genomes-generation-pipeline_v1.0.0 | default | CheckM2_v1.0.1 | 100 | 0.38 | 14.2 | chicken gut metagenome | False | chicken | gut | mucosa | True | d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Lactobacillaceae;g__Lactobacillus;s__Lactobacillus crispatus |

With columns indicating:
  * _genome_name_: genome id (unique string identifier)
  * _accessions_: run(s) or assembly(ies) the genome was generated from (DRR/ERR/SRRxxxxxx for runs, DRZ/ERZ/SRZxxxxxx for assemblies). If the genome was generated by a co-assembly of multiple runs, separate them with a comma.
  * _assembly_software_: assemblerName_vX.X
  * _binning_software_: binnerName_vX.X
  * _binning_parameters_: binning parameters
  * _stats_generation_software_: software_vX.X
  * _completeness_: `float`
  * _contamination_: `float`
  * _rRNA_presence_: `True/False` if all among 5S, 16S, and 23S genes, and at least 18 tRNA genes, have been detected in the genome
  * _NCBI_lineage_: full NCBI lineage, either in tax ids (`integers`) or `strings`. Format: x;y;z;...
  * _metagenome_: needs to be listed in the taxonomy tree [here](<https://www.ebi.ac.uk/ena/browser/view/408169?show=tax-tree>) (you might need to press "Tax tree - Show" in the right most section of the page)
  * _co-assembly_: `True/False`, whether the genome was generated from a co-assembly. N.B. the script only supports co-assemblies generated from the same project.
  * _genome_coverage_ : genome coverage against raw reads
  * _genome_path_: path to genome to upload (already compressed)
  * _broad_environment_: `string` (explanation following)
  * _local_environment_: `string` (explanation following)
  * _environmental_medium_: `string` (explanation following)

According to ENA checklist's guidelines, 'broad_environment' describes the broad ecological context of a sample - desert, taiga, coral reef, ... 'local_environment' is more local - lake, harbour, cliff, ... 'environmental_medium' is either the material displaced by the sample, or the one in which the sample was embedded prior to the sampling event - air, soil, water, ...
For host-associated metagenomic samples, the three variables can be defined similarly to the following example for the chicken gut metagenome: "chicken digestive system", "digestive tube", "caecum". More information can be found at [ERC000050](<https://www.ebi.ac.uk/ena/browser/view/ERC000050>) for bins and [ERC000047](<https://www.ebi.ac.uk/ena/browser/view/ERC000047>) for MAGs under field names "broad-scale environmental context", "local environmental context", "environmental medium"

Another example can be found [here](examples/input_example.tsv)

### Warnings

Raw-read runs from which genomes were generated should already be available on the INSDC (ENA by EBI, GenBank by NCBI, or DDBJ), hence at least one DRR|ERR|SRR accession should be available for every genome to be uploaded. Assembly accessions (ERZ|SRZ|DRZ) are also supported.

If uploading TPA (Third PArty) genomes, you will need to contact [ENA support](<https://www.ebi.ac.uk/ena/browser/support>) before using the script. They will provide instructions on how to correctly register a TPA project where to submit your genomes. If both TPA and non-TPA genomes need to be uploaded, please divide them in two batches and use the `--tpa` flag only with TPA genomes.

Files to be uploaded will need to be compressed (e.g. already in .gz format).

No more than 5000 genomes can be submitted at the same time.


## Installation and setup

You can install **genome_uploader** with:

```bash
pip install genome_uploader
```

Next download webin-cli for upload to **ENA** with:

```bash
download_webin_cli -v 8.2.0
```

## Setting ENA Credentials

This tool requires your ENA Webin credentials to function. You can provide these by setting environment variables or using an environment file.

### Using an environment file

Create a file named `.env` in your home directory (`~/.env`), your current working directory (`./.env`), or specify a custom file (default is `.env`).

Add the following lines with your credentials:

```env
ENA_WEBIN=your_username_here
ENA_WEBIN_PASSWORD=your_password_here
```

### Alternatively, set the environment variables directly in your shell

```bash
export ENA_WEBIN=your_username_here
export ENA_WEBIN_PASSWORD=your_password_here
```

You can generate pre-upload files with:

```bash
genome_upload -u UPLOAD_STUDY --genome_info METADATA_FILE (--mags | --bins) --centre_name CENTRE_NAME [--out] [--force] [--live] [--tpa]
```

where
  * `-u UPLOAD_STUDY`: study accession for genomes upload to ENA (in format ERPxxxxxx or PRJEBxxxxxx)
  * `---genome_info METADATA_FILE` : genomes metadata file in tsv format
  * `-m, --mags, --b, --bins`: select for bin or MAG upload. If in doubt, look at [their definition according to ENA](<https://ena-docs.readthedocs.io/en/latest/submit/assembly/metagenome.html>)
  * `--out`: output folder (default: working directory)
  * `--force`: forces reset of sample xmls generation
  * `--live`: registers genomes on ENA's live server. Omitting this option allows to validate samples beforehand (it will need the `-test` option in the upload command for the test submission to work)
  * `--centre_name CENTRE_NAME`: name of the centre generating and uploading genomes
  * `--tpa`: if uploading TPA (Third PArty) generated genomes
  * `--private`: if data is private

It is recommended to validate your genomes in test mode (i.e. without `--live` in the registration step and with `-test` during the upload) before attempting the final upload. Launching the registration in test mode will add a timestamp to the genome name to allow multiple executions of the test process.

Sample xmls won't be regenerated automatically if a previous xml already exists. If any metadata or value in the tsv table changes, `--force` will allow xml regeneration.

### Produced files:
The script produces the following files and folders:
```bash
bin_upload/MAG_upload
├── manifests
│    └── ...
├── manifests_test                  # folder generated for validation in test mode
│    └── ...
├── ENA_backup.json                 # backup file to prevent re-download of metadata from ENA. Regeneration can be forced with --force
├── genome_samples.xml              # xml generated to register samples on ENA before the upload
├── registered_bins/MAGs.tsv        # list of genomes registered on ENA in live mode - needed for manifest generation
├── registered_bins/MAGs_test.tsv   # list of genomes registered on ENA in test mode - needed for manifest generation
└── submission.xml                  # xml used for genome registration on ENA
```

## Upload genomes
Once manifest files are generated, it is necessary to use ENA's webin-cli resource to upload genomes.

To test your submission (i.e. you registered your samples without the `--live` option with genome_upload.py), add the `-test` argument.

A live execution example within this repo is the following:
```bash
java -jar ./webin-cli.jar \
  -context=genome \
  -manifest=ERR123456_bin.1.manifest \
  -userName="Webin-XXX" \
  -password="YYY" \
  -submit
```

More information on ENA's webin-cli can be found [here](<https://ena-docs.readthedocs.io/en/latest/submit/general-guide/webin-cli.html>).

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "genome-uploader",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "bioinformatics, tool, metagenomics",
    "author": null,
    "author_email": "MGnify team <metagenomics-help@ebi.ac.uk>",
    "download_url": "https://files.pythonhosted.org/packages/a6/fa/1bc416dffdc1f1614b6ee7a19dca34f02931990fefd4caadac812967834a/genome_uploader-2.4.0.tar.gz",
    "platform": null,
    "description": "# Public bins and MAGs uploader\nPython script to upload bins and MAGs in fasta format to ENA (European Nucleotide Archive). This script generates xmls and manifests necessary for submission with webin-cli.\n\nIt takes as input one tsv (tab-separated values) table in the following format:\n\n| genome_name | genome_path | accessions | assembly_software | binning_software | binning_parameters | stats_generation_software | completeness | contamination | genome_coverage | metagenome | co-assembly | broad_environment | local_environment | environmental_medium | rRNA_presence | taxonomy_lineage |\n| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |\n| ERR4647712_crispatus | path/to/ERR4647712.fa.gz | ERR4647712 | megahit_v1.2.9 | MGnify-genomes-generation-pipeline_v1.0.0 | default | CheckM2_v1.0.1 | 100 | 0.38 | 14.2 | chicken gut metagenome | False | chicken | gut | mucosa | True | d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Lactobacillaceae;g__Lactobacillus;s__Lactobacillus crispatus |\n\nWith columns indicating:\n  * _genome_name_: genome id (unique string identifier)\n  * _accessions_: run(s) or assembly(ies) the genome was generated from (DRR/ERR/SRRxxxxxx for runs, DRZ/ERZ/SRZxxxxxx for assemblies). If the genome was generated by a co-assembly of multiple runs, separate them with a comma.\n  * _assembly_software_: assemblerName_vX.X\n  * _binning_software_: binnerName_vX.X\n  * _binning_parameters_: binning parameters\n  * _stats_generation_software_: software_vX.X\n  * _completeness_: `float`\n  * _contamination_: `float`\n  * _rRNA_presence_: `True/False` if all among 5S, 16S, and 23S genes, and at least 18 tRNA genes, have been detected in the genome\n  * _NCBI_lineage_: full NCBI lineage, either in tax ids (`integers`) or `strings`. Format: x;y;z;...\n  * _metagenome_: needs to be listed in the taxonomy tree [here](<https://www.ebi.ac.uk/ena/browser/view/408169?show=tax-tree>) (you might need to press \"Tax tree - Show\" in the right most section of the page)\n  * _co-assembly_: `True/False`, whether the genome was generated from a co-assembly. N.B. the script only supports co-assemblies generated from the same project.\n  * _genome_coverage_ : genome coverage against raw reads\n  * _genome_path_: path to genome to upload (already compressed)\n  * _broad_environment_: `string` (explanation following)\n  * _local_environment_: `string` (explanation following)\n  * _environmental_medium_: `string` (explanation following)\n\nAccording to ENA checklist's guidelines, 'broad_environment' describes the broad ecological context of a sample - desert, taiga, coral reef, ... 'local_environment' is more local - lake, harbour, cliff, ... 'environmental_medium' is either the material displaced by the sample, or the one in which the sample was embedded prior to the sampling event - air, soil, water, ...\nFor host-associated metagenomic samples, the three variables can be defined similarly to the following example for the chicken gut metagenome: \"chicken digestive system\", \"digestive tube\", \"caecum\". More information can be found at [ERC000050](<https://www.ebi.ac.uk/ena/browser/view/ERC000050>) for bins and [ERC000047](<https://www.ebi.ac.uk/ena/browser/view/ERC000047>) for MAGs under field names \"broad-scale environmental context\", \"local environmental context\", \"environmental medium\"\n\nAnother example can be found [here](examples/input_example.tsv)\n\n### Warnings\n\nRaw-read runs from which genomes were generated should already be available on the INSDC (ENA by EBI, GenBank by NCBI, or DDBJ), hence at least one DRR|ERR|SRR accession should be available for every genome to be uploaded. Assembly accessions (ERZ|SRZ|DRZ) are also supported.\n\nIf uploading TPA (Third PArty) genomes, you will need to contact [ENA support](<https://www.ebi.ac.uk/ena/browser/support>) before using the script. They will provide instructions on how to correctly register a TPA project where to submit your genomes. If both TPA and non-TPA genomes need to be uploaded, please divide them in two batches and use the `--tpa` flag only with TPA genomes.\n\nFiles to be uploaded will need to be compressed (e.g. already in .gz format).\n\nNo more than 5000 genomes can be submitted at the same time.\n\n\n## Installation and setup\n\nYou can install **genome_uploader** with:\n\n```bash\npip install genome_uploader\n```\n\nNext download webin-cli for upload to **ENA** with:\n\n```bash\ndownload_webin_cli -v 8.2.0\n```\n\n## Setting ENA Credentials\n\nThis tool requires your ENA Webin credentials to function. You can provide these by setting environment variables or using an environment file.\n\n### Using an environment file\n\nCreate a file named `.env` in your home directory (`~/.env`), your current working directory (`./.env`), or specify a custom file (default is `.env`).\n\nAdd the following lines with your credentials:\n\n```env\nENA_WEBIN=your_username_here\nENA_WEBIN_PASSWORD=your_password_here\n```\n\n### Alternatively, set the environment variables directly in your shell\n\n```bash\nexport ENA_WEBIN=your_username_here\nexport ENA_WEBIN_PASSWORD=your_password_here\n```\n\nYou can generate pre-upload files with:\n\n```bash\ngenome_upload -u UPLOAD_STUDY --genome_info METADATA_FILE (--mags | --bins) --centre_name CENTRE_NAME [--out] [--force] [--live] [--tpa]\n```\n\nwhere\n  * `-u UPLOAD_STUDY`: study accession for genomes upload to ENA (in format ERPxxxxxx or PRJEBxxxxxx)\n  * `---genome_info METADATA_FILE` : genomes metadata file in tsv format\n  * `-m, --mags, --b, --bins`: select for bin or MAG upload. If in doubt, look at [their definition according to ENA](<https://ena-docs.readthedocs.io/en/latest/submit/assembly/metagenome.html>)\n  * `--out`: output folder (default: working directory)\n  * `--force`: forces reset of sample xmls generation\n  * `--live`: registers genomes on ENA's live server. Omitting this option allows to validate samples beforehand (it will need the `-test` option in the upload command for the test submission to work)\n  * `--centre_name CENTRE_NAME`: name of the centre generating and uploading genomes\n  * `--tpa`: if uploading TPA (Third PArty) generated genomes\n  * `--private`: if data is private\n\nIt is recommended to validate your genomes in test mode (i.e. without `--live` in the registration step and with `-test` during the upload) before attempting the final upload. Launching the registration in test mode will add a timestamp to the genome name to allow multiple executions of the test process.\n\nSample xmls won't be regenerated automatically if a previous xml already exists. If any metadata or value in the tsv table changes, `--force` will allow xml regeneration.\n\n### Produced files:\nThe script produces the following files and folders:\n```bash\nbin_upload/MAG_upload\n\u251c\u2500\u2500 manifests\n\u2502    \u2514\u2500\u2500 ...\n\u251c\u2500\u2500 manifests_test                  # folder generated for validation in test mode\n\u2502    \u2514\u2500\u2500 ...\n\u251c\u2500\u2500 ENA_backup.json                 # backup file to prevent re-download of metadata from ENA. Regeneration can be forced with --force\n\u251c\u2500\u2500 genome_samples.xml              # xml generated to register samples on ENA before the upload\n\u251c\u2500\u2500 registered_bins/MAGs.tsv        # list of genomes registered on ENA in live mode - needed for manifest generation\n\u251c\u2500\u2500 registered_bins/MAGs_test.tsv   # list of genomes registered on ENA in test mode - needed for manifest generation\n\u2514\u2500\u2500 submission.xml                  # xml used for genome registration on ENA\n```\n\n## Upload genomes\nOnce manifest files are generated, it is necessary to use ENA's webin-cli resource to upload genomes.\n\nTo test your submission (i.e. you registered your samples without the `--live` option with genome_upload.py), add the `-test` argument.\n\nA live execution example within this repo is the following:\n```bash\njava -jar ./webin-cli.jar \\\n  -context=genome \\\n  -manifest=ERR123456_bin.1.manifest \\\n  -userName=\"Webin-XXX\" \\\n  -password=\"YYY\" \\\n  -submit\n```\n\nMore information on ENA's webin-cli can be found [here](<https://ena-docs.readthedocs.io/en/latest/submit/general-guide/webin-cli.html>).\n",
    "bugtrack_url": null,
    "license": "Apache Software License 2.0",
    "summary": "Python script to upload bins and MAGs in fasta format to ENA (European Nucleotide Archive). This script generates xmls and manifests necessary for submission with webin-cli.",
    "version": "2.4.0",
    "project_urls": {
        "Homepage": "https://github.com/EBI-Metagenomics/genome_uploader",
        "Issues": "https://github.com/EBI-Metagenomics/genome_uploader/issues"
    },
    "split_keywords": [
        "bioinformatics",
        " tool",
        " metagenomics"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "92f81f871ad1658848bcc7924f10d75190bcfe8ff836fcf2192a0dbdb38efd65",
                "md5": "751d44f3db6962839dfd342bd601fb9b",
                "sha256": "eb46aaa371945eea2b4e0a5eca26eeeaec1a056a7e26f5dbfb5a26cbd2d10c7f"
            },
            "downloads": -1,
            "filename": "genome_uploader-2.4.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "751d44f3db6962839dfd342bd601fb9b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 31057,
            "upload_time": "2025-07-31T10:52:10",
            "upload_time_iso_8601": "2025-07-31T10:52:10.840542Z",
            "url": "https://files.pythonhosted.org/packages/92/f8/1f871ad1658848bcc7924f10d75190bcfe8ff836fcf2192a0dbdb38efd65/genome_uploader-2.4.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "a6fa1bc416dffdc1f1614b6ee7a19dca34f02931990fefd4caadac812967834a",
                "md5": "1c5b73c4e97c73d9ea82c7e1e756a4d9",
                "sha256": "29667c468b3291de1ef2b879b7f77a8ba1934684c29fe71c9b657b3e91a24b30"
            },
            "downloads": -1,
            "filename": "genome_uploader-2.4.0.tar.gz",
            "has_sig": false,
            "md5_digest": "1c5b73c4e97c73d9ea82c7e1e756a4d9",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 30590,
            "upload_time": "2025-07-31T10:52:12",
            "upload_time_iso_8601": "2025-07-31T10:52:12.099811Z",
            "url": "https://files.pythonhosted.org/packages/a6/fa/1bc416dffdc1f1614b6ee7a19dca34f02931990fefd4caadac812967834a/genome_uploader-2.4.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-31 10:52:12",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "EBI-Metagenomics",
    "github_project": "genome_uploader",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "genome-uploader"
}

None