opengenomebrowser-tools

Name	opengenomebrowser-tools JSON
Version	0.0.9 JSON
	download
home_page	https://github.com/opengenomebrowser/opengenomebrowser-tools
Summary	Set of scripts to aid OpenGenomeBrowser administrators import data
upload_time	2023-09-08 17:23:21
maintainer
docs_url	None
author	Thomas Roder
requires_python
license	MIT
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # OpenGenomeBrowser Tools

A set of scripts that helps to import genome data into the OpenGenomeBrowser folder structure.

## Installation

This package requires at least `Python 3.9`.

```bash
pip install opengenomebrowser-tools
```

## Help function

All scripts have a help function, for example:

```bash
import_genome --help
```

## `init_folder_structure`

Creates a basic OpenGenomeBrowser folders structure.

<details>
  <summary>More details:</summary>

Once the folder structure has been initiated...

- use [`import_genome`](#import_genome) to add genomes to the folder structure
- use [`download_ncbi_genome`](#download_ncbi_genome) and [`import_genome`](#import_genome) to download and add genomes from NCBI
- when all genomes have been added, use [`init_orthofinder`](#init_orthofinder) and [`import_orthofinder`](#import_orthofinder) to calculate
  orthologs (optional)

Usage:

```shell
export FOLDER_STRUCTURE=/path/to/folder_structure
init_folder_structure  # or --folder_structure_dir=/path/to/folder_structure
```

<details>
  <summary>Result:</summary>

```
  folder_structure
  ├── organisms
  ├── annotations.json
  ├── annotation-descriptions
  │   ├── SL.tsv
  │   ├── KO.tsv
  │   ├── KR.tsv
  │   ├── EC.tsv
  │   └── GO.tsv
  ├── orthologs
  └── pathway-maps
      ├── type_dictionary.json
      └── svg
```

</details>

</details>

## `import_genome`

Import genome-associated files into OpenGenomeBrowser folder structure, automatically generate metadata files.

<details>
  <summary>More details:</summary>

If the annotation was performed using the proper organism name, genome identifier and taxonomic information (recommended), the import is
straightforward because no files need to be renamed.

```shell
export FOLDER_STRUCTURE=/path/to/folder_structure   # this directory contains the 'organisms' folder
import_genome --import_dir=/prokka/out/dir  # optional: add "--organism STRAIN --genome STRAIN.1" as sanity check
```

<details>
  <summary>How to run prokka to get correct locus tags</summary>

Suppose the desired organism name is `STRAIN`, the genome identifier is `STRAIN.1`, this is how to run prokka:

```shell
prokka \
  --strain STRAIN \ 
  --locustag STRAIN.1 \
  --prefix STRAIN.1 \
  --genus Mycoplasma --species genitalium \  # Optional. If set, this script can automatically detect the taxid.
  --out /prokka/out/dir \
  assembly.fasta
```

</details>

<details>
  <summary>How to run PGAP to get correct locus tags</summary>

Suppose the desired organism name is `STRAIN`, the genome identifier is `STRAIN.1`, these are the lines in PGAPs `submol.yaml` that are relevant to
this script:

```yaml
organism:
  genus_species: 'Mycoplasma genitalium'  # Optional. If set, this script can automatically detect the taxid.
  strain: 'STRAIN'
locus_tag_prefix: 'STRAIN.1'
bioproject: 'PRJNA9999999'  # Optional. If set, this script can automatically add it to bioproject_accession in genome.json.
biosample: 'SAMN99999999'  # Optional. If set, this script can automatically add it to biosample_accession in genome.json.
publications: # Optional. If set, this script can automatically add it to the literature_references in genome.json.
  - publication:
      pmid: 16397293
```

</details>

### Rename files during import

Should the locus tags not start with the genome identifier, the files need to be renamed accordingly. The `import_genome` command can do this
automatically sung the `--rename` flag.

```shell
export FOLDER_STRUCTURE=/path/to/folder_structure   # this directory contains the 'organisms' folder
import_genome --import_dir=/prokka/out/dir --organism STRAIN --genome STRAIN.1 --rename
```

The renaming is provided as-is, and was only tested on files produced by certain versions of prokka and PGAP. If there is an error, you must rename
the files manually (with or without the help of my [renaming scripts](#rename_-rename-locus-tags-in-genome-associated-files)) and then import them as
described in the previous section.

### Required files

These files need to be in `import_dir`:

- `.fna`: assembly (FASTA)
- `.gbk`: GenBank file
- `.gff`: General feature format file

Optional files:

- `.faa`: protein sequences (FASTA). If non-existent, it will automatically be generated from the `.gbk` file
- `.ffn`: nucleotides file (FASTA). If non-existent, it will automatically be generated from the `.gbk` file
- `.sqn`: required for submission to GenBank, not really used by OpenGenomeBrowser
- `.emapper.annotations`: Eggnog annotation file
- `.XX`: custom annotation file (e.g. `EC`, `.GO`, etc.; any files with a suffix of two upper case letters are detected as custom annotations)
- `_busco.txt`: BUSCO output file, content will be added to `genome.json`
- `genome.json`: content will be added to final `genome.json`, may be as simple as `{"restricted": true}`
- `organism.json`: content will be added to final `organism.json`, may be as simple as `{"assembly_tool": "SPAdes"}`

<details>
  <summary>Example result:</summary>

```text
#### folder structure ####
folder_structure
└── organisms
    └── STRAIN
        ├── organism.json
        └── genomes
             └── STRAIN.1
         	     ├── genome.json
         	     ├── STRAIN.1.faa
         	     ├── STRAIN.1.ffn
         	     ├── STRAIN.1.fna
         	     ├── STRAIN.1.gbk
         	     ├── STRAIN.1.gff
         	     ├── STRAIN.1.sqn
         	     └── rest
         	      	 ├── PROKKA_08112021.err
         	      	 ├── PROKKA_08112021.fsa
        	      	 ├── PROKKA_08112021.log
         	      	 ├── PROKKA_08112021.tbl
         	      	 ├── PROKKA_08112021.tsv
         	      	 ├── PROKKA_08112021.txt
         	      	 └── short_summary.specific.lactobacillales_odb10.FAM3228-i1-1_busco.txt
```

</details>

### Modify where files are moved to

It is possible to change where files end up in the folder structure. The behaviour is determined by a config file in json format that can be specified
with the --import_settings parameter or the `OGB_IMPORT_SETTINGS` environment variable.

```shell
export OGB_IMPORT_SETTINGS=/path/to/import_config.json
```

<details>
  <summary>These are the default settings:</summary>

```text
{
    "organism_template": {},                           # use this to add metadata to all imported organism.json files, e.g. {"restricted": true}
    "genome_template": {},                             # use this to add metadata to all imported genome.json files, e.g. {"assembly_tool": "SPAdes"}
    "path_transformer": {
        ".*\\.fna": "{genome}.{suffix}",               # all files that match the regex will end up in organisms/STRAIN/genomes/STRAIN.1/STRAIN.1.fna
        ".*\\.faa": "{genome}.{suffix}",
        ".*\\.gbk": "{genome}.{suffix}",
        ".*\\.gff": "{genome}.{suffix}",
        ".*\\.sqn": "{genome}.{suffix}",
        ".*\\.ffn": "{genome}.{suffix}",
        ".*\\.emapper.annotations": "{genome}.eggnog",
        ".*\\.[A-Z]{2}": "{genome}.{suffix}",
        "genome.md": "genome.md", 
        "organism.md": "../../organism.md",            # this file will end up in /organisms/STRAIN/organism.md
        "genome.json": null,                           # this file will not be copied
        "organism.json": null,                         # this file will not be copied
        ".*": "rest/{original_path}"                   # this regex matches all files, thus all files that did not match any previous regex will
                                                       #   will end up in .../STRAIN.1/rest/
    }
}
```

</details>

<details>
  <summary>This is an example of an alternative configuration:</summary>

```text
{
    "organism_template": {},
    "genome_template": {},
    "path_transformer": {
        
        # raw reads
        ".*fastqc?\\..*": "0_raw_reads/{original_path}",
        
        # assembly
        ".*\\.fna": "1_assembly/{genome}.{suffix}",
        
        # coding sequence (CDS) calling
        ".*\\.faa": "2_cds/{genome}.{suffix}",
        ".*\\.gbk": "2_cds/{genome}.{suffix}",
        ".*\\.gff": "2_cds/{genome}.{suffix}",
        ".*\\.ffn": "2_cds/{genome}.{suffix}",
        ".*\\.sqn": "2_cds/{genome}.{suffix}",
        "PROKKA_.*": "2_cds/{original_path}",
        
        # functional annotations
        ".*\\.emapper.annotations": "3_annotation/{genome}.eggnog",
        ".*\\.[A-Z]{2}": "3_annotation/{genome}.{suffix}",
        ".*_busco\\.txt": "3_annotation/{original_path}",
        
        # special files
        "genome.md": "genome.md",
        "organism.md": "../../organism.md",
        "genome.json": null,
        "organism.json": null,
        
        # rest
        ".*": "rest/{original_path}"
    }
}
```

Result:

```text
#### folder structure ####
folder_structure
└── organisms
    └── STRAIN
       ├── organism.json
       └── genomes
             └── STRAIN.1
                 ├── genome.json
                 ├── 1_assembly
                 │     └── STRAIN.1.fna
                 ├── 2_cds
                 │     ├── PROKKA_08112021.err
                 │     ├── PROKKA_08112021.fsa
                 │     ├── PROKKA_08112021.log
                 │     ├── PROKKA_08112021.tbl
                 │     ├── PROKKA_08112021.tsv
                 │     ├── PROKKA_08112021.txt
                 │     ├── STRAIN.1.faa
                 │     ├── STRAIN.1.ffn
                 │     ├── STRAIN.1.gbk
                 │     ├── STRAIN.1.gff
                 │     └── STRAIN.1.sqn
                 └── 3_annotation
                       └── short_summary_busco.txt
```

</details>

### Add custom metadata

There are two ways to achieve this:

1) Add a `organism.json` and/or `genome.json` file into `import_dir` (see [import_genome: Required files](#required-files))
2) Set a global `organism.json` and/or `genome.json` file that is used as a basis for all future imports (
   see [import_genome: Modify where files are moved to](#modify-where-files-are-moved-to))

</details>

## `rename_*`

The following scripts change the locus tags in the respective file formats.

| Script                      | Purpose                                                                  |
|-----------------------------|--------------------------------------------------------------------------|
| `rename_fasta`              | Change locus tags of protein or nucleotide FASTA files                   |
| `rename_genbank`            | Change locus tags of GenBank files (tested with prokka and PGAP files)   |
| `rename_gff`                | Change locus tags of gff (general feature format) files                  |
| `rename_eggnog`             | Change locus tags of Eggnog files (`.emapper.annotations`)               |
| `rename_custom_annotations` | Change locus tags of custom annotations files                            |

<details>
  <summary>More details:</summary>

The syntax is always the same.

```shell
rename_fasta \
  --file /path/to/input.file \
  --out /path/to/output.file \
  --new_locus_tag_prefix STRAIN.2 \
  --old_locus_tag_prefix STRAIN.1  # optional, good as sanity check
```

</details>

## `reindex_assembly`

This script changes the header of assembly FASTA (`.fna`) files.

<details>
  <summary>More details:</summary>

```shell
reindex_assembly \
  --file /path/to/input.file \
  --out /path/to/output.file \
  --prefix STRAIN_scf \
  --leading_zeroes 5  # optional
```

This would transform a FASTA header like this `>anything here` into `>STRAIN_scf_00001`.

</details>

## `genbank_to_fasta`

Convert GenBank to nucleotide (`.ffn`) or protein FASTA (`.faa`).

<details>
  <summary>More details:</summary>

Usage:

```shell
genbank_to_fasta \
  --gbk /path/to/input.gbk \
  --out /path/to/output.fasta \
  --format faa  # or ffn
```

</details>

## `download_ncbi_genome`

Download genome-associated files (`.fna`, `.gbk`, `.gff`) from NCBI, rename the locus tags, and generate `.ffn` and `faa` files.

<details>
  <summary>More details:</summary>

Usage:

```shell
download_ncbi_genome \
  --assembly_name GCF_005864195.1 \
  --out_dir /path/to/outdir \
  --new_locus_tag_prefix FAM3257_ 
```

Result:

```text
outdir
├── FAM3257.faa
├── FAM3257.ffn
├── FAM3257.fna
├── FAM3257.gbk
└── FAM3257.gff
```

The next step might be to import these genomes into the OpenGenomeBrowser folder structure like this:

```shell
import_genome --import_dir=/path/to/outdir --organism FAM3257 --genome FAM3257
```

</details>

## `init_orthofinder`

This script collects the protein FASTAs in `folder_structure/OrthoFinder/fastas` and prints the command to run OrthoFinder.

<details>
  <summary>More details:</summary>

Usage:

```shell
export FOLDER_STRUCTURE=/path/to/folder_structure
init_orthofinder --representatives_only
```

Result:

```
  folder_structure
  ├── ...
  └── OrthoFinder
      └── fastas
          ├── GENOME1.faa
          ├── GENOME2.faa
          └── ...
```

</details>

## `import_orthofinder`

The output of OrthoFinder needs to be processed for OpenGenomeBrowser. This script creates two files:

- `annotation-descriptions/OL.tsv`: maps orthologs to the most common gene name, i.e. `OG0000005` -> `MFS transporter`
- `orthologs/orthologs.tsv`: maps orthologs to genes, i.e. `OG0000005` -> `STRAIN1_000069, STRAIN2_000128, STRAIN2_000137`

<details>
  <summary>More details:</summary>

Usage:

```shell
export FOLDER_STRUCTURE=/path/to/folder_structure
import_orthofinder --which hog  # 'hog' for hierarchical orthogroups and 'og' for regular orthogroups
```

Once these files exist, run the following command from within the OpenGenomeBrowser docker container:

```shell
python db_setup/manage_ogb.py import-orthologs
```

</details>

## `update_folder_structure`

From time to time, changes are made to the OpenGenomeBrowser folder structure. The current version of your folder structure is denoted
in `version.json`. Use this script to upgrade to a new version.

<details>
  <summary>More details:</summary>

- `1_to_2`: add `COG` to genome.json

</details>

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/opengenomebrowser/opengenomebrowser-tools",
    "name": "opengenomebrowser-tools",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "",
    "author": "Thomas Roder",
    "author_email": "roder.thomas@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/d6/71/f333c5e00c9a999db89c459d67d4e09f45de91ca2559974dd8d35639d111/opengenomebrowser-tools-0.0.9.tar.gz",
    "platform": null,
    "description": "# OpenGenomeBrowser Tools\n\nA set of scripts that helps to import genome data into the OpenGenomeBrowser folder structure.\n\n## Installation\n\nThis package requires at least `Python 3.9`.\n\n```bash\npip install opengenomebrowser-tools\n```\n\n## Help function\n\nAll scripts have a help function, for example:\n\n```bash\nimport_genome --help\n```\n\n## `init_folder_structure`\n\nCreates a basic OpenGenomeBrowser folders structure.\n\n<details>\n  <summary>More details:</summary>\n\nOnce the folder structure has been initiated...\n\n- use [`import_genome`](#import_genome) to add genomes to the folder structure\n- use [`download_ncbi_genome`](#download_ncbi_genome) and [`import_genome`](#import_genome) to download and add genomes from NCBI\n- when all genomes have been added, use [`init_orthofinder`](#init_orthofinder) and [`import_orthofinder`](#import_orthofinder) to calculate\n  orthologs (optional)\n\nUsage:\n\n```shell\nexport FOLDER_STRUCTURE=/path/to/folder_structure\ninit_folder_structure  # or --folder_structure_dir=/path/to/folder_structure\n```\n\n<details>\n  <summary>Result:</summary>\n\n```\n  folder_structure\n  \u251c\u2500\u2500 organisms\n  \u251c\u2500\u2500 annotations.json\n  \u251c\u2500\u2500 annotation-descriptions\n  \u2502   \u251c\u2500\u2500 SL.tsv\n  \u2502   \u251c\u2500\u2500 KO.tsv\n  \u2502   \u251c\u2500\u2500 KR.tsv\n  \u2502   \u251c\u2500\u2500 EC.tsv\n  \u2502   \u2514\u2500\u2500 GO.tsv\n  \u251c\u2500\u2500 orthologs\n  \u2514\u2500\u2500 pathway-maps\n      \u251c\u2500\u2500 type_dictionary.json\n      \u2514\u2500\u2500 svg\n```\n\n</details>\n\n</details>\n\n## `import_genome`\n\nImport genome-associated files into OpenGenomeBrowser folder structure, automatically generate metadata files.\n\n<details>\n  <summary>More details:</summary>\n\nIf the annotation was performed using the proper organism name, genome identifier and taxonomic information (recommended), the import is\nstraightforward because no files need to be renamed.\n\n```shell\nexport FOLDER_STRUCTURE=/path/to/folder_structure   # this directory contains the 'organisms' folder\nimport_genome --import_dir=/prokka/out/dir  # optional: add \"--organism STRAIN --genome STRAIN.1\" as sanity check\n```\n\n<details>\n  <summary>How to run prokka to get correct locus tags</summary>\n\nSuppose the desired organism name is `STRAIN`, the genome identifier is `STRAIN.1`, this is how to run prokka:\n\n```shell\nprokka \\\n  --strain STRAIN \\ \n  --locustag STRAIN.1 \\\n  --prefix STRAIN.1 \\\n  --genus Mycoplasma --species genitalium \\  # Optional. If set, this script can automatically detect the taxid.\n  --out /prokka/out/dir \\\n  assembly.fasta\n```\n\n</details>\n\n<details>\n  <summary>How to run PGAP to get correct locus tags</summary>\n\nSuppose the desired organism name is `STRAIN`, the genome identifier is `STRAIN.1`, these are the lines in PGAPs `submol.yaml` that are relevant to\nthis script:\n\n```yaml\norganism:\n  genus_species: 'Mycoplasma genitalium'  # Optional. If set, this script can automatically detect the taxid.\n  strain: 'STRAIN'\nlocus_tag_prefix: 'STRAIN.1'\nbioproject: 'PRJNA9999999'  # Optional. If set, this script can automatically add it to bioproject_accession in genome.json.\nbiosample: 'SAMN99999999'  # Optional. If set, this script can automatically add it to biosample_accession in genome.json.\npublications: # Optional. If set, this script can automatically add it to the literature_references in genome.json.\n  - publication:\n      pmid: 16397293\n```\n\n</details>\n\n### Rename files during import\n\nShould the locus tags not start with the genome identifier, the files need to be renamed accordingly. The `import_genome` command can do this\nautomatically sung the `--rename` flag.\n\n```shell\nexport FOLDER_STRUCTURE=/path/to/folder_structure   # this directory contains the 'organisms' folder\nimport_genome --import_dir=/prokka/out/dir --organism STRAIN --genome STRAIN.1 --rename\n```\n\nThe renaming is provided as-is, and was only tested on files produced by certain versions of prokka and PGAP. If there is an error, you must rename\nthe files manually (with or without the help of my [renaming scripts](#rename_-rename-locus-tags-in-genome-associated-files)) and then import them as\ndescribed in the previous section.\n\n### Required files\n\nThese files need to be in `import_dir`:\n\n- `.fna`: assembly (FASTA)\n- `.gbk`: GenBank file\n- `.gff`: General feature format file\n\nOptional files:\n\n- `.faa`: protein sequences (FASTA). If non-existent, it will automatically be generated from the `.gbk` file\n- `.ffn`: nucleotides file (FASTA). If non-existent, it will automatically be generated from the `.gbk` file\n- `.sqn`: required for submission to GenBank, not really used by OpenGenomeBrowser\n- `.emapper.annotations`: Eggnog annotation file\n- `.XX`: custom annotation file (e.g. `EC`, `.GO`, etc.; any files with a suffix of two upper case letters are detected as custom annotations)\n- `_busco.txt`: BUSCO output file, content will be added to `genome.json`\n- `genome.json`: content will be added to final `genome.json`, may be as simple as `{\"restricted\": true}`\n- `organism.json`: content will be added to final `organism.json`, may be as simple as `{\"assembly_tool\": \"SPAdes\"}`\n\n<details>\n  <summary>Example result:</summary>\n\n```text\n#### folder structure ####\nfolder_structure\n\u2514\u2500\u2500 organisms\n    \u2514\u2500\u2500 STRAIN\n        \u251c\u2500\u2500 organism.json\n        \u2514\u2500\u2500 genomes\n             \u2514\u2500\u2500 STRAIN.1\n         \t     \u251c\u2500\u2500 genome.json\n         \t     \u251c\u2500\u2500 STRAIN.1.faa\n         \t     \u251c\u2500\u2500 STRAIN.1.ffn\n         \t     \u251c\u2500\u2500 STRAIN.1.fna\n         \t     \u251c\u2500\u2500 STRAIN.1.gbk\n         \t     \u251c\u2500\u2500 STRAIN.1.gff\n         \t     \u251c\u2500\u2500 STRAIN.1.sqn\n         \t     \u2514\u2500\u2500 rest\n         \t      \t \u251c\u2500\u2500 PROKKA_08112021.err\n         \t      \t \u251c\u2500\u2500 PROKKA_08112021.fsa\n        \t      \t \u251c\u2500\u2500 PROKKA_08112021.log\n         \t      \t \u251c\u2500\u2500 PROKKA_08112021.tbl\n         \t      \t \u251c\u2500\u2500 PROKKA_08112021.tsv\n         \t      \t \u251c\u2500\u2500 PROKKA_08112021.txt\n         \t      \t \u2514\u2500\u2500 short_summary.specific.lactobacillales_odb10.FAM3228-i1-1_busco.txt\n```\n\n</details>\n\n### Modify where files are moved to\n\nIt is possible to change where files end up in the folder structure. The behaviour is determined by a config file in json format that can be specified\nwith the --import_settings parameter or the `OGB_IMPORT_SETTINGS` environment variable.\n\n```shell\nexport OGB_IMPORT_SETTINGS=/path/to/import_config.json\n```\n\n<details>\n  <summary>These are the default settings:</summary>\n\n```text\n{\n    \"organism_template\": {},                           # use this to add metadata to all imported organism.json files, e.g. {\"restricted\": true}\n    \"genome_template\": {},                             # use this to add metadata to all imported genome.json files, e.g. {\"assembly_tool\": \"SPAdes\"}\n    \"path_transformer\": {\n        \".*\\\\.fna\": \"{genome}.{suffix}\",               # all files that match the regex will end up in organisms/STRAIN/genomes/STRAIN.1/STRAIN.1.fna\n        \".*\\\\.faa\": \"{genome}.{suffix}\",\n        \".*\\\\.gbk\": \"{genome}.{suffix}\",\n        \".*\\\\.gff\": \"{genome}.{suffix}\",\n        \".*\\\\.sqn\": \"{genome}.{suffix}\",\n        \".*\\\\.ffn\": \"{genome}.{suffix}\",\n        \".*\\\\.emapper.annotations\": \"{genome}.eggnog\",\n        \".*\\\\.[A-Z]{2}\": \"{genome}.{suffix}\",\n        \"genome.md\": \"genome.md\", \n        \"organism.md\": \"../../organism.md\",            # this file will end up in /organisms/STRAIN/organism.md\n        \"genome.json\": null,                           # this file will not be copied\n        \"organism.json\": null,                         # this file will not be copied\n        \".*\": \"rest/{original_path}\"                   # this regex matches all files, thus all files that did not match any previous regex will\n                                                       #   will end up in .../STRAIN.1/rest/\n    }\n}\n```\n\n</details>\n\n<details>\n  <summary>This is an example of an alternative configuration:</summary>\n\n```text\n{\n    \"organism_template\": {},\n    \"genome_template\": {},\n    \"path_transformer\": {\n        \n        # raw reads\n        \".*fastqc?\\\\..*\": \"0_raw_reads/{original_path}\",\n        \n        # assembly\n        \".*\\\\.fna\": \"1_assembly/{genome}.{suffix}\",\n        \n        # coding sequence (CDS) calling\n        \".*\\\\.faa\": \"2_cds/{genome}.{suffix}\",\n        \".*\\\\.gbk\": \"2_cds/{genome}.{suffix}\",\n        \".*\\\\.gff\": \"2_cds/{genome}.{suffix}\",\n        \".*\\\\.ffn\": \"2_cds/{genome}.{suffix}\",\n        \".*\\\\.sqn\": \"2_cds/{genome}.{suffix}\",\n        \"PROKKA_.*\": \"2_cds/{original_path}\",\n        \n        # functional annotations\n        \".*\\\\.emapper.annotations\": \"3_annotation/{genome}.eggnog\",\n        \".*\\\\.[A-Z]{2}\": \"3_annotation/{genome}.{suffix}\",\n        \".*_busco\\\\.txt\": \"3_annotation/{original_path}\",\n        \n        # special files\n        \"genome.md\": \"genome.md\",\n        \"organism.md\": \"../../organism.md\",\n        \"genome.json\": null,\n        \"organism.json\": null,\n        \n        # rest\n        \".*\": \"rest/{original_path}\"\n    }\n}\n```\n\nResult:\n\n```text\n#### folder structure ####\nfolder_structure\n\u2514\u2500\u2500 organisms\n    \u2514\u2500\u2500 STRAIN\n       \u251c\u2500\u2500 organism.json\n       \u2514\u2500\u2500 genomes\n             \u2514\u2500\u2500 STRAIN.1\n                 \u251c\u2500\u2500 genome.json\n                 \u251c\u2500\u2500 1_assembly\n                 \u2502     \u2514\u2500\u2500 STRAIN.1.fna\n                 \u251c\u2500\u2500 2_cds\n                 \u2502     \u251c\u2500\u2500 PROKKA_08112021.err\n                 \u2502     \u251c\u2500\u2500 PROKKA_08112021.fsa\n                 \u2502     \u251c\u2500\u2500 PROKKA_08112021.log\n                 \u2502     \u251c\u2500\u2500 PROKKA_08112021.tbl\n                 \u2502     \u251c\u2500\u2500 PROKKA_08112021.tsv\n                 \u2502     \u251c\u2500\u2500 PROKKA_08112021.txt\n                 \u2502     \u251c\u2500\u2500 STRAIN.1.faa\n                 \u2502     \u251c\u2500\u2500 STRAIN.1.ffn\n                 \u2502     \u251c\u2500\u2500 STRAIN.1.gbk\n                 \u2502     \u251c\u2500\u2500 STRAIN.1.gff\n                 \u2502     \u2514\u2500\u2500 STRAIN.1.sqn\n                 \u2514\u2500\u2500 3_annotation\n                       \u2514\u2500\u2500 short_summary_busco.txt\n```\n\n</details>\n\n### Add custom metadata\n\nThere are two ways to achieve this:\n\n1) Add a `organism.json` and/or `genome.json` file into `import_dir` (see [import_genome: Required files](#required-files))\n2) Set a global `organism.json` and/or `genome.json` file that is used as a basis for all future imports (\n   see [import_genome: Modify where files are moved to](#modify-where-files-are-moved-to))\n\n</details>\n\n## `rename_*`\n\nThe following scripts change the locus tags in the respective file formats.\n\n| Script                      | Purpose                                                                  |\n|-----------------------------|--------------------------------------------------------------------------|\n| `rename_fasta`              | Change locus tags of protein or nucleotide FASTA files                   |\n| `rename_genbank`            | Change locus tags of GenBank files (tested with prokka and PGAP files)   |\n| `rename_gff`                | Change locus tags of gff (general feature format) files                  |\n| `rename_eggnog`             | Change locus tags of Eggnog files (`.emapper.annotations`)               |\n| `rename_custom_annotations` | Change locus tags of custom annotations files                            |\n\n<details>\n  <summary>More details:</summary>\n\nThe syntax is always the same.\n\n```shell\nrename_fasta \\\n  --file /path/to/input.file \\\n  --out /path/to/output.file \\\n  --new_locus_tag_prefix STRAIN.2 \\\n  --old_locus_tag_prefix STRAIN.1  # optional, good as sanity check\n```\n\n</details>\n\n## `reindex_assembly`\n\nThis script changes the header of assembly FASTA (`.fna`) files.\n\n<details>\n  <summary>More details:</summary>\n\n```shell\nreindex_assembly \\\n  --file /path/to/input.file \\\n  --out /path/to/output.file \\\n  --prefix STRAIN_scf \\\n  --leading_zeroes 5  # optional\n```\n\nThis would transform a FASTA header like this `>anything here` into `>STRAIN_scf_00001`.\n\n</details>\n\n## `genbank_to_fasta`\n\nConvert GenBank to nucleotide (`.ffn`) or protein FASTA (`.faa`).\n\n<details>\n  <summary>More details:</summary>\n\nUsage:\n\n```shell\ngenbank_to_fasta \\\n  --gbk /path/to/input.gbk \\\n  --out /path/to/output.fasta \\\n  --format faa  # or ffn\n```\n\n</details>\n\n## `download_ncbi_genome`\n\nDownload genome-associated files (`.fna`, `.gbk`, `.gff`) from NCBI, rename the locus tags, and generate `.ffn` and `faa` files.\n\n<details>\n  <summary>More details:</summary>\n\nUsage:\n\n```shell\ndownload_ncbi_genome \\\n  --assembly_name GCF_005864195.1 \\\n  --out_dir /path/to/outdir \\\n  --new_locus_tag_prefix FAM3257_ \n```\n\nResult:\n\n```text\noutdir\n\u251c\u2500\u2500 FAM3257.faa\n\u251c\u2500\u2500 FAM3257.ffn\n\u251c\u2500\u2500 FAM3257.fna\n\u251c\u2500\u2500 FAM3257.gbk\n\u2514\u2500\u2500 FAM3257.gff\n```\n\nThe next step might be to import these genomes into the OpenGenomeBrowser folder structure like this:\n\n```shell\nimport_genome --import_dir=/path/to/outdir --organism FAM3257 --genome FAM3257\n```\n\n</details>\n\n## `init_orthofinder`\n\nThis script collects the protein FASTAs in `folder_structure/OrthoFinder/fastas` and prints the command to run OrthoFinder.\n\n<details>\n  <summary>More details:</summary>\n\nUsage:\n\n```shell\nexport FOLDER_STRUCTURE=/path/to/folder_structure\ninit_orthofinder --representatives_only\n```\n\nResult:\n\n```\n  folder_structure\n  \u251c\u2500\u2500 ...\n  \u2514\u2500\u2500 OrthoFinder\n      \u2514\u2500\u2500 fastas\n          \u251c\u2500\u2500 GENOME1.faa\n          \u251c\u2500\u2500 GENOME2.faa\n          \u2514\u2500\u2500 ...\n```\n\n</details>\n\n## `import_orthofinder`\n\nThe output of OrthoFinder needs to be processed for OpenGenomeBrowser. This script creates two files:\n\n- `annotation-descriptions/OL.tsv`: maps orthologs to the most common gene name, i.e. `OG0000005` -> `MFS transporter`\n- `orthologs/orthologs.tsv`: maps orthologs to genes, i.e. `OG0000005` -> `STRAIN1_000069, STRAIN2_000128, STRAIN2_000137`\n\n<details>\n  <summary>More details:</summary>\n\nUsage:\n\n```shell\nexport FOLDER_STRUCTURE=/path/to/folder_structure\nimport_orthofinder --which hog  # 'hog' for hierarchical orthogroups and 'og' for regular orthogroups\n```\n\nOnce these files exist, run the following command from within the OpenGenomeBrowser docker container:\n\n```shell\npython db_setup/manage_ogb.py import-orthologs\n```\n\n</details>\n\n## `update_folder_structure`\n\nFrom time to time, changes are made to the OpenGenomeBrowser folder structure. The current version of your folder structure is denoted\nin `version.json`. Use this script to upgrade to a new version.\n\n<details>\n  <summary>More details:</summary>\n\n- `1_to_2`: add `COG` to genome.json\n\n</details>\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Set of scripts to aid OpenGenomeBrowser administrators import data",
    "version": "0.0.9",
    "project_urls": {
        "Homepage": "https://github.com/opengenomebrowser/opengenomebrowser-tools"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d671f333c5e00c9a999db89c459d67d4e09f45de91ca2559974dd8d35639d111",
                "md5": "4e1b50bbf91ef0a2ca60a6ddc17398bb",
                "sha256": "c655e77f5bf9d74f3197f996bd5496b8f1ce7c45a08ebaebbccf77d3994d3fba"
            },
            "downloads": -1,
            "filename": "opengenomebrowser-tools-0.0.9.tar.gz",
            "has_sig": false,
            "md5_digest": "4e1b50bbf91ef0a2ca60a6ddc17398bb",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 39643,
            "upload_time": "2023-09-08T17:23:21",
            "upload_time_iso_8601": "2023-09-08T17:23:21.569235Z",
            "url": "https://files.pythonhosted.org/packages/d6/71/f333c5e00c9a999db89c459d67d4e09f45de91ca2559974dd8d35639d111/opengenomebrowser-tools-0.0.9.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-09-08 17:23:21",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "opengenomebrowser",
    "github_project": "opengenomebrowser-tools",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "opengenomebrowser-tools"
}

Thomas Roder