pygcap


Namepygcap JSON
Version 1.2.5 PyPI version JSON
download
home_pagehttps://github.com/jrim42/pyGCAP
SummaryPython package for probe-based gene cluster finding in large microbial genome database
upload_time2024-07-30 01:43:48
maintainerNone
docs_urlNone
authorjsrim
requires_python>=3.6
licenseNone
keywords gene cluster genomics bioinformatics
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # pyGCAP: a (py)thon (G)ene (C)luster (A)nnotation & (P)rofiling

A Python Package for Probe-based Gene Cluster Finding in Large Microbial Genome Database

- [Introduction](#Introduction)
- [Pipeline-flow](#Pipeline-flow)
- [Pre-requirement](#Pre-requirement)
- [Usage](#Usage)
- [Output](#Output)

---

## Introduction

Bacterial gene clusters provide insights into metabolism and evolution, and facilitate biotechnological applications. We developed pyGCAP, a Python package for probe-based gene cluster discovery. This pipeline uses sequence search and analysis tools and public databases (e.g. BLAST, MMSeqs2, UniProt, and NCBI) to predict potential gene clusters by user-provided probe genes. We tested the pipeline with the division and cell wall (dcw) gene cluster, crucial for cell division and peptidoglycan biosynthesis.

To evaluate pyGCAP, we used 17 major dcw genes defined by Megrian et al. [1] as a probe set to search for gene clusters in 716 Lactobacillales genomes. The results were integrated to provide detailed information on gene content, gene order, and types of clusters. While PGCfinder examined the completeness of the gene clusters, it could also suggest novel taxa-specific accessory genes related to dcw clusters in Lactobacillales genomes. The package will be freely available on the Python Package Index, Bioconda, and GitHub.

[1] Megrian, D., et al. [Ancient origin and constrained evolution of the division and cell wall gene cluster in Bacteria](https://www.nature.com/articles/s41564-022-01257-y). Nat Microbiol 7, 2114–2127 (2022).

---

## Pipeline-flow

<p align="center">
  <img width="1000" alt="flowchart" src="https://github.com/user-attachments/assets/e957794d-091c-4931-a0c9-fd013f02d307">
</p>

---

## Pre-requirement

1. `Python` >= 3.6
2. `conda` environment

   - `blast` ([bioconda blast package](https://anaconda.org/bioconda/blast))

     ```
     conda install bioconda::blast
     conda install bioconda/label/cf201901::blast
     ```

   - `datasets` & `dataformat` from NCBI ([conda-forge ncbi-datasets-cli package](https://anaconda.org/conda-forge/ncbi-datasets-cli))

     ```
     conda install conda-forge::ncbi-datasets-cli
     ```

   - `MMseqs2` ([MMseqs2 github](https://github.com/soedinglab/MMseqs2))

     ```
     conda install -c conda-forge -c bioconda mmseqs2
     ```

   - If you want to make a new conda environment for pygcap, follow the instructions below:

     ```
     conda create -n pygcap
     conda activate pygcap
     pip install pygcap (or) conda install bioconda::pygcap
     conda install -c conda-forge ncbi-datasets-cli
     conda install -c conda-forge -c bioconda mmseqs2
     ```

---

## Usage

- pypi pygcap ([link](https://pypi.org/project/pygcap/)) / bioconda pygcap ([link](https://anaconda.org/bioconda/pygcap))

  ```python
  # pip install pygcap
  # conda install bioconda::pygcap
  pygcap [TAXON] [PROBE_FILE]
  ```

- input argument description

  ```python
  ### usage example
  pygcap Facklamia pygcap/data/probe_sample.tsv
  pygcap 66831 pygcap/data/probe_sample.tsv
  ```

  2.  `taxon` (both name and taxid are available)
  3.  path of `probe.tsv` ([sample file](https://github.com/jrim42/pyGCAP/blob/main/pygcap/data/probe_sample.tsv))

      - `Probe Name` (user defined)
      - `Prediction` (user defined)
      - `Accession` (UniProt entry)

- When the appropriate environment is set up, try running the following command from the root directory. If you have successfully met all the pre-requirements, it will execute correctly, and a directory named 'Facklamia' containing the test results will be created in the root directory.

  ```
  python3 test.py
  [or]
  pygcap Facklamia pygcap/data/probe_sample.tsv
  ```

### Options

1. `--working_dir` or `-w` (default: `.`): Specify the working directory path.

   ```
   pygcap [TAXON] [PROBE_FILE] —-working_dir or -w [PATH_OF_WORKING_DIRECTORY]
   ```

2. `--thread` or `-t` (default: `50`): Number of threads to use when running MMseqs2 and blastp. The number of threads can be adjusted automatically based on the CPU environment. It must be an integer greater than 0.

   ```
   pygcap [TAXON] [PROBE_FILE] —-thread or -t [NUMBER_OF_THREAD]
   ```

3. `--identity` of `-i` (default: `0.5`): The value of protein identity to be used in MMseqs2. It must be a value between 0 and 1.

   ```
   pygcap [TAXON] [PROBE_FILE] —-identity or -i [PROTEIN_IDENTITY]
   ```

4. `--max_target_seqs` of `-m` (default: `500`): The vaue of aligned sequences to retain in the overall BLASTP results. It must be an integergreater than 0.

   ```
   pygcap [TAXON] [PROBE_FILE] —-max_target_seqs or -m [MAX_TARGET]
   ```

5. `--skip` of `-s` (default: `none`): Specify steps to skip during the process. Multiple steps can be skipped by using this option multiple times. This option is useful when you want to add a new probe to the same TAXON as before or when you want to change the identity option for MMseqs2.

   ```
   pygcap [TAXON] [PROBE_FILE] —-skip or -s [ARG]
   ```

   - `all`: Skip all the processes listed below.
   - `ncbi`: Skip downloading genome data from NCBI.
   - `mmseqs2`: Skip running MMseqs2.
   - `parsing`: Skip parsing genome data.
   - `uniprot`: Skip downloading probe data from UniProt.
   - `blastdb`: Skip running makeblastdb.

---

## Output

- A directory with the following structure will be created in your `working directory` with the name of the `TAXON` provided as input.

  ```
  📦 [TAXON_NAME]
  ├─ data
  │  ├─ assembly_report.tsv
  │  ├─ metadata_target.tsv
  │  └─ ...
  ├─ input
  │  ├─ [GENUS_01]
  │  ├─ [GENUS_02]
  │  └─ ...
  ├─ output
  │  ├─ genus
  │  ├─ img
  │  └─ tsv
  └─ seqlib
     ├─ blast_output.tsv
     ├─ seqlib.tsv
     └─ ...
  ```

### Example

Profiling _dcw_ genes from pan-genomes of Lactobacillales (LAB)

- The following are some of the result data you can obtain through this pipeline:

  - `working_directory/TAXON/output/img`: A heatmap representing the dcw gene contents of Lactobacillales at the genus level.

    <p align="center">
      <img width="1000" alt="example1" src="https://github.com/user-attachments/assets/745e0d71-cf7c-4796-8601-5793dba42960">
    </p>

  - `working_directory/TAXON/output/geus`: A plot visualizing the dcw gene order of Lactobacillales grouped by genus.

    <p align="center">
      <img width="1000" alt="example2" src="https://github.com/user-attachments/assets/f9a3fd2f-b636-4171-a5cb-257062ebd1f0">
    </p>

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/jrim42/pyGCAP",
    "name": "pygcap",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": null,
    "keywords": "gene, cluster, genomics, bioinformatics",
    "author": "jsrim",
    "author_email": "comfortindex@naver.com",
    "download_url": "https://files.pythonhosted.org/packages/51/63/8a122be6e637573bab01e9760cbe8acbd3babbc0a2edc33b2537ffa9ddfe/pygcap-1.2.5.tar.gz",
    "platform": null,
    "description": "# pyGCAP: a (py)thon (G)ene (C)luster (A)nnotation & (P)rofiling\n\nA Python Package for Probe-based Gene Cluster Finding in Large Microbial Genome Database\n\n- [Introduction](#Introduction)\n- [Pipeline-flow](#Pipeline-flow)\n- [Pre-requirement](#Pre-requirement)\n- [Usage](#Usage)\n- [Output](#Output)\n\n---\n\n## Introduction\n\nBacterial gene clusters provide insights into metabolism and evolution, and facilitate biotechnological applications. We developed pyGCAP, a Python package for probe-based gene cluster discovery. This pipeline uses sequence search and analysis tools and public databases (e.g. BLAST, MMSeqs2, UniProt, and NCBI) to predict potential gene clusters by user-provided probe genes. We tested the pipeline with the division and cell wall (dcw) gene cluster, crucial for cell division and peptidoglycan biosynthesis.\n\nTo evaluate pyGCAP, we used 17 major dcw genes defined by Megrian et al. [1] as a probe set to search for gene clusters in 716 Lactobacillales genomes. The results were integrated to provide detailed information on gene content, gene order, and types of clusters. While PGCfinder examined the completeness of the gene clusters, it could also suggest novel taxa-specific accessory genes related to dcw clusters in Lactobacillales genomes. The package will be freely available on the Python Package Index, Bioconda, and GitHub.\n\n[1] Megrian, D., et al. [Ancient origin and constrained evolution of the division and cell wall gene cluster in Bacteria](https://www.nature.com/articles/s41564-022-01257-y). Nat Microbiol 7, 2114\u20132127 (2022).\n\n---\n\n## Pipeline-flow\n\n<p align=\"center\">\n  <img width=\"1000\" alt=\"flowchart\" src=\"https://github.com/user-attachments/assets/e957794d-091c-4931-a0c9-fd013f02d307\">\n</p>\n\n---\n\n## Pre-requirement\n\n1. `Python` >= 3.6\n2. `conda` environment\n\n   - `blast` ([bioconda blast package](https://anaconda.org/bioconda/blast))\n\n     ```\n     conda install bioconda::blast\n     conda install bioconda/label/cf201901::blast\n     ```\n\n   - `datasets` & `dataformat` from NCBI ([conda-forge ncbi-datasets-cli package](https://anaconda.org/conda-forge/ncbi-datasets-cli))\n\n     ```\n     conda install conda-forge::ncbi-datasets-cli\n     ```\n\n   - `MMseqs2` ([MMseqs2 github](https://github.com/soedinglab/MMseqs2))\n\n     ```\n     conda install -c conda-forge -c bioconda mmseqs2\n     ```\n\n   - If you want to make a new conda environment for pygcap, follow the instructions below:\n\n     ```\n     conda create -n pygcap\n     conda activate pygcap\n     pip install pygcap (or) conda install bioconda::pygcap\n     conda install -c conda-forge ncbi-datasets-cli\n     conda install -c conda-forge -c bioconda mmseqs2\n     ```\n\n---\n\n## Usage\n\n- pypi pygcap ([link](https://pypi.org/project/pygcap/)) / bioconda pygcap ([link](https://anaconda.org/bioconda/pygcap))\n\n  ```python\n  # pip install pygcap\n  # conda install bioconda::pygcap\n  pygcap [TAXON] [PROBE_FILE]\n  ```\n\n- input argument description\n\n  ```python\n  ### usage example\n  pygcap Facklamia pygcap/data/probe_sample.tsv\n  pygcap 66831 pygcap/data/probe_sample.tsv\n  ```\n\n  2.  `taxon` (both name and taxid are available)\n  3.  path of `probe.tsv` ([sample file](https://github.com/jrim42/pyGCAP/blob/main/pygcap/data/probe_sample.tsv))\n\n      - `Probe Name` (user defined)\n      - `Prediction` (user defined)\n      - `Accession` (UniProt entry)\n\n- When the appropriate environment is set up, try running the following command from the root directory. If you have successfully met all the pre-requirements, it will execute correctly, and a directory named 'Facklamia' containing the test results will be created in the root directory.\n\n  ```\n  python3 test.py\n  [or]\n  pygcap Facklamia pygcap/data/probe_sample.tsv\n  ```\n\n### Options\n\n1. `--working_dir` or `-w` (default: `.`): Specify the working directory path.\n\n   ```\n   pygcap [TAXON] [PROBE_FILE] \u2014-working_dir or -w [PATH_OF_WORKING_DIRECTORY]\n   ```\n\n2. `--thread` or `-t` (default: `50`): Number of threads to use when running MMseqs2 and blastp. The number of threads can be adjusted automatically based on the CPU environment. It must be an integer greater than 0.\n\n   ```\n   pygcap [TAXON] [PROBE_FILE] \u2014-thread or -t [NUMBER_OF_THREAD]\n   ```\n\n3. `--identity` of `-i` (default: `0.5`): The value of protein identity to be used in MMseqs2. It must be a value between 0 and 1.\n\n   ```\n   pygcap [TAXON] [PROBE_FILE] \u2014-identity or -i [PROTEIN_IDENTITY]\n   ```\n\n4. `--max_target_seqs` of `-m` (default: `500`): The vaue of aligned sequences to retain in the overall BLASTP results. It must be an integergreater than 0.\n\n   ```\n   pygcap [TAXON] [PROBE_FILE] \u2014-max_target_seqs or -m [MAX_TARGET]\n   ```\n\n5. `--skip` of `-s` (default: `none`): Specify steps to skip during the process. Multiple steps can be skipped by using this option multiple times. This option is useful when you want to add a new probe to the same TAXON as before or when you want to change the identity option for MMseqs2.\n\n   ```\n   pygcap [TAXON] [PROBE_FILE] \u2014-skip or -s [ARG]\n   ```\n\n   - `all`: Skip all the processes listed below.\n   - `ncbi`: Skip downloading genome data from NCBI.\n   - `mmseqs2`: Skip running MMseqs2.\n   - `parsing`: Skip parsing genome data.\n   - `uniprot`: Skip downloading probe data from UniProt.\n   - `blastdb`: Skip running makeblastdb.\n\n---\n\n## Output\n\n- A directory with the following structure will be created in your `working directory` with the name of the `TAXON` provided as input.\n\n  ```\n  \ud83d\udce6 [TAXON_NAME]\n  \u251c\u2500\u00a0data\n  \u2502\u00a0\u00a0\u251c\u2500\u00a0assembly_report.tsv\n  \u2502\u00a0\u00a0\u251c\u2500\u00a0metadata_target.tsv\n  \u2502\u00a0\u00a0\u2514\u2500\u00a0...\n  \u251c\u2500\u00a0input\n  \u2502\u00a0\u00a0\u251c\u2500\u00a0[GENUS_01]\n  \u2502\u00a0\u00a0\u251c\u2500\u00a0[GENUS_02]\n  \u2502\u00a0\u00a0\u2514\u2500\u00a0...\n  \u251c\u2500\u00a0output\n  \u2502\u00a0\u00a0\u251c\u2500\u00a0genus\n  \u2502\u00a0\u00a0\u251c\u2500\u00a0img\n  \u2502\u00a0\u00a0\u2514\u2500\u00a0tsv\n  \u2514\u2500\u00a0seqlib\n  \u00a0\u00a0\u00a0\u251c\u2500\u00a0blast_output.tsv\n  \u00a0\u00a0\u00a0\u251c\u2500\u00a0seqlib.tsv\n  \u00a0\u00a0\u00a0\u2514\u2500\u00a0...\n  ```\n\n### Example\n\nProfiling _dcw_ genes from pan-genomes of Lactobacillales (LAB)\n\n- The following are some of the result data you can obtain through this pipeline:\n\n  - `working_directory/TAXON/output/img`: A heatmap representing the dcw gene contents of Lactobacillales at the genus level.\n\n    <p align=\"center\">\n      <img width=\"1000\" alt=\"example1\" src=\"https://github.com/user-attachments/assets/745e0d71-cf7c-4796-8601-5793dba42960\">\n    </p>\n\n  - `working_directory/TAXON/output/geus`: A plot visualizing the dcw gene order of Lactobacillales grouped by genus.\n\n    <p align=\"center\">\n      <img width=\"1000\" alt=\"example2\" src=\"https://github.com/user-attachments/assets/f9a3fd2f-b636-4171-a5cb-257062ebd1f0\">\n    </p>\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Python package for probe-based gene cluster finding in large microbial genome database",
    "version": "1.2.5",
    "project_urls": {
        "Homepage": "https://github.com/jrim42/pyGCAP"
    },
    "split_keywords": [
        "gene",
        " cluster",
        " genomics",
        " bioinformatics"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3206ca985b2230cb7b354b6873e186cf5bc14f7d62c3421fd07d94dde8b8e902",
                "md5": "99a637e0c0d38be24a646bc043e05a42",
                "sha256": "02a6beda8ddfbf37efebf10857b50b8ec81987ab59934d4957f8a1b46e42d9de"
            },
            "downloads": -1,
            "filename": "pygcap-1.2.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "99a637e0c0d38be24a646bc043e05a42",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 30034,
            "upload_time": "2024-07-30T01:43:46",
            "upload_time_iso_8601": "2024-07-30T01:43:46.443867Z",
            "url": "https://files.pythonhosted.org/packages/32/06/ca985b2230cb7b354b6873e186cf5bc14f7d62c3421fd07d94dde8b8e902/pygcap-1.2.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "51638a122be6e637573bab01e9760cbe8acbd3babbc0a2edc33b2537ffa9ddfe",
                "md5": "a10296959e2d529da233cbaede747627",
                "sha256": "caa4edf728b8cee811daf71e6fda4f3a088fcec3a6bd24a3fc37807c2ee01125"
            },
            "downloads": -1,
            "filename": "pygcap-1.2.5.tar.gz",
            "has_sig": false,
            "md5_digest": "a10296959e2d529da233cbaede747627",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 26331,
            "upload_time": "2024-07-30T01:43:48",
            "upload_time_iso_8601": "2024-07-30T01:43:48.346192Z",
            "url": "https://files.pythonhosted.org/packages/51/63/8a122be6e637573bab01e9760cbe8acbd3babbc0a2edc33b2537ffa9ddfe/pygcap-1.2.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-07-30 01:43:48",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "jrim42",
    "github_project": "pyGCAP",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "pygcap"
}
        
Elapsed time: 3.83005s