sphae

Name	sphae JSON
Version	1.4.6 JSON
	download
home_page	https://github.com/linsalrob/sphae
Summary	Assembling pure culture phages from both Illumina and Nanopore sequencing technology
upload_time	2025-02-12 08:16:12
maintainer	None
docs_url	None
author	Bhavya Papudeshi
requires_python	None
license	The MIT License (MIT)
keywords	phage 'phage therapy' bioinformatics microbiology bacteria genome genomics
VCS
bugtrack_url
requirements	Bio argparse attrmap attrmap click igraph metasnek numpy pandas pytz snaketool_utils tkinter types
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            [![Edwards Lab](https://img.shields.io/badge/Bioinformatics-EdwardsLab-03A9F4)](https://edwards.flinders.edu.au)
[![DOI](https://zenodo.org/badge/403889262.svg)](https://zenodo.org/doi/10.5281/zenodo.8365088)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

![GitHub language count](https://img.shields.io/github/languages/count/linsalrob/spae)
[![](https://img.shields.io/static/v1?label=CLI&message=Snaketool&color=blueviolet)](https://github.com/beardymcjohnface/Snaketool)
![GitHub last commit (branch)](https://img.shields.io/github/last-commit/linsalrob/spae/main)
[![CI](https://github.com/linsalrob/spae/actions/workflows/testing.yml/badge.svg)](https://github.com/linsalrob/spae/actions/workflows/testing.yml)

[![install with pip](https://img.shields.io/static/v1?label=Install%20with&message=PIP&color=success)](https://pypi.org/project/sphae/)
[![Pip Downloads](https://static.pepy.tech/badge/sphae)](https://www.pepy.tech/projects/sphae)
[![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/sphae/README.html)
[![Bioconda Downloads](https://img.shields.io/conda/dn/bioconda/sphae)](https://img.shields.io/conda/dn/bioconda/sphae)
![Docker Pulls](https://img.shields.io/docker/pulls/npbhavya/sphae.svg)


# Sphae 
## Phage toolkit to detect phage candidates for phage therapy
<p align="center">
  <img src="img/sphae.png#gh-light-mode-only" width="300">
  <img src="img/sphaedark.png#gh-dark-mode-only" width="300">
</p>


**Overview**

The steps that sphae takes are shown here:
<p align="center">
  <img src="img/sphae_steps.png#gh-light-mode-only" width="300">
</p>

This snakemake workflow was built using Snaketool [https://doi.org/10.1371/journal.pcbi.1010705], to assemble and annotate phage sequences. Currently, this tool is being developed for phage genomes. The steps include,

- Quality control that removes adaptor sequences, low-quality reads and host contamination (optional). 
- Assembly
- Contig quality checks; read coverage, viral or not, completeness, and assembly graph components. 
- Phage genome annotation

**If you are new to bioinformatics or running command line tools, here is a great tutorial to follow: https://github.com/AnitaTarasenko/sphae/wiki/Sphae-tutorial**

**Cite Sphae: https://doi.org/10.1093/bioadv/vbaf004**

### Install 

**Pip install**

```bash
#creating a new envrionment
conda create -y -n sphae python=3.12
conda activate sphae
#install sphae 
pip install sphae
```

**Conda install** 
```bash
#creating a new environment
conda create -y -n sphae 
conda activate sphae
#install sphae
mamba install sphae
```
**Source Install**

Setting up a new conda environment 

```bash
conda create -n sphae python=3.12
conda activate sphae
conda install -n base -c conda-forge mamba #if you don't already have mamba installed
```

**Container Install**

We have two containers available, 
1. [Sphae v1.4.5 with databases](https://hub.docker.com/repository/docker/npbhavya/sphae)
   This is very large container, about 17.5 GB, so it may take a while to download and install.

   Here are the commands to download sphae container with databases
    ```
    TMPDIR=<where your tmpdir lives>
    IMAGEDIR-<where you want the image to live>
    
    singularity pull --tmpdir=$TMPDIR --dir $IMAGEDIR docker://npbhavya/sphae:latest
    singularity exec sphae_latest.sif sphae --help
    singularity exec sphae_latest.sif sphae run --help
    singularity exec sphae_latest.sif sphae install --help

    singularity exec -B <path/to/inputfiles>:/input,<path/to/output>:/output sphae_latest.sif sphae run --input /input --output /output
    ```
    
2. [Sphae v1.4.5 **without** databases](https://hub.docker.com/repository/docker/npbhavya/sphae)
   This version of sphae container does not include the databases, so they would have to be downloaded separately. The advantage of this is the container is smaller, so quick to donwnload and the databases can be downloaded separately. 

   You will still need to install the databases with `sphae install` as outlined below.

   ```
   TMPDIR=<where your tmpdir lives>
    IMAGEDIR-<where you want the image to live>
    
    singularity pull --tmpdir=$TMPDIR --dir $IMAGEDIR docker://npbhavya/sphae:latest
    #test if sphae is installed 
    singularity exec sphae_latest.sif sphae --help
    singularity exec sphae_latest.sif sphae run --help
    #mount the databases and input files to the image and run with a dataset
    singularity exec -B </path/to/databases>:/databases, <path/to/inputfiles>:/input,<path/to/output>:/output sphae_latest.sif sphae run --input /input --db_dir /databases --output /output
   ```
   
**Source install**

```bash
#clone sphae repository
git clone https://github.com/linsalrob/sphae.git

#move to sphae folder
cd sphae

#install sphae
pip install -e .

#confirm the workflow is installed by running the below command 
sphae --help
```
You will still need to install the databases with `sphae install` as outlined below.


## Installing databases
Run the below command,

```bash
#Installs the database to default directory, `sphae/workflow/databases`
sphae install

#Install database to specific directory
sphae install --db_dir <directory> 
```

  Install the databases to a directory, `sphae/workflow/databases`

  This workflow requires the 
  - Pfam35.0 database to run viral_verify for contig classification. 
  - CheckV database to test for phage completeness
  - Pharokka databases 
  - Phynteny models
  - Phold databases

This step requires ~17G of storage

## Running the workflow

Sphae is developed to be modular: 
- `sphae run` will run QC, assembly and annotation
- `sphae annotate` will run only annotation steps
  
**Commands to run**

Only one command needs to be submitted to run all the above steps: QC, assembly and assembly stats

```bash
#For illumina reads, place the reads both forward and reverse reads to one directory
#Make sure the fastq reads are saved as {sample_name}_R1.fastq and {sample_name}_R2.fastq or with extensions {sample_name}_R1.fastq.gz
sphae run --input tests/data/illumina-subset --output example -k --use-conda --conda-frontend mamba

#For nanopore reads, place the reads, one file per sample in a directory
sphae run --input tests/data/nanopore-subset --sequencing longread --output example -k --use-conda --conda-frontend mamba

#For newer ONT sequencing data where polishing is not required, run the command
sphae run --input tests/data/nanopore-subset --sequencing longread --output example -k --no_medaka --use-conda --conda-frontend mamba

#To run either of the commands on the cluster, add --executor slurm to the command. There is a little bit of setup to do here.
#Setup a ~/.config/snakemake/slurm/config.yaml file - https://snakemake.github.io/snakemake-plugin-catalog/plugins/executor/slurm.html#advanced-resource-specifications
#I may have set this workflow to run only slurm right now, will make it more generic soon.
sphae run --input tests/data/nanopore-subset --preprocess longread --output example --profile slurm -k --threads 16 --use-conda --conda-frontend mamba

```

**Command to run only annotation steps and phylogenetic trees**
This step reruns 
   - Pharokka, Phold, Phynteny
   - Phylogenetic tree with terminase large subunit, portal protein
   
```bash
#the genomes directory has the already assembled complete genomes
sphae annotate --genome <genomes directory> --output example -k --use-conda --conda-frontend mamba
```

**Output**

Output is saved to `example/RESULTS` directory. In this directory, there will be four files 
  - Genome annotations in GenBank format (Phynteny output)
  - Genome in fasta format (either the reoriented to terminase output from Pharokka, or assembled viral contigs)
  - Circular visualization in `png` format (Pharokka output)
  - Genome summary file

Genome summary file includes the following information to help, 
  - Sample name
  - Length of the genome 
  - Coding density
  - If the assembled contig is circular or not (From checkv)
  - Completeness (calculated from CheckV)
  - Contamination (calculated from CheckV)
  - Taxonomy accession ID (Pharokka output, searches the genome against INPHARED database using mash)
  - Taxa mash includes the number of matching hashes of the assembled genome to the accession ID/Taxa name. Higher the matching hash- more likely the genome is related to the taxa predicted
  - Gene searches:
    - Whether integrase is found (search for integrase gene in annotations)
    - Whether anti-microbial genes were found (Phold and Pharokka search against AMR database)
    - Whether any virulence factors were found (Pharokka search against virulence gene database)
    - Whether any CRISPR spacers were found (Pharokka search against MinCED database) 

### FAQ
1. **"Failed during assembly":**
   - This message indicates that the assembly process was unsuccessful. It suggests that the assembler could not generate contigs, which are contiguous sequences of DNA, typically representing segments of a genome. 
   - To confirm this, you can check the logs located at `sphae.out/PROCESSING/assembly/flye/<sample name>/assembly_info.txt` or `sphae.out/PROCESSING/assembly/megahit/<sample name>/log`. These logs should provide details about the error or the step at which the assembly failed.
   - One possible reason for this failure could be insufficient genome coverage, meaning that there were not enough sequencing reads to accurately assemble the genome.

2. **"Genome includes multiple contigs, fragmented":**
   - This message indicates that the assembly generated numerous short fragments (contigs) instead of a single, contiguous sequence representing a nearly complete phage genome. 
   - You can verify this by examining the file `sphae.out/PROCESSING/assembly/flye/<sample name>-assembly-stats_flye.csv` or `sphae.out/PROCESSING/assembly/megahit/<sample name>-assembly-stats_megahit.csv`.
   - Each row in these tables represents a contig along with its characteristics. If none of the contigs are identified as viral and do not meet a certain completeness threshold (e.g., greater than 70% completeness), it suggests that the assembly consists of fragmented contigs.
   - Fragmented contigs make it challenging to accurately identify genes. To address this issue, you may need to resequence the phages for better coverage or try using different assembly algorithms.

3. **"Good genome coverage but still encountering assembly issues":**
   - If you have adequate genome coverage but still face assembly problems, you may consider adjusting the subsampling step in sphae. This step involves randomly selecting a subset of reads to reduce the computational burden.
   - To modify the subsampling parameters, navigate to the `config/config.yaml` file and update the line under `subsample` section, for example:
     ```
     subsample:
         --bases 1000M
     ```
   - Increase or decrease the number of bases (e.g., `1000M` for 1000 megabases) based on your requirements.
   - After making the changes, rerun sphae and ensure that the updated subsampling parameters are reflected in the `sphae.out/sphae.config.yaml` file.

4. **"What does 'No integrases found ...but Phynteny predicted a few unknown function genes to have some similarity with integrase genes but with low confidence. Maybe a false positive or a novel integrase gene' mean?"**
   This message indicates that while no integrase genes were explicitly identified, the analysis detected certain genes that exhibited similarities to integrase genes. However, these genes were associated with low confidence scores, suggesting a possibility of being false positives or potentially representing novel integrase genes.
   
   [Phynteny](https://github.com/susiegriggo/Phynteny), the tool used for this prediction, assigns a confidence score to each gene prediction. If this score falls below a certain threshold (typically 90%), the gene remains classified as having an unknown function. To further investigate these genes, advanced techniques such as folding using tools like [ColabFold](https://github.com/sokrypton/ColabFold) and [Foldseek](https://github.com/steineggerlab/foldseek) can be employed. Analyzing the structure of these genes may provide additional insights into their functionality and potential role in biological processes.

5. **How do I visualize the phages and gene annotations?**
   To visualize the phages and gene annotations, I recommend using [Clinker](https://github.com/gamcil/clinker). First, gather all the sample genbank files from `sphae.out/RESULTS` and place them in a new directory. Then, execute the clinker command to generate clinker plots, which compare the genes in each genome to each other.
   
   Additionally, for enhanced visualization, consider running [dnaapler](https://github.com/gbouras13/dnaapler) on the genomes in fasta format obtained from 
   `sphae.out/RESULTS`. This step generates reoriented phages that start with terminase genes. Pharokka -> Phold -> Phynteny has to be rerun, and the resulting genbank files can be used for visualization. To perform the annotation steps, run the command 
   `sphae annotate --input <reoriented genomes from dnaapler in fasta format directory>`
   
   Please note that dnaapler may fail if terminase genes are not found, particularly when working with novel phages. The reason these steps haven't been added to sphae. If you encounter any challenges during this process, please feel free to leave an issue, and I'll provide improved documentation to assist you further with the command on how to install and run the command different commands. 

6. **Where are the intermediate files being saved?**
   These files are being saved in `sphae.out/PROCESSING`. If you need more information on the file structure here, or have ideas of better organization then leave an issue and I will make a note to have more documentation. 

7. **Just run annotation on already assembled genomes?**
   
    `sphae annotate --input <input genomes>`
    This command runs only Pharokka, Phold and Phynteny to annotate the assembled genomes. The results are saved to a new directory labeled `sphae.out/annotation`. 

    Note: Currently, Sphae runs Phold in CPU mode, but efforts are underway to support Phold GPU mode for faster processing of this step.

8. How to change the number of base pairs to subsample for a sample?
    Run the command `sphae config`
    This copies the config file within the workflow to the current directory. Open this file and update the line `bases: 10000000` to for instance `bases: 300000`
    Then run sphae run with the command `sphae run --input tests/data/illumina-subset --output example -k --config <path to the config file with the change>`
   
   
## Citation
To cite sphae, doi: https://doi.org/10.1101/2024.11.18.624194

## Issues and Questions

If you come across any issues or errors, report them under [Issues](https://github.com/linsalrob/sphae/issues).

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/linsalrob/sphae",
    "name": "sphae",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "phage 'phage therapy' bioinformatics microbiology bacteria genome genomics",
    "author": "Bhavya Papudeshi",
    "author_email": "npbhavya13@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/cc/a7/fa43981e49f61e95bcf06e0ef0351a29b09a4f97fd8b080220e503840520/sphae-1.4.6.tar.gz",
    "platform": "any",
    "description": "[![Edwards Lab](https://img.shields.io/badge/Bioinformatics-EdwardsLab-03A9F4)](https://edwards.flinders.edu.au)\n[![DOI](https://zenodo.org/badge/403889262.svg)](https://zenodo.org/doi/10.5281/zenodo.8365088)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\n![GitHub language count](https://img.shields.io/github/languages/count/linsalrob/spae)\n[![](https://img.shields.io/static/v1?label=CLI&message=Snaketool&color=blueviolet)](https://github.com/beardymcjohnface/Snaketool)\n![GitHub last commit (branch)](https://img.shields.io/github/last-commit/linsalrob/spae/main)\n[![CI](https://github.com/linsalrob/spae/actions/workflows/testing.yml/badge.svg)](https://github.com/linsalrob/spae/actions/workflows/testing.yml)\n\n[![install with pip](https://img.shields.io/static/v1?label=Install%20with&message=PIP&color=success)](https://pypi.org/project/sphae/)\n[![Pip Downloads](https://static.pepy.tech/badge/sphae)](https://www.pepy.tech/projects/sphae)\n[![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/sphae/README.html)\n[![Bioconda Downloads](https://img.shields.io/conda/dn/bioconda/sphae)](https://img.shields.io/conda/dn/bioconda/sphae)\n![Docker Pulls](https://img.shields.io/docker/pulls/npbhavya/sphae.svg)\n\n\n# Sphae \n## Phage toolkit to detect phage candidates for phage therapy\n<p align=\"center\">\n  <img src=\"img/sphae.png#gh-light-mode-only\" width=\"300\">\n  <img src=\"img/sphaedark.png#gh-dark-mode-only\" width=\"300\">\n</p>\n\n\n**Overview**\n\nThe steps that sphae takes are shown here:\n<p align=\"center\">\n  <img src=\"img/sphae_steps.png#gh-light-mode-only\" width=\"300\">\n</p>\n\nThis snakemake workflow was built using Snaketool [https://doi.org/10.1371/journal.pcbi.1010705], to assemble and annotate phage sequences. Currently, this tool is being developed for phage genomes. The steps include,\n\n- Quality control that removes adaptor sequences, low-quality reads and host contamination (optional). \n- Assembly\n- Contig quality checks; read coverage, viral or not, completeness, and assembly graph components. \n- Phage genome annotation\n\n**If you are new to bioinformatics or running command line tools, here is a great tutorial to follow: https://github.com/AnitaTarasenko/sphae/wiki/Sphae-tutorial**\n\n**Cite Sphae: https://doi.org/10.1093/bioadv/vbaf004**\n\n### Install \n\n**Pip install**\n\n```bash\n#creating a new envrionment\nconda create -y -n sphae python=3.12\nconda activate sphae\n#install sphae \npip install sphae\n```\n\n**Conda install** \n```bash\n#creating a new environment\nconda create -y -n sphae \nconda activate sphae\n#install sphae\nmamba install sphae\n```\n**Source Install**\n\nSetting up a new conda environment \n\n```bash\nconda create -n sphae python=3.12\nconda activate sphae\nconda install -n base -c conda-forge mamba #if you don't already have mamba installed\n```\n\n**Container Install**\n\nWe have two containers available, \n1. [Sphae v1.4.5 with databases](https://hub.docker.com/repository/docker/npbhavya/sphae)\n   This is very large container, about 17.5 GB, so it may take a while to download and install.\n\n   Here are the commands to download sphae container with databases\n    ```\n    TMPDIR=<where your tmpdir lives>\n    IMAGEDIR-<where you want the image to live>\n    \n    singularity pull --tmpdir=$TMPDIR --dir $IMAGEDIR docker://npbhavya/sphae:latest\n    singularity exec sphae_latest.sif sphae --help\n    singularity exec sphae_latest.sif sphae run --help\n    singularity exec sphae_latest.sif sphae install --help\n\n    singularity exec -B <path/to/inputfiles>:/input,<path/to/output>:/output sphae_latest.sif sphae run --input /input --output /output\n    ```\n    \n2. [Sphae v1.4.5 **without** databases](https://hub.docker.com/repository/docker/npbhavya/sphae)\n   This version of sphae container does not include the databases, so they would have to be downloaded separately. The advantage of this is the container is smaller, so quick to donwnload and the databases can be downloaded separately. \n\n   You will still need to install the databases with `sphae install` as outlined below.\n\n   ```\n   TMPDIR=<where your tmpdir lives>\n    IMAGEDIR-<where you want the image to live>\n    \n    singularity pull --tmpdir=$TMPDIR --dir $IMAGEDIR docker://npbhavya/sphae:latest\n    #test if sphae is installed \n    singularity exec sphae_latest.sif sphae --help\n    singularity exec sphae_latest.sif sphae run --help\n    #mount the databases and input files to the image and run with a dataset\n    singularity exec -B </path/to/databases>:/databases, <path/to/inputfiles>:/input,<path/to/output>:/output sphae_latest.sif sphae run --input /input --db_dir /databases --output /output\n   ```\n   \n**Source install**\n\n```bash\n#clone sphae repository\ngit clone https://github.com/linsalrob/sphae.git\n\n#move to sphae folder\ncd sphae\n\n#install sphae\npip install -e .\n\n#confirm the workflow is installed by running the below command \nsphae --help\n```\nYou will still need to install the databases with `sphae install` as outlined below.\n\n\n## Installing databases\nRun the below command,\n\n```bash\n#Installs the database to default directory, `sphae/workflow/databases`\nsphae install\n\n#Install database to specific directory\nsphae install --db_dir <directory> \n```\n\n  Install the databases to a directory, `sphae/workflow/databases`\n\n  This workflow requires the \n  - Pfam35.0 database to run viral_verify for contig classification. \n  - CheckV database to test for phage completeness\n  - Pharokka databases \n  - Phynteny models\n  - Phold databases\n\nThis step requires ~17G of storage\n\n## Running the workflow\n\nSphae is developed to be modular: \n- `sphae run` will run QC, assembly and annotation\n- `sphae annotate` will run only annotation steps\n  \n**Commands to run**\n\nOnly one command needs to be submitted to run all the above steps: QC, assembly and assembly stats\n\n```bash\n#For illumina reads, place the reads both forward and reverse reads to one directory\n#Make sure the fastq reads are saved as {sample_name}_R1.fastq and {sample_name}_R2.fastq or with extensions {sample_name}_R1.fastq.gz\nsphae run --input tests/data/illumina-subset --output example -k --use-conda --conda-frontend mamba\n\n#For nanopore reads, place the reads, one file per sample in a directory\nsphae run --input tests/data/nanopore-subset --sequencing longread --output example -k --use-conda --conda-frontend mamba\n\n#For newer ONT sequencing data where polishing is not required, run the command\nsphae run --input tests/data/nanopore-subset --sequencing longread --output example -k --no_medaka --use-conda --conda-frontend mamba\n\n#To run either of the commands on the cluster, add --executor slurm to the command. There is a little bit of setup to do here.\n#Setup a ~/.config/snakemake/slurm/config.yaml file - https://snakemake.github.io/snakemake-plugin-catalog/plugins/executor/slurm.html#advanced-resource-specifications\n#I may have set this workflow to run only slurm right now, will make it more generic soon.\nsphae run --input tests/data/nanopore-subset --preprocess longread --output example --profile slurm -k --threads 16 --use-conda --conda-frontend mamba\n\n```\n\n**Command to run only annotation steps and phylogenetic trees**\nThis step reruns \n   - Pharokka, Phold, Phynteny\n   - Phylogenetic tree with terminase large subunit, portal protein\n   \n```bash\n#the genomes directory has the already assembled complete genomes\nsphae annotate --genome <genomes directory> --output example -k --use-conda --conda-frontend mamba\n```\n\n**Output**\n\nOutput is saved to `example/RESULTS` directory. In this directory, there will be four files \n  - Genome annotations in GenBank format (Phynteny output)\n  - Genome in fasta format (either the reoriented to terminase output from Pharokka, or assembled viral contigs)\n  - Circular visualization in `png` format (Pharokka output)\n  - Genome summary file\n\nGenome summary file includes the following information to help, \n  - Sample name\n  - Length of the genome \n  - Coding density\n  - If the assembled contig is circular or not (From checkv)\n  - Completeness (calculated from CheckV)\n  - Contamination (calculated from CheckV)\n  - Taxonomy accession ID (Pharokka output, searches the genome against INPHARED database using mash)\n  - Taxa mash includes the number of matching hashes of the assembled genome to the accession ID/Taxa name. Higher the matching hash- more likely the genome is related to the taxa predicted\n  - Gene searches:\n    - Whether integrase is found (search for integrase gene in annotations)\n    - Whether anti-microbial genes were found (Phold and Pharokka search against AMR database)\n    - Whether any virulence factors were found (Pharokka search against virulence gene database)\n    - Whether any CRISPR spacers were found (Pharokka search against MinCED database) \n\n### FAQ\n1. **\"Failed during assembly\":**\n   - This message indicates that the assembly process was unsuccessful. It suggests that the assembler could not generate contigs, which are contiguous sequences of DNA, typically representing segments of a genome. \n   - To confirm this, you can check the logs located at `sphae.out/PROCESSING/assembly/flye/<sample name>/assembly_info.txt` or `sphae.out/PROCESSING/assembly/megahit/<sample name>/log`. These logs should provide details about the error or the step at which the assembly failed.\n   - One possible reason for this failure could be insufficient genome coverage, meaning that there were not enough sequencing reads to accurately assemble the genome.\n\n2. **\"Genome includes multiple contigs, fragmented\":**\n   - This message indicates that the assembly generated numerous short fragments (contigs) instead of a single, contiguous sequence representing a nearly complete phage genome. \n   - You can verify this by examining the file `sphae.out/PROCESSING/assembly/flye/<sample name>-assembly-stats_flye.csv` or `sphae.out/PROCESSING/assembly/megahit/<sample name>-assembly-stats_megahit.csv`.\n   - Each row in these tables represents a contig along with its characteristics. If none of the contigs are identified as viral and do not meet a certain completeness threshold (e.g., greater than 70% completeness), it suggests that the assembly consists of fragmented contigs.\n   - Fragmented contigs make it challenging to accurately identify genes. To address this issue, you may need to resequence the phages for better coverage or try using different assembly algorithms.\n\n3. **\"Good genome coverage but still encountering assembly issues\":**\n   - If you have adequate genome coverage but still face assembly problems, you may consider adjusting the subsampling step in sphae. This step involves randomly selecting a subset of reads to reduce the computational burden.\n   - To modify the subsampling parameters, navigate to the `config/config.yaml` file and update the line under `subsample` section, for example:\n     ```\n     subsample:\n         --bases 1000M\n     ```\n   - Increase or decrease the number of bases (e.g., `1000M` for 1000 megabases) based on your requirements.\n   - After making the changes, rerun sphae and ensure that the updated subsampling parameters are reflected in the `sphae.out/sphae.config.yaml` file.\n\n4. **\"What does 'No integrases found ...but Phynteny predicted a few unknown function genes to have some similarity with integrase genes but with low confidence. Maybe a false positive or a novel integrase gene' mean?\"**\n   This message indicates that while no integrase genes were explicitly identified, the analysis detected certain genes that exhibited similarities to integrase genes. However, these genes were associated with low confidence scores, suggesting a possibility of being false positives or potentially representing novel integrase genes.\n   \n   [Phynteny](https://github.com/susiegriggo/Phynteny), the tool used for this prediction, assigns a confidence score to each gene prediction. If this score falls below a certain threshold (typically 90%), the gene remains classified as having an unknown function. To further investigate these genes, advanced techniques such as folding using tools like [ColabFold](https://github.com/sokrypton/ColabFold) and [Foldseek](https://github.com/steineggerlab/foldseek) can be employed. Analyzing the structure of these genes may provide additional insights into their functionality and potential role in biological processes.\n\n5. **How do I visualize the phages and gene annotations?**\n   To visualize the phages and gene annotations, I recommend using [Clinker](https://github.com/gamcil/clinker). First, gather all the sample genbank files from `sphae.out/RESULTS` and place them in a new directory. Then, execute the clinker command to generate clinker plots, which compare the genes in each genome to each other.\n   \n   Additionally, for enhanced visualization, consider running [dnaapler](https://github.com/gbouras13/dnaapler) on the genomes in fasta format obtained from \n   `sphae.out/RESULTS`. This step generates reoriented phages that start with terminase genes. Pharokka -> Phold -> Phynteny has to be rerun, and the resulting genbank files can be used for visualization. To perform the annotation steps, run the command \n   `sphae annotate --input <reoriented genomes from dnaapler in fasta format directory>`\n   \n   Please note that dnaapler may fail if terminase genes are not found, particularly when working with novel phages. The reason these steps haven't been added to sphae. If you encounter any challenges during this process, please feel free to leave an issue, and I'll provide improved documentation to assist you further with the command on how to install and run the command different commands. \n\n6. **Where are the intermediate files being saved?**\n   These files are being saved in `sphae.out/PROCESSING`. If you need more information on the file structure here, or have ideas of better organization then leave an issue and I will make a note to have more documentation. \n\n7. **Just run annotation on already assembled genomes?**\n   \n    `sphae annotate --input <input genomes>`\n    This command runs only Pharokka, Phold and Phynteny to annotate the assembled genomes. The results are saved to a new directory labeled `sphae.out/annotation`. \n\n    Note: Currently, Sphae runs Phold in CPU mode, but efforts are underway to support Phold GPU mode for faster processing of this step.\n\n8. How to change the number of base pairs to subsample for a sample?\n    Run the command `sphae config`\n    This copies the config file within the workflow to the current directory. Open this file and update the line `bases: 10000000` to for instance `bases: 300000`\n    Then run sphae run with the command `sphae run --input tests/data/illumina-subset --output example -k --config <path to the config file with the change>`\n   \n   \n## Citation\nTo cite sphae, doi: https://doi.org/10.1101/2024.11.18.624194\n\n## Issues and Questions\n\nIf you come across any issues or errors, report them under [Issues](https://github.com/linsalrob/sphae/issues). \n\n\n",
    "bugtrack_url": null,
    "license": "The MIT License (MIT)",
    "summary": "Assembling pure culture phages from both Illumina and Nanopore sequencing technology",
    "version": "1.4.6",
    "project_urls": {
        "Homepage": "https://github.com/linsalrob/sphae"
    },
    "split_keywords": [
        "phage",
        "'phage",
        "therapy'",
        "bioinformatics",
        "microbiology",
        "bacteria",
        "genome",
        "genomics"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "26d2eecd47db00b416f8f95093b7eac6b40552b338f48b0883600812fead60c4",
                "md5": "33c372acc8f008a8d6a56b7450fbb133",
                "sha256": "87a0bfb8b33b9ceff93d27b65cc2bb9611f75ec51222d984d61ea5fadf567e72"
            },
            "downloads": -1,
            "filename": "sphae-1.4.6-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "33c372acc8f008a8d6a56b7450fbb133",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 92121,
            "upload_time": "2025-02-12T08:16:11",
            "upload_time_iso_8601": "2025-02-12T08:16:11.303340Z",
            "url": "https://files.pythonhosted.org/packages/26/d2/eecd47db00b416f8f95093b7eac6b40552b338f48b0883600812fead60c4/sphae-1.4.6-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "cca7fa43981e49f61e95bcf06e0ef0351a29b09a4f97fd8b080220e503840520",
                "md5": "2db93f5df52c5576e2fc19f59aadd505",
                "sha256": "11fe78372c6e605cb30e01c91e5597ea01d3c1ff2c74573a1803671e1f81a529"
            },
            "downloads": -1,
            "filename": "sphae-1.4.6.tar.gz",
            "has_sig": false,
            "md5_digest": "2db93f5df52c5576e2fc19f59aadd505",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 72153,
            "upload_time": "2025-02-12T08:16:12",
            "upload_time_iso_8601": "2025-02-12T08:16:12.433515Z",
            "url": "https://files.pythonhosted.org/packages/cc/a7/fa43981e49f61e95bcf06e0ef0351a29b09a4f97fd8b080220e503840520/sphae-1.4.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-12 08:16:12",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "linsalrob",
    "github_project": "sphae",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "Bio",
            "specs": []
        },
        {
            "name": "argparse",
            "specs": []
        },
        {
            "name": "attrmap",
            "specs": []
        },
        {
            "name": "attrmap",
            "specs": []
        },
        {
            "name": "click",
            "specs": []
        },
        {
            "name": "igraph",
            "specs": []
        },
        {
            "name": "metasnek",
            "specs": []
        },
        {
            "name": "numpy",
            "specs": []
        },
        {
            "name": "pandas",
            "specs": []
        },
        {
            "name": "pytz",
            "specs": []
        },
        {
            "name": "snaketool_utils",
            "specs": []
        },
        {
            "name": "tkinter",
            "specs": []
        },
        {
            "name": "types",
            "specs": []
        }
    ],
    "lcname": "sphae"
}

Bhavya Papudeshi