orthoflow


Nameorthoflow JSON
Version 0.3.4 PyPI version JSON
download
home_pagehttps://github.com/rbturnbull/orthoflow
SummaryA phylogenomic workflow
upload_time2024-03-14 11:41:06
maintainer
docs_urlNone
authorRobert Turnbull
requires_python>=3.8,<3.12
licenseApache-2.0
keywords phylogenomics
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ======================
Orthoflow
======================

.. image:: https://raw.githubusercontent.com/rbturnbull/orthoflow/master/docs/source/_static/images/orthoflow-banner.svg

.. start-badges

|pipeline badge| |docs badge| |black badge| |snakemake badge| |git3moji badge| |contributor covenant badge| |pypi badge|

.. |pipeline badge| image:: https://github.com/rbturnbull/orthoflow/actions/workflows/testing.yml/badge.svg
    :target: https://github.com/rbturnbull/orthoflow/actions/workflows/testing.yml

.. |docs badge| image:: https://github.com/rbturnbull/orthoflow/actions/workflows/docs.yml/badge.svg
    :target: https://rbturnbull.github.io/orthoflow/
    
.. |black badge| image:: https://img.shields.io/badge/code%20style-black-000000.svg
    :target: https://github.com/psf/black

.. |snakemake badge| image:: https://img.shields.io/badge/snakemake-≥7.0.0-brightgreen.svg?style=flat
    :target: https://snakemake.readthedocs.io

.. |git3moji badge| image:: https://img.shields.io/badge/git3moji-%E2%9A%A1%EF%B8%8F%F0%9F%90%9B%F0%9F%93%BA%F0%9F%91%AE%F0%9F%94%A4-fffad8.svg
    :target: https://robinpokorny.github.io/git3moji/

.. |contributor covenant badge| image:: https://img.shields.io/badge/Contributor%20Covenant-2.1-4baaaa.svg
    :target: CONTRIBUTING.html#code-of-conduct

.. |pypi badge| image:: https://badge.fury.io/py/orthoflow.svg
    :target: https://pypi.org/project/orthoflow/

.. end-badges

Orthoflow is a workflow for phylogenetic inference of genome-scale datasets of protein-coding genes. 
Our goal was to make it straightforward to work from a combination of input sources including annotated contigs in Genbank format and FASTA files containing CDSs.
It uses several state of the art inference methods for orthology inference, either based on HMM profiles or de novo inference of orthogroups.
Through the use of OrthoSNAP, many additional ortholog alignments can be generated from multi-copy gene families.
For phylogenetic inference, users can choose a supermatrix approach and/or gene tree inference followed by supertree reconstruction.
Users can specify a range of alignment filtering settings to retain high-quality alignments for phylogenetic inference.
The workflow produces a detailed report that, in addition to the phylogenetic results, includes a range of diagnostics to verify the quality of the results.


.. image:: docs/source/_static/images/orthoflow-workflow-diagram.svg

Documentation
=============

Detailed documentation can be found at https://rbturnbull.github.io/orthoflow/

=================
Quick start guide
=================

Installation
============

You can install orthoflow with pip:

.. code-block::

    pip install orthoflow

More information about installation is available here: https://rbturnbull.github.io/orthoflow/main/installation.html

.. start-beginner-tutorial

Input data
==========

Orthoflow works from an input CSV file with information about the data sources  to be used. Preparing this file is central to setting up your run. The default filename for this is ``input_sources.csv``.

It needs the columns ``file``, ``taxon_string``, ``data_type`` and ``translation_table``.

- The ``file`` column is the path to the file relative to the working directory.
- The ``taxon_string`` is the name of the taxon from which the data was obtained.
- The ``data_type`` column should be ``GenBank`` when providing a GenBank-formatted file with CDS annotations, or ``CDS`` or ``Protein`` when providing a FASTA file with coding sequences consisting of nucleotides or amino acids respectively.
- The ``translation_table`` column should have the translation table (genetic code) number for the data as given `here <https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?mode=c>`_.

Let's look at the demonstration dataset distributed with the code: ``tests/test-data/input_sources.csv``.

=================== ================================== ========== =================
file                taxon_string                       data_type  translation_table
=================== ================================== ========== =================
KY509313.gb         Avrainvillea_mazei_HV02664         GenBank    11
NC_026795.txt       Bryopsis_plumosa_WEST4718          GenBank    11
KX808498.gb         Caulerpa_cliftonii_HV03798         GenBank    11
KY819064.cds.fasta  Chlorodesmis_fastigiata_HV03865    CDS        11
KX808497.fa         Derbesia_sp_WEST4838               CDS        11
MH591079.gb         Dichotomosiphon_tuberosus_HV03781  GenBank    11
MH591080.gbk        Dichotomosiphon_tuberosus_HV03781  GenBank    11
MH591081.gbk        Dichotomosiphon_tuberosus_HV03781  GenBank    11
MH591083.gb         Flabellia_petiolata_HV01202        GenBank    11
MH591084.gb         Flabellia_petiolata_HV01202        GenBank    11
MH591085.gb         Flabellia_petiolata_HV01202        GenBank    11
MH591086.gb         Flabellia_petiolata_HV01202        GenBank    11
=================== ================================== ========== =================

We are using a dataset of algal chloroplast genomes, some as annotated genbank files (``data_type: Genbank``), some as fasta files of the coding sequences (``data_type: CDS``). They all use the bacterial genetic code (``translation_table: 11``). Some of the genomes were in a single Genbank file (e.g. ``KY09313.gb`` at the top), others were fragmented across multiple files (e.g. last 4 all belonging to the same taxon).

The ``taxon_string`` column is perhaps the most important one, as these will be the names to appear in the output tree and this determines how input data gets grouped (e.g. all CDSs in the final four GenBank files will be grouped into a single taxon). In this case, we have included specimen numbers as part of the taxon string but that is optional.



Simple run
==========

We are using the small demonstration dataset distributed with the Orthoflow in the ``tests/test-data`` subdirectory.

Go into the directory containing the ``input_sources.csv`` file and run orthoflow with default settings with these commands:

.. code-block::

    cd tests/test-data
    orthoflow

By default, Orthoflow will extract the CDSs from the input files, run OrthoFinder followed by OrthoSNAP to determine orthologous genes, align them and infer a concatenated tree from the protein sequences. You can follow progress on the screen as the workflow executes and outputs are produced.

Note that the first time you run the workflow, it will be slow because it needs to download and install the software it depends on. This is a one-time thing and runs should get going much faster after.


Examining the output
====================

Inferred tree and intermediate files
------------------------------------
All output files are saved in the ``results`` directory. Output files are subdivided into the workflow modules, which each have their own subdirectory. For the demonstration analysis that we ran above, the inferred phylogeny will be in the ``supermatrix`` subdirectory and be called ``supermatrix.protein.treefile``. Open this with a tree browser (e.g. `FigTree <https://github.com/rambaut/figtree>`_). Also take some time to browse the intermediary files, including the orthogroups, gene alignments and the supermatrix constructed from them.

Report and diagnostics
----------------------
The report provides an overview of the results, the analysis settings used and citations of the software used to produce the results. This report is found in the ``results/report.cds.html`` and/or ``results/report.protein.html``, depending on the method used to infer the phylogeny.

Output logs
-----------
The output logs of all software used as part of the workflow can be found in the ``logs`` directory.

.. warning::
    Orthoflow creates log files for most of the steps of the workflow. 
    When there are many orthologs, this can generate hundreds of thousands of log and result files.
    On systems where there are limitations on the number of files then the workflow may fail.
    You can delete directories of log files after the steps have completed if you no longer need them.

.. end-beginner-tutorial


Credits and Attribution
========================

.. start-credits

Orthoflow was created by Robert Turnbull, Jacob Steenwyk, Simon Mutch, Vinícius Salazar, Pelle Scholten, Joanne L. Birch and Heroen Verbruggen.

The preprint for Orthoflow is here:

    Robert Turnbull, Jacob L. Steenwyk, Simon J. Mutch, Pelle Scholten, Vinícius W. Salazar, Joanne L. Birch, and Heroen Verbruggen. Orthoflow: phylogenomic analysis and diagnostics with one command, 04 December 2023, PREPRINT available at Research Square [https://doi.org/10.21203/rs.3.rs-3699210/]

More details to come.

.. end-credits
            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/rbturnbull/orthoflow",
    "name": "orthoflow",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8,<3.12",
    "maintainer_email": "",
    "keywords": "phylogenomics",
    "author": "Robert Turnbull",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/7c/bd/17d1f13603081db76028e2499b21e7b1cb7c53e7826381c4eb7c2890c386/orthoflow-0.3.4.tar.gz",
    "platform": null,
    "description": "======================\nOrthoflow\n======================\n\n.. image:: https://raw.githubusercontent.com/rbturnbull/orthoflow/master/docs/source/_static/images/orthoflow-banner.svg\n\n.. start-badges\n\n|pipeline badge| |docs badge| |black badge| |snakemake badge| |git3moji badge| |contributor covenant badge| |pypi badge|\n\n.. |pipeline badge| image:: https://github.com/rbturnbull/orthoflow/actions/workflows/testing.yml/badge.svg\n    :target: https://github.com/rbturnbull/orthoflow/actions/workflows/testing.yml\n\n.. |docs badge| image:: https://github.com/rbturnbull/orthoflow/actions/workflows/docs.yml/badge.svg\n    :target: https://rbturnbull.github.io/orthoflow/\n    \n.. |black badge| image:: https://img.shields.io/badge/code%20style-black-000000.svg\n    :target: https://github.com/psf/black\n\n.. |snakemake badge| image:: https://img.shields.io/badge/snakemake-\u22657.0.0-brightgreen.svg?style=flat\n    :target: https://snakemake.readthedocs.io\n\n.. |git3moji badge| image:: https://img.shields.io/badge/git3moji-%E2%9A%A1%EF%B8%8F%F0%9F%90%9B%F0%9F%93%BA%F0%9F%91%AE%F0%9F%94%A4-fffad8.svg\n    :target: https://robinpokorny.github.io/git3moji/\n\n.. |contributor covenant badge| image:: https://img.shields.io/badge/Contributor%20Covenant-2.1-4baaaa.svg\n    :target: CONTRIBUTING.html#code-of-conduct\n\n.. |pypi badge| image:: https://badge.fury.io/py/orthoflow.svg\n    :target: https://pypi.org/project/orthoflow/\n\n.. end-badges\n\nOrthoflow is a workflow for phylogenetic inference of genome-scale datasets of protein-coding genes. \nOur goal was to make it straightforward to work from a combination of input sources including annotated contigs in Genbank format and FASTA files containing CDSs.\nIt uses several state of the art inference methods for orthology inference, either based on HMM profiles or de novo inference of orthogroups.\nThrough the use of OrthoSNAP, many additional ortholog alignments can be generated from multi-copy gene families.\nFor phylogenetic inference, users can choose a supermatrix approach and/or gene tree inference followed by supertree reconstruction.\nUsers can specify a range of alignment filtering settings to retain high-quality alignments for phylogenetic inference.\nThe workflow produces a detailed report that, in addition to the phylogenetic results, includes a range of diagnostics to verify the quality of the results.\n\n\n.. image:: docs/source/_static/images/orthoflow-workflow-diagram.svg\n\nDocumentation\n=============\n\nDetailed documentation can be found at https://rbturnbull.github.io/orthoflow/\n\n=================\nQuick start guide\n=================\n\nInstallation\n============\n\nYou can install orthoflow with pip:\n\n.. code-block::\n\n    pip install orthoflow\n\nMore information about installation is available here: https://rbturnbull.github.io/orthoflow/main/installation.html\n\n.. start-beginner-tutorial\n\nInput data\n==========\n\nOrthoflow works from an input CSV file with information about the data sources  to be used. Preparing this file is central to setting up your run. The default filename for this is ``input_sources.csv``.\n\nIt needs the columns ``file``, ``taxon_string``, ``data_type`` and ``translation_table``.\n\n- The ``file`` column is the path to the file relative to the working directory.\n- The ``taxon_string`` is the name of the taxon from which the data was obtained.\n- The ``data_type`` column should be ``GenBank`` when providing a GenBank-formatted file with CDS annotations, or ``CDS`` or ``Protein`` when providing a FASTA file with coding sequences consisting of nucleotides or amino acids respectively.\n- The ``translation_table`` column should have the translation table (genetic code) number for the data as given `here <https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?mode=c>`_.\n\nLet's look at the demonstration dataset distributed with the code: ``tests/test-data/input_sources.csv``.\n\n=================== ================================== ========== =================\nfile                taxon_string                       data_type  translation_table\n=================== ================================== ========== =================\nKY509313.gb         Avrainvillea_mazei_HV02664         GenBank    11\nNC_026795.txt       Bryopsis_plumosa_WEST4718          GenBank    11\nKX808498.gb         Caulerpa_cliftonii_HV03798         GenBank    11\nKY819064.cds.fasta  Chlorodesmis_fastigiata_HV03865    CDS        11\nKX808497.fa         Derbesia_sp_WEST4838               CDS        11\nMH591079.gb         Dichotomosiphon_tuberosus_HV03781  GenBank    11\nMH591080.gbk        Dichotomosiphon_tuberosus_HV03781  GenBank    11\nMH591081.gbk        Dichotomosiphon_tuberosus_HV03781  GenBank    11\nMH591083.gb         Flabellia_petiolata_HV01202        GenBank    11\nMH591084.gb         Flabellia_petiolata_HV01202        GenBank    11\nMH591085.gb         Flabellia_petiolata_HV01202        GenBank    11\nMH591086.gb         Flabellia_petiolata_HV01202        GenBank    11\n=================== ================================== ========== =================\n\nWe are using a dataset of algal chloroplast genomes, some as annotated genbank files (``data_type: Genbank``), some as fasta files of the coding sequences (``data_type: CDS``). They all use the bacterial genetic code (``translation_table: 11``). Some of the genomes were in a single Genbank file (e.g. ``KY09313.gb`` at the top), others were fragmented across multiple files (e.g. last 4 all belonging to the same taxon).\n\nThe ``taxon_string`` column is perhaps the most important one, as these will be the names to appear in the output tree and this determines how input data gets grouped (e.g. all CDSs in the final four GenBank files will be grouped into a single taxon). In this case, we have included specimen numbers as part of the taxon string but that is optional.\n\n\n\nSimple run\n==========\n\nWe are using the small demonstration dataset distributed with the Orthoflow in the ``tests/test-data`` subdirectory.\n\nGo into the directory containing the ``input_sources.csv`` file and run orthoflow with default settings with these commands:\n\n.. code-block::\n\n    cd tests/test-data\n    orthoflow\n\nBy default, Orthoflow will extract the CDSs from the input files, run OrthoFinder followed by OrthoSNAP to determine orthologous genes, align them and infer a concatenated tree from the protein sequences. You can follow progress on the screen as the workflow executes and outputs are produced.\n\nNote that the first time you run the workflow, it will be slow because it needs to download and install the software it depends on. This is a one-time thing and runs should get going much faster after.\n\n\nExamining the output\n====================\n\nInferred tree and intermediate files\n------------------------------------\nAll output files are saved in the ``results`` directory. Output files are subdivided into the workflow modules, which each have their own subdirectory. For the demonstration analysis that we ran above, the inferred phylogeny will be in the ``supermatrix`` subdirectory and be called ``supermatrix.protein.treefile``. Open this with a tree browser (e.g. `FigTree <https://github.com/rambaut/figtree>`_). Also take some time to browse the intermediary files, including the orthogroups, gene alignments and the supermatrix constructed from them.\n\nReport and diagnostics\n----------------------\nThe report provides an overview of the results, the analysis settings used and citations of the software used to produce the results. This report is found in the ``results/report.cds.html`` and/or ``results/report.protein.html``, depending on the method used to infer the phylogeny.\n\nOutput logs\n-----------\nThe output logs of all software used as part of the workflow can be found in the ``logs`` directory.\n\n.. warning::\n    Orthoflow creates log files for most of the steps of the workflow. \n    When there are many orthologs, this can generate hundreds of thousands of log and result files.\n    On systems where there are limitations on the number of files then the workflow may fail.\n    You can delete directories of log files after the steps have completed if you no longer need them.\n\n.. end-beginner-tutorial\n\n\nCredits and Attribution\n========================\n\n.. start-credits\n\nOrthoflow was created by Robert Turnbull, Jacob Steenwyk, Simon Mutch, Vin\u00edcius Salazar, Pelle Scholten, Joanne L. Birch and Heroen Verbruggen.\n\nThe preprint for Orthoflow is here:\n\n    Robert Turnbull, Jacob L. Steenwyk, Simon J. Mutch, Pelle Scholten, Vin\u00edcius W. Salazar, Joanne L. Birch, and Heroen Verbruggen. Orthoflow: phylogenomic analysis and diagnostics with one command, 04 December 2023, PREPRINT available at Research Square [https://doi.org/10.21203/rs.3.rs-3699210/]\n\nMore details to come.\n\n.. end-credits",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "A phylogenomic workflow",
    "version": "0.3.4",
    "project_urls": {
        "Documentation": "https://rbturnbull.github.io/orthoflow/",
        "Homepage": "https://github.com/rbturnbull/orthoflow",
        "Repository": "https://github.com/rbturnbull/orthoflow"
    },
    "split_keywords": [
        "phylogenomics"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f85fb40259c574b3c41a59fbe9a91fb9c404a6544bf299af28d382b3f03df76d",
                "md5": "34d4dd43e2fddab77975bea2d0f3b054",
                "sha256": "a80596453e4680b0f9a04f60f964b80c3cf273b8dbb38751dc18666af9eb8b78"
            },
            "downloads": -1,
            "filename": "orthoflow-0.3.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "34d4dd43e2fddab77975bea2d0f3b054",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8,<3.12",
            "size": 212112,
            "upload_time": "2024-03-14T11:41:04",
            "upload_time_iso_8601": "2024-03-14T11:41:04.845660Z",
            "url": "https://files.pythonhosted.org/packages/f8/5f/b40259c574b3c41a59fbe9a91fb9c404a6544bf299af28d382b3f03df76d/orthoflow-0.3.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "7cbd17d1f13603081db76028e2499b21e7b1cb7c53e7826381c4eb7c2890c386",
                "md5": "c084a2ab196ff72a084083b4cddbaba3",
                "sha256": "063bff7c0f5e4a62e637f472ce27b95a6ccd8709fcf381b8b735bfd7df3a55e6"
            },
            "downloads": -1,
            "filename": "orthoflow-0.3.4.tar.gz",
            "has_sig": false,
            "md5_digest": "c084a2ab196ff72a084083b4cddbaba3",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8,<3.12",
            "size": 187580,
            "upload_time": "2024-03-14T11:41:06",
            "upload_time_iso_8601": "2024-03-14T11:41:06.376356Z",
            "url": "https://files.pythonhosted.org/packages/7c/bd/17d1f13603081db76028e2499b21e7b1cb7c53e7826381c4eb7c2890c386/orthoflow-0.3.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-03-14 11:41:06",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "rbturnbull",
    "github_project": "orthoflow",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "orthoflow"
}
        
Elapsed time: 0.22698s