diverse-seq


Namediverse-seq JSON
Version 2025.9.11 PyPI version JSON
download
home_pageNone
Summarydiverse_seq: a tool for sampling diverse biological sequences
upload_time2025-09-10 22:52:40
maintainerNone
docs_urlNone
authorNone
requires_python<3.14,>=3.10
licenseNone
keywords biology genomics statistics phylogeny evolution bioinformatics
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ![PyPI - Python Version](https://img.shields.io/pypi/pyversions/diverse-seq)
[![CI](https://github.com/HuttleyLab/DiverseSeq/actions/workflows/ci.yml/badge.svg)](https://github.com/HuttleyLab/DiverseSeq/actions/workflows/ci.yml)
[![Coverage Status](https://coveralls.io/repos/github/HuttleyLab/DiverseSeq/badge.svg?branch=main)](https://coveralls.io/github/HuttleyLab/DiverseSeq?branch=main)
[![Codacy Badge](https://app.codacy.com/project/badge/Grade/ef3010ea162f47a2a5a44e0f3f6ed1f0)](https://app.codacy.com/gh/HuttleyLab/DiverseSeq/dashboard?utm_source=gh&utm_medium=referral&utm_content=&utm_campaign=Badge_grade)
[![CodeQL](https://github.com/HuttleyLab/DiverseSeq/actions/workflows/codeql.yml/badge.svg)](https://github.com/HuttleyLab/DiverseSeq/actions/workflows/codeql.yml)
[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
[![DOI](https://joss.theoj.org/papers/10.21105/joss.07765/status.svg)](https://doi.org/10.21105/joss.07765)

# `diverse-seq` provides alignment-free algorithms to facilitate phylogenetic workflows

`diverse-seq` implements computationally efficient alignment-free algorithms that enable efficient prototyping for phylogenetic workflows. It can accelerate parameter selection searches for sequence alignment and phylogeny estimation by identifying a subset of sequences that are representative of the diversity in a collection. We show that selecting representative sequences with an entropy measure of *k*-mer frequencies correspond well to sampling via conventional genetic distances. The computational performance is linear with respect to the number of sequences and can be run in parallel. Applied to a collection of 10.5k whole microbial genomes on a laptop took ~8 minutes to prepare the data and 4 minutes to select 100 representatives. `diverse-seq` can further boost the performance of phylogenetic estimation by providing a seed phylogeny that can be further refined by a more sophisticated algorithm. For ~1k whole microbial genomes on a laptop, it takes ~1.8 minutes to estimate a bifurcating tree from mash distances.

You can read more about the methods implemented in `diverse-seq` in the preprint [here](https://biorxiv.org/cgi/content/short/2024.11.10.622877v1).

The user documentation [is here](https://diverse-seq.readthedocs.io).

### Installation

We recommend installing `diverse-seq` from PyPI as follows

```
pip install "diverse-seq[extra]"
```

for the full jupyter experience.

For command line only usage, install as follows

```
pip install diverse-seq
```

> **NOTE**
> If you experience any errors during installation, we recommend using [uv pip](https://docs.astral.sh/uv/). This command provides much better error messages than the standard `pip` command. If you cannot resolve the installation problem, please open an issue on the [GitHub repository](https://github.com/HuttleyLab/DiverseSeq/issues).

#### Using `uv`

Speaking of `uv`, it provides a simplified approach to install `dvs` as a command-line only tool as

```
uv tool install diverse-seq
```

Usage in this case is then

```
uvx --from diverse-seq dvs
```

#### Dependencies

For a full listing of dependencies, see the [pyproject.toml](./pyproject.toml) file.

### The command line interface

`dvs` is the command line interface for `diverse-seq`.

<details>
    <summary>The `dvs` subcommands</summary>

<!-- [[[cog
import cog
from diverse_seq.cli import main
from click.testing import CliRunner
runner = CliRunner()
result = runner.invoke(main, [])
help = result.output.replace("Usage: main", "Usage: dvs")
cog.out(
    "```\n{}\n```".format(help)
)
]]] -->
```
Usage: dvs [OPTIONS] COMMAND [ARGS]...

  dvs -- alignment free detection of the most diverse sequences using JSD

Options:
  --version  Show the version and exit.
  --help     Show this message and exit.

Commands:
  demo-data  Export a demo sequence file
  prep       Writes processed sequences to a <HDF5 file>.dvseqs.
  max        Identify the seqs that maximise average delta JSD
  nmost      Identify n seqs that maximise average delta JSD
  ctree      Quickly compute a cluster tree based on kmers for a collection...

```
<!-- [[[end]]] -->

</details>

### The Python API

We make comparable capabilities available as [cogent3 apps](https://cogent3.org/doc/app/index.html). The main difference is the app instances directly operate on, and return, `cogent3` sequence collections. See [the docs](https://diverse-seq.readthedocs.io/en/latest/apps/) for demonstrations of how to use the apps.

## Project Information 

`diverse-seq` is released under the BSD-3 license. If you want to contribute to the `diverse-seq` project (and we hope you do! :innocent:) the code of conduct and other useful developer information is available on the [wiki](https://github.com/HuttleyLab/DiverseSeq/wiki).

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "diverse-seq",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<3.14,>=3.10",
    "maintainer_email": null,
    "keywords": "biology, genomics, statistics, phylogeny, evolution, bioinformatics",
    "author": null,
    "author_email": "Gavin Huttley <Gavin.Huttley@anu.edu.au>",
    "download_url": "https://files.pythonhosted.org/packages/77/2a/f0a1868ae512b67b184281d60ff6e000135bf2cb39c39921cad0ce62baf8/diverse_seq-2025.9.11.tar.gz",
    "platform": null,
    "description": "![PyPI - Python Version](https://img.shields.io/pypi/pyversions/diverse-seq)\n[![CI](https://github.com/HuttleyLab/DiverseSeq/actions/workflows/ci.yml/badge.svg)](https://github.com/HuttleyLab/DiverseSeq/actions/workflows/ci.yml)\n[![Coverage Status](https://coveralls.io/repos/github/HuttleyLab/DiverseSeq/badge.svg?branch=main)](https://coveralls.io/github/HuttleyLab/DiverseSeq?branch=main)\n[![Codacy Badge](https://app.codacy.com/project/badge/Grade/ef3010ea162f47a2a5a44e0f3f6ed1f0)](https://app.codacy.com/gh/HuttleyLab/DiverseSeq/dashboard?utm_source=gh&utm_medium=referral&utm_content=&utm_campaign=Badge_grade)\n[![CodeQL](https://github.com/HuttleyLab/DiverseSeq/actions/workflows/codeql.yml/badge.svg)](https://github.com/HuttleyLab/DiverseSeq/actions/workflows/codeql.yml)\n[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)\n[![DOI](https://joss.theoj.org/papers/10.21105/joss.07765/status.svg)](https://doi.org/10.21105/joss.07765)\n\n# `diverse-seq` provides alignment-free algorithms to facilitate phylogenetic workflows\n\n`diverse-seq` implements computationally efficient alignment-free algorithms that enable efficient prototyping for phylogenetic workflows. It can accelerate parameter selection searches for sequence alignment and phylogeny estimation by identifying a subset of sequences that are representative of the diversity in a collection. We show that selecting representative sequences with an entropy measure of *k*-mer frequencies correspond well to sampling via conventional genetic distances. The computational performance is linear with respect to the number of sequences and can be run in parallel. Applied to a collection of 10.5k whole microbial genomes on a laptop took ~8 minutes to prepare the data and 4 minutes to select 100 representatives. `diverse-seq` can further boost the performance of phylogenetic estimation by providing a seed phylogeny that can be further refined by a more sophisticated algorithm. For ~1k whole microbial genomes on a laptop, it takes ~1.8 minutes to estimate a bifurcating tree from mash distances.\n\nYou can read more about the methods implemented in `diverse-seq` in the preprint [here](https://biorxiv.org/cgi/content/short/2024.11.10.622877v1).\n\nThe user documentation [is here](https://diverse-seq.readthedocs.io).\n\n### Installation\n\nWe recommend installing `diverse-seq` from PyPI as follows\n\n```\npip install \"diverse-seq[extra]\"\n```\n\nfor the full jupyter experience.\n\nFor command line only usage, install as follows\n\n```\npip install diverse-seq\n```\n\n> **NOTE**\n> If you experience any errors during installation, we recommend using [uv pip](https://docs.astral.sh/uv/). This command provides much better error messages than the standard `pip` command. If you cannot resolve the installation problem, please open an issue on the [GitHub repository](https://github.com/HuttleyLab/DiverseSeq/issues).\n\n#### Using `uv`\n\nSpeaking of `uv`, it provides a simplified approach to install `dvs` as a command-line only tool as\n\n```\nuv tool install diverse-seq\n```\n\nUsage in this case is then\n\n```\nuvx --from diverse-seq dvs\n```\n\n#### Dependencies\n\nFor a full listing of dependencies, see the [pyproject.toml](./pyproject.toml) file.\n\n### The command line interface\n\n`dvs` is the command line interface for `diverse-seq`.\n\n<details>\n    <summary>The `dvs` subcommands</summary>\n\n<!-- [[[cog\nimport cog\nfrom diverse_seq.cli import main\nfrom click.testing import CliRunner\nrunner = CliRunner()\nresult = runner.invoke(main, [])\nhelp = result.output.replace(\"Usage: main\", \"Usage: dvs\")\ncog.out(\n    \"```\\n{}\\n```\".format(help)\n)\n]]] -->\n```\nUsage: dvs [OPTIONS] COMMAND [ARGS]...\n\n  dvs -- alignment free detection of the most diverse sequences using JSD\n\nOptions:\n  --version  Show the version and exit.\n  --help     Show this message and exit.\n\nCommands:\n  demo-data  Export a demo sequence file\n  prep       Writes processed sequences to a <HDF5 file>.dvseqs.\n  max        Identify the seqs that maximise average delta JSD\n  nmost      Identify n seqs that maximise average delta JSD\n  ctree      Quickly compute a cluster tree based on kmers for a collection...\n\n```\n<!-- [[[end]]] -->\n\n</details>\n\n### The Python API\n\nWe make comparable capabilities available as [cogent3 apps](https://cogent3.org/doc/app/index.html). The main difference is the app instances directly operate on, and return, `cogent3` sequence collections. See [the docs](https://diverse-seq.readthedocs.io/en/latest/apps/) for demonstrations of how to use the apps.\n\n## Project Information \n\n`diverse-seq` is released under the BSD-3 license. If you want to contribute to the `diverse-seq` project (and we hope you do! :innocent:) the code of conduct and other useful developer information is available on the [wiki](https://github.com/HuttleyLab/DiverseSeq/wiki).\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "diverse_seq: a tool for sampling diverse biological sequences",
    "version": "2025.9.11",
    "project_urls": {
        "Bug Tracker": "https://github.com/HuttleyLab/DiverseSeq/issues",
        "Documentation": "https://diverse-seq.readthedocs.io/",
        "Source Code": "https://github.com/HuttleyLab/DiverseSeq/"
    },
    "split_keywords": [
        "biology",
        " genomics",
        " statistics",
        " phylogeny",
        " evolution",
        " bioinformatics"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "556fdc4bff1563a7cc437885650124b0fb02ee49b604cf3fa8863e5e2a1ccdee",
                "md5": "5f323c63b91a2af912a8b766612914cb",
                "sha256": "f08eef86931fc51fcae2c18fd6f0faf7fc0e0ed5dadd39bc8006ce649fbdeaea"
            },
            "downloads": -1,
            "filename": "diverse_seq-2025.9.11-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "5f323c63b91a2af912a8b766612914cb",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.14,>=3.10",
            "size": 66106,
            "upload_time": "2025-09-10T22:52:38",
            "upload_time_iso_8601": "2025-09-10T22:52:38.839788Z",
            "url": "https://files.pythonhosted.org/packages/55/6f/dc4bff1563a7cc437885650124b0fb02ee49b604cf3fa8863e5e2a1ccdee/diverse_seq-2025.9.11-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "772af0a1868ae512b67b184281d60ff6e000135bf2cb39c39921cad0ce62baf8",
                "md5": "59e95bbaf0f49f4acf3b4a4dbbc16068",
                "sha256": "56b25e514adeb5e81c05912468059e2c0900a3f7dfd4181171cdd114d0d48cb2"
            },
            "downloads": -1,
            "filename": "diverse_seq-2025.9.11.tar.gz",
            "has_sig": false,
            "md5_digest": "59e95bbaf0f49f4acf3b4a4dbbc16068",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.14,>=3.10",
            "size": 179849,
            "upload_time": "2025-09-10T22:52:40",
            "upload_time_iso_8601": "2025-09-10T22:52:40.673750Z",
            "url": "https://files.pythonhosted.org/packages/77/2a/f0a1868ae512b67b184281d60ff6e000135bf2cb39c39921cad0ce62baf8/diverse_seq-2025.9.11.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-10 22:52:40",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "HuttleyLab",
    "github_project": "DiverseSeq",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "diverse-seq"
}
        
Elapsed time: 1.62921s