
[](https://github.com/HuttleyLab/DiverseSeq/actions/workflows/ci.yml)
[](https://coveralls.io/github/HuttleyLab/DiverseSeq?branch=main)
[](https://app.codacy.com/gh/HuttleyLab/DiverseSeq/dashboard?utm_source=gh&utm_medium=referral&utm_content=&utm_campaign=Badge_grade)
[](https://github.com/HuttleyLab/DiverseSeq/actions/workflows/codeql.yml)
[](https://github.com/astral-sh/ruff)
[](https://doi.org/10.21105/joss.07765)
# `diverse-seq` provides alignment-free algorithms to facilitate phylogenetic workflows
`diverse-seq` implements computationally efficient alignment-free algorithms that enable efficient prototyping for phylogenetic workflows. It can accelerate parameter selection searches for sequence alignment and phylogeny estimation by identifying a subset of sequences that are representative of the diversity in a collection. We show that selecting representative sequences with an entropy measure of *k*-mer frequencies correspond well to sampling via conventional genetic distances. The computational performance is linear with respect to the number of sequences and can be run in parallel. Applied to a collection of 10.5k whole microbial genomes on a laptop took ~8 minutes to prepare the data and 4 minutes to select 100 representatives. `diverse-seq` can further boost the performance of phylogenetic estimation by providing a seed phylogeny that can be further refined by a more sophisticated algorithm. For ~1k whole microbial genomes on a laptop, it takes ~1.8 minutes to estimate a bifurcating tree from mash distances.
You can read more about the methods implemented in `diverse-seq` in the preprint [here](https://biorxiv.org/cgi/content/short/2024.11.10.622877v1).
The user documentation [is here](https://diverse-seq.readthedocs.io).
### Installation
We recommend installing `diverse-seq` from PyPI as follows
```
pip install "diverse-seq[extra]"
```
for the full jupyter experience.
For command line only usage, install as follows
```
pip install diverse-seq
```
> **NOTE**
> If you experience any errors during installation, we recommend using [uv pip](https://docs.astral.sh/uv/). This command provides much better error messages than the standard `pip` command. If you cannot resolve the installation problem, please open an issue on the [GitHub repository](https://github.com/HuttleyLab/DiverseSeq/issues).
#### Using `uv`
Speaking of `uv`, it provides a simplified approach to install `dvs` as a command-line only tool as
```
uv tool install diverse-seq
```
Usage in this case is then
```
uvx --from diverse-seq dvs
```
#### Dependencies
For a full listing of dependencies, see the [pyproject.toml](./pyproject.toml) file.
### The command line interface
`dvs` is the command line interface for `diverse-seq`.
<details>
<summary>The `dvs` subcommands</summary>
<!-- [[[cog
import cog
from diverse_seq.cli import main
from click.testing import CliRunner
runner = CliRunner()
result = runner.invoke(main, [])
help = result.output.replace("Usage: main", "Usage: dvs")
cog.out(
"```\n{}\n```".format(help)
)
]]] -->
```
Usage: dvs [OPTIONS] COMMAND [ARGS]...
dvs -- alignment free detection of the most diverse sequences using JSD
Options:
--version Show the version and exit.
--help Show this message and exit.
Commands:
demo-data Export a demo sequence file
prep Writes processed sequences to a <HDF5 file>.dvseqs.
max Identify the seqs that maximise average delta JSD
nmost Identify n seqs that maximise average delta JSD
ctree Quickly compute a cluster tree based on kmers for a collection...
```
<!-- [[[end]]] -->
</details>
### The Python API
We make comparable capabilities available as [cogent3 apps](https://cogent3.org/doc/app/index.html). The main difference is the app instances directly operate on, and return, `cogent3` sequence collections. See [the docs](https://diverse-seq.readthedocs.io/en/latest/apps/) for demonstrations of how to use the apps.
## Project Information
`diverse-seq` is released under the BSD-3 license. If you want to contribute to the `diverse-seq` project (and we hope you do! :innocent:) the code of conduct and other useful developer information is available on the [wiki](https://github.com/HuttleyLab/DiverseSeq/wiki).
Raw data
{
"_id": null,
"home_page": null,
"name": "diverse-seq",
"maintainer": null,
"docs_url": null,
"requires_python": "<3.14,>=3.10",
"maintainer_email": null,
"keywords": "biology, genomics, statistics, phylogeny, evolution, bioinformatics",
"author": null,
"author_email": "Gavin Huttley <Gavin.Huttley@anu.edu.au>",
"download_url": "https://files.pythonhosted.org/packages/77/2a/f0a1868ae512b67b184281d60ff6e000135bf2cb39c39921cad0ce62baf8/diverse_seq-2025.9.11.tar.gz",
"platform": null,
"description": "\n[](https://github.com/HuttleyLab/DiverseSeq/actions/workflows/ci.yml)\n[](https://coveralls.io/github/HuttleyLab/DiverseSeq?branch=main)\n[](https://app.codacy.com/gh/HuttleyLab/DiverseSeq/dashboard?utm_source=gh&utm_medium=referral&utm_content=&utm_campaign=Badge_grade)\n[](https://github.com/HuttleyLab/DiverseSeq/actions/workflows/codeql.yml)\n[](https://github.com/astral-sh/ruff)\n[](https://doi.org/10.21105/joss.07765)\n\n# `diverse-seq` provides alignment-free algorithms to facilitate phylogenetic workflows\n\n`diverse-seq` implements computationally efficient alignment-free algorithms that enable efficient prototyping for phylogenetic workflows. It can accelerate parameter selection searches for sequence alignment and phylogeny estimation by identifying a subset of sequences that are representative of the diversity in a collection. We show that selecting representative sequences with an entropy measure of *k*-mer frequencies correspond well to sampling via conventional genetic distances. The computational performance is linear with respect to the number of sequences and can be run in parallel. Applied to a collection of 10.5k whole microbial genomes on a laptop took ~8 minutes to prepare the data and 4 minutes to select 100 representatives. `diverse-seq` can further boost the performance of phylogenetic estimation by providing a seed phylogeny that can be further refined by a more sophisticated algorithm. For ~1k whole microbial genomes on a laptop, it takes ~1.8 minutes to estimate a bifurcating tree from mash distances.\n\nYou can read more about the methods implemented in `diverse-seq` in the preprint [here](https://biorxiv.org/cgi/content/short/2024.11.10.622877v1).\n\nThe user documentation [is here](https://diverse-seq.readthedocs.io).\n\n### Installation\n\nWe recommend installing `diverse-seq` from PyPI as follows\n\n```\npip install \"diverse-seq[extra]\"\n```\n\nfor the full jupyter experience.\n\nFor command line only usage, install as follows\n\n```\npip install diverse-seq\n```\n\n> **NOTE**\n> If you experience any errors during installation, we recommend using [uv pip](https://docs.astral.sh/uv/). This command provides much better error messages than the standard `pip` command. If you cannot resolve the installation problem, please open an issue on the [GitHub repository](https://github.com/HuttleyLab/DiverseSeq/issues).\n\n#### Using `uv`\n\nSpeaking of `uv`, it provides a simplified approach to install `dvs` as a command-line only tool as\n\n```\nuv tool install diverse-seq\n```\n\nUsage in this case is then\n\n```\nuvx --from diverse-seq dvs\n```\n\n#### Dependencies\n\nFor a full listing of dependencies, see the [pyproject.toml](./pyproject.toml) file.\n\n### The command line interface\n\n`dvs` is the command line interface for `diverse-seq`.\n\n<details>\n <summary>The `dvs` subcommands</summary>\n\n<!-- [[[cog\nimport cog\nfrom diverse_seq.cli import main\nfrom click.testing import CliRunner\nrunner = CliRunner()\nresult = runner.invoke(main, [])\nhelp = result.output.replace(\"Usage: main\", \"Usage: dvs\")\ncog.out(\n \"```\\n{}\\n```\".format(help)\n)\n]]] -->\n```\nUsage: dvs [OPTIONS] COMMAND [ARGS]...\n\n dvs -- alignment free detection of the most diverse sequences using JSD\n\nOptions:\n --version Show the version and exit.\n --help Show this message and exit.\n\nCommands:\n demo-data Export a demo sequence file\n prep Writes processed sequences to a <HDF5 file>.dvseqs.\n max Identify the seqs that maximise average delta JSD\n nmost Identify n seqs that maximise average delta JSD\n ctree Quickly compute a cluster tree based on kmers for a collection...\n\n```\n<!-- [[[end]]] -->\n\n</details>\n\n### The Python API\n\nWe make comparable capabilities available as [cogent3 apps](https://cogent3.org/doc/app/index.html). The main difference is the app instances directly operate on, and return, `cogent3` sequence collections. See [the docs](https://diverse-seq.readthedocs.io/en/latest/apps/) for demonstrations of how to use the apps.\n\n## Project Information \n\n`diverse-seq` is released under the BSD-3 license. If you want to contribute to the `diverse-seq` project (and we hope you do! :innocent:) the code of conduct and other useful developer information is available on the [wiki](https://github.com/HuttleyLab/DiverseSeq/wiki).\n",
"bugtrack_url": null,
"license": null,
"summary": "diverse_seq: a tool for sampling diverse biological sequences",
"version": "2025.9.11",
"project_urls": {
"Bug Tracker": "https://github.com/HuttleyLab/DiverseSeq/issues",
"Documentation": "https://diverse-seq.readthedocs.io/",
"Source Code": "https://github.com/HuttleyLab/DiverseSeq/"
},
"split_keywords": [
"biology",
" genomics",
" statistics",
" phylogeny",
" evolution",
" bioinformatics"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "556fdc4bff1563a7cc437885650124b0fb02ee49b604cf3fa8863e5e2a1ccdee",
"md5": "5f323c63b91a2af912a8b766612914cb",
"sha256": "f08eef86931fc51fcae2c18fd6f0faf7fc0e0ed5dadd39bc8006ce649fbdeaea"
},
"downloads": -1,
"filename": "diverse_seq-2025.9.11-py3-none-any.whl",
"has_sig": false,
"md5_digest": "5f323c63b91a2af912a8b766612914cb",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<3.14,>=3.10",
"size": 66106,
"upload_time": "2025-09-10T22:52:38",
"upload_time_iso_8601": "2025-09-10T22:52:38.839788Z",
"url": "https://files.pythonhosted.org/packages/55/6f/dc4bff1563a7cc437885650124b0fb02ee49b604cf3fa8863e5e2a1ccdee/diverse_seq-2025.9.11-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "772af0a1868ae512b67b184281d60ff6e000135bf2cb39c39921cad0ce62baf8",
"md5": "59e95bbaf0f49f4acf3b4a4dbbc16068",
"sha256": "56b25e514adeb5e81c05912468059e2c0900a3f7dfd4181171cdd114d0d48cb2"
},
"downloads": -1,
"filename": "diverse_seq-2025.9.11.tar.gz",
"has_sig": false,
"md5_digest": "59e95bbaf0f49f4acf3b4a4dbbc16068",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<3.14,>=3.10",
"size": 179849,
"upload_time": "2025-09-10T22:52:40",
"upload_time_iso_8601": "2025-09-10T22:52:40.673750Z",
"url": "https://files.pythonhosted.org/packages/77/2a/f0a1868ae512b67b184281d60ff6e000135bf2cb39c39921cad0ce62baf8/diverse_seq-2025.9.11.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-10 22:52:40",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "HuttleyLab",
"github_project": "DiverseSeq",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "diverse-seq"
}