[![CI](https://github.com/HuttleyLab/DiverseSeq/actions/workflows/ci.yml/badge.svg)](https://github.com/HuttleyLab/DiverseSeq/actions/workflows/ci.yml)
[![Coverage Status](https://coveralls.io/repos/github/HuttleyLab/DiverseSeq/badge.svg?branch=main)](https://coveralls.io/github/HuttleyLab/DiverseSeq?branch=main)
[![Codacy Badge](https://app.codacy.com/project/badge/Grade/ef3010ea162f47a2a5a44e0f3f6ed1f0)](https://app.codacy.com/gh/HuttleyLab/DiverseSeq/dashboard?utm_source=gh&utm_medium=referral&utm_content=&utm_campaign=Badge_grade)
[![CodeQL](https://github.com/HuttleyLab/DiverseSeq/actions/workflows/codeql.yml/badge.svg)](https://github.com/HuttleyLab/DiverseSeq/actions/workflows/codeql.yml)
[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
# `diverse_seq` provides alignment-free algorithms to facilitate phylogenetic workflows
`diverse-seq` implements computationally efficient alignment-free algorithms that enable efficient prototyping for phylogenetic workflows. It can accelerate parameter selection searches for sequence alignment and phylogeny estimation by identifying a subset of sequences that are representative of the diversity in a collection. We show that selecting representative sequences with an entropy measure of *k*-mer frequencies correspond well to sampling via conventional genetic distances. The computational performance is linear with respect to the number of sequences and can be run in parallel. Applied to a collection of 10.5k whole microbial genomes on a laptop took ~8 minutes to prepare the data and 4 minutes to select 100 representatives. `diverse-seq` can further boost the performance of phylogenetic estimation by providing a seed phylogeny that can be further refined by a more sophisticated algorithm. For ~1k whole microbial genomes on a laptop, it takes ~1.8 minutes to estimate a bifurcating tree from mash distances.
You can read more about the methods implemented in `diverse_seq` in the preprint [here](https://biorxiv.org/cgi/content/short/2024.11.10.622877v1).
### `dvs prep`: preparing the sequence data
Convert sequence data into a more efficient format for the diversity assessment. This must be done before running either the `nmost` or `max` commands.
<details>
<summary>CLI options for dvs prep</summary>
<!-- [[[cog
import cog
from diverse_seq.cli import main
from click.testing import CliRunner
runner = CliRunner()
result = runner.invoke(main, ["prep", "--help"])
help = result.output.replace("Usage: main", "Usage: dvs")
cog.out(
"```\n{}\n```".format(help)
)
]]] -->
```
Usage: dvs prep [OPTIONS]
Writes processed sequences to a <HDF5 file>.dvseqs.
Options:
-s, --seqdir PATH directory containing sequence files [required]
-sf, --suffix TEXT sequence file suffix [default: fa]
-o, --outpath PATH write processed seqs to this filename [required]
-np, --numprocs INTEGER number of processes [default: 1]
-F, --force_overwrite Overwrite existing file if it exists
-m, --moltype [dna|rna] Molecular type of sequences [default: dna]
-L, --limit INTEGER number of sequences to process
-hp, --hide_progress hide progress bars
--help Show this message and exit.
```
<!-- [[[end]]] -->
</details>
### `dvs nmost`: select the n-most diverse sequences
Selects the n sequences that maximise the total JSD. We recommend using `nmost` for large datasets.
> **Note**
> A fuller explanation is coming soon!
<details>
<summary>Options for command line dvs nmost</summary>
<!-- [[[cog
import cog
from diverse_seq.cli import main
from click.testing import CliRunner
runner = CliRunner()
result = runner.invoke(main, ["nmost", "--help"])
help = result.output.replace("Usage: main", "Usage: dvs")
cog.out(
"```\n{}\n```".format(help)
)
]]] -->
```
Usage: dvs nmost [OPTIONS]
Identify n seqs that maximise average delta JSD
Options:
-s, --seqfile PATH path to .dvseqs file [required]
-o, --outpath PATH the input string will be cast to Path instance
-n, --number INTEGER number of seqs in divergent set [required]
-k INTEGER k-mer size [default: 6]
-i, --include TEXT seqnames to include in divergent set
-np, --numprocs INTEGER number of processes [default: 1]
-L, --limit INTEGER number of sequences to process
-v, --verbose is an integer indicating number of cl occurrences
[default: 0]
-hp, --hide_progress hide progress bars
--help Show this message and exit.
```
<!-- [[[end]]] -->
</details>
<details>
<summary>Options for cogent3 app dvs_nmost</summary>
The `dvs nmost` is also available as the [cogent3 app](https://cogent3.org/doc/app/index.html) `dvs_nmost`. The result of using `cogent3.app_help("dvs_nmost")` is shown below.
<!-- [[[cog
import cog
import contextlib
import io
from cogent3 import app_help
buffer = io.StringIO()
with contextlib.redirect_stdout(buffer):
app_help("dvs_nmost")
cog.out(
"```\n{}\n```".format(buffer.getvalue())
)
]]] -->
```
Overview
--------
select the n-most diverse seqs from a sequence collection
Options for making the app
--------------------------
dvs_nmost_app = get_app(
'dvs_nmost',
n=10,
moltype='dna',
include=None,
k=6,
seed=None,
)
Parameters
----------
n
the number of divergent sequences
moltype
molecular type of the sequences
k
k-mer size
include
sequence names to include in the final result
seed
random number seed
Notes
-----
If called with an alignment, the ungapped sequences are used.
The order of the sequences is randomised. If include is not None, the
named sequences are added to the final result.
Input type
----------
ArrayAlignment, SequenceCollection, Alignment
Output type
-----------
ArrayAlignment, SequenceCollection, Alignment
```
<!-- [[[end]]] -->
</details>
### `dvs max`: maximise variance in the selected sequences
The result of the `max` command is typically a set that are modestly more diverse than that from `nmost`.
> **Note**
> A fuller explanation is coming soon!
<details>
<summary>Options for command line dvs max</summary>
<!-- [[[cog
import cog
from diverse_seq.cli import main
from click.testing import CliRunner
runner = CliRunner()
result = runner.invoke(main, ["max", "--help"])
help = result.output.replace("Usage: main", "Usage: dvs")
cog.out(
"```\n{}\n```".format(help)
)
]]] -->
```
Usage: dvs max [OPTIONS]
Identify the seqs that maximise average delta JSD
Options:
-s, --seqfile PATH path to .dvseqs file [required]
-o, --outpath PATH the input string will be cast to Path instance
-z, --min_size INTEGER minimum size of divergent set [default: 7]
-zp, --max_size INTEGER maximum size of divergent set
-k INTEGER k-mer size [default: 6]
-st, --stat [stdev|cov] statistic to maximise [default: stdev]
-i, --include TEXT seqnames to include in divergent set
-np, --numprocs INTEGER number of processes [default: 1]
-L, --limit INTEGER number of sequences to process
-T, --test_run reduce number of paths and size of query seqs
-v, --verbose is an integer indicating number of cl occurrences
[default: 0]
-hp, --hide_progress hide progress bars
--help Show this message and exit.
```
<!-- [[[end]]] -->
</details>
<details>
<summary>Options for cogent3 app dvs_max</summary>
The `dvs max` is also available as the [cogent3 app](https://cogent3.org/doc/app/index.html) `dvs_max`.
<!-- [[[cog
import cog
import contextlib
import io
from cogent3 import app_help
buffer = io.StringIO()
with contextlib.redirect_stdout(buffer):
app_help("dvs_max")
cog.out(
"```\n{}\n```".format(buffer.getvalue())
)
]]] -->
```
Overview
--------
select the maximally divergent seqs from a sequence collection
Options for making the app
--------------------------
dvs_max_app = get_app(
'dvs_max',
min_size=5,
max_size=30,
stat='stdev',
moltype='dna',
include=None,
k=6,
seed=None,
)
Parameters
----------
min_size
minimum size of the divergent set
max_size
the maximum size of the divergent set
stat
either stdev or cov, which represent the statistics
std(delta_jsd) and cov(delta_jsd) respectively
moltype
molecular type of the sequences
include
sequence names to include in the final result
k
k-mer size
seed
random number seed
Notes
-----
If called with an alignment, the ungapped sequences are used.
The order of the sequences is randomised. If include is not None, the
named sequences are added to the final result.
Input type
----------
ArrayAlignment, SequenceCollection, Alignment
Output type
-----------
ArrayAlignment, SequenceCollection, Alignment
```
<!-- [[[end]]] -->
</details>
### `dvs ctree`: build a phylogeny using k-mers
The result of the `ctree` command is a newick formatted tree string without distances.
> **Note**
> A fuller explanation is coming soon!
<details>
<summary>Options for command line dvs ctree</summary>
<!-- [[[cog
import cog
from diverse_seq.cli import main
from click.testing import CliRunner
runner = CliRunner()
result = runner.invoke(main, ["ctree", "--help"])
help = result.output.replace("Usage: main", "Usage: dvs")
cog.out(
"```\n{}\n```".format(help)
)
]]] -->
```
Usage: dvs ctree [OPTIONS]
Quickly compute a cluster tree based on kmers for a collection of sequences.
Options:
-s, --seqfile PATH path to .dvseqs file [required]
-o, --outpath PATH the input string will be cast to Path instance
-m, --moltype [dna|rna] Molecular type of sequences [default: dna]
-k INTEGER k-mer size [default: 6]
--sketch-size INTEGER sketch size for mash distance
-d, --distance [mash|euclidean]
distance measure for tree construction
[default: mash]
-c, --canonical-kmers consider kmers identical to their reverse
complement
-L, --limit INTEGER number of sequences to process
-np, --numprocs INTEGER number of processes [default: 1]
-hp, --hide_progress hide progress bars
--help Show this message and exit.
```
<!-- [[[end]]] -->
</details>
<details>
<summary>Options for cogent3 app dvs_ctree</summary>
The `dvs ctree` is also available as the [cogent3 app](https://cogent3.org/doc/app/index.html) `dvs_ctree` or `dvs_par_ctree`. The latter is not composable, but can run the analysis for a single collection in parallel.
<!-- [[[cog
import cog
import contextlib
import io
from cogent3 import app_help
buffer = io.StringIO()
with contextlib.redirect_stdout(buffer):
app_help("dvs_ctree")
cog.out(
"```\n{}\n```".format(buffer.getvalue())
)
]]] -->
```
Overview
--------
Create a cluster tree from kmer distances.
Options for making the app
--------------------------
dvs_ctree_app = get_app(
'dvs_ctree',
k=12,
sketch_size=3000,
moltype='dna',
distance_mode='mash',
mash_canonical_kmers=None,
show_progress=False,
)
Initialise parameters for generating a kmer cluster tree.
Parameters
----------
k
kmer size
sketch_size
size of sketches, only applies to mash distance
moltype
seq collection molecular type
distance_mode
mash distance or euclidean distance between kmer freqs
mash_canonical_kmers
whether to use mash canonical kmers for mash distance
show_progress
whether to show progress bars
Notes
-----
This app is composable.
If mash_canonical_kmers is enabled when using the mash distance,
kmers are considered identical to their reverse complement.
References
----------
.. [1] Ondov, B. D., Treangen, T. J., Melsted, P., Mallonee, A. B.,
Bergman, N. H., Koren, S., & Phillippy, A. M. (2016).
Mash: fast genome and metagenome distance estimation using MinHash.
Genome biology, 17, 1-14.
Input type
----------
ArrayAlignment, SequenceCollection, Alignment
Output type
-----------
PhyloNode
```
<!-- [[[end]]] -->
<!-- [[[cog
import cog
import contextlib
import io
from cogent3 import app_help
buffer = io.StringIO()
with contextlib.redirect_stdout(buffer):
app_help("dvs_par_ctree")
cog.out(
"```\n{}\n```".format(buffer.getvalue())
)
]]] -->
```
Overview
--------
Create a cluster tree from kmer distances in parallel.
Options for making the app
--------------------------
dvs_par_ctree_app = get_app(
'dvs_par_ctree',
k=12,
sketch_size=3000,
moltype='dna',
distance_mode='mash',
mash_canonical_kmers=None,
show_progress=False,
max_workers=None,
parallel=True,
)
Initialise parameters for generating a kmer cluster tree.
Parameters
----------
k
kmer size
sketch_size
size of sketches, only applies to mash distance
moltype
seq collection molecular type
distance_mode
mash distance or euclidean distance between kmer freqs
mash_canonical_kmers
whether to use mash canonical kmers for mash distance
show_progress
whether to show progress bars
numprocs
number of workers, defaults to running serial
Notes
-----
This app is not composable but can run in parallel. It is
best suited to a single large sequence collection.
If mash_canonical_kmers is enabled when using the mash distance,
kmers are considered identical to their reverse complement.
References
----------
.. [1] Ondov, B. D., Treangen, T. J., Melsted, P., Mallonee, A. B.,
Bergman, N. H., Koren, S., & Phillippy, A. M. (2016).
Mash: fast genome and metagenome distance estimation using MinHash.
Genome biology, 17, 1-14.
Input type
----------
ArrayAlignment, SequenceCollection, Alignment
Output type
-----------
PhyloNode
```
<!-- [[[end]]] -->
</details>
Raw data
{
"_id": null,
"home_page": null,
"name": "diverse-seq",
"maintainer": null,
"docs_url": null,
"requires_python": "<3.13,>=3.10",
"maintainer_email": null,
"keywords": "biology, genomics, statistics, phylogeny, evolution, bioinformatics",
"author": null,
"author_email": "Gavin Huttley <Gavin.Huttley@anu.edu.au>",
"download_url": "https://files.pythonhosted.org/packages/20/33/f9afe7804502cdbecc8586e62cb67b50d72f6171f49a5d4974991f32b6ea/diverse_seq-2024.11.22a1.tar.gz",
"platform": null,
"description": "[![CI](https://github.com/HuttleyLab/DiverseSeq/actions/workflows/ci.yml/badge.svg)](https://github.com/HuttleyLab/DiverseSeq/actions/workflows/ci.yml)\n[![Coverage Status](https://coveralls.io/repos/github/HuttleyLab/DiverseSeq/badge.svg?branch=main)](https://coveralls.io/github/HuttleyLab/DiverseSeq?branch=main)\n[![Codacy Badge](https://app.codacy.com/project/badge/Grade/ef3010ea162f47a2a5a44e0f3f6ed1f0)](https://app.codacy.com/gh/HuttleyLab/DiverseSeq/dashboard?utm_source=gh&utm_medium=referral&utm_content=&utm_campaign=Badge_grade)\n[![CodeQL](https://github.com/HuttleyLab/DiverseSeq/actions/workflows/codeql.yml/badge.svg)](https://github.com/HuttleyLab/DiverseSeq/actions/workflows/codeql.yml)\n[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)\n\n# `diverse_seq` provides alignment-free algorithms to facilitate phylogenetic workflows\n\n`diverse-seq` implements computationally efficient alignment-free algorithms that enable efficient prototyping for phylogenetic workflows. It can accelerate parameter selection searches for sequence alignment and phylogeny estimation by identifying a subset of sequences that are representative of the diversity in a collection. We show that selecting representative sequences with an entropy measure of *k*-mer frequencies correspond well to sampling via conventional genetic distances. The computational performance is linear with respect to the number of sequences and can be run in parallel. Applied to a collection of 10.5k whole microbial genomes on a laptop took ~8 minutes to prepare the data and 4 minutes to select 100 representatives. `diverse-seq` can further boost the performance of phylogenetic estimation by providing a seed phylogeny that can be further refined by a more sophisticated algorithm. For ~1k whole microbial genomes on a laptop, it takes ~1.8 minutes to estimate a bifurcating tree from mash distances.\n\nYou can read more about the methods implemented in `diverse_seq` in the preprint [here](https://biorxiv.org/cgi/content/short/2024.11.10.622877v1).\n\n### `dvs prep`: preparing the sequence data\n\nConvert sequence data into a more efficient format for the diversity assessment. This must be done before running either the `nmost` or `max` commands.\n\n<details>\n <summary>CLI options for dvs prep</summary>\n\n<!-- [[[cog\nimport cog\nfrom diverse_seq.cli import main\nfrom click.testing import CliRunner\nrunner = CliRunner()\nresult = runner.invoke(main, [\"prep\", \"--help\"])\nhelp = result.output.replace(\"Usage: main\", \"Usage: dvs\")\ncog.out(\n \"```\\n{}\\n```\".format(help)\n)\n]]] -->\n```\nUsage: dvs prep [OPTIONS]\n\n Writes processed sequences to a <HDF5 file>.dvseqs.\n\nOptions:\n -s, --seqdir PATH directory containing sequence files [required]\n -sf, --suffix TEXT sequence file suffix [default: fa]\n -o, --outpath PATH write processed seqs to this filename [required]\n -np, --numprocs INTEGER number of processes [default: 1]\n -F, --force_overwrite Overwrite existing file if it exists\n -m, --moltype [dna|rna] Molecular type of sequences [default: dna]\n -L, --limit INTEGER number of sequences to process\n -hp, --hide_progress hide progress bars\n --help Show this message and exit.\n\n```\n<!-- [[[end]]] -->\n\n</details>\n\n### `dvs nmost`: select the n-most diverse sequences\n\nSelects the n sequences that maximise the total JSD. We recommend using `nmost` for large datasets.\n\n> **Note**\n> A fuller explanation is coming soon!\n\n<details>\n <summary>Options for command line dvs nmost</summary>\n\n<!-- [[[cog\nimport cog\nfrom diverse_seq.cli import main\nfrom click.testing import CliRunner\nrunner = CliRunner()\nresult = runner.invoke(main, [\"nmost\", \"--help\"])\nhelp = result.output.replace(\"Usage: main\", \"Usage: dvs\")\ncog.out(\n \"```\\n{}\\n```\".format(help)\n)\n]]] -->\n```\nUsage: dvs nmost [OPTIONS]\n\n Identify n seqs that maximise average delta JSD\n\nOptions:\n -s, --seqfile PATH path to .dvseqs file [required]\n -o, --outpath PATH the input string will be cast to Path instance\n -n, --number INTEGER number of seqs in divergent set [required]\n -k INTEGER k-mer size [default: 6]\n -i, --include TEXT seqnames to include in divergent set\n -np, --numprocs INTEGER number of processes [default: 1]\n -L, --limit INTEGER number of sequences to process\n -v, --verbose is an integer indicating number of cl occurrences\n [default: 0]\n -hp, --hide_progress hide progress bars\n --help Show this message and exit.\n\n```\n<!-- [[[end]]] -->\n\n</details>\n\n<details>\n <summary>Options for cogent3 app dvs_nmost</summary>\n\nThe `dvs nmost` is also available as the [cogent3 app](https://cogent3.org/doc/app/index.html) `dvs_nmost`. The result of using `cogent3.app_help(\"dvs_nmost\")` is shown below.\n\n<!-- [[[cog\nimport cog\nimport contextlib\nimport io\n\n\nfrom cogent3 import app_help\n\nbuffer = io.StringIO()\n\nwith contextlib.redirect_stdout(buffer):\n app_help(\"dvs_nmost\")\ncog.out(\n \"```\\n{}\\n```\".format(buffer.getvalue())\n)\n]]] -->\n```\nOverview\n--------\nselect the n-most diverse seqs from a sequence collection\n\nOptions for making the app\n--------------------------\ndvs_nmost_app = get_app(\n 'dvs_nmost',\n n=10,\n moltype='dna',\n include=None,\n k=6,\n seed=None,\n)\n\nParameters\n----------\nn\n the number of divergent sequences\nmoltype\n molecular type of the sequences\nk\n k-mer size\ninclude\n sequence names to include in the final result\nseed\n random number seed\n\nNotes\n-----\nIf called with an alignment, the ungapped sequences are used.\nThe order of the sequences is randomised. If include is not None, the\nnamed sequences are added to the final result.\n\nInput type\n----------\nArrayAlignment, SequenceCollection, Alignment\n\nOutput type\n-----------\nArrayAlignment, SequenceCollection, Alignment\n\n```\n<!-- [[[end]]] -->\n</details>\n\n\n### `dvs max`: maximise variance in the selected sequences\n\nThe result of the `max` command is typically a set that are modestly more diverse than that from `nmost`.\n\n> **Note**\n> A fuller explanation is coming soon!\n\n<details>\n <summary>Options for command line dvs max</summary>\n\n<!-- [[[cog\nimport cog\nfrom diverse_seq.cli import main\nfrom click.testing import CliRunner\nrunner = CliRunner()\nresult = runner.invoke(main, [\"max\", \"--help\"])\nhelp = result.output.replace(\"Usage: main\", \"Usage: dvs\")\ncog.out(\n \"```\\n{}\\n```\".format(help)\n)\n]]] -->\n```\nUsage: dvs max [OPTIONS]\n\n Identify the seqs that maximise average delta JSD\n\nOptions:\n -s, --seqfile PATH path to .dvseqs file [required]\n -o, --outpath PATH the input string will be cast to Path instance\n -z, --min_size INTEGER minimum size of divergent set [default: 7]\n -zp, --max_size INTEGER maximum size of divergent set\n -k INTEGER k-mer size [default: 6]\n -st, --stat [stdev|cov] statistic to maximise [default: stdev]\n -i, --include TEXT seqnames to include in divergent set\n -np, --numprocs INTEGER number of processes [default: 1]\n -L, --limit INTEGER number of sequences to process\n -T, --test_run reduce number of paths and size of query seqs\n -v, --verbose is an integer indicating number of cl occurrences\n [default: 0]\n -hp, --hide_progress hide progress bars\n --help Show this message and exit.\n\n```\n<!-- [[[end]]] -->\n\n</details>\n\n<details>\n<summary>Options for cogent3 app dvs_max</summary>\n\nThe `dvs max` is also available as the [cogent3 app](https://cogent3.org/doc/app/index.html) `dvs_max`. \n\n<!-- [[[cog\nimport cog\nimport contextlib\nimport io\n\n\nfrom cogent3 import app_help\n\nbuffer = io.StringIO()\n\nwith contextlib.redirect_stdout(buffer):\n app_help(\"dvs_max\")\ncog.out(\n \"```\\n{}\\n```\".format(buffer.getvalue())\n)\n]]] -->\n```\nOverview\n--------\nselect the maximally divergent seqs from a sequence collection\n\nOptions for making the app\n--------------------------\ndvs_max_app = get_app(\n 'dvs_max',\n min_size=5,\n max_size=30,\n stat='stdev',\n moltype='dna',\n include=None,\n k=6,\n seed=None,\n)\n\nParameters\n----------\nmin_size\n minimum size of the divergent set\nmax_size\n the maximum size of the divergent set\nstat\n either stdev or cov, which represent the statistics\n std(delta_jsd) and cov(delta_jsd) respectively\nmoltype\n molecular type of the sequences\ninclude\n sequence names to include in the final result\nk\n k-mer size\nseed\n random number seed\n\nNotes\n-----\nIf called with an alignment, the ungapped sequences are used.\nThe order of the sequences is randomised. If include is not None, the\nnamed sequences are added to the final result.\n\nInput type\n----------\nArrayAlignment, SequenceCollection, Alignment\n\nOutput type\n-----------\nArrayAlignment, SequenceCollection, Alignment\n\n```\n<!-- [[[end]]] -->\n</details>\n\n### `dvs ctree`: build a phylogeny using k-mers\n\nThe result of the `ctree` command is a newick formatted tree string without distances.\n\n> **Note**\n> A fuller explanation is coming soon!\n\n<details>\n <summary>Options for command line dvs ctree</summary>\n\n<!-- [[[cog\nimport cog\nfrom diverse_seq.cli import main\nfrom click.testing import CliRunner\nrunner = CliRunner()\nresult = runner.invoke(main, [\"ctree\", \"--help\"])\nhelp = result.output.replace(\"Usage: main\", \"Usage: dvs\")\ncog.out(\n \"```\\n{}\\n```\".format(help)\n)\n]]] -->\n```\nUsage: dvs ctree [OPTIONS]\n\n Quickly compute a cluster tree based on kmers for a collection of sequences.\n\nOptions:\n -s, --seqfile PATH path to .dvseqs file [required]\n -o, --outpath PATH the input string will be cast to Path instance\n -m, --moltype [dna|rna] Molecular type of sequences [default: dna]\n -k INTEGER k-mer size [default: 6]\n --sketch-size INTEGER sketch size for mash distance\n -d, --distance [mash|euclidean]\n distance measure for tree construction\n [default: mash]\n -c, --canonical-kmers consider kmers identical to their reverse\n complement\n -L, --limit INTEGER number of sequences to process\n -np, --numprocs INTEGER number of processes [default: 1]\n -hp, --hide_progress hide progress bars\n --help Show this message and exit.\n\n```\n<!-- [[[end]]] -->\n\n</details>\n\n<details>\n <summary>Options for cogent3 app dvs_ctree</summary>\n\nThe `dvs ctree` is also available as the [cogent3 app](https://cogent3.org/doc/app/index.html) `dvs_ctree` or `dvs_par_ctree`. The latter is not composable, but can run the analysis for a single collection in parallel.\n\n<!-- [[[cog\nimport cog\nimport contextlib\nimport io\n\n\nfrom cogent3 import app_help\n\nbuffer = io.StringIO()\n\nwith contextlib.redirect_stdout(buffer):\n app_help(\"dvs_ctree\")\ncog.out(\n \"```\\n{}\\n```\".format(buffer.getvalue())\n)\n]]] -->\n```\nOverview\n--------\nCreate a cluster tree from kmer distances.\n\nOptions for making the app\n--------------------------\ndvs_ctree_app = get_app(\n 'dvs_ctree',\n k=12,\n sketch_size=3000,\n moltype='dna',\n distance_mode='mash',\n mash_canonical_kmers=None,\n show_progress=False,\n)\n\nInitialise parameters for generating a kmer cluster tree.\n\nParameters\n----------\nk\n kmer size\nsketch_size\n size of sketches, only applies to mash distance\nmoltype\n seq collection molecular type\ndistance_mode\n mash distance or euclidean distance between kmer freqs\nmash_canonical_kmers\n whether to use mash canonical kmers for mash distance\nshow_progress\n whether to show progress bars\n\nNotes\n-----\nThis app is composable.\n\nIf mash_canonical_kmers is enabled when using the mash distance,\nkmers are considered identical to their reverse complement.\n\nReferences\n----------\n.. [1] Ondov, B. D., Treangen, T. J., Melsted, P., Mallonee, A. B.,\n Bergman, N. H., Koren, S., & Phillippy, A. M. (2016).\n Mash: fast genome and metagenome distance estimation using MinHash.\n Genome biology, 17, 1-14.\n\nInput type\n----------\nArrayAlignment, SequenceCollection, Alignment\n\nOutput type\n-----------\nPhyloNode\n\n```\n<!-- [[[end]]] -->\n\n\n<!-- [[[cog\nimport cog\nimport contextlib\nimport io\n\n\nfrom cogent3 import app_help\n\nbuffer = io.StringIO()\n\nwith contextlib.redirect_stdout(buffer):\n app_help(\"dvs_par_ctree\")\ncog.out(\n \"```\\n{}\\n```\".format(buffer.getvalue())\n)\n]]] -->\n```\nOverview\n--------\nCreate a cluster tree from kmer distances in parallel.\n\nOptions for making the app\n--------------------------\ndvs_par_ctree_app = get_app(\n 'dvs_par_ctree',\n k=12,\n sketch_size=3000,\n moltype='dna',\n distance_mode='mash',\n mash_canonical_kmers=None,\n show_progress=False,\n max_workers=None,\n parallel=True,\n)\n\nInitialise parameters for generating a kmer cluster tree.\n\nParameters\n----------\nk\n kmer size\nsketch_size\n size of sketches, only applies to mash distance\nmoltype\n seq collection molecular type\ndistance_mode\n mash distance or euclidean distance between kmer freqs\nmash_canonical_kmers\n whether to use mash canonical kmers for mash distance\nshow_progress\n whether to show progress bars\nnumprocs\n number of workers, defaults to running serial\n\nNotes\n-----\nThis app is not composable but can run in parallel. It is\nbest suited to a single large sequence collection.\n\nIf mash_canonical_kmers is enabled when using the mash distance,\nkmers are considered identical to their reverse complement.\n\nReferences\n----------\n.. [1] Ondov, B. D., Treangen, T. J., Melsted, P., Mallonee, A. B.,\n Bergman, N. H., Koren, S., & Phillippy, A. M. (2016).\n Mash: fast genome and metagenome distance estimation using MinHash.\n Genome biology, 17, 1-14.\n\nInput type\n----------\nArrayAlignment, SequenceCollection, Alignment\n\nOutput type\n-----------\nPhyloNode\n\n```\n<!-- [[[end]]] -->\n\n</details>\n\n",
"bugtrack_url": null,
"license": null,
"summary": "diverse_seq: a tool for sampling diverse biological sequences",
"version": "2024.11.22a1",
"project_urls": {
"Bug Tracker": "https://github.com/HuttleyLab/DiverseSeq/issues",
"Documentation": "https://github.com/HuttleyLab/DiverseSeq",
"Source Code": "https://github.com/HuttleyLab/DiverseSeq/"
},
"split_keywords": [
"biology",
" genomics",
" statistics",
" phylogeny",
" evolution",
" bioinformatics"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "f3f004e129fd046546dc8f70c9b7d54f5d82af492f8e6d5fe43f98adf6b7fd58",
"md5": "2efea1ffa50dc6568db495e1e77e8cf3",
"sha256": "c780ea677bea640ab2a77ff2e00b7a289575b3097a438aa5db71aa209e3b1bcc"
},
"downloads": -1,
"filename": "diverse_seq-2024.11.22a1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "2efea1ffa50dc6568db495e1e77e8cf3",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<3.13,>=3.10",
"size": 33542,
"upload_time": "2024-11-22T08:40:36",
"upload_time_iso_8601": "2024-11-22T08:40:36.826234Z",
"url": "https://files.pythonhosted.org/packages/f3/f0/04e129fd046546dc8f70c9b7d54f5d82af492f8e6d5fe43f98adf6b7fd58/diverse_seq-2024.11.22a1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "2033f9afe7804502cdbecc8586e62cb67b50d72f6171f49a5d4974991f32b6ea",
"md5": "36c37a8709b261a18fa0b2ee2178264c",
"sha256": "960d6b3640c192b5f99972b3a29a877eafc86d5013fe5a9a90bc97111c6396f9"
},
"downloads": -1,
"filename": "diverse_seq-2024.11.22a1.tar.gz",
"has_sig": false,
"md5_digest": "36c37a8709b261a18fa0b2ee2178264c",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<3.13,>=3.10",
"size": 147811,
"upload_time": "2024-11-22T08:40:38",
"upload_time_iso_8601": "2024-11-22T08:40:38.335278Z",
"url": "https://files.pythonhosted.org/packages/20/33/f9afe7804502cdbecc8586e62cb67b50d72f6171f49a5d4974991f32b6ea/diverse_seq-2024.11.22a1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-22 08:40:38",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "HuttleyLab",
"github_project": "DiverseSeq",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "diverse-seq"
}