Name | micov JSON |
Version |
2025.2
JSON |
| download |
home_page | None |
Summary | None |
upload_time | 2025-02-14 18:39:19 |
maintainer | None |
docs_url | None |
author | None |
requires_python | >=3.9 |
license | ==========================
The micov licensing terms
==========================
The micov project is licensed under the terms of the Modified BSD License
(also known as New or Revised BSD), as follows:
Copyright (c) 2024-, The micov Development Team <damcdonald@ucsd.edu>
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of the micov development team nor the names of its
contributors may be used to endorse or promote products derived from this
software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE MICOV DEVELOPMENT TEAM BE LIABLE FOR
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
keywords |
microbiome
bioinformatics
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
## micov: aggregate MIcrobiome COVerage
We introduce aggregate MIcrobiome COVerage (micov), a bioinformatic tool that efficiently computes precise, optionally-aggregated, genomic coverage positions across numerous metagenomes and arbitrary sample types. Micov offers three key advantages over conventional tools: rapid sample type-specific cumulative coverage calculations, identification of mobile or polymorphic genetic elements, and detection of strain heterogeneity through coverage variations.
## Design
The primary input mapping structure for micov is per-sample SAM/BAM or BED
(3-column). These data are then consolidated into Parquet files to utilize
pushdown filters.
## Installation
We recommend creating a separate conda environment, and installing
into that.
```bash
$ pip install micov
```
## Installation From Source
To install the most up-to-date version of micov
```bash
$ conda create -n micov python=3.12
$ conda install -q --yes -n micov -c conda-forge --file ci/conda_requirements.txt
$ conda activate micov
$ git clone https://github.com/biocore/micov.git
$ cd micov
$ pip install -e .
```
## Example Usages
See below for examples of running `micov` on SAM files.
### 1. Set Up Environment
First, activate the **Conda environment** where `micov` is installed:
```bash
conda activate micov
```
### 2. Process SAM Files to Extract Covered Positions
Next, we will process SAM files to extract covered positions. Note: If you have
`coverages.tgz` coverage files from Qitta, please go to step 4. `micov` accepts
**headerless** SAM/BAM files, and writes out BED-like files which describe the
observed start and stop positions on the references in the SAM data.
If your input files contain headers, remove them using `samtools` before running micov:
```bash
samtools view -S input.sam > output.sam
```
Similarly, if your input files are in BAM format, convert them to SAM format using `samtools`:
```bash
samtools view input.bam > output.sam
```
Next, compress the SAM data into BED coverge files. The `samtools` command above
can be piped into `micov` to compress the SAM data into BED-like files if
desired, but for simplicity, we will demonstrate use from SAM. In writing, we
asssume the name of the SAM file corresponds to a sample name. The subsequent
code expects the BED files to have either a `.cov` or `.cov.gz` extension.
```bash
mkdir -p "./example/coverages"
for file in ./example/samfiles/*.sam.xz; do
sample_id=$(basename "$file" .sam.xz)
echo "Processing $file..."
# Run micov compress
xzcat $file | micov compress | gzip > "./example/coverages/${sample_id}.cov.gz"
done
```
### 3. Consolidate Coverage Files
After extracting coverage data, consolidate the `.cov` files into Parquet
representations. This requires a **length mapping file (`length.tsv`)**, which
maps genome IDs to their corresponding genome lengths. An example length file
can be found in `./example/metadata/length.tsv`. If this file is not available,
it can for example be generated using `seqkit`:
```bash
seqkit fx2tab --length --name --header-line foo.fasta > length.tsv
```
Now, consolidate the coverage files. On read, `micov` will interpret the non-extension
portion of a filename as the sample ID. For example, given `foo/bar/baz.cov.gz`, the
sample ID will be `baz`.
```bash
micov nonqiita-to-parquet \
--pattern "example/coverages/*.cov.gz" \
--output example/parquet/example \
--lengths example/metadata/length.tsv
```
### 4. Convert Coverage Data to Parquet Format
`micov` provides functionality to convert **Qiita-formatted coverage data** into **Parquet format** as well.
```bash
mkdir -p "./example/parquet"
# note: multiple coverage files can be specified by repeating the --qiita-coverages argument
micov qiita-to-parquet \
--qiita-coverages "./example/consolidate/consolidated.tgz" \
--output "./example/parquet/example" \
--lengths "./example/metadata/length.tsv"
```
### 5. Generate Per-Sample-Group Plots
A series of plots can be constructed guided by metadata. Specifically, `micov` produces the following:
* **Non-cumulative coverage curves** for each genome in the feature metadata.
* **Cumulative coverage curves** for each genome in the feature metadata. These accumulation data are supported by K-S tests written to the output directory.
* **Scaled and unscaled position plots** for each genome in the feature metadata.
Categorical metadata can be used to group samples; `sample-metadata` is
required. The genomes to examine can optionally be constrained using
`features-to-keep`. Specific start and stop regions of genomes can also be
specified within the `features-to-keep` but limited to a single region per
genome currently.
`micov` expects the first column of a sample metadata file to be the sample ID
under the header `sample_id`. Similarly, the first column of a feature metadata
file should be the feature ID under the header `genome_id`.
The `--output` parameter specified a prefix for the output files.
Optionally, Monte Carlo curves can be produced for the cumulative plots by
specifying `--monte`. There are two Monte Carlo options: `unfocused` and
`focused`. The `unfocused` option will select samples at random with _any_
coverage data, while the `focused` option will randomly select samples with
nonzero coverage of the current genome. Both options select independent of
sample metadata, and will select the max number of samples observed in a sample
group.
```bash
mkdir -p "./example/plots/per_sample_groups"
micov per-sample-group \
--parquet-coverage "./example/parquet/example" \
--sample-metadata "./example/metadata/sample_metadata.txt" \
--sample-metadata-column "dog" \
--features-to-keep "./example/metadata/feature_metadata.txt" \
--output "./example/plots/per_sample_groups/example" \
--plot
```
### 6. Additional usage (optional)
Existing .SAM/.BAM can be converted into coverage percentages by specifying length data at compression:
```bash
$ xzcat some_data.sam.xz | micov compress --length length.tsv > coverages.tsv
```
Multiple coverage files for the same sample can be aggregated into a single file:
```bash
$ zcat run1/sample1.cov.gz run2/sample1.cov.gz | micov compress | gzip > combined/sample1.cov.gz
```
Raw data
{
"_id": null,
"home_page": null,
"name": "micov",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "microbiome, bioinformatics",
"author": null,
"author_email": "Daniel McDonald <damcdonald@ucsd.edu>, Sherlyn Weng <y1weng@ucsd.edu>, Caitlin Guccione <cguccion@ucsd.edu>",
"download_url": "https://files.pythonhosted.org/packages/4d/01/758c56e3de862fa0636f75d958e7523edd9239e887fc4bb8430695610392/micov-2025.2.tar.gz",
"platform": null,
"description": "## micov: aggregate MIcrobiome COVerage\n\nWe introduce aggregate MIcrobiome COVerage (micov), a bioinformatic tool that efficiently computes precise, optionally-aggregated, genomic coverage positions across numerous metagenomes and arbitrary sample types. Micov offers three key advantages over conventional tools: rapid sample type-specific cumulative coverage calculations, identification of mobile or polymorphic genetic elements, and detection of strain heterogeneity through coverage variations.\n\n## Design\n\nThe primary input mapping structure for micov is per-sample SAM/BAM or BED\n(3-column). These data are then consolidated into Parquet files to utilize\npushdown filters.\n\n## Installation\n\nWe recommend creating a separate conda environment, and installing\ninto that.\n\n```bash\n$ pip install micov\n```\n\n## Installation From Source\n\nTo install the most up-to-date version of micov\n\n```bash\n$ conda create -n micov python=3.12\n$ conda install -q --yes -n micov -c conda-forge --file ci/conda_requirements.txt\n$ conda activate micov\n$ git clone https://github.com/biocore/micov.git\n$ cd micov\n$ pip install -e .\n```\n\n## Example Usages\n\nSee below for examples of running `micov` on SAM files.\n\n### 1. Set Up Environment\nFirst, activate the **Conda environment** where `micov` is installed:\n\n```bash\nconda activate micov\n```\n\n### 2. Process SAM Files to Extract Covered Positions\nNext, we will process SAM files to extract covered positions. Note: If you have\n`coverages.tgz` coverage files from Qitta, please go to step 4. `micov` accepts\n**headerless** SAM/BAM files, and writes out BED-like files which describe the\nobserved start and stop positions on the references in the SAM data.\n\nIf your input files contain headers, remove them using `samtools` before running micov:\n\n```bash\nsamtools view -S input.sam > output.sam\n```\n\nSimilarly, if your input files are in BAM format, convert them to SAM format using `samtools`:\n\n```bash\nsamtools view input.bam > output.sam\n```\n\nNext, compress the SAM data into BED coverge files. The `samtools` command above\ncan be piped into `micov` to compress the SAM data into BED-like files if\ndesired, but for simplicity, we will demonstrate use from SAM. In writing, we\nasssume the name of the SAM file corresponds to a sample name. The subsequent\ncode expects the BED files to have either a `.cov` or `.cov.gz` extension.\n\n```bash\nmkdir -p \"./example/coverages\"\n\nfor file in ./example/samfiles/*.sam.xz; do\n sample_id=$(basename \"$file\" .sam.xz)\n\n echo \"Processing $file...\"\n\n # Run micov compress\n xzcat $file | micov compress | gzip > \"./example/coverages/${sample_id}.cov.gz\"\ndone\n```\n\n\n### 3. Consolidate Coverage Files\nAfter extracting coverage data, consolidate the `.cov` files into Parquet\nrepresentations. This requires a **length mapping file (`length.tsv`)**, which\nmaps genome IDs to their corresponding genome lengths. An example length file\ncan be found in `./example/metadata/length.tsv`. If this file is not available,\nit can for example be generated using `seqkit`:\n\n```bash\nseqkit fx2tab --length --name --header-line foo.fasta > length.tsv\n```\n\nNow, consolidate the coverage files. On read, `micov` will interpret the non-extension\nportion of a filename as the sample ID. For example, given `foo/bar/baz.cov.gz`, the\nsample ID will be `baz`.\n\n```bash\n\nmicov nonqiita-to-parquet \\\n --pattern \"example/coverages/*.cov.gz\" \\\n --output example/parquet/example \\\n --lengths example/metadata/length.tsv\n```\n\n### 4. Convert Coverage Data to Parquet Format\n`micov` provides functionality to convert **Qiita-formatted coverage data** into **Parquet format** as well.\n\n```bash\nmkdir -p \"./example/parquet\"\n\n# note: multiple coverage files can be specified by repeating the --qiita-coverages argument\nmicov qiita-to-parquet \\\n --qiita-coverages \"./example/consolidate/consolidated.tgz\" \\\n --output \"./example/parquet/example\" \\\n --lengths \"./example/metadata/length.tsv\"\n```\n\n### 5. Generate Per-Sample-Group Plots\nA series of plots can be constructed guided by metadata. Specifically, `micov` produces the following:\n\n* **Non-cumulative coverage curves** for each genome in the feature metadata.\n* **Cumulative coverage curves** for each genome in the feature metadata. These accumulation data are supported by K-S tests written to the output directory.\n* **Scaled and unscaled position plots** for each genome in the feature metadata.\n\nCategorical metadata can be used to group samples; `sample-metadata` is\nrequired. The genomes to examine can optionally be constrained using\n`features-to-keep`. Specific start and stop regions of genomes can also be\nspecified within the `features-to-keep` but limited to a single region per\ngenome currently.\n\n`micov` expects the first column of a sample metadata file to be the sample ID\nunder the header `sample_id`. Similarly, the first column of a feature metadata\nfile should be the feature ID under the header `genome_id`.\n\nThe `--output` parameter specified a prefix for the output files.\n\nOptionally, Monte Carlo curves can be produced for the cumulative plots by\nspecifying `--monte`. There are two Monte Carlo options: `unfocused` and\n`focused`. The `unfocused` option will select samples at random with _any_\ncoverage data, while the `focused` option will randomly select samples with\nnonzero coverage of the current genome. Both options select independent of\nsample metadata, and will select the max number of samples observed in a sample\ngroup.\n\n```bash\nmkdir -p \"./example/plots/per_sample_groups\"\n\nmicov per-sample-group \\\n --parquet-coverage \"./example/parquet/example\" \\\n --sample-metadata \"./example/metadata/sample_metadata.txt\" \\\n --sample-metadata-column \"dog\" \\\n --features-to-keep \"./example/metadata/feature_metadata.txt\" \\\n --output \"./example/plots/per_sample_groups/example\" \\\n --plot\n```\n\n### 6. Additional usage (optional)\n\nExisting .SAM/.BAM can be converted into coverage percentages by specifying length data at compression:\n\n```bash\n$ xzcat some_data.sam.xz | micov compress --length length.tsv > coverages.tsv\n```\n\nMultiple coverage files for the same sample can be aggregated into a single file:\n\n```bash\n$ zcat run1/sample1.cov.gz run2/sample1.cov.gz | micov compress | gzip > combined/sample1.cov.gz\n```\n",
"bugtrack_url": null,
"license": "==========================\n The micov licensing terms\n ==========================\n \n The micov project is licensed under the terms of the Modified BSD License\n (also known as New or Revised BSD), as follows:\n \n Copyright (c) 2024-, The micov Development Team <damcdonald@ucsd.edu>\n \n All rights reserved.\n \n Redistribution and use in source and binary forms, with or without\n modification, are permitted provided that the following conditions are met:\n * Redistributions of source code must retain the above copyright\n notice, this list of conditions and the following disclaimer.\n * Redistributions in binary form must reproduce the above copyright\n notice, this list of conditions and the following disclaimer in the\n documentation and/or other materials provided with the distribution.\n * Neither the name of the micov development team nor the names of its\n contributors may be used to endorse or promote products derived from this\n software without specific prior written permission.\n \n THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS \"AS IS\" AND\n ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED\n WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE\n DISCLAIMED. IN NO EVENT SHALL THE MICOV DEVELOPMENT TEAM BE LIABLE FOR\n ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES\n (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;\n LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND\n ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT\n (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS\n SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.\n ",
"summary": null,
"version": "2025.2",
"project_urls": null,
"split_keywords": [
"microbiome",
" bioinformatics"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "4d01758c56e3de862fa0636f75d958e7523edd9239e887fc4bb8430695610392",
"md5": "86b17f92349b032206ca9bd77a6fd829",
"sha256": "ea80ddb260f073bed201331aab91b28e289202f5b3b960cbb63373d19660c749"
},
"downloads": -1,
"filename": "micov-2025.2.tar.gz",
"has_sig": false,
"md5_digest": "86b17f92349b032206ca9bd77a6fd829",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 88581,
"upload_time": "2025-02-14T18:39:19",
"upload_time_iso_8601": "2025-02-14T18:39:19.232024Z",
"url": "https://files.pythonhosted.org/packages/4d/01/758c56e3de862fa0636f75d958e7523edd9239e887fc4bb8430695610392/micov-2025.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-02-14 18:39:19",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "micov"
}