moose-classifier


Namemoose-classifier JSON
Version 0.15 PyPI version JSON
download
home_pageNone
SummaryAlignment based taxonomic classifier
upload_time2025-09-06 04:14:16
maintainerNone
docs_urlNone
authorChris Rosenthal
requires_python>=3.7
licenseNone
keywords bioinformatics blast classifier dna genetics genomics ncbi rna
VCS
bugtrack_url
requirements sphinx ghp-import twine tox flake8 taxtastic
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # moose

A tool for taxonomically selecting, grouping and summarizing pairwise
alignment classifications into something more concise and readable.

## authors

* [Noah Hoffman](https://github.com/nhoffman)
* [Tim Holland](https://github.com/tholland)
* [Daniel Hoogestraat](https://github.com/dhoogest)
* [Tyler Land](https://github.com/tyleraland)
* [Steve Salipante](mailto:stevesal@uw.edu)
* [Chris Rosenthal](mailto:crosenth@gmail.com)

## about

Moose groups pairwise alignments by taxonomy and alignment scores.  It works 
safely with large data sets utilizing the Python Data Analysis Library.

## dependencies

* Python >= 3.7
* [Pandas](https://pandas.pydata.org/) >= 2.0.2

## installation

Moose can be installed in a few ways:

From PyPI:

```
% pip install moose_classifier
```

Or cloned from Github:

```
% git clone https://github.com/crosenth/moose.git
% python moose/setup.py install
```

## examples

The following examples will use results using 16s sequences aligned to a local
NCBI nt database.  For instructions on creating a local blast nt database see
the NCBI walkthrough
[here](https://www.ncbi.nlm.nih.gov/sites/books/NBK537770/) and
and [here](https://www.ncbi.nlm.nih.gov/sites/books/NBK279688/).

The simplest example pipes blast "10 qaccver saccver pident staxid" into
the classifier and outputs a table of species level taxonomy results:

```
% blastn -db nt -outfmt "10 qaccver saccver pident staxid" -query sequences.fasta | classify --columns qaccver,saccver,pident,staxid
specimen,assignment_id,assignment,best_rank,max_percent,min_percent,min_threshold,reads,clusters,pct_reads
query1,0,Homo sapiens,species,99.67,99.67,0.00,1,1,100.00
query10,0,Actinobacteria*;uncultured bacterium*,species,100.00,93.21,0.00,1,1,100.00
query11,0,Bacteroidetes*;uncultured bacterium*/organism,species,100.00,91.26,0.00,1,1,100.00
query12,0,Apteryx australis*;Bacteria*;Firmicutes*,species,100.00,98.26,0.00,1,1,100.00
query13,0,Dikarya*;uncultured bacterium/eukaryote,species,100.00,82.88,0.00,1,1,100.00
query14,0,Saccharomyces cerevisiae*;uncultured eukaryote,species,100.00,99.00,0.00,1,1,100.00
query2,0,Homo sapiens;Pan troglodytes,species,97.07,95.40,0.00,1,1,100.00
query6,0,Bacteria*;Escherichia coli;Staphylococcus,species,100.00,98.62,0.00,1,1,100.00
query7,0,Bacteroidetes*;uncultured bacterium*/organism*,species,100.00,91.61,0.00,1,1,100.00
query8,0,Bacteria*;Escherichia coli;Staphylococcus,species,100.00,98.62,0.00,1,1,100.00
query9,0,Bacteria*;uncultured organism*,species,100.00,98.62,0.00,1,1,100.00
```

This example shows the bare minimum information required to simplify and group
alignment results: a query sequence (qseqid), subject sequence (sseqid), a
percent identiy (pident) and a subject taxonomy id (staxid). If the staxid
column is unavailable an accession to taxonomy id map file can be used with
the `--seq-info` argument. Results are output in csv format.

Sending the `blastn` results to a standalone file we can look a bit closer 
at what happened.  And for purposes of this walkthrough the csv output will
be displated in as a nicely formatted table:

```
% blastn -outfmt "10 qaccver saccver pident staxid" -query sequences.fasta -out blast.csv
% wc --lines blast.csv
1084 blast.csv
% classify --columns qaccver,saccver,pident,staxid blast.csv
specimen,assignment_id,assignment,best_rank,max_percent,min_percent,min_threshold,reads,clusters,pct_reads
|----------+---------------+------------------------------------------------+-----------+-------------+-------------+---------------+-------+----------+-----------|
| specimen | assignment_id | assignment                                     | best_rank | max_percent | min_percent | min_threshold | reads | clusters | pct_reads |
|----------+---------------+------------------------------------------------+-----------+-------------+-------------+---------------+-------+----------+-----------|
| query1   | 0             | Homo sapiens                                   | species   | 99.67       | 99.67       | 0.00          | 1     | 1        | 100.00    |
| query10  | 0             | Actinobacteria*;uncultured bacterium*          | species   | 100.00      | 93.21       | 0.00          | 1     | 1        | 100.00    |
| query11  | 0             | Bacteroidetes*;uncultured bacterium*/organism  | species   | 100.00      | 91.26       | 0.00          | 1     | 1        | 100.00    |
| query12  | 0             | Apteryx australis*;Bacteria*;Firmicutes*       | species   | 100.00      | 98.26       | 0.00          | 1     | 1        | 100.00    |
| query13  | 0             | Dikarya*;uncultured bacterium/eukaryote        | species   | 100.00      | 82.88       | 0.00          | 1     | 1        | 100.00    |
| query14  | 0             | Saccharomyces cerevisiae*;uncultured eukaryote | species   | 100.00      | 99.00       | 0.00          | 1     | 1        | 100.00    |
| query2   | 0             | Homo sapiens;Pan troglodytes                   | species   | 97.07       | 95.40       | 0.00          | 1     | 1        | 100.00    |
| query6   | 0             | Bacteria*;Escherichia coli;Staphylococcus      | species   | 100.00      | 98.62       | 0.00          | 1     | 1        | 100.00    |
| query7   | 0             | Bacteroidetes*;uncultured bacterium*/organism* | species   | 100.00      | 91.61       | 0.00          | 1     | 1        | 100.00    |
| query8   | 0             | Bacteria*;Escherichia coli;Staphylococcus      | species   | 100.00      | 98.62       | 0.00          | 1     | 1        | 100.00    |
| query9   | 0             | Bacteria*;uncultured organism*                 | species   | 100.00      | 98.62       | 0.00          | 1     | 1        | 100.00    |
|----------+---------------+------------------------------------------------+-----------+-------------+-------------+---------------+-------+----------+-----------|
```

1,084 lines of blast results are conveniently grouped taxonomically and 
with a single row per specimen query sequence.

Taxonomy grouping is accomplished with a lineages table that can be specified
using the `--lineages` argument.  If a lineages file is not supplied it will be
generated automatically using NCBI taxonomy data by default.  A Moose classify 
built lineages table can be saved to a file using the `lineages-out` command 
which will speed up subsequent classify runs:

```
classify --columns qaccver,saccver,pident,staxid --lineages-out lineages.csv --specimen one blast.csv
```

### Taxonomony grouping

By default, classifications are taxonomically grouped according to
`--max-group-size` with 3 being the default.  Classification names will start
at the species level by default and recursively regroup at a higher taxonomony
until `--max-group-size` is satisfied.  

By increasing the `--max-group-size 5`:
```
classify --columns qaccver,saccver,pident,staxid --lineages lineages.csv --max-group-size 5 blast.csv
|----------+---------------+---------------------------------------------------------------------------------------------------------------------+-----------+-------------+-------------+---------------+-------+----------+-----------|
| specimen | assignment_id | assignment                                                                                                          | best_rank | max_percent | min_percent | min_threshold | reads | clusters | pct_reads |
|----------+---------------+---------------------------------------------------------------------------------------------------------------------+-----------+-------------+-------------+---------------+-------+----------+-----------|
| query1   | 0             | Homo sapiens                                                                                                        | species   | 99.67       | 99.67       | 0.00          | 1     | 1        | 100.00    |
| query10  | 0             | Actinomycetales bacterium 'ARUP UnID 260'*;Corynebacterium*;uncultured actinobacterium/bacterium*                   | species   | 100.00      | 93.21       | 0.00          | 1     | 1        | 100.00    |
| query11  | 0             | Prevotella*;uncultured Bacteroidales bacterium*;uncultured Bacteroidetes bacterium;uncultured bacterium*/organism   | species   | 100.00      | 91.26       | 0.00          | 1     | 1        | 100.00    |
| query12  | 0             | Apteryx australis*;Bacilli*;Staphylococcus*;bacterium*;uncultured Firmicutes bacterium*;uncultured bacterium*       | species   | 100.00      | 98.26       | 0.00          | 1     | 1        | 100.00    |
| query13  | 0             | Saccharomycetales*;Xanthophyllomyces dendrorhous;uncultured bacterium/eukaryote                                     | species   | 100.00      | 82.88       | 0.00          | 1     | 1        | 100.00    |
| query14  | 0             | Saccharomyces cerevisiae*;uncultured eukaryote                                                                      | species   | 100.00      | 99.00       | 0.00          | 1     | 1        | 100.00    |
| query2   | 0             | Homo sapiens;Pan troglodytes                                                                                        | species   | 97.07       | 95.40       | 0.00          | 1     | 1        | 100.00    |
| query6   | 0             | Escherichia coli;Staphylococcus;bacterium CulaenoE10F;human oral bacterium C20;uncultured bacterium*                | species   | 100.00      | 98.62       | 0.00          | 1     | 1        | 100.00    |
| query7   | 0             | Prevotella*;uncultured Bacteroidales bacterium*;uncultured Bacteroidetes bacterium*;uncultured bacterium*/organism* | species   | 100.00      | 91.61       | 0.00          | 1     | 1        | 100.00    |
| query8   | 0             | Escherichia coli;Staphylococcus;bacterium CulaenoE10F;human oral bacterium C20;uncultured bacterium*                | species   | 100.00      | 98.62       | 0.00          | 1     | 1        | 100.00    |
| query9   | 0             | Bacteria*;Escherichia coli;Staphylococcus;uncultured organism*                                                      | species   | 100.00      | 98.62       | 0.00          | 1     | 1        | 100.00    |
|----------+---------------+---------------------------------------------------------------------------------------------------------------------+-----------+-------------+-------------+---------------+-------+----------+-----------|
```

And

```
classify --columns qaccver,saccver,pident,staxid --lineages lineages.csv --max-group-size 9 blast.csv
|----------+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+-------------+-------------+---------------+-------+----------+-----------|
| specimen | assignment_id | assignment                                                                                                                                                                    | best_rank | max_percent | min_percent | min_threshold | reads | clusters | pct_reads |
|----------+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+-------------+-------------+---------------+-------+----------+-----------|
| query1   | 0             | Homo sapiens                                                                                                                                                                  | species   | 99.67       | 99.67       | 0.00          | 1     | 1        | 100.00    |
| query10  | 0             | Actinomycetales bacterium 'ARUP UnID 260'*;Corynebacterium*;uncultured actinobacterium/bacterium*                                                                             | species   | 100.00      | 93.21       | 0.00          | 1     | 1        | 100.00    |
| query11  | 0             | Prevotella amnii/bivia;Prevotella sp. 3-5;uncultured Bacteroidales bacterium*;uncultured Bacteroidetes bacterium;uncultured Prevotella sp.*;uncultured bacterium*/organism    | species   | 100.00      | 91.26       | 0.00          | 1     | 1        | 100.00    |
| query12  | 0             | Apteryx australis*;Bacilli*;Staphylococcus*;bacterium*;uncultured Firmicutes bacterium*;uncultured bacterium*                                                                 | species   | 100.00      | 98.26       | 0.00          | 1     | 1        | 100.00    |
| query13  | 0             | Saccharomycetales*;Xanthophyllomyces dendrorhous;uncultured bacterium/eukaryote                                                                                               | species   | 100.00      | 82.88       | 0.00          | 1     | 1        | 100.00    |
| query14  | 0             | Saccharomyces cerevisiae*;uncultured eukaryote                                                                                                                                | species   | 100.00      | 99.00       | 0.00          | 1     | 1        | 100.00    |
| query2   | 0             | Homo sapiens;Pan troglodytes                                                                                                                                                  | species   | 97.07       | 95.40       | 0.00          | 1     | 1        | 100.00    |
| query6   | 0             | Escherichia coli;Staphylococcus;bacterium CulaenoE10F;human oral bacterium C20;uncultured bacterium*                                                                          | species   | 100.00      | 98.62       | 0.00          | 1     | 1        | 100.00    |
| query7   | 0             | Prevotella amnii/bivia*;Prevotella sp. 3-5;uncultured Bacteroidales bacterium*;uncultured Bacteroidetes bacterium*;uncultured Prevotella sp.*;uncultured bacterium*/organism* | species   | 100.00      | 91.61       | 0.00          | 1     | 1        | 100.00    |
| query8   | 0             | Escherichia coli;Staphylococcus;bacterium CulaenoE10F;human oral bacterium C20;uncultured bacterium*                                                                          | species   | 100.00      | 98.62       | 0.00          | 1     | 1        | 100.00    |
| query9   | 0             | Escherichia coli;Staphylococcus;bacterium CulaenoE10F;human oral bacterium C20;uncultured bacterium*/organism*                                                                | species   | 100.00      | 98.62       | 0.00          | 1     | 1        | 100.00    |
|----------+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+-------------+-------------+---------------+-------+----------+-----------|
```

And using `--max-group-size 1`:

```
classify --columns qaccver,saccver,pident,staxid --lineages lineages.csv --max-group-size 1 blast.csv
|----------+---------------+--------------+--------------+-------------+-------------+---------------+-------+----------+-----------|
| specimen | assignment_id | assignment   | best_rank    | max_percent | min_percent | min_threshold | reads | clusters | pct_reads |
|----------+---------------+--------------+--------------+-------------+-------------+---------------+-------+----------+-----------|
| query1   | 0             | Homo sapiens | species      | 99.67       | 99.67       | 0.00          | 1     | 1        | 100.00    |
| query10  | 0             | Bacteria*    | superkingdom | 100.00      | 93.21       | 0.00          | 1     | 1        | 100.00    |
| query11  | 0             | root*        | root         | 100.00      | 91.26       | 0.00          | 1     | 1        | 100.00    |
| query12  | 0             | root*        | root         | 100.00      | 98.26       | 0.00          | 1     | 1        | 100.00    |
| query13  | 0             | root*        | root         | 100.00      | 82.88       | 0.00          | 1     | 1        | 100.00    |
| query14  | 0             | Eukaryota*   | superkingdom | 100.00      | 99.00       | 0.00          | 1     | 1        | 100.00    |
| query2   | 0             | Homininae    | subfamily    | 97.07       | 95.40       | 0.00          | 1     | 1        | 100.00    |
| query6   | 0             | Bacteria*    | superkingdom | 100.00      | 98.62       | 0.00          | 1     | 1        | 100.00    |
| query7   | 0             | root*        | root         | 100.00      | 91.61       | 0.00          | 1     | 1        | 100.00    |
| query8   | 0             | Bacteria*    | superkingdom | 100.00      | 98.62       | 0.00          | 1     | 1        | 100.00    |
| query9   | 0             | root*        | root         | 100.00      | 98.62       | 0.00          | 1     | 1        | 100.00    |
|----------+---------------+--------------+--------------+-------------+-------------+---------------+-------+----------+-----------|
```

Using the `--specimen` argument the results can be grouped together to further simplify the results:

```
classify --columns qaccver,saccver,pident,staxid --lineages lineages.csv --max-group-size 1 --specimen one blast.csv
|----------+---------------+--------------+--------------+-------------+-------------+---------------+-------+----------+-----------|
| specimen | assignment_id | assignment   | best_rank    | max_percent | min_percent | min_threshold | reads | clusters | pct_reads |
|----------+---------------+--------------+--------------+-------------+-------------+---------------+-------+----------+-----------|
| one      | 0             | root*        | root         | 100.00      | 82.88       | 0.00          | 5     | 5        | 45.45     |
| one      | 1             | Bacteria*    | superkingdom | 100.00      | 93.21       | 0.00          | 3     | 3        | 27.27     |
| one      | 2             | Homo sapiens | species      | 99.67       | 99.67       | 0.00          | 1     | 1        | 9.09      |
| one      | 3             | Homininae    | subfamily    | 97.07       | 95.40       | 0.00          | 1     | 1        | 9.09      |
| one      | 4             | Eukaryota*   | superkingdom | 100.00      | 99.00       | 0.00          | 1     | 1        | 9.09      |
|----------+---------------+--------------+--------------+-------------+-------------+---------------+-------+----------+-----------|
```

If `--columns` is not specified the classifier will check for a header with 
minimum qseqid,sseqid,pident columns.  If no header than blast outfmt 6 columns
are assumed.

### Alignment selection
TODO

### Rank thresholds

The Moose classifier is built to accept dynamic thresholds for any taxonomic
to provide the best possible classification using the `--rank-thresholds`
argument. An example input looks like this:

```
% cl rank_thresholds.csv
|--------+------+---------+--------+-------+-------+--------+-------+---------|
| tax_id | root | kingdom | phylum | class | order | family | genus | species | subspecies |
|--------+------+---------+--------+-------+-------+--------+-------+---------|
| 1      | 75.0 | 75.0    | 80.0   | 90.0  | 93.0  | 95.0   | 97.0  |  99.0   |
|--------+------+---------+--------+-------+-------+--------+-------+---------|
```

Any tax_id can be specified in a rank thresholds file.  If a tax_id is not
present the rank thresholds file the classifier will work its way up the 
reference sequence's taxonomy lineage in order to assign rank thresholds.

The classifier will use the rank threshold table to select the lowest possible
best hits available for classification.  Example usage:

```
classify --columns qaccver,saccver,pident,staxid --lineages lineages.csv --rank-thresholds rank_thresholds.csv --specimen one blast.csv
|----------+---------------+-----------------------------------------------------------------------------------+-----------+-------------+-------------+---------------+-------+----------+-----------|
| specimen | assignment_id | assignment                                                                        | best_rank | max_percent | min_percent | min_threshold | reads | clusters | pct_reads |
|----------+---------------+-----------------------------------------------------------------------------------+-----------+-------------+-------------+---------------+-------+----------+-----------|
| one      | 0             | Bacteroidetes*;uncultured bacterium*/organism*                                    | species   | 100.00      | 99.27       | 99.00         | 2     | 2        | 18.18     |
| one      | 1             | Bacteria*;Escherichia coli;Staphylococcus                                         | species   | 100.00      | 99.31       | 99.00         | 2     | 2        | 18.18     |
| one      | 2             | Saccharomyces cerevisiae*;uncultured eukaryote                                    | species   | 100.00      | 99.33       | 99.00         | 1     | 1        | 9.09      |
| one      | 3             | Homo sapiens                                                                      | species   | 99.67       | 99.67       | 99.00         | 1     | 1        | 9.09      |
| one      | 4             | Homo                                                                              | genus     | 97.07       | 97.07       | 97.00         | 1     | 1        | 9.09      |
| one      | 5             | Clavispora lusitaniae*                                                            | species   | 100.00      | 100.00      | 99.00         | 1     | 1        | 9.09      |
| one      | 6             | Bacteria*;uncultured organism*                                                    | species   | 100.00      | 99.31       | 99.00         | 1     | 1        | 9.09      |
| one      | 7             | Apteryx australis*;Bacteria*;Firmicutes*                                          | species   | 100.00      | 99.31       | 99.00         | 1     | 1        | 9.09      |
| one      | 8             | Actinomycetales bacterium 'ARUP UnID 260'*;Corynebacterium*;uncultured bacterium* | species   | 100.00      | 99.64       | 99.00         | 1     | 1        | 9.09      |
|----------+---------------+-----------------------------------------------------------------------------------+-----------+-------------+-------------+---------------+-------+----------+-----------|
```

There are a few things to notice when using a rank thresholds table.
The first is the min_percent column will correspond to the lowest rank 
threshold used for hit selection.  The second is the difference in 
classifications after dropping hits below the min_threshold.

Lastly, the genus level Homo classification was not rolled into the 
Homo sapiens classification because the rank thresholds table determined that
the best hits available for that query sequence could only be classified at the genus
level.  This is despite the fact that the genus level Homo classification was
derived from Homo sapien reference sequences.  What the rank thresholds table
defines is classification uncertainty.  So, despite the Homo sapien
reference sequences hits the classifier could not determine the query sequence 
was in fact Homo sapien but only of genus level Homo origin.

### Specimen map

A three column specimen,qseqid,weight file included using the `--specimen-map`
argument.  An example might look like this:

```
cat specimen_map.csv
one,query1,100
one,query2,95
one,query6,75
one,query7,70
one,query8,65
one,query9,60
one,query10,55
one,query11,50
one,query12,45
one,query13,40
one,query14,35
classify --columns qaccver,saccver,pident,staxid --lineages lineages.csv --rank-thresholds rank_thresholds.csv --specimen-map specimen_map.csv blast.csv | cl
|----------+---------------+-----------------------------------------------------------------------------------+-----------+-------------+-------------+---------------+-------+----------+-----------|
| specimen | assignment_id | assignment                                                                        | best_rank | max_percent | min_percent | min_threshold | reads | clusters | pct_reads |
|----------+---------------+-----------------------------------------------------------------------------------+-----------+-------------+-------------+---------------+-------+----------+-----------|
| one      | 0             | Bacteria*;Escherichia coli;Staphylococcus                                         | species   | 100.00      | 99.31       | 99.00         | 140   | 2        | 20.29     |
| one      | 1             | Bacteroidetes*;uncultured bacterium*/organism*                                    | species   | 100.00      | 99.27       | 99.00         | 120   | 2        | 17.39     |
| one      | 2             | Homo sapiens                                                                      | species   | 99.67       | 99.67       | 99.00         | 100   | 1        | 14.49     |
| one      | 3             | Homo                                                                              | genus     | 97.07       | 97.07       | 97.00         | 95    | 1        | 13.77     |
| one      | 4             | Bacteria*;uncultured organism*                                                    | species   | 100.00      | 99.31       | 99.00         | 60    | 1        | 8.70      |
| one      | 5             | Actinomycetales bacterium 'ARUP UnID 260'*;Corynebacterium*;uncultured bacterium* | species   | 100.00      | 99.64       | 99.00         | 55    | 1        | 7.97      |
| one      | 6             | Apteryx australis*;Bacteria*;Firmicutes*                                          | species   | 100.00      | 99.31       | 99.00         | 45    | 1        | 6.52      |
| one      | 7             | Clavispora lusitaniae*                                                            | species   | 100.00      | 100.00      | 99.00         | 40    | 1        | 5.80      |
| one      | 8             | Saccharomyces cerevisiae*;uncultured eukaryote                                    | species   | 100.00      | 99.33       | 99.00         | 35    | 1        | 5.07      |
|----------+---------------+-----------------------------------------------------------------------------------+-----------+-------------+-------------+---------------+-------+----------+-----------|
```

The classifer will interpret each qseqid in the specimen map file as part of
the specimen.  If a qseqid is not included in the blast.csv results then a 
classification of `[no blast result]` will be assigned:

```
cat specimen_map.csv
one,query1,100
one,query2,95
one,query3,90
one,query4,85
one,query5,80
one,query6,75
one,query7,70
one,query8,65
one,query9,60
one,query10,55
one,query11,50
one,query12,45
one,query13,40
one,query14,35
classify --columns qaccver,saccver,pident,staxid --lineages lineages.csv --rank-thresholds rank_thresholds.csv --specimen-map specimen_map.csv blast.csv
|----------+---------------+-----------------------------------------------------------------------------------+-----------+-------------+-------------+---------------+-------+----------+-----------|
| specimen | assignment_id | assignment                                                                        | best_rank | max_percent | min_percent | min_threshold | reads | clusters | pct_reads |
|----------+---------------+-----------------------------------------------------------------------------------+-----------+-------------+-------------+---------------+-------+----------+-----------|
| one      | 0             | [no blast result]                                                                 |           |             |             |               | 255   | 3        | 26.98     |
| one      | 1             | Bacteria*;Escherichia coli;Staphylococcus                                         | species   | 100.00      | 99.31       | 99.00         | 140   | 2        | 14.81     |
| one      | 2             | Bacteroidetes*;uncultured bacterium*/organism*                                    | species   | 100.00      | 99.27       | 99.00         | 120   | 2        | 12.70     |
| one      | 3             | Homo sapiens                                                                      | species   | 99.67       | 99.67       | 99.00         | 100   | 1        | 10.58     |
| one      | 4             | Homo                                                                              | genus     | 97.07       | 97.07       | 97.00         | 95    | 1        | 10.05     |
| one      | 5             | Bacteria*;uncultured organism*                                                    | species   | 100.00      | 99.31       | 99.00         | 60    | 1        | 6.35      |
| one      | 6             | Actinomycetales bacterium 'ARUP UnID 260'*;Corynebacterium*;uncultured bacterium* | species   | 100.00      | 99.64       | 99.00         | 55    | 1        | 5.82      |
| one      | 7             | Apteryx australis*;Bacteria*;Firmicutes*                                          | species   | 100.00      | 99.31       | 99.00         | 45    | 1        | 4.76      |
| one      | 8             | Clavispora lusitaniae*                                                            | species   | 100.00      | 100.00      | 99.00         | 40    | 1        | 4.23      |
| one      | 9             | Saccharomyces cerevisiae*;uncultured eukaryote                                    | species   | 100.00      | 99.33       | 99.00         | 35    | 1        | 3.70      |
|----------+---------------+-----------------------------------------------------------------------------------+-----------+-------------+-------------+---------------+-------+----------+-----------|
```

Multiple specimens can be specified with qseqids of the same name:

```
cat specimen_map.csv
one,query1,100
one,query2,95
one,query3,90
one,query4,85
one,query5,80
one,query6,75
one,query7,70
one,query8,65
one,query9,60
one,query10,55
one,query11,50
one,query12,45
one,query13,40
one,query14,35
two,query3,1000
two,query6,500
two,query1,25
two,query8,900
classify --columns qaccver,saccver,pident,staxid --lineages lineages.csv --rank-thresholds rank_thresholds.csv --specimen-map specimen_map.csv blast.csv
|----------+---------------+-----------------------------------------------------------------------------------+-----------+-------------+-------------+---------------+-------+----------+-----------|
| specimen | assignment_id | assignment                                                                        | best_rank | max_percent | min_percent | min_threshold | reads | clusters | pct_reads |
|----------+---------------+-----------------------------------------------------------------------------------+-----------+-------------+-------------+---------------+-------+----------+-----------|
| one      | 0             | [no blast result]                                                                 |           |             |             |               | 255   | 3        | 26.98     |
| one      | 1             | Bacteria*;Escherichia coli;Staphylococcus                                         | species   | 100.00      | 99.31       | 99.00         | 140   | 2        | 14.81     |
| one      | 2             | Bacteroidetes*;uncultured bacterium*/organism*                                    | species   | 100.00      | 99.27       | 99.00         | 120   | 2        | 12.70     |
| one      | 3             | Homo sapiens                                                                      | species   | 99.67       | 99.67       | 99.00         | 100   | 1        | 10.58     |
| one      | 4             | Homo                                                                              | genus     | 97.07       | 97.07       | 97.00         | 95    | 1        | 10.05     |
| one      | 5             | Bacteria*;uncultured organism*                                                    | species   | 100.00      | 99.31       | 99.00         | 60    | 1        | 6.35      |
| one      | 6             | Actinomycetales bacterium 'ARUP UnID 260'*;Corynebacterium*;uncultured bacterium* | species   | 100.00      | 99.64       | 99.00         | 55    | 1        | 5.82      |
| one      | 7             | Apteryx australis*;Bacteria*;Firmicutes*                                          | species   | 100.00      | 99.31       | 99.00         | 45    | 1        | 4.76      |
| one      | 8             | Clavispora lusitaniae*                                                            | species   | 100.00      | 100.00      | 99.00         | 40    | 1        | 4.23      |
| one      | 9             | Saccharomyces cerevisiae*;uncultured eukaryote                                    | species   | 100.00      | 99.33       | 99.00         | 35    | 1        | 3.70      |
| two      | 0             | Bacteria*;Escherichia coli;Staphylococcus                                         | species   | 100.00      | 99.31       | 99.00         | 1400  | 2        | 57.73     |
| two      | 1             | [no blast result]                                                                 |           |             |             |               | 1000  | 1        | 41.24     |
| two      | 2             | Homo sapiens                                                                      | species   | 99.67       | 99.67       | 99.00         | 25    | 1        | 1.03      |
|----------+---------------+-----------------------------------------------------------------------------------+-----------+-------------+-------------+---------------+-------+----------+-----------|
```

If a query sequence is not included in the specimen map but returned as part
of the blast.csv results then it will be added as its own specimen:

```
cat specimen_map.csv
one,query1,100
one,query4,85
one,query5,80
one,query6,75
one,query7,70
classify --columns qaccver,saccver,pident,staxid --lineages lineages.csv --rank-thresholds rank_thresholds.csv --specimen-map specimen_map.csv blast.csv
|----------+---------------+-----------------------------------------------------------------------------------+-----------+-------------+-------------+---------------+-------+----------+-----------|
| specimen | assignment_id | assignment                                                                        | best_rank | max_percent | min_percent | min_threshold | reads | clusters | pct_reads |
|----------+---------------+-----------------------------------------------------------------------------------+-----------+-------------+-------------+---------------+-------+----------+-----------|
| one      | 0             | [no blast result]                                                                 |           |             |             |               | 165   | 2        | 40.24     |
| one      | 1             | Homo sapiens                                                                      | species   | 99.67       | 99.67       | 99.00         | 100   | 1        | 24.39     |
| one      | 2             | Bacteria*;Escherichia coli;Staphylococcus                                         | species   | 100.00      | 99.31       | 99.00         | 75    | 1        | 18.29     |
| one      | 3             | Bacteroidetes*;uncultured bacterium*/organism*                                    | species   | 100.00      | 99.27       | 99.00         | 70    | 1        | 17.07     |
| query10  | 0             | Actinomycetales bacterium 'ARUP UnID 260'*;Corynebacterium*;uncultured bacterium* | species   | 100.00      | 99.64       | 99.00         | 1     | 1        | 100.00    |
| query11  | 0             | Bacteroidetes*;uncultured bacterium*/organism                                     | species   | 100.00      | 99.27       | 99.00         | 1     | 1        | 100.00    |
| query12  | 0             | Apteryx australis*;Bacteria*;Firmicutes*                                          | species   | 100.00      | 99.31       | 99.00         | 1     | 1        | 100.00    |
| query13  | 0             | Clavispora lusitaniae*                                                            | species   | 100.00      | 100.00      | 99.00         | 1     | 1        | 100.00    |
| query14  | 0             | Saccharomyces cerevisiae*;uncultured eukaryote                                    | species   | 100.00      | 99.33       | 99.00         | 1     | 1        | 100.00    |
| query2   | 0             | Homo                                                                              | genus     | 97.07       | 97.07       | 97.00         | 1     | 1        | 100.00    |
| query8   | 0             | Bacteria*;Escherichia coli;Staphylococcus                                         | species   | 100.00      | 99.31       | 99.00         | 1     | 1        | 100.00    |
| query9   | 0             | Bacteria*;uncultured organism*                                                    | species   | 100.00      | 99.31       | 99.00         | 1     | 1        | 100.00    |
|----------+---------------+-----------------------------------------------------------------------------------+-----------+-------------+-------------+---------------+-------+----------+-----------|
```

### copy numbers

Multiple copies of the gene may be present in a species genome which may
distort the relative weight abundance of a classification.  The Moose
Classifier excepts a two column csv file with columns `--copy-numbers`:

```
|--------+-------|
| tax_id | count |
|--------+-------|
```

and will divide the final classification tax_id by the count number in this
file and expressed under the `corrected` column in the output file.  This is
useful for adjusting relative abundance of species when, for example, 
doing 16s classifications.



            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "moose-classifier",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "Dan Hoogestraat <dhoogest@uw.edu>",
    "keywords": "bioinformatics, blast, classifier, dna, genetics, genomics, ncbi, rna",
    "author": "Chris Rosenthal",
    "author_email": "crosenth@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/d0/88/d22fb1bbd64aac8dc738e889bbaaa976b508a012e93969f654745bb424c9/moose_classifier-0.15.tar.gz",
    "platform": null,
    "description": "# moose\n\nA tool for taxonomically selecting, grouping and summarizing pairwise\nalignment classifications into something more concise and readable.\n\n## authors\n\n* [Noah Hoffman](https://github.com/nhoffman)\n* [Tim Holland](https://github.com/tholland)\n* [Daniel Hoogestraat](https://github.com/dhoogest)\n* [Tyler Land](https://github.com/tyleraland)\n* [Steve Salipante](mailto:stevesal@uw.edu)\n* [Chris Rosenthal](mailto:crosenth@gmail.com)\n\n## about\n\nMoose groups pairwise alignments by taxonomy and alignment scores.  It works \nsafely with large data sets utilizing the Python Data Analysis Library.\n\n## dependencies\n\n* Python >= 3.7\n* [Pandas](https://pandas.pydata.org/) >= 2.0.2\n\n## installation\n\nMoose can be installed in a few ways:\n\nFrom PyPI:\n\n```\n% pip install moose_classifier\n```\n\nOr cloned from Github:\n\n```\n% git clone https://github.com/crosenth/moose.git\n% python moose/setup.py install\n```\n\n## examples\n\nThe following examples will use results using 16s sequences aligned to a local\nNCBI nt database.  For instructions on creating a local blast nt database see\nthe NCBI walkthrough\n[here](https://www.ncbi.nlm.nih.gov/sites/books/NBK537770/) and\nand [here](https://www.ncbi.nlm.nih.gov/sites/books/NBK279688/).\n\nThe simplest example pipes blast \"10 qaccver saccver pident staxid\" into\nthe classifier and outputs a table of species level taxonomy results:\n\n```\n% blastn -db nt -outfmt \"10 qaccver saccver pident staxid\" -query sequences.fasta | classify --columns qaccver,saccver,pident,staxid\nspecimen,assignment_id,assignment,best_rank,max_percent,min_percent,min_threshold,reads,clusters,pct_reads\nquery1,0,Homo sapiens,species,99.67,99.67,0.00,1,1,100.00\nquery10,0,Actinobacteria*;uncultured bacterium*,species,100.00,93.21,0.00,1,1,100.00\nquery11,0,Bacteroidetes*;uncultured bacterium*/organism,species,100.00,91.26,0.00,1,1,100.00\nquery12,0,Apteryx australis*;Bacteria*;Firmicutes*,species,100.00,98.26,0.00,1,1,100.00\nquery13,0,Dikarya*;uncultured bacterium/eukaryote,species,100.00,82.88,0.00,1,1,100.00\nquery14,0,Saccharomyces cerevisiae*;uncultured eukaryote,species,100.00,99.00,0.00,1,1,100.00\nquery2,0,Homo sapiens;Pan troglodytes,species,97.07,95.40,0.00,1,1,100.00\nquery6,0,Bacteria*;Escherichia coli;Staphylococcus,species,100.00,98.62,0.00,1,1,100.00\nquery7,0,Bacteroidetes*;uncultured bacterium*/organism*,species,100.00,91.61,0.00,1,1,100.00\nquery8,0,Bacteria*;Escherichia coli;Staphylococcus,species,100.00,98.62,0.00,1,1,100.00\nquery9,0,Bacteria*;uncultured organism*,species,100.00,98.62,0.00,1,1,100.00\n```\n\nThis example shows the bare minimum information required to simplify and group\nalignment results: a query sequence (qseqid), subject sequence (sseqid), a\npercent identiy (pident) and a subject taxonomy id (staxid). If the staxid\ncolumn is unavailable an accession to taxonomy id map file can be used with\nthe `--seq-info` argument. Results are output in csv format.\n\nSending the `blastn` results to a standalone file we can look a bit closer \nat what happened.  And for purposes of this walkthrough the csv output will\nbe displated in as a nicely formatted table:\n\n```\n% blastn -outfmt \"10 qaccver saccver pident staxid\" -query sequences.fasta -out blast.csv\n% wc --lines blast.csv\n1084 blast.csv\n% classify --columns qaccver,saccver,pident,staxid blast.csv\nspecimen,assignment_id,assignment,best_rank,max_percent,min_percent,min_threshold,reads,clusters,pct_reads\n|----------+---------------+------------------------------------------------+-----------+-------------+-------------+---------------+-------+----------+-----------|\n| specimen | assignment_id | assignment                                     | best_rank | max_percent | min_percent | min_threshold | reads | clusters | pct_reads |\n|----------+---------------+------------------------------------------------+-----------+-------------+-------------+---------------+-------+----------+-----------|\n| query1   | 0             | Homo sapiens                                   | species   | 99.67       | 99.67       | 0.00          | 1     | 1        | 100.00    |\n| query10  | 0             | Actinobacteria*;uncultured bacterium*          | species   | 100.00      | 93.21       | 0.00          | 1     | 1        | 100.00    |\n| query11  | 0             | Bacteroidetes*;uncultured bacterium*/organism  | species   | 100.00      | 91.26       | 0.00          | 1     | 1        | 100.00    |\n| query12  | 0             | Apteryx australis*;Bacteria*;Firmicutes*       | species   | 100.00      | 98.26       | 0.00          | 1     | 1        | 100.00    |\n| query13  | 0             | Dikarya*;uncultured bacterium/eukaryote        | species   | 100.00      | 82.88       | 0.00          | 1     | 1        | 100.00    |\n| query14  | 0             | Saccharomyces cerevisiae*;uncultured eukaryote | species   | 100.00      | 99.00       | 0.00          | 1     | 1        | 100.00    |\n| query2   | 0             | Homo sapiens;Pan troglodytes                   | species   | 97.07       | 95.40       | 0.00          | 1     | 1        | 100.00    |\n| query6   | 0             | Bacteria*;Escherichia coli;Staphylococcus      | species   | 100.00      | 98.62       | 0.00          | 1     | 1        | 100.00    |\n| query7   | 0             | Bacteroidetes*;uncultured bacterium*/organism* | species   | 100.00      | 91.61       | 0.00          | 1     | 1        | 100.00    |\n| query8   | 0             | Bacteria*;Escherichia coli;Staphylococcus      | species   | 100.00      | 98.62       | 0.00          | 1     | 1        | 100.00    |\n| query9   | 0             | Bacteria*;uncultured organism*                 | species   | 100.00      | 98.62       | 0.00          | 1     | 1        | 100.00    |\n|----------+---------------+------------------------------------------------+-----------+-------------+-------------+---------------+-------+----------+-----------|\n```\n\n1,084 lines of blast results are conveniently grouped taxonomically and \nwith a single row per specimen query sequence.\n\nTaxonomy grouping is accomplished with a lineages table that can be specified\nusing the `--lineages` argument.  If a lineages file is not supplied it will be\ngenerated automatically using NCBI taxonomy data by default.  A Moose classify \nbuilt lineages table can be saved to a file using the `lineages-out` command \nwhich will speed up subsequent classify runs:\n\n```\nclassify --columns qaccver,saccver,pident,staxid --lineages-out lineages.csv --specimen one blast.csv\n```\n\n### Taxonomony grouping\n\nBy default, classifications are taxonomically grouped according to\n`--max-group-size` with 3 being the default.  Classification names will start\nat the species level by default and recursively regroup at a higher taxonomony\nuntil `--max-group-size` is satisfied.  \n\nBy increasing the `--max-group-size 5`:\n```\nclassify --columns qaccver,saccver,pident,staxid --lineages lineages.csv --max-group-size 5 blast.csv\n|----------+---------------+---------------------------------------------------------------------------------------------------------------------+-----------+-------------+-------------+---------------+-------+----------+-----------|\n| specimen | assignment_id | assignment                                                                                                          | best_rank | max_percent | min_percent | min_threshold | reads | clusters | pct_reads |\n|----------+---------------+---------------------------------------------------------------------------------------------------------------------+-----------+-------------+-------------+---------------+-------+----------+-----------|\n| query1   | 0             | Homo sapiens                                                                                                        | species   | 99.67       | 99.67       | 0.00          | 1     | 1        | 100.00    |\n| query10  | 0             | Actinomycetales bacterium 'ARUP UnID 260'*;Corynebacterium*;uncultured actinobacterium/bacterium*                   | species   | 100.00      | 93.21       | 0.00          | 1     | 1        | 100.00    |\n| query11  | 0             | Prevotella*;uncultured Bacteroidales bacterium*;uncultured Bacteroidetes bacterium;uncultured bacterium*/organism   | species   | 100.00      | 91.26       | 0.00          | 1     | 1        | 100.00    |\n| query12  | 0             | Apteryx australis*;Bacilli*;Staphylococcus*;bacterium*;uncultured Firmicutes bacterium*;uncultured bacterium*       | species   | 100.00      | 98.26       | 0.00          | 1     | 1        | 100.00    |\n| query13  | 0             | Saccharomycetales*;Xanthophyllomyces dendrorhous;uncultured bacterium/eukaryote                                     | species   | 100.00      | 82.88       | 0.00          | 1     | 1        | 100.00    |\n| query14  | 0             | Saccharomyces cerevisiae*;uncultured eukaryote                                                                      | species   | 100.00      | 99.00       | 0.00          | 1     | 1        | 100.00    |\n| query2   | 0             | Homo sapiens;Pan troglodytes                                                                                        | species   | 97.07       | 95.40       | 0.00          | 1     | 1        | 100.00    |\n| query6   | 0             | Escherichia coli;Staphylococcus;bacterium CulaenoE10F;human oral bacterium C20;uncultured bacterium*                | species   | 100.00      | 98.62       | 0.00          | 1     | 1        | 100.00    |\n| query7   | 0             | Prevotella*;uncultured Bacteroidales bacterium*;uncultured Bacteroidetes bacterium*;uncultured bacterium*/organism* | species   | 100.00      | 91.61       | 0.00          | 1     | 1        | 100.00    |\n| query8   | 0             | Escherichia coli;Staphylococcus;bacterium CulaenoE10F;human oral bacterium C20;uncultured bacterium*                | species   | 100.00      | 98.62       | 0.00          | 1     | 1        | 100.00    |\n| query9   | 0             | Bacteria*;Escherichia coli;Staphylococcus;uncultured organism*                                                      | species   | 100.00      | 98.62       | 0.00          | 1     | 1        | 100.00    |\n|----------+---------------+---------------------------------------------------------------------------------------------------------------------+-----------+-------------+-------------+---------------+-------+----------+-----------|\n```\n\nAnd\n\n```\nclassify --columns qaccver,saccver,pident,staxid --lineages lineages.csv --max-group-size 9 blast.csv\n|----------+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+-------------+-------------+---------------+-------+----------+-----------|\n| specimen | assignment_id | assignment                                                                                                                                                                    | best_rank | max_percent | min_percent | min_threshold | reads | clusters | pct_reads |\n|----------+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+-------------+-------------+---------------+-------+----------+-----------|\n| query1   | 0             | Homo sapiens                                                                                                                                                                  | species   | 99.67       | 99.67       | 0.00          | 1     | 1        | 100.00    |\n| query10  | 0             | Actinomycetales bacterium 'ARUP UnID 260'*;Corynebacterium*;uncultured actinobacterium/bacterium*                                                                             | species   | 100.00      | 93.21       | 0.00          | 1     | 1        | 100.00    |\n| query11  | 0             | Prevotella amnii/bivia;Prevotella sp. 3-5;uncultured Bacteroidales bacterium*;uncultured Bacteroidetes bacterium;uncultured Prevotella sp.*;uncultured bacterium*/organism    | species   | 100.00      | 91.26       | 0.00          | 1     | 1        | 100.00    |\n| query12  | 0             | Apteryx australis*;Bacilli*;Staphylococcus*;bacterium*;uncultured Firmicutes bacterium*;uncultured bacterium*                                                                 | species   | 100.00      | 98.26       | 0.00          | 1     | 1        | 100.00    |\n| query13  | 0             | Saccharomycetales*;Xanthophyllomyces dendrorhous;uncultured bacterium/eukaryote                                                                                               | species   | 100.00      | 82.88       | 0.00          | 1     | 1        | 100.00    |\n| query14  | 0             | Saccharomyces cerevisiae*;uncultured eukaryote                                                                                                                                | species   | 100.00      | 99.00       | 0.00          | 1     | 1        | 100.00    |\n| query2   | 0             | Homo sapiens;Pan troglodytes                                                                                                                                                  | species   | 97.07       | 95.40       | 0.00          | 1     | 1        | 100.00    |\n| query6   | 0             | Escherichia coli;Staphylococcus;bacterium CulaenoE10F;human oral bacterium C20;uncultured bacterium*                                                                          | species   | 100.00      | 98.62       | 0.00          | 1     | 1        | 100.00    |\n| query7   | 0             | Prevotella amnii/bivia*;Prevotella sp. 3-5;uncultured Bacteroidales bacterium*;uncultured Bacteroidetes bacterium*;uncultured Prevotella sp.*;uncultured bacterium*/organism* | species   | 100.00      | 91.61       | 0.00          | 1     | 1        | 100.00    |\n| query8   | 0             | Escherichia coli;Staphylococcus;bacterium CulaenoE10F;human oral bacterium C20;uncultured bacterium*                                                                          | species   | 100.00      | 98.62       | 0.00          | 1     | 1        | 100.00    |\n| query9   | 0             | Escherichia coli;Staphylococcus;bacterium CulaenoE10F;human oral bacterium C20;uncultured bacterium*/organism*                                                                | species   | 100.00      | 98.62       | 0.00          | 1     | 1        | 100.00    |\n|----------+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+-------------+-------------+---------------+-------+----------+-----------|\n```\n\nAnd using `--max-group-size 1`:\n\n```\nclassify --columns qaccver,saccver,pident,staxid --lineages lineages.csv --max-group-size 1 blast.csv\n|----------+---------------+--------------+--------------+-------------+-------------+---------------+-------+----------+-----------|\n| specimen | assignment_id | assignment   | best_rank    | max_percent | min_percent | min_threshold | reads | clusters | pct_reads |\n|----------+---------------+--------------+--------------+-------------+-------------+---------------+-------+----------+-----------|\n| query1   | 0             | Homo sapiens | species      | 99.67       | 99.67       | 0.00          | 1     | 1        | 100.00    |\n| query10  | 0             | Bacteria*    | superkingdom | 100.00      | 93.21       | 0.00          | 1     | 1        | 100.00    |\n| query11  | 0             | root*        | root         | 100.00      | 91.26       | 0.00          | 1     | 1        | 100.00    |\n| query12  | 0             | root*        | root         | 100.00      | 98.26       | 0.00          | 1     | 1        | 100.00    |\n| query13  | 0             | root*        | root         | 100.00      | 82.88       | 0.00          | 1     | 1        | 100.00    |\n| query14  | 0             | Eukaryota*   | superkingdom | 100.00      | 99.00       | 0.00          | 1     | 1        | 100.00    |\n| query2   | 0             | Homininae    | subfamily    | 97.07       | 95.40       | 0.00          | 1     | 1        | 100.00    |\n| query6   | 0             | Bacteria*    | superkingdom | 100.00      | 98.62       | 0.00          | 1     | 1        | 100.00    |\n| query7   | 0             | root*        | root         | 100.00      | 91.61       | 0.00          | 1     | 1        | 100.00    |\n| query8   | 0             | Bacteria*    | superkingdom | 100.00      | 98.62       | 0.00          | 1     | 1        | 100.00    |\n| query9   | 0             | root*        | root         | 100.00      | 98.62       | 0.00          | 1     | 1        | 100.00    |\n|----------+---------------+--------------+--------------+-------------+-------------+---------------+-------+----------+-----------|\n```\n\nUsing the `--specimen` argument the results can be grouped together to further simplify the results:\n\n```\nclassify --columns qaccver,saccver,pident,staxid --lineages lineages.csv --max-group-size 1 --specimen one blast.csv\n|----------+---------------+--------------+--------------+-------------+-------------+---------------+-------+----------+-----------|\n| specimen | assignment_id | assignment   | best_rank    | max_percent | min_percent | min_threshold | reads | clusters | pct_reads |\n|----------+---------------+--------------+--------------+-------------+-------------+---------------+-------+----------+-----------|\n| one      | 0             | root*        | root         | 100.00      | 82.88       | 0.00          | 5     | 5        | 45.45     |\n| one      | 1             | Bacteria*    | superkingdom | 100.00      | 93.21       | 0.00          | 3     | 3        | 27.27     |\n| one      | 2             | Homo sapiens | species      | 99.67       | 99.67       | 0.00          | 1     | 1        | 9.09      |\n| one      | 3             | Homininae    | subfamily    | 97.07       | 95.40       | 0.00          | 1     | 1        | 9.09      |\n| one      | 4             | Eukaryota*   | superkingdom | 100.00      | 99.00       | 0.00          | 1     | 1        | 9.09      |\n|----------+---------------+--------------+--------------+-------------+-------------+---------------+-------+----------+-----------|\n```\n\nIf `--columns` is not specified the classifier will check for a header with \nminimum qseqid,sseqid,pident columns.  If no header than blast outfmt 6 columns\nare assumed.\n\n### Alignment selection\nTODO\n\n### Rank thresholds\n\nThe Moose classifier is built to accept dynamic thresholds for any taxonomic\nto provide the best possible classification using the `--rank-thresholds`\nargument. An example input looks like this:\n\n```\n% cl rank_thresholds.csv\n|--------+------+---------+--------+-------+-------+--------+-------+---------|\n| tax_id | root | kingdom | phylum | class | order | family | genus | species | subspecies |\n|--------+------+---------+--------+-------+-------+--------+-------+---------|\n| 1      | 75.0 | 75.0    | 80.0   | 90.0  | 93.0  | 95.0   | 97.0  |  99.0   |\n|--------+------+---------+--------+-------+-------+--------+-------+---------|\n```\n\nAny tax_id can be specified in a rank thresholds file.  If a tax_id is not\npresent the rank thresholds file the classifier will work its way up the \nreference sequence's taxonomy lineage in order to assign rank thresholds.\n\nThe classifier will use the rank threshold table to select the lowest possible\nbest hits available for classification.  Example usage:\n\n```\nclassify --columns qaccver,saccver,pident,staxid --lineages lineages.csv --rank-thresholds rank_thresholds.csv --specimen one blast.csv\n|----------+---------------+-----------------------------------------------------------------------------------+-----------+-------------+-------------+---------------+-------+----------+-----------|\n| specimen | assignment_id | assignment                                                                        | best_rank | max_percent | min_percent | min_threshold | reads | clusters | pct_reads |\n|----------+---------------+-----------------------------------------------------------------------------------+-----------+-------------+-------------+---------------+-------+----------+-----------|\n| one      | 0             | Bacteroidetes*;uncultured bacterium*/organism*                                    | species   | 100.00      | 99.27       | 99.00         | 2     | 2        | 18.18     |\n| one      | 1             | Bacteria*;Escherichia coli;Staphylococcus                                         | species   | 100.00      | 99.31       | 99.00         | 2     | 2        | 18.18     |\n| one      | 2             | Saccharomyces cerevisiae*;uncultured eukaryote                                    | species   | 100.00      | 99.33       | 99.00         | 1     | 1        | 9.09      |\n| one      | 3             | Homo sapiens                                                                      | species   | 99.67       | 99.67       | 99.00         | 1     | 1        | 9.09      |\n| one      | 4             | Homo                                                                              | genus     | 97.07       | 97.07       | 97.00         | 1     | 1        | 9.09      |\n| one      | 5             | Clavispora lusitaniae*                                                            | species   | 100.00      | 100.00      | 99.00         | 1     | 1        | 9.09      |\n| one      | 6             | Bacteria*;uncultured organism*                                                    | species   | 100.00      | 99.31       | 99.00         | 1     | 1        | 9.09      |\n| one      | 7             | Apteryx australis*;Bacteria*;Firmicutes*                                          | species   | 100.00      | 99.31       | 99.00         | 1     | 1        | 9.09      |\n| one      | 8             | Actinomycetales bacterium 'ARUP UnID 260'*;Corynebacterium*;uncultured bacterium* | species   | 100.00      | 99.64       | 99.00         | 1     | 1        | 9.09      |\n|----------+---------------+-----------------------------------------------------------------------------------+-----------+-------------+-------------+---------------+-------+----------+-----------|\n```\n\nThere are a few things to notice when using a rank thresholds table.\nThe first is the min_percent column will correspond to the lowest rank \nthreshold used for hit selection.  The second is the difference in \nclassifications after dropping hits below the min_threshold.\n\nLastly, the genus level Homo classification was not rolled into the \nHomo sapiens classification because the rank thresholds table determined that\nthe best hits available for that query sequence could only be classified at the genus\nlevel.  This is despite the fact that the genus level Homo classification was\nderived from Homo sapien reference sequences.  What the rank thresholds table\ndefines is classification uncertainty.  So, despite the Homo sapien\nreference sequences hits the classifier could not determine the query sequence \nwas in fact Homo sapien but only of genus level Homo origin.\n\n### Specimen map\n\nA three column specimen,qseqid,weight file included using the `--specimen-map`\nargument.  An example might look like this:\n\n```\ncat specimen_map.csv\none,query1,100\none,query2,95\none,query6,75\none,query7,70\none,query8,65\none,query9,60\none,query10,55\none,query11,50\none,query12,45\none,query13,40\none,query14,35\nclassify --columns qaccver,saccver,pident,staxid --lineages lineages.csv --rank-thresholds rank_thresholds.csv --specimen-map specimen_map.csv blast.csv | cl\n|----------+---------------+-----------------------------------------------------------------------------------+-----------+-------------+-------------+---------------+-------+----------+-----------|\n| specimen | assignment_id | assignment                                                                        | best_rank | max_percent | min_percent | min_threshold | reads | clusters | pct_reads |\n|----------+---------------+-----------------------------------------------------------------------------------+-----------+-------------+-------------+---------------+-------+----------+-----------|\n| one      | 0             | Bacteria*;Escherichia coli;Staphylococcus                                         | species   | 100.00      | 99.31       | 99.00         | 140   | 2        | 20.29     |\n| one      | 1             | Bacteroidetes*;uncultured bacterium*/organism*                                    | species   | 100.00      | 99.27       | 99.00         | 120   | 2        | 17.39     |\n| one      | 2             | Homo sapiens                                                                      | species   | 99.67       | 99.67       | 99.00         | 100   | 1        | 14.49     |\n| one      | 3             | Homo                                                                              | genus     | 97.07       | 97.07       | 97.00         | 95    | 1        | 13.77     |\n| one      | 4             | Bacteria*;uncultured organism*                                                    | species   | 100.00      | 99.31       | 99.00         | 60    | 1        | 8.70      |\n| one      | 5             | Actinomycetales bacterium 'ARUP UnID 260'*;Corynebacterium*;uncultured bacterium* | species   | 100.00      | 99.64       | 99.00         | 55    | 1        | 7.97      |\n| one      | 6             | Apteryx australis*;Bacteria*;Firmicutes*                                          | species   | 100.00      | 99.31       | 99.00         | 45    | 1        | 6.52      |\n| one      | 7             | Clavispora lusitaniae*                                                            | species   | 100.00      | 100.00      | 99.00         | 40    | 1        | 5.80      |\n| one      | 8             | Saccharomyces cerevisiae*;uncultured eukaryote                                    | species   | 100.00      | 99.33       | 99.00         | 35    | 1        | 5.07      |\n|----------+---------------+-----------------------------------------------------------------------------------+-----------+-------------+-------------+---------------+-------+----------+-----------|\n```\n\nThe classifer will interpret each qseqid in the specimen map file as part of\nthe specimen.  If a qseqid is not included in the blast.csv results then a \nclassification of `[no blast result]` will be assigned:\n\n```\ncat specimen_map.csv\none,query1,100\none,query2,95\none,query3,90\none,query4,85\none,query5,80\none,query6,75\none,query7,70\none,query8,65\none,query9,60\none,query10,55\none,query11,50\none,query12,45\none,query13,40\none,query14,35\nclassify --columns qaccver,saccver,pident,staxid --lineages lineages.csv --rank-thresholds rank_thresholds.csv --specimen-map specimen_map.csv blast.csv\n|----------+---------------+-----------------------------------------------------------------------------------+-----------+-------------+-------------+---------------+-------+----------+-----------|\n| specimen | assignment_id | assignment                                                                        | best_rank | max_percent | min_percent | min_threshold | reads | clusters | pct_reads |\n|----------+---------------+-----------------------------------------------------------------------------------+-----------+-------------+-------------+---------------+-------+----------+-----------|\n| one      | 0             | [no blast result]                                                                 |           |             |             |               | 255   | 3        | 26.98     |\n| one      | 1             | Bacteria*;Escherichia coli;Staphylococcus                                         | species   | 100.00      | 99.31       | 99.00         | 140   | 2        | 14.81     |\n| one      | 2             | Bacteroidetes*;uncultured bacterium*/organism*                                    | species   | 100.00      | 99.27       | 99.00         | 120   | 2        | 12.70     |\n| one      | 3             | Homo sapiens                                                                      | species   | 99.67       | 99.67       | 99.00         | 100   | 1        | 10.58     |\n| one      | 4             | Homo                                                                              | genus     | 97.07       | 97.07       | 97.00         | 95    | 1        | 10.05     |\n| one      | 5             | Bacteria*;uncultured organism*                                                    | species   | 100.00      | 99.31       | 99.00         | 60    | 1        | 6.35      |\n| one      | 6             | Actinomycetales bacterium 'ARUP UnID 260'*;Corynebacterium*;uncultured bacterium* | species   | 100.00      | 99.64       | 99.00         | 55    | 1        | 5.82      |\n| one      | 7             | Apteryx australis*;Bacteria*;Firmicutes*                                          | species   | 100.00      | 99.31       | 99.00         | 45    | 1        | 4.76      |\n| one      | 8             | Clavispora lusitaniae*                                                            | species   | 100.00      | 100.00      | 99.00         | 40    | 1        | 4.23      |\n| one      | 9             | Saccharomyces cerevisiae*;uncultured eukaryote                                    | species   | 100.00      | 99.33       | 99.00         | 35    | 1        | 3.70      |\n|----------+---------------+-----------------------------------------------------------------------------------+-----------+-------------+-------------+---------------+-------+----------+-----------|\n```\n\nMultiple specimens can be specified with qseqids of the same name:\n\n```\ncat specimen_map.csv\none,query1,100\none,query2,95\none,query3,90\none,query4,85\none,query5,80\none,query6,75\none,query7,70\none,query8,65\none,query9,60\none,query10,55\none,query11,50\none,query12,45\none,query13,40\none,query14,35\ntwo,query3,1000\ntwo,query6,500\ntwo,query1,25\ntwo,query8,900\nclassify --columns qaccver,saccver,pident,staxid --lineages lineages.csv --rank-thresholds rank_thresholds.csv --specimen-map specimen_map.csv blast.csv\n|----------+---------------+-----------------------------------------------------------------------------------+-----------+-------------+-------------+---------------+-------+----------+-----------|\n| specimen | assignment_id | assignment                                                                        | best_rank | max_percent | min_percent | min_threshold | reads | clusters | pct_reads |\n|----------+---------------+-----------------------------------------------------------------------------------+-----------+-------------+-------------+---------------+-------+----------+-----------|\n| one      | 0             | [no blast result]                                                                 |           |             |             |               | 255   | 3        | 26.98     |\n| one      | 1             | Bacteria*;Escherichia coli;Staphylococcus                                         | species   | 100.00      | 99.31       | 99.00         | 140   | 2        | 14.81     |\n| one      | 2             | Bacteroidetes*;uncultured bacterium*/organism*                                    | species   | 100.00      | 99.27       | 99.00         | 120   | 2        | 12.70     |\n| one      | 3             | Homo sapiens                                                                      | species   | 99.67       | 99.67       | 99.00         | 100   | 1        | 10.58     |\n| one      | 4             | Homo                                                                              | genus     | 97.07       | 97.07       | 97.00         | 95    | 1        | 10.05     |\n| one      | 5             | Bacteria*;uncultured organism*                                                    | species   | 100.00      | 99.31       | 99.00         | 60    | 1        | 6.35      |\n| one      | 6             | Actinomycetales bacterium 'ARUP UnID 260'*;Corynebacterium*;uncultured bacterium* | species   | 100.00      | 99.64       | 99.00         | 55    | 1        | 5.82      |\n| one      | 7             | Apteryx australis*;Bacteria*;Firmicutes*                                          | species   | 100.00      | 99.31       | 99.00         | 45    | 1        | 4.76      |\n| one      | 8             | Clavispora lusitaniae*                                                            | species   | 100.00      | 100.00      | 99.00         | 40    | 1        | 4.23      |\n| one      | 9             | Saccharomyces cerevisiae*;uncultured eukaryote                                    | species   | 100.00      | 99.33       | 99.00         | 35    | 1        | 3.70      |\n| two      | 0             | Bacteria*;Escherichia coli;Staphylococcus                                         | species   | 100.00      | 99.31       | 99.00         | 1400  | 2        | 57.73     |\n| two      | 1             | [no blast result]                                                                 |           |             |             |               | 1000  | 1        | 41.24     |\n| two      | 2             | Homo sapiens                                                                      | species   | 99.67       | 99.67       | 99.00         | 25    | 1        | 1.03      |\n|----------+---------------+-----------------------------------------------------------------------------------+-----------+-------------+-------------+---------------+-------+----------+-----------|\n```\n\nIf a query sequence is not included in the specimen map but returned as part\nof the blast.csv results then it will be added as its own specimen:\n\n```\ncat specimen_map.csv\none,query1,100\none,query4,85\none,query5,80\none,query6,75\none,query7,70\nclassify --columns qaccver,saccver,pident,staxid --lineages lineages.csv --rank-thresholds rank_thresholds.csv --specimen-map specimen_map.csv blast.csv\n|----------+---------------+-----------------------------------------------------------------------------------+-----------+-------------+-------------+---------------+-------+----------+-----------|\n| specimen | assignment_id | assignment                                                                        | best_rank | max_percent | min_percent | min_threshold | reads | clusters | pct_reads |\n|----------+---------------+-----------------------------------------------------------------------------------+-----------+-------------+-------------+---------------+-------+----------+-----------|\n| one      | 0             | [no blast result]                                                                 |           |             |             |               | 165   | 2        | 40.24     |\n| one      | 1             | Homo sapiens                                                                      | species   | 99.67       | 99.67       | 99.00         | 100   | 1        | 24.39     |\n| one      | 2             | Bacteria*;Escherichia coli;Staphylococcus                                         | species   | 100.00      | 99.31       | 99.00         | 75    | 1        | 18.29     |\n| one      | 3             | Bacteroidetes*;uncultured bacterium*/organism*                                    | species   | 100.00      | 99.27       | 99.00         | 70    | 1        | 17.07     |\n| query10  | 0             | Actinomycetales bacterium 'ARUP UnID 260'*;Corynebacterium*;uncultured bacterium* | species   | 100.00      | 99.64       | 99.00         | 1     | 1        | 100.00    |\n| query11  | 0             | Bacteroidetes*;uncultured bacterium*/organism                                     | species   | 100.00      | 99.27       | 99.00         | 1     | 1        | 100.00    |\n| query12  | 0             | Apteryx australis*;Bacteria*;Firmicutes*                                          | species   | 100.00      | 99.31       | 99.00         | 1     | 1        | 100.00    |\n| query13  | 0             | Clavispora lusitaniae*                                                            | species   | 100.00      | 100.00      | 99.00         | 1     | 1        | 100.00    |\n| query14  | 0             | Saccharomyces cerevisiae*;uncultured eukaryote                                    | species   | 100.00      | 99.33       | 99.00         | 1     | 1        | 100.00    |\n| query2   | 0             | Homo                                                                              | genus     | 97.07       | 97.07       | 97.00         | 1     | 1        | 100.00    |\n| query8   | 0             | Bacteria*;Escherichia coli;Staphylococcus                                         | species   | 100.00      | 99.31       | 99.00         | 1     | 1        | 100.00    |\n| query9   | 0             | Bacteria*;uncultured organism*                                                    | species   | 100.00      | 99.31       | 99.00         | 1     | 1        | 100.00    |\n|----------+---------------+-----------------------------------------------------------------------------------+-----------+-------------+-------------+---------------+-------+----------+-----------|\n```\n\n### copy numbers\n\nMultiple copies of the gene may be present in a species genome which may\ndistort the relative weight abundance of a classification.  The Moose\nClassifier excepts a two column csv file with columns `--copy-numbers`:\n\n```\n|--------+-------|\n| tax_id | count |\n|--------+-------|\n```\n\nand will divide the final classification tax_id by the count number in this\nfile and expressed under the `corrected` column in the output file.  This is\nuseful for adjusting relative abundance of species when, for example, \ndoing 16s classifications.\n\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Alignment based taxonomic classifier",
    "version": "0.15",
    "project_urls": {
        "repository": "https://github.com/crosenth/moose"
    },
    "split_keywords": [
        "bioinformatics",
        " blast",
        " classifier",
        " dna",
        " genetics",
        " genomics",
        " ncbi",
        " rna"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "7e99f1aaaca84f4d1eecd48e1bebe983446637bfb5b7259bb20189f8c5ea4174",
                "md5": "5a8d4344211310e32f0e5bf9225c43ca",
                "sha256": "8076f9b33768bd3af501a748dd6a2570fa1d136ed06cf9be3f9ce727cee68091"
            },
            "downloads": -1,
            "filename": "moose_classifier-0.15-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "5a8d4344211310e32f0e5bf9225c43ca",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 20311,
            "upload_time": "2025-09-06T04:14:15",
            "upload_time_iso_8601": "2025-09-06T04:14:15.025006Z",
            "url": "https://files.pythonhosted.org/packages/7e/99/f1aaaca84f4d1eecd48e1bebe983446637bfb5b7259bb20189f8c5ea4174/moose_classifier-0.15-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "d088d22fb1bbd64aac8dc738e889bbaaa976b508a012e93969f654745bb424c9",
                "md5": "bfabf4171dde240b8acd6f03a767469a",
                "sha256": "a159fa12a7adcedd796d1cd99f17efdf920f33ff49272e9cde272a147957296a"
            },
            "downloads": -1,
            "filename": "moose_classifier-0.15.tar.gz",
            "has_sig": false,
            "md5_digest": "bfabf4171dde240b8acd6f03a767469a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 692677,
            "upload_time": "2025-09-06T04:14:16",
            "upload_time_iso_8601": "2025-09-06T04:14:16.647348Z",
            "url": "https://files.pythonhosted.org/packages/d0/88/d22fb1bbd64aac8dc738e889bbaaa976b508a012e93969f654745bb424c9/moose_classifier-0.15.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-06 04:14:16",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "crosenth",
    "github_project": "moose",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "sphinx",
            "specs": []
        },
        {
            "name": "ghp-import",
            "specs": []
        },
        {
            "name": "twine",
            "specs": []
        },
        {
            "name": "tox",
            "specs": []
        },
        {
            "name": "flake8",
            "specs": []
        },
        {
            "name": "taxtastic",
            "specs": []
        }
    ],
    "lcname": "moose-classifier"
}
        
Elapsed time: 2.49499s