CCMetagen


NameCCMetagen JSON
Version 1.5.0 PyPI version JSON
download
home_pagehttps://github.com/vrmarcelino/CCMetagen.git
SummaryMicrobiome classification pipeline
upload_time2024-04-08 05:32:11
maintainerNone
docs_urlNone
authorVanessa R. Marcelino Jan P. Buchmann Andreas Sjödin Philip T.L.C. Clausen
requires_python>=3.9
licenseGPL-3.0
keywords bioinformatics taxonomy metagenomic classifier kma
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # CCMetagen

CCMetagen processes sequence alignments produced with [KMA](https://bitbucket.org/genomicepidemiology/kma), which implements the ConClave sorting scheme to achieve highly accurate read mappings. The pipeline is fast enough to use the whole NCBI nt collection as reference, facilitating the inclusion of understudied organisms, such as microbial eukaryotes, in metagenome surveys. CCMetagen produces ranked taxonomic results in user-friendly formats that are ready for publication or downstream statistical analyses.

If you this tool, please cite CCMetagen and KMA:

  * [Marcelino VR, Clausen PT, Buchman J, Wille M, Iredell JR, Meyer W, Lund O, Sorrell T, Holmes EC. 2019. CCMetagen: comprehensive and accurate identification of eukaryotes and prokaryotes in metagenomic data. Genome Biology. 2020 Dec;21(1):1-5.](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02014-2)

  * [Clausen PT, Aarestrup FM, Lund O. 2018. Rapid and precise alignment of raw reads against redundant databases with KMA. BMC bioinformatics. 2018 Dec;19(1):307.](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2336-6)

Besides the guidelines below, we also provide a tutorial to reproduce our metagenome clasisfication analyses of the microbiome of wild birds [here](https://github.com/vrmarcelino/CCMetagen/tree/master/docs/tutorial).

The guidelines below will guide you in using the command-line version of the CCMetagen pipeline.

CCMetagen is also available as a web service at https://cge.food.dtu.dk/services/CCMetagen/.
Note that we recommend using this command-line version to analyze data exceeding 1.5Gb.


## Installation

We recommend installing CCMetagen through [conda](https://docs.conda.io/en/latest/). This will install CCMetagen along with all of its required dependencies.

After installing conda, you can create a new environment with CCMetagen through the following command:

```
conda create -n ccmetagen ccmetagen -c bioconda -c conda-forge
```

You can then activate your environment with:

```
conda activate ccmetagen
```

The `-n ccmetagen` flag will name the environment as `ccmetagen`, but you can choose any different name that you'd like. The `ccmetagen` after that specifies that you'd like the CCMetagen package installed into that environment.
Finally, the `-c bioconda -c conda-forge` specifies that you'd like to use the Bioconda and Conda-Forge channels, which host CCMetagen and its dependencies.

You can also install CCMetagen from source, or using pip, the Python package manager. For that, follow the installation instructions in the deprecated README file in the [docs](https://github.com/vrmarcelino/CCMetagen/tree/master/docs/old_repository_README.md).

Check your CCMetagen installation by running `CCMetagen.py --version` on your command-line.


## Databases

After installing CCMetagen, you will need a reference database to perform taxonomic classification. There are two ways to obtain this:

**Option 1** Download the indexed (ready-to-go) nt from [here](https://melbourne.figshare.com/projects/CCMetagen_databases/195755).
Download the ncbi_nt_kma file (103GB zipped file) or the RefSeq_bf.zip (90GB zipped file)
Unzip the database, e.g.: `unzip ncbi_nt_kma`.
The nt database contains the whole in NCBI nucleotide collection (from 2019, updated database to be released soon!), and therefore is suitable to identify a range of microorganisms, including prokaryotes and eukaryotes.

There are two versions of the nt database, the one previously mentioned, and another one that does not contain environemntal or artificial sequences. The file ncbi_nt_no_env_11jun2019.zip contains all ncbi nt entries excluding the descendants of environmental eukaryotes (taxid 61964), environmental prokaryotes (48479), unclassified sequences (12908) and artificial sequences (28384).

**Option 2:** Build your own reference database (recommended!)
Follow the instructions in the [KMA website](https://bitbucket.org/genomicepidemiology/kma) to index the database.
It is important that taxids are incorporated in sequence headers for processing with CCMetagen. Sequence headers should look like 
`>1234|sequence_description`, where 1234 is the taxid. 
We provide scripts to rename sequences in the nt database [here](https://github.com/vrmarcelino/CCMetagen/tree/master/docs/benchmarking/rename_nt).

If you want to use the RefSeq database, the format is similar to the one required for Kraken. The [Opiniomics blog](http://www.opiniomics.org/building-a-kraken-database-with-new-ftp-structure-and-no-gi-numbers/) describes how to download sequences in an adequate format. Note that you still need to build the index with KMA: `kma_index -i refseq.fna -o refseq_indexed -NI -Sparse -` or `kma_index -i refseq.fna -o refseq_indexed -NI -Sparse TG` for faster analysis.


## Quick Start

  * First map sequence reads (or contigs) to the database with **KMA**.

For paired-end files:

```
kma -ipe $SAMPLE_R1 $SAMPLE_R2 -o sample_out_kma -t_db $db -t $th -1t1 -mem_mode -and -apm f
```

For single-end files:
```
kma -i $SAMPLE -o sample_out_kma -t_db $db -t $th -1t1 -mem_mode -and
```

If you want to calculate abundance in reads per million (RPM) or in number of reads (fragments), or if you want to calculate the proportion of mapped reads, add the flag -ef (extended features):
```
kma -ipe $SAMPLE_R1 $SAMPLE_R2 -o sample_out_kma -t_db $db -t $th -1t1 -mem_mode -and -apm f -ef
```

Where:

- `$db` is the path to the reference database
- `$th` is the number of threads
- `$SAMPLE_R1` is the path to the mate1 of a paired-end metagenome/metatranscriptome sample (fastq or fasta)
- `$SAMPLE_R2` is the path to the mate2 of a paired-end metagenome/metatranscriptome sample (fastq or fasta)
- `$SAMPLE` is the path to a single-end metagenome/metatranscriptome file (reads or contigs)

Then run `CCMetagen.py`:

```
CCMetagen.py -i $sample_out_kma.res -o results
```

Where $sample_out_kma.res is alignment results produced by KMA.

Note that if you are running CCMetagen from the local folder (instead of adding it to your path), you may need to add 'python' before CCMetagen: `python CCMetagen.py -i $sample_out_kma.res -o results`

Done! This will make an additional quality filter and output a text file with ranked taxonomic classifications and a krona graph file for interactive visualization.

An example of the CCMetagen output can be found [here (.csv file)](https://github.com/vrmarcelino/CCMetagen/blob/master/docs/tutorial/figs_tutorial/Turnstone_Temperate_Flu_Ng.res.csv) and [here (.html file)](https://htmlpreview.github.io/?https://github.com/vrmarcelino/CCMetagen/blob/master/docs/tutorial/figs_tutorial/Turnstone_Temperate_Flu_Ng.res.html).

<img src=docs/tutorial/figs_tutorial/krona_photo.png width="500" height="419.64">

In the .csv file, you will find the depth (abundance) of each match.

## Abundance units

**Depth can be estimated in four ways:** 

1. By applying an additional correction for template length (default in KMA and CCMetagen);
2. By counting the number of nucleotides matching the reference sequence (use flag --depth_unit nc);
3. By calculating depth in Reads Per Million (RPM, use flag --depth_unit rpm); or
4. By counting the number of fragments (i.e. number of PE reads matching to teh reference sequence, use flag --depth_unit fr). If you want RPM or fragment units, you will need to suply the .mapstats file generated with KMA (which you get when running kma with the flag '-ef').


## Balancing sensitivity and specificity

You can **adjust the stringency of the taxonomic assignments** by adjusting the minimum coverage (--coverage), the minimum abundance (--depth), and the minimum level of sequence similarity (--query_identity). Coverage is the percentage of bases in the reference sequence that is covered by the consensus sequence (your query), it can be over 100% when the consensus sequence is larger than the reference (due to insertions for example). You can also adjust the KMA settings to facilitate the identification of more distant-related taxa (see below)

If you change the default depth unit, we recommend adjusting the minimum abundance (--depth) to remove taxa found in low abundance accordingly. For example, you can use -d 200 (200 nucleotides) when using --depth_unit nc, which is similar to -d 0.2 when using the default '--depth_unit kma' option. If you choose to calculate abundances in RPM, you may want to adjust the minimum abundance according to your sequence depth.
For example, to calculate abundances in RPM, and filter out all matches with less than one read per million:

```
CCMetagen.py -i $sample_out_kma.res -o results -map $sample_out_kma.mapstat --depth_unit rpm --depth 1
```

If you would like to know the **proportion of reads mapped** to each template, run kma with the '-ef' flag. This will generate a file with the '.mapstat' extension. Then provide this file to CCMetagen (-map $sample_out_kma.mapstat) and add the flag '-ef y':

```
CCMetagen.py -i $sample_out_kma.res -o results -map $sample_out_kma.mapstat -ef y
```
This will filter the .mapstat file, removing the templates that did not pass CCMetagen's quality control, will add the percentage of mapped reads for each template and will output a file with extension 'stats_csv'. It will also output the overall proportion of reads mapped to these templates in the terminal. For more details about the additional columns of this file, please check [KMA's manual](https://bitbucket.org/genomicepidemiology/kma/src/master/KMAspecification.pdf).

When working with highly complex environemnts for which reference databases are scarce (e.g. many soil and marine metagenomes), it is common to obtain a low proportion of classified reads, especially if the sequencing depth is low. For a more sensitive analysis, besides relaxing the CCMetatgen settings, you can adjust the KMA aligner settings, by for example: removing the `-and` and the `-apm f` flags, so that you can get a match even when the reference sequences are not significantly overrepresented or when only one of the PE reads maps to the template. Check the [KMA manual](https://bitbucket.org/genomicepidemiology/kma/src/master/KMAspecification.pdf) for more details. It can also be useful to build a customized reference database with additional genomes of organisms that are closely related to what you expect to find in your samples.

## Understanding the ranked taxonomic output of CCMetagen:
The taxonomic classifications reflect the sequence similarity between query and reference sequences, according to default or user-defined similarity thresholds. For example, if a match is 97% similar to the reference sequence, the match will not get a species-level classification. If the match is 85% similar to the reference sequence, then the species, genus and family-level classifications will be 'none'.
Note that this is different from identifications tagged as unk_x (unknown taxa). These unknowns indicate taxa where higher-rank classifications have not been defined (according to the NCBI taxonomy database), and it is unrelated to sequence similarity.


For a list of options to customize your analyze, type:
```
CCMetagen.py -h
```

  * **To get the abundance of each taxon, and/or summarize results for multiple samples, use CCMetagen_merge**:
```
CCMetagen_merge.py -i $CCMetagen_out
```

Where `$CCMetagen_out` is the folder containing the CCMetagen taxonomic classifications.
The results must be in .csv format (default or '--mode text' output of CCMetagen), and these files **must end in ".ccm.csv"**.

The flag '-t' define the taxonomic level to merge the results. The default is species-level.

You can also filter out specific taxa, at any taxonomic level:

Use flag -kr to keep (k) or remove (r) taxa.
Use flag -l to set the taxonomic level for the filtering.
Use flag -tlist to list the taxa to keep or remove (separated by comma).

EX1: Filter out bacteria: `CCMetagen_merge.py -i $CCMetagen_out -kr r -l Kingdom -tlist Bacteria`

EX2: Filter out bacteria and Metazoa: `CCMetagen_merge.py -i $CCMetagen_out -kr r -l Kingdom -tlist Bacteria, Metazoa`

EX3: Merge results at family-level, and remove Metazoa and Viridiplantae taxa at Kingdom level:
```
CCMetagen_merge.py -i $CCMetagen_out -t Family -kr r -l Kingdom -tlist Metazoa,Viridiplantae -o family_table
```

For species-level filtering (where there is a space in taxa names), use quotation marks.
Ex 4: Keep only _Escherichia coli_ and _Candida albicans_:
```
CCMetagen_merge.py -i 05_KMetagen/ -kr k -l Species -tlist "Escherichia coli,Candida albicans"
```

If you only have one sample, you can also use CMetagen_merge to get one line per taxa.

To see all options, type:
```
CCMetagen_merge.py -h
```
This file should look like [this](https://github.com/vrmarcelino/CCMetagen/blob/master/tutorial/figs_tutorial/Bird_family_table_filtered.csv).


* **To extract sequences of a given taxon, use CCMetagen_extract_seqs**:

This script will produce a fasta file containing all reads assigned to a taxon of interest. 
Ex: Generate a fasta file containing all sequences that mapped to the genus Eschericha:
```
CCMetagen_extract_seqs.py -iccm $CCMetagen_out -ifrag $sample_out_kma.frag -l Genus -t Eschericha
```

Where `$CCMetagen_out` is the .csv file generated with CCMetagen and `$sample_out_kma.frag` is the .frag file generated with KMA. The frag file needs to be decompressed: `gunzip *.frag.gz`

For species-level filtering (where there is a space in taxon names), use quotation marks.
Ex: Generate a fasta file containing all sequences that mapped to _E. coli_:
```
CCMetagen_extract_seqs.py -iccm $CCMetagen_out -ifrag $sample_out_kma.frag -l Species -t "Escherichia coli"
```


**Check out our [tutorial](https://github.com/vrmarcelino/CCMetagen/tree/master/docs/tutorial) for an applied example of the CCMetagen pipeline.**



## FAQs

* Error taxid not found.
  You probably need to update your local ETE3 database, which contains the taxids and lineage information:
```
python
from ete3 import NCBITaxa
ncbi = NCBITaxa()
ncbi.update_taxonomy_database()
quit()
```

* TypeError: concat() got an unexpected keyword argument 'sort'.
  If you get this error, please update the python module pandas:
```
pip install pandas --upgrade --user
```

* WARNING: no NCBI's taxid found for accession [something], this match will not get taxonomic ranks

  This is not an error, this is just a warning indicating that one of your query sequences matchs to a genbank record for which the NCBI taxonomic identifier (taxid) is not known. CCMetagen therefore will not be able to assign taxonomic ranks to this match, but you will still be able to see it in the output file.

* KeyError: "['Superkingdom' 'Kingdom' 'Phylum' 'Class' 'Order' 'Family' .... ] not in index"
  Make sure that the output of CCMetagen ends in '.csv'.

* The results of the CCMetagen_merge.py at different taxonomic levels do not sum up.
  As explained above, this script merges all unclassified taxa at a given taxonomic level. For example, if you have 20 matches to the genus _Candida_, but only 2 matches were classified at the species level, the output of CCMetagen_merge.py -t Species (default) will only have the abundances of two classified _Candida_ species, while the others will be merged with the "Unclassified" taxa. The output of CCMetagen_merge.py -t Genus however will contain all 20 matches. 
  If this behaviour is undesirable, one option is to disable the similarity thresholds (use flag -off) - so that all taxonomic levels are reported regardless of their similarity to the reference sequence. Alternatively, you can cluster species at the 'Closest_match' (using the flag --tax_level Closest_match).


## Complete option list

CCMetagen:
```
usage: CCMetagen.py [-h] [-m MODE] -i RES_FP [-o OUTPUT_FP]
                    [-r REFERENCE_DATABASE] [-ef EXTENDED_OUTPUT_FILE]
                    [-du DEPTH_UNIT] [-map MAPSTAT] [-d DEPTH] [-c COVERAGE]
                    [-q QUERY_IDENTITY] [-p PVALUE] [-st SPECIES_THRESHOLD]
                    [-gt GENUS_THRESHOLD] [-ft FAMILY_THRESHOLD]
                    [-ot ORDER_THRESHOLD] [-ct CLASS_THRESHOLD]
                    [-pt PHYLUM_THRESHOLD] [-off TURN_OFF_SIM_THRESHOLDS]
                    [--version]

optional arguments:
  -h, --help            show this help message and exit
  -m MODE, --mode MODE  what do you want CCMetagen to do? Valid options are
                        'visual', 'text' or 'both': text: parses kma, filters
                        based on quality and output a text file with taxonomic
                        information and detailed mapping information visual:
                        parses kma, filters based on quality and output a
                        simplified text file and a krona html file for
                        visualization both: outputs both text and visual file
                        formats. Default = both
  -i RES_FP, --res_fp RES_FP
                        Path to the KMA result (.res file)
  -o OUTPUT_FP, --output_fp OUTPUT_FP
                        Path to the output file. Default = CCMetagen_out
  -r REFERENCE_DATABASE, --reference_database REFERENCE_DATABASE
                        Which reference database was used. Options: UNITE,
                        RefSeq or nt. Default = nt
  -ef EXTENDED_OUTPUT_FILE, --extended_output_file EXTENDED_OUTPUT_FILE
                        Produce an extended output file that includes the
                        percentage of classified reads. Options: y or n. To
                        use this featire, you need to generate the mapstat
                        file when required unning KMA (use flag -ef), and use
                        it as input in CCMetagen (flag --mapstat). Default = n
  -du DEPTH_UNIT, --depth_unit DEPTH_UNIT
                        Desired unit for Depth(abundance) measurements.
                        Default = kma (KMA default depth, which is the number
                        of nucleotides overlapping each template, divided by
                        the lengh of the template). Alternatively, you can
                        have abundance calculated in Reads Per Million (RPM,
                        option 'rpm'), in number of nucleotides overlaping the
                        template (option 'nc') or in number of fragments (i.e.
                        PE reads, option 'fr'). If you use the 'nc', 'rpm' or
                        'fr' options, remember to change the default --depth
                        parameter accordingly. Valid options are nc, rpm, fr
                        and kma
  -map MAPSTAT, --mapstat MAPSTAT
                        Path to the mapstat file produced with KMA when using
                        the -ef flag (.mapstat). Required when calculating
                        abundances in RPM or in number of fragments, or when
                        producing the extended_output_file
  -d DEPTH, --depth DEPTH
                        minimum sequencing depth. Default = 0.2. The unit
                        corresponds to the one used with --depth_unit If you
                        use --depth_unit different from the default, change
                        this accordingly.
  -c COVERAGE, --coverage COVERAGE
                        Minimum coverage. Default = 20 (i.e. 20% of the
                        reference sequence)
  -q QUERY_IDENTITY, --query_identity QUERY_IDENTITY
                        Minimum query identity (Phylum level). Default = 50
  -p PVALUE, --pvalue PVALUE
                        Minimum p-value. Default = 0.05.
  -st SPECIES_THRESHOLD, --species_threshold SPECIES_THRESHOLD
                        Species-level similarity threshold. Default = 98.41
  -gt GENUS_THRESHOLD, --genus_threshold GENUS_THRESHOLD
                        Genus-level similarity threshold. Default = 96.31
  -ft FAMILY_THRESHOLD, --family_threshold FAMILY_THRESHOLD
                        Family-level similarity threshold. Default = 88.51
  -ot ORDER_THRESHOLD, --order_threshold ORDER_THRESHOLD
                        Order-level similarity threshold. Default = 81.21
  -ct CLASS_THRESHOLD, --class_threshold CLASS_THRESHOLD
                        Class-level similarity threshold. Default = 80.91
  -pt PHYLUM_THRESHOLD, --phylum_threshold PHYLUM_THRESHOLD
                        Phylum-level similarity threshold. Default = 0 - not
                        applied
  -off TURN_OFF_SIM_THRESHOLDS, --turn_off_sim_thresholds TURN_OFF_SIM_THRESHOLDS
                        Turns simularity-based filtering off. Options = y or
                        n. Default = n
  --version             show program's version number and exit
 ```


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/vrmarcelino/CCMetagen.git",
    "name": "CCMetagen",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "bioinformatics taxonomy metagenomic classifier KMA",
    "author": "Vanessa R. Marcelino Jan P. Buchmann Andreas Sj\u00f6din Philip T.L.C. Clausen",
    "author_email": "vrmarcelino@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/1d/cf/0eca947dc02c3646bbefe77097226cb034f3c255ff056a6361bbb7dcd96c/CCMetagen-1.5.0.tar.gz",
    "platform": null,
    "description": "# CCMetagen\n\nCCMetagen processes sequence alignments produced with [KMA](https://bitbucket.org/genomicepidemiology/kma), which implements the ConClave sorting scheme to achieve highly accurate read mappings. The pipeline is fast enough to use the whole NCBI nt collection as reference, facilitating the inclusion of understudied organisms, such as microbial eukaryotes, in metagenome surveys. CCMetagen produces ranked taxonomic results in user-friendly formats that are ready for publication or downstream statistical analyses.\n\nIf you this tool, please cite CCMetagen and KMA:\n\n  * [Marcelino VR, Clausen PT, Buchman J, Wille M, Iredell JR, Meyer W, Lund O, Sorrell T, Holmes EC. 2019. CCMetagen: comprehensive and accurate identification of eukaryotes and prokaryotes in metagenomic data. Genome Biology. 2020 Dec;21(1):1-5.](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02014-2)\n\n  * [Clausen PT, Aarestrup FM, Lund O. 2018. Rapid and precise alignment of raw reads against redundant databases with KMA. BMC bioinformatics. 2018 Dec;19(1):307.](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2336-6)\n\nBesides the guidelines below, we also provide a tutorial to reproduce our metagenome clasisfication analyses of the microbiome of wild birds [here](https://github.com/vrmarcelino/CCMetagen/tree/master/docs/tutorial).\n\nThe guidelines below will guide you in using the command-line version of the CCMetagen pipeline.\n\nCCMetagen is also available as a web service at https://cge.food.dtu.dk/services/CCMetagen/.\nNote that we recommend using this command-line version to analyze data exceeding 1.5Gb.\n\n\n## Installation\n\nWe recommend installing CCMetagen through [conda](https://docs.conda.io/en/latest/). This will install CCMetagen along with all of its required dependencies.\n\nAfter installing conda, you can create a new environment with CCMetagen through the following command:\n\n```\nconda create -n ccmetagen ccmetagen -c bioconda -c conda-forge\n```\n\nYou can then activate your environment with:\n\n```\nconda activate ccmetagen\n```\n\nThe `-n ccmetagen` flag will name the environment as `ccmetagen`, but you can choose any different name that you'd like. The `ccmetagen` after that specifies that you'd like the CCMetagen package installed into that environment.\nFinally, the `-c bioconda -c conda-forge` specifies that you'd like to use the Bioconda and Conda-Forge channels, which host CCMetagen and its dependencies.\n\nYou can also install CCMetagen from source, or using pip, the Python package manager. For that, follow the installation instructions in the deprecated README file in the [docs](https://github.com/vrmarcelino/CCMetagen/tree/master/docs/old_repository_README.md).\n\nCheck your CCMetagen installation by running `CCMetagen.py --version` on your command-line.\n\n\n## Databases\n\nAfter installing CCMetagen, you will need a reference database to perform taxonomic classification. There are two ways to obtain this:\n\n**Option 1** Download the indexed (ready-to-go) nt from [here](https://melbourne.figshare.com/projects/CCMetagen_databases/195755).\nDownload the ncbi_nt_kma file (103GB zipped file) or the RefSeq_bf.zip (90GB zipped file)\nUnzip the database, e.g.: `unzip ncbi_nt_kma`.\nThe nt database contains the whole in NCBI nucleotide collection (from 2019, updated database to be released soon!), and therefore is suitable to identify a range of microorganisms, including prokaryotes and eukaryotes.\n\nThere are two versions of the nt database, the one previously mentioned, and another one that does not contain environemntal or artificial sequences. The file ncbi_nt_no_env_11jun2019.zip contains all ncbi nt entries excluding the descendants of environmental eukaryotes (taxid 61964), environmental prokaryotes (48479), unclassified sequences (12908) and artificial sequences (28384).\n\n**Option 2:** Build your own reference database (recommended!)\nFollow the instructions in the [KMA website](https://bitbucket.org/genomicepidemiology/kma) to index the database.\nIt is important that taxids are incorporated in sequence headers for processing with CCMetagen. Sequence headers should look like \n`>1234|sequence_description`, where 1234 is the taxid. \nWe provide scripts to rename sequences in the nt database [here](https://github.com/vrmarcelino/CCMetagen/tree/master/docs/benchmarking/rename_nt).\n\nIf you want to use the RefSeq database, the format is similar to the one required for Kraken. The [Opiniomics blog](http://www.opiniomics.org/building-a-kraken-database-with-new-ftp-structure-and-no-gi-numbers/) describes how to download sequences in an adequate format. Note that you still need to build the index with KMA: `kma_index -i refseq.fna -o refseq_indexed -NI -Sparse -` or `kma_index -i refseq.fna -o refseq_indexed -NI -Sparse TG` for faster analysis.\n\n\n## Quick Start\n\n  * First map sequence reads (or contigs) to the database with **KMA**.\n\nFor paired-end files:\n\n```\nkma -ipe $SAMPLE_R1 $SAMPLE_R2 -o sample_out_kma -t_db $db -t $th -1t1 -mem_mode -and -apm f\n```\n\nFor single-end files:\n```\nkma -i $SAMPLE -o sample_out_kma -t_db $db -t $th -1t1 -mem_mode -and\n```\n\nIf you want to calculate abundance in reads per million (RPM) or in number of reads (fragments), or if you want to calculate the proportion of mapped reads, add the flag -ef (extended features):\n```\nkma -ipe $SAMPLE_R1 $SAMPLE_R2 -o sample_out_kma -t_db $db -t $th -1t1 -mem_mode -and -apm f -ef\n```\n\nWhere:\n\n- `$db` is the path to the reference database\n- `$th` is the number of threads\n- `$SAMPLE_R1` is the path to the mate1 of a paired-end metagenome/metatranscriptome sample (fastq or fasta)\n- `$SAMPLE_R2` is the path to the mate2 of a paired-end metagenome/metatranscriptome sample (fastq or fasta)\n- `$SAMPLE` is the path to a single-end metagenome/metatranscriptome file (reads or contigs)\n\nThen run `CCMetagen.py`:\n\n```\nCCMetagen.py -i $sample_out_kma.res -o results\n```\n\nWhere $sample_out_kma.res is alignment results produced by KMA.\n\nNote that if you are running CCMetagen from the local folder (instead of adding it to your path), you may need to add 'python' before CCMetagen: `python CCMetagen.py -i $sample_out_kma.res -o results`\n\nDone! This will make an additional quality filter and output a text file with ranked taxonomic classifications and a krona graph file for interactive visualization.\n\nAn example of the CCMetagen output can be found [here (.csv file)](https://github.com/vrmarcelino/CCMetagen/blob/master/docs/tutorial/figs_tutorial/Turnstone_Temperate_Flu_Ng.res.csv) and [here (.html file)](https://htmlpreview.github.io/?https://github.com/vrmarcelino/CCMetagen/blob/master/docs/tutorial/figs_tutorial/Turnstone_Temperate_Flu_Ng.res.html).\n\n<img src=docs/tutorial/figs_tutorial/krona_photo.png width=\"500\" height=\"419.64\">\n\nIn the .csv file, you will find the depth (abundance) of each match.\n\n## Abundance units\n\n**Depth can be estimated in four ways:** \n\n1. By applying an additional correction for template length (default in KMA and CCMetagen);\n2. By counting the number of nucleotides matching the reference sequence (use flag --depth_unit nc);\n3. By calculating depth in Reads Per Million (RPM, use flag --depth_unit rpm); or\n4. By counting the number of fragments (i.e. number of PE reads matching to teh reference sequence, use flag --depth_unit fr). If you want RPM or fragment units, you will need to suply the .mapstats file generated with KMA (which you get when running kma with the flag '-ef').\n\n\n## Balancing sensitivity and specificity\n\nYou can **adjust the stringency of the taxonomic assignments** by adjusting the minimum coverage (--coverage), the minimum abundance (--depth), and the minimum level of sequence similarity (--query_identity). Coverage is the percentage of bases in the reference sequence that is covered by the consensus sequence (your query), it can be over 100% when the consensus sequence is larger than the reference (due to insertions for example). You can also adjust the KMA settings to facilitate the identification of more distant-related taxa (see below)\n\nIf you change the default depth unit, we recommend adjusting the minimum abundance (--depth) to remove taxa found in low abundance accordingly. For example, you can use -d 200 (200 nucleotides) when using --depth_unit nc, which is similar to -d 0.2 when using the default '--depth_unit kma' option. If you choose to calculate abundances in RPM, you may want to adjust the minimum abundance according to your sequence depth.\nFor example, to calculate abundances in RPM, and filter out all matches with less than one read per million:\n\n```\nCCMetagen.py -i $sample_out_kma.res -o results -map $sample_out_kma.mapstat --depth_unit rpm --depth 1\n```\n\nIf you would like to know the **proportion of reads mapped** to each template, run kma with the '-ef' flag. This will generate a file with the '.mapstat' extension. Then provide this file to CCMetagen (-map $sample_out_kma.mapstat) and add the flag '-ef y':\n\n```\nCCMetagen.py -i $sample_out_kma.res -o results -map $sample_out_kma.mapstat -ef y\n```\nThis will filter the .mapstat file, removing the templates that did not pass CCMetagen's quality control, will add the percentage of mapped reads for each template and will output a file with extension 'stats_csv'. It will also output the overall proportion of reads mapped to these templates in the terminal. For more details about the additional columns of this file, please check [KMA's manual](https://bitbucket.org/genomicepidemiology/kma/src/master/KMAspecification.pdf).\n\nWhen working with highly complex environemnts for which reference databases are scarce (e.g. many soil and marine metagenomes), it is common to obtain a low proportion of classified reads, especially if the sequencing depth is low. For a more sensitive analysis, besides relaxing the CCMetatgen settings, you can adjust the KMA aligner settings, by for example: removing the `-and` and the `-apm f` flags, so that you can get a match even when the reference sequences are not significantly overrepresented or when only one of the PE reads maps to the template. Check the [KMA manual](https://bitbucket.org/genomicepidemiology/kma/src/master/KMAspecification.pdf) for more details. It can also be useful to build a customized reference database with additional genomes of organisms that are closely related to what you expect to find in your samples.\n\n## Understanding the ranked taxonomic output of CCMetagen:\nThe taxonomic classifications reflect the sequence similarity between query and reference sequences, according to default or user-defined similarity thresholds. For example, if a match is 97% similar to the reference sequence, the match will not get a species-level classification. If the match is 85% similar to the reference sequence, then the species, genus and family-level classifications will be 'none'.\nNote that this is different from identifications tagged as unk_x (unknown taxa). These unknowns indicate taxa where higher-rank classifications have not been defined (according to the NCBI taxonomy database), and it is unrelated to sequence similarity.\n\n\nFor a list of options to customize your analyze, type:\n```\nCCMetagen.py -h\n```\n\n  * **To get the abundance of each taxon, and/or summarize results for multiple samples, use CCMetagen_merge**:\n```\nCCMetagen_merge.py -i $CCMetagen_out\n```\n\nWhere `$CCMetagen_out` is the folder containing the CCMetagen taxonomic classifications.\nThe results must be in .csv format (default or '--mode text' output of CCMetagen), and these files **must end in \".ccm.csv\"**.\n\nThe flag '-t' define the taxonomic level to merge the results. The default is species-level.\n\nYou can also filter out specific taxa, at any taxonomic level:\n\nUse flag -kr to keep (k) or remove (r) taxa.\nUse flag -l to set the taxonomic level for the filtering.\nUse flag -tlist to list the taxa to keep or remove (separated by comma).\n\nEX1: Filter out bacteria: `CCMetagen_merge.py -i $CCMetagen_out -kr r -l Kingdom -tlist Bacteria`\n\nEX2: Filter out bacteria and Metazoa: `CCMetagen_merge.py -i $CCMetagen_out -kr r -l Kingdom -tlist Bacteria, Metazoa`\n\nEX3: Merge results at family-level, and remove Metazoa and Viridiplantae taxa at Kingdom level:\n```\nCCMetagen_merge.py -i $CCMetagen_out -t Family -kr r -l Kingdom -tlist Metazoa,Viridiplantae -o family_table\n```\n\nFor species-level filtering (where there is a space in taxa names), use quotation marks.\nEx 4: Keep only _Escherichia coli_ and _Candida albicans_:\n```\nCCMetagen_merge.py -i 05_KMetagen/ -kr k -l Species -tlist \"Escherichia coli,Candida albicans\"\n```\n\nIf you only have one sample, you can also use CMetagen_merge to get one line per taxa.\n\nTo see all options, type:\n```\nCCMetagen_merge.py -h\n```\nThis file should look like [this](https://github.com/vrmarcelino/CCMetagen/blob/master/tutorial/figs_tutorial/Bird_family_table_filtered.csv).\n\n\n* **To extract sequences of a given taxon, use CCMetagen_extract_seqs**:\n\nThis script will produce a fasta file containing all reads assigned to a taxon of interest. \nEx: Generate a fasta file containing all sequences that mapped to the genus Eschericha:\n```\nCCMetagen_extract_seqs.py -iccm $CCMetagen_out -ifrag $sample_out_kma.frag -l Genus -t Eschericha\n```\n\nWhere `$CCMetagen_out` is the .csv file generated with CCMetagen and `$sample_out_kma.frag` is the .frag file generated with KMA. The frag file needs to be decompressed: `gunzip *.frag.gz`\n\nFor species-level filtering (where there is a space in taxon names), use quotation marks.\nEx: Generate a fasta file containing all sequences that mapped to _E. coli_:\n```\nCCMetagen_extract_seqs.py -iccm $CCMetagen_out -ifrag $sample_out_kma.frag -l Species -t \"Escherichia coli\"\n```\n\n\n**Check out our [tutorial](https://github.com/vrmarcelino/CCMetagen/tree/master/docs/tutorial) for an applied example of the CCMetagen pipeline.**\n\n\n\n## FAQs\n\n* Error taxid not found.\n  You probably need to update your local ETE3 database, which contains the taxids and lineage information:\n```\npython\nfrom ete3 import NCBITaxa\nncbi = NCBITaxa()\nncbi.update_taxonomy_database()\nquit()\n```\n\n* TypeError: concat() got an unexpected keyword argument 'sort'.\n  If you get this error, please update the python module pandas:\n```\npip install pandas --upgrade --user\n```\n\n* WARNING: no NCBI's taxid found for accession [something], this match will not get taxonomic ranks\n\n  This is not an error, this is just a warning indicating that one of your query sequences matchs to a genbank record for which the NCBI taxonomic identifier (taxid) is not known. CCMetagen therefore will not be able to assign taxonomic ranks to this match, but you will still be able to see it in the output file.\n\n* KeyError: \"['Superkingdom' 'Kingdom' 'Phylum' 'Class' 'Order' 'Family' .... ] not in index\"\n  Make sure that the output of CCMetagen ends in '.csv'.\n\n* The results of the CCMetagen_merge.py at different taxonomic levels do not sum up.\n  As explained above, this script merges all unclassified taxa at a given taxonomic level. For example, if you have 20 matches to the genus _Candida_, but only 2 matches were classified at the species level, the output of CCMetagen_merge.py -t Species (default) will only have the abundances of two classified _Candida_ species, while the others will be merged with the \"Unclassified\" taxa. The output of CCMetagen_merge.py -t Genus however will contain all 20 matches. \n  If this behaviour is undesirable, one option is to disable the similarity thresholds (use flag -off) - so that all taxonomic levels are reported regardless of their similarity to the reference sequence. Alternatively, you can cluster species at the 'Closest_match' (using the flag --tax_level Closest_match).\n\n\n## Complete option list\n\nCCMetagen:\n```\nusage: CCMetagen.py [-h] [-m MODE] -i RES_FP [-o OUTPUT_FP]\n                    [-r REFERENCE_DATABASE] [-ef EXTENDED_OUTPUT_FILE]\n                    [-du DEPTH_UNIT] [-map MAPSTAT] [-d DEPTH] [-c COVERAGE]\n                    [-q QUERY_IDENTITY] [-p PVALUE] [-st SPECIES_THRESHOLD]\n                    [-gt GENUS_THRESHOLD] [-ft FAMILY_THRESHOLD]\n                    [-ot ORDER_THRESHOLD] [-ct CLASS_THRESHOLD]\n                    [-pt PHYLUM_THRESHOLD] [-off TURN_OFF_SIM_THRESHOLDS]\n                    [--version]\n\noptional arguments:\n  -h, --help            show this help message and exit\n  -m MODE, --mode MODE  what do you want CCMetagen to do? Valid options are\n                        'visual', 'text' or 'both': text: parses kma, filters\n                        based on quality and output a text file with taxonomic\n                        information and detailed mapping information visual:\n                        parses kma, filters based on quality and output a\n                        simplified text file and a krona html file for\n                        visualization both: outputs both text and visual file\n                        formats. Default = both\n  -i RES_FP, --res_fp RES_FP\n                        Path to the KMA result (.res file)\n  -o OUTPUT_FP, --output_fp OUTPUT_FP\n                        Path to the output file. Default = CCMetagen_out\n  -r REFERENCE_DATABASE, --reference_database REFERENCE_DATABASE\n                        Which reference database was used. Options: UNITE,\n                        RefSeq or nt. Default = nt\n  -ef EXTENDED_OUTPUT_FILE, --extended_output_file EXTENDED_OUTPUT_FILE\n                        Produce an extended output file that includes the\n                        percentage of classified reads. Options: y or n. To\n                        use this featire, you need to generate the mapstat\n                        file when required unning KMA (use flag -ef), and use\n                        it as input in CCMetagen (flag --mapstat). Default = n\n  -du DEPTH_UNIT, --depth_unit DEPTH_UNIT\n                        Desired unit for Depth(abundance) measurements.\n                        Default = kma (KMA default depth, which is the number\n                        of nucleotides overlapping each template, divided by\n                        the lengh of the template). Alternatively, you can\n                        have abundance calculated in Reads Per Million (RPM,\n                        option 'rpm'), in number of nucleotides overlaping the\n                        template (option 'nc') or in number of fragments (i.e.\n                        PE reads, option 'fr'). If you use the 'nc', 'rpm' or\n                        'fr' options, remember to change the default --depth\n                        parameter accordingly. Valid options are nc, rpm, fr\n                        and kma\n  -map MAPSTAT, --mapstat MAPSTAT\n                        Path to the mapstat file produced with KMA when using\n                        the -ef flag (.mapstat). Required when calculating\n                        abundances in RPM or in number of fragments, or when\n                        producing the extended_output_file\n  -d DEPTH, --depth DEPTH\n                        minimum sequencing depth. Default = 0.2. The unit\n                        corresponds to the one used with --depth_unit If you\n                        use --depth_unit different from the default, change\n                        this accordingly.\n  -c COVERAGE, --coverage COVERAGE\n                        Minimum coverage. Default = 20 (i.e. 20% of the\n                        reference sequence)\n  -q QUERY_IDENTITY, --query_identity QUERY_IDENTITY\n                        Minimum query identity (Phylum level). Default = 50\n  -p PVALUE, --pvalue PVALUE\n                        Minimum p-value. Default = 0.05.\n  -st SPECIES_THRESHOLD, --species_threshold SPECIES_THRESHOLD\n                        Species-level similarity threshold. Default = 98.41\n  -gt GENUS_THRESHOLD, --genus_threshold GENUS_THRESHOLD\n                        Genus-level similarity threshold. Default = 96.31\n  -ft FAMILY_THRESHOLD, --family_threshold FAMILY_THRESHOLD\n                        Family-level similarity threshold. Default = 88.51\n  -ot ORDER_THRESHOLD, --order_threshold ORDER_THRESHOLD\n                        Order-level similarity threshold. Default = 81.21\n  -ct CLASS_THRESHOLD, --class_threshold CLASS_THRESHOLD\n                        Class-level similarity threshold. Default = 80.91\n  -pt PHYLUM_THRESHOLD, --phylum_threshold PHYLUM_THRESHOLD\n                        Phylum-level similarity threshold. Default = 0 - not\n                        applied\n  -off TURN_OFF_SIM_THRESHOLDS, --turn_off_sim_thresholds TURN_OFF_SIM_THRESHOLDS\n                        Turns simularity-based filtering off. Options = y or\n                        n. Default = n\n  --version             show program's version number and exit\n ```\n\n",
    "bugtrack_url": null,
    "license": "GPL-3.0",
    "summary": "Microbiome classification pipeline",
    "version": "1.5.0",
    "project_urls": {
        "Homepage": "https://github.com/vrmarcelino/CCMetagen.git",
        "Preprint": "https://www.biorxiv.org/content/10.1101/641332v2",
        "Source": "https://github.com/vrmarcelino/CCMetagen.git"
    },
    "split_keywords": [
        "bioinformatics",
        "taxonomy",
        "metagenomic",
        "classifier",
        "kma"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a73d079d2e17ea13f71162dd58bbcf8cea9bd125acd5f872a2eb889dd8a7ca6b",
                "md5": "12a8753c427716f5f455e94d0cbe7f9d",
                "sha256": "af464e2aab5c67e2fc35b59faff3c17edca29f756576af2932292c57d30b10fc"
            },
            "downloads": -1,
            "filename": "CCMetagen-1.5.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "12a8753c427716f5f455e94d0cbe7f9d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 33231,
            "upload_time": "2024-04-08T05:32:09",
            "upload_time_iso_8601": "2024-04-08T05:32:09.531222Z",
            "url": "https://files.pythonhosted.org/packages/a7/3d/079d2e17ea13f71162dd58bbcf8cea9bd125acd5f872a2eb889dd8a7ca6b/CCMetagen-1.5.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1dcf0eca947dc02c3646bbefe77097226cb034f3c255ff056a6361bbb7dcd96c",
                "md5": "c4526a232cfa56c4b0dcd7bd5de7987b",
                "sha256": "1e0a65a6349a09413c322aa3be0fd57f729efb3eb28e271771c587dda2738184"
            },
            "downloads": -1,
            "filename": "CCMetagen-1.5.0.tar.gz",
            "has_sig": false,
            "md5_digest": "c4526a232cfa56c4b0dcd7bd5de7987b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 37896,
            "upload_time": "2024-04-08T05:32:11",
            "upload_time_iso_8601": "2024-04-08T05:32:11.996532Z",
            "url": "https://files.pythonhosted.org/packages/1d/cf/0eca947dc02c3646bbefe77097226cb034f3c255ff056a6361bbb7dcd96c/CCMetagen-1.5.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-08 05:32:11",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "vrmarcelino",
    "github_project": "CCMetagen",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "ccmetagen"
}
        
Elapsed time: 0.21896s