waafle

Name	waafle JSON
Version	1.1.0 JSON
	download
home_page	https://huttenhower.sph.harvard.edu/waafle
Summary	WAAFLE: a Workflow to Annotate Assemblies and Find LGT Events
upload_time	2024-10-10 20:40:14
maintainer	None
docs_url	None
author	Eric Franzosa
requires_python	None
license	MIT
keywords	microbial microbiome bioinformatics microbiology metagenomic metatranscriptomic waafle
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

# WAAFLE

**WAAFLE** (a **W**orkflow to **A**nnotate **A**ssemblies and **F**ind **L**GT Events) is a method for finding novel LGT (Lateral Gene Transfer) events in assembled metagenomes, including those from human microbiomes. Please direct questions to the [WAAFLE "topic" in the bioBakery Support Forum](https://forum.biobakery.org/c/Microbial-community-profiling/WAAFLE).

## Citation
The WAAFLE manuscript has been submitted!

> Tiffany Y. Hsu*, Etienne Nzabarushimana*, Dennis Wong, Chengwei Luo, Robert G. Beiko, Morgan Langille, Curtis Huttenhower, Long H. Nguyen**, Eric A. Franzosa**. _Profiling novel lateral gene transfer events in the human microbiome_. (Submitted.)
>
> \* = co-lead; \*\* = co-supervised

In the meantime, if you use WAAFLE in your work, please cite the WAAFLE repository on GitHub: https://github.com/biobakery/waafle.

## Installation

Install the WAAFLE software and Python dependencies with `pip`:

```
$ pip install waafle
```

You can also clone the WAAFLE package from Github:

```
$ git clone https://github.com/biobakery/waafle.git
```

Or download and inflate WAAFLE package directly:

```
$ wget https://github.com/biobakery/waafle/archive/master.zip
$ unzip master.zip
```

## Software requirements

* Python 3+ or 2.7+
* Python `numpy` (tested with v1.13.3)
* [NCBI BLAST+](https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download) (tested with v2.6.0)
* [bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) (for performing contig-level QC; tested with v2.2.3)

## Database requirements

WAAFLE requires two input databases: 1) a BLAST-formatted **nucleotide sequence database** and 2) a corresponding **taxonomy file**. The versions used in the WAAFLE publication are available for download here:

* [waafledb.tar.gz](http://huttenhower.sph.harvard.edu/waafle_data/waafledb.tar.gz) (4.3 GB)
* [waafledb_taxonomy.tsv](http://huttenhower.sph.harvard.edu/waafle_data/waafledb_taxonomy.tsv) (<1 MB)

## Input data

An individual WAAFLE run requires one or two non-fixed inputs: 1) a file containing **metagenomic contigs** and (optionally) 2) a GFF file describing **gene/ORF coordinates** along those contigs.

### Input contigs (required)

Contigs should be provided as nucleotide sequences in FASTA format. Contigs are expected to have unique, BLAST-compatible headers. WAAFLE is optimized for working with fragmentary contigs from partially assembled metagenomes (spanning 2-20 genes, or roughly 1-20 kb). WAAFLE is not optimized to work with extremely long contigs (100s of kbs), scaffolds, or closed genomes. The WAAFLE developers recommend [MEGAHIT](https://github.com/voutcn/megahit) as a general-purpose metagenomic assembler.

* [A sample contigs input file](https://raw.githubusercontent.com/biobakery/waafle/refs/heads/main/demo/input/demo_contigs.fna)

### Input ORF calls (optional)

The optional GFF file, if provided, should conform to the [GFF format]([https://useast.ensembl.org/info/website/upload/gff.html). The WAAFLE developers recommend [Prodigal](https://github.com/hyattpd/Prodigal) as a general-purpose ORF caller with GFF output. WAAFLE can alternatively generate a GFF file as part of its normal operation ([as described below](#step-2-gene-calling-with-waafle_genecaller)).

* [A sample GFF file produced by WAAFLE](https://github.com/biobakery/waafle/blob/master/demo/output/demo_contigs.gff)
* [A sample GFF file produced by Prodigal](https://github.com/biobakery/waafle/blob/master/demo/output_prodigal/demo_contigs.prodigal.gff)

## Performing a WAAFLE analysis

Analyzing a set of contigs with WAAFLE requires performing four steps: 1) subjecting the contigs to homology-based search, 2) identifying genes / open reading frames (ORFs) along the contigs, 3) combining the results of Steps 1 and 2 to identify candidate LGT events, and 4) performing contig-level quality control (QC) to weed out misassembled contigs. All of these steps can be performed using WAAFLE utilities. Steps 2 and 4 can optionally be performed outside of WAAFLE using other methods.

### Step 1: Homology-based search with `waafle_search`

`waafle_search` is a light wrapper around `blastn` to help guide the nucleotide-level search of your metagenomic contigs against a WAAFLE-formatted database (for example, it ensures that all of the non-default BLAST output fields required for downstream processing are generated).

A sample call to `waafle_search` with input contigs `contigs.fna` and a blast database located in `waafledb` would be:

```
$ waafle_search contigs.fna waafledb/waafled
```

By default, this produces an output file `contigs.blastout` in the same location as the input contigs. See the `--help` menu for additional configuration options.

### Step 2: Gene calling with `waafle_genecaller`

WAAFLE can identify gene coordinates of interest directly from the BLAST output produced in the previous step:

`$ waafle_genecaller contigs.blastout`

This produces a file in GFF format called `contigs.gff` for use in the next and last WAAFLE step. See the `--help` menu for additional configuration options.

[As described above](#input-orf-calls-optional), a user can alternatively provide a GFF file for their contigs that was generated outside of WAAFLE using another method, e.g. [Prodigal](https://github.com/hyattpd/Prodigal).

### Step 3: Identify candidate LGT events with `waafle_orgscorer`

The next and most critical step of a WAAFLE analysis is combining the BLAST output generated in Step 1 with a GFF file (generated in Step 2 or with an external method) to 1) taxonomically score genes along the length of the input contigs and then 2) identify those contigs as derived from a single clade or a pair of clades (i.e. putative LGT). Assuming you have run Steps 1 and 2 as described above, a sample call to `waafle_orgscorer` would be:

```
$ waafle_orgscorer \
contigs.fna \
contigs.blastout \
contigs.gff \
taxonomy.tsv
```

This will produce three output files which divide and describe your contigs as putative LGT contigs, single-clade (no-LGT) contigs, and unclassified contigs (e.g. those containing no genes):

* `contigs.lgt.tsv`
* `contigs.no_lgt.tsv`
* `contigs.unclassified.tsv`

These files and their formats are described in more detailed below (see "WAAFLE outputs").

`waafle_orgscorer` offers many options for fine-tuning your analysis. The various analysis parameters have been pre-optimized for maximum specificity on both short contigs (containing as little as two partial genes) and longer contigs (10s of genes). These options are detailed in the `--help` menu:

### Step 4: Filter out misassembled contigs

While WAAFLE's method of LGT detection has been optimized to distinguish LGT from other biological events (e.g. gene deletion), it does so assuming that the input contigs are biologically valid. This is notable as misassembled contigs, in particular chimeric assembly of genetic material from 2+ organisms, can spuriously manifest as LGT. It is therefore critical to filter out low-quality contigs before performing further downstream analyses on WAAFLE outputs.

WAAFLE provides a set of utilities that can aid in the process of performing contig-level QC / filtering using the outputs of Steps 1-3. This method is [described below](#contig-level-quality-control). Alternatively, a user can perform contig-level QC (and remove/correct erroneous contigs) before starting their WAAFLE analysis using an external method, e.g. [metaMIC](https://github.com/ZhaoXM-Lab/metaMIC).

Notably, all contigs included in the demo files linked from this manual passed the QC checks imposed by the WAAFLE QC utilities referenced above.

## WAAFLE outputs

The `contigs.lgt.tsv` output file lists the details of putative LGT contigs. Its fields are a superset of the types of fields included in the other output files. The following represents the first two lines/rows of a `contigs.lgt.tsv` file *transposed* such that first line (column headers) is shown as the first column and the details of the first LGT contig (second row) are shown as the second column:

```
CONTIG_NAME 12571
CALL lgt
CONTIG_LENGTH 9250
MIN_MAX_SCORE 0.856
AVG_MAX_SCORE 0.965
SYNTENY AABAAAA
DIRECTION B>A
CLADE_A s__Ruminococcus_bromii
CLADE_B s__Faecalibacterium_prausnitzii
LCA f__Ruminococcaceae
MELDED_A --
MELDED_B --
TAXONOMY_A r__Root|k__Bacteria|p__Firmicutes|...|g__Ruminococcus|s__Ruminococcus_bromii
TAXONOMY_B r__Root|k__Bacteria|p__Firmicutes|...|g__Faecalibacterium|s__Faecalibacterium_prausnitzii
LOCI 252:668:-|792:1367:-|1557:2360:-|2724:3596:-|4540:5592:+|5608:7977:+|8180:8425:+
ANNOTATIONS:UNIPROT R5E4K6|D4L7I2|D4JXM0|D4L7I1|D4L7I0|None|D4L7H8
```

The fields in detail:

* **`CONTIG_NAME`**: the name of the contig from the input FASTA file.
* **`CALL`**: indicates that this was an LGT contig.
* **`CONTIG_LENGTH`**: the length of the contig in nucleotides.
* **`MIN_MAX_SCORE`**: the minimum score for the pair of clades explaining the contig along the length of the contig. (i.e. the score for identifying this as a putative LGT contig, with a default threshold of 0.8.)
* **`AVG_MAX_SCORE`**: the average score for the pair of clades explaining the contig (used for ranking multiple potential explanations of the contig).
* **`SYNTENY`**: the pattern of genes assigned to the A or B clades along the contig. `*` indicates that the gene could be contributed by either clade; `~` indicates an ignored gene; `!` indicates a problem (should not occur).
* **`DIRECTION`**: indicates this as a putative B-to-A transfer event, as determined from synteny (A genes flank the inserted B gene). `A?B` indicates that the direction was uncertain.
* **`CLADE_A`**: the name of clade A.
* **`CLADE_B`**: the name of clade B.
* **`LCA`**: the lowest common ancestor of clades A and B. A higher-level LCA indicates a more remote LGT event.
* **`MELDED_A`**: if using a meld reporting option, the individual melded clades will be listed here. For example, if a contig could be explained by a transfer from *Genus_1 species_1.1* to either *Genus_2 species_2.1* or *Genus_2 species_2.2*, this field would list `species_2.1; species 2.2` and *Genus 2* would be given as `CLADE_A`.
* **`MELDED_B`**: *see above*.
* **`TAXONOMY_A`**: the full taxonomy of `CLADE_A`.
* **`TAXONOMY_B`**: the full taxonomy of `CLADE_B`.
* **`LOCI`**: Ccordinates of the loci (genes) that were considered for this contig in format `START:STOP:STRAND`.
* **`ANNOTATIONS:UNIPROT`**: indicates that UniProt annotations were provided for the genes in the input sequence database (in format `UNIPROT=IDENTIFIER`). The best-scoring UniProt annotation for each gene is given here. (Additional annotations would appear as additional, similarly-formatted columns in the output.)

## Contig-level quality control

WAAFLE bundles two utilities, `waafle_junctions` and `waafle_qc`, that can be used to aid in the identification and removal of low-quality contigs.

### Quantifying junction support with `waafle_junctions`

`waafle_junctions` maps reads to contigs to quantify support for individiual gene-gene junctions. Specifically, two forms of support are considered/computed:

1. Sequencing fragments (paired reads) that span the gene-gene junction. These are less common for junctions that are larger than typical sequencing insert sizes (~300 nts).

2. Relative junction coverage compared to the mean coverage of the two flanking genes. If the junction is not well covered relative to its flanking genes, it may represent a non-biological overlap.

A sample call to `waafle_junctions` looks like:

```
$ waafle_junctions \
contigs.fna \
contigs.gff \
--reads1 contigs_reads.1.fq \
--reads2 contigs_reads.2.fq \
```

With this call, `waafle_junctions` will use [bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) to index the contigs and then align the input reads (pairwise) against the index to produce a SAM file. (`waafle_junctions` can also interpret a mapping from an existing SAM file.) The alignment results are then interpreted to score individual junctions, producing an output file for each.

* `contigs.junctions.tsv`

A sample report for an individual junction looks like:

```
CONTIG SRS011086_k119_10006
GENE1 1:1363:-
GENE2 1451:2382:-
LEN_GENE1 1363
LEN_GENE2 932
GAP 87
JUNCTION_HITS 5
COVERAGE_GENE1 8.1079
COVERAGE_GENE2 11.3573
COVERAGE_JUNCTION 10.9101
RATIO 1.1210
```

This report indicates that the junction between genes 1 and 2 (which may or may not be an LGT junction) was well supported: it was spanned by 5 mate-pairs (`JUNCTION_HITS=5`) and had coverable coverage (`RATIO=1.12`) to the mean of its flanking genes (8.11 and 11.4).

`waafle_junctions` can be tuned to produce additional gene- and nucleotide-level quality reports. Consult the `--help` menu for a full list of options.

### Using junction data for contig QC with `waafle_qc`

The `waafle_qc` script interprets the output of `waafle_junctions` to remove contigs with weak read support at one or more junctions. Currently, the script focuses on the junctions flanking LGT'ed genes among putative LGT-containing contigs.

A sample call to `waafle_qc` looks like:

```
$ waafle_qc contigs.lgt.tsv contigs.junctions.tsv
```

Where `contigs.junctions.tsv` is the output of `waafle_junctions` on this set of contigs and its underlying reads. This produces a file `contigs.lgt.tsv.qc_pass`: a subset of the original LGT calls that were supported by read-level evidence.

By default, a junction is supported if it was contained in 2+ mate-pairs *or* had >0.5x the average coverage of its two flanking genes. These thresholds are tunable with the `--min-junction-hits` and `--min-junction-ratio` parameters of `waafle_qc`, respectively. Consult the `--help` menu for a full list of options.

## Advanced topics

### Formatting a sequence database for WAAFLE

WAAFLE performs a nucleotide-level search of metagenomic contigs against a collection of taxonomically annotated protein-coding genes (*not* complete genomes). A common way to build such a database is to combine a collection of microbial pangenomes of interest. The protein-coding genes should be organized in FASTA format and then indexed for use with `blastn`. For example, the FASTA file `waafledb.fnt` would be indexed as:

```
$ makeblastdb -in waafledb.fnt -dbtype nucl
```

WAAFLE expects the input FASTA sequence headers to follow a specific format. At a minimum, the headers must contain a unique sequence ID (`GENE_123` below) followed by `|` (pipe) followed by a taxon name or taxonomic ID (`SPECIES_456` below):

```
>GENE_123|SPECIES_456
```

Headers can contain additional `|`-delimited fields corresponding to functional annotations of the gene. These fields have the format `SYSTEM=IDENTIFIER` and multiple such fields can be included per header, as in:

```
>GENE_123|SPECIES_456|PFAM=P00789|EC=1.2.3.4
```
Headers are allowed to contain different sets of functional annotation fields. WAAFLE currently expects at most one annotation per annotation system per gene; this will be improved in future versions. (We currently recommend annotating genes against the [UniRef90 and UniRef50](http://www.uniprot.org/help/uniref) databases to enable link-outs to more detailed functional annotations in downstream analyses.)

WAAFLE assumes that the taxa listed in the sequence database file are all at the same taxonomic level (for example, all genera or all species or all strains).

### Formatting a WAAFLE taxonomy file

WAAFLE requires a taxonomy file to understand the taxonomic relationships among the taxa whose genes are included in the sequence database. The taxonomy file is a tab delimited list of child-parent relationships, for example:

```
k__Animalia r__Root
p__Chordata k__Animalia
c__Mammalia p__Chordata
o__Primates c__Mammalia
f__Hominidae o__Primates
g__Homo f__hominidae
s__Homo sapiens g__Homo
```

While the format of this file is relatively simple, it has a number of critical structural constraints that must be obeyed:

* All taxon names/IDs used in the sequence database must appear is the taxonomy file.

* The file must contain a root taxon from which all other taxa descend (the root taxon should be named `r__Root`, as above).

* All taxon names/IDs used in the sequence database must be the same distance from the root.

The following properties of the taxonomy file are optional:

* Additional taxa *below* the level of the taxa in the sequence file can be included in the taxonomy file. For example, a species-level sequence database could contain isolate genomes as children of the species-level clades in the taxonomy file. (WAAFLE can use such entries as "leaf support" for LGT events.)

* We recommend prefixing taxonomic clades to indicate their level. For example, `g__Homo` identifies *Homo* as a genus above.

## Contributions ##
This work was supported by the National Institutes of Health grants U54DE023798 (CH), R24DK110499 (CH), and K23DK125838 (LHN), the American Gastroenterological Association Research Foundation’s Research Scholars Award (LHN), and the Crohn’s and Colitis Foundation Career Development Award (LHN). The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH. We thank April Pawluk, Kelsey N. Thompson, Kristen Curry, and Todd Treangen for comments on the manuscript and helpful discussions. We also acknowledge Monia Michaud, Casey Dulong, and Yan Yan for their help with validation experiments. Computational work was conducted on the FASRC Cannon cluster supported by the FAS Division of Science Research Computing Group at Harvard University.

Raw data

            {
    "_id": null,
    "home_page": "https://huttenhower.sph.harvard.edu/waafle",
    "name": "waafle",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "microbial, microbiome, bioinformatics, microbiology, metagenomic, metatranscriptomic, waafle",
    "author": "Eric Franzosa",
    "author_email": "franzosa@hsph.harvard.edu",
    "download_url": "https://files.pythonhosted.org/packages/47/53/ea5b79773cdb155b462c84f5ae4c6aef5a779837879284eaa9f1e7e37e21/waafle-1.1.0.tar.gz",
    "platform": "Linux",
    "description": "# WAAFLE\n\n**WAAFLE** (a **W**orkflow to **A**nnotate **A**ssemblies and **F**ind **L**GT Events) is a method for finding novel LGT (Lateral Gene Transfer) events in assembled metagenomes, including those from human microbiomes. Please direct questions to the [WAAFLE \"topic\" in the bioBakery Support Forum](https://forum.biobakery.org/c/Microbial-community-profiling/WAAFLE).\n\n\n## Citation\nThe WAAFLE manuscript has been submitted!\n\n> Tiffany Y. Hsu*, Etienne Nzabarushimana*, Dennis Wong, Chengwei Luo, Robert G. Beiko, Morgan Langille, Curtis Huttenhower, Long H. Nguyen**, Eric A. Franzosa**. _Profiling novel lateral gene transfer events in the human microbiome_. (Submitted.)\n> \n> \\* = co-lead; \\*\\* = co-supervised\n\n\nIn the meantime, if you use WAAFLE in your work, please cite the WAAFLE repository on GitHub: https://github.com/biobakery/waafle.\n\n## Installation\n\nInstall the WAAFLE software and Python dependencies with `pip`:\n\n```\n$ pip install waafle\n```\n\nYou can also clone the WAAFLE package from Github:\n\n```\n$ git clone https://github.com/biobakery/waafle.git\n```\n\nOr download and inflate WAAFLE package directly:\n\n```\n$ wget https://github.com/biobakery/waafle/archive/master.zip\n$ unzip master.zip\n```\n\n## Software requirements\n\n* Python 3+ or 2.7+\n* Python `numpy` (tested with v1.13.3)\n* [NCBI BLAST+](https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download) (tested with v2.6.0)\n* [bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) (for performing contig-level QC; tested with v2.2.3)\n\n## Database requirements\n\nWAAFLE requires two input databases: 1) a BLAST-formatted **nucleotide sequence database** and 2) a corresponding **taxonomy file**. The versions used in the WAAFLE publication are available for download here:\n\n* [waafledb.tar.gz](http://huttenhower.sph.harvard.edu/waafle_data/waafledb.tar.gz) (4.3 GB)\n* [waafledb_taxonomy.tsv](http://huttenhower.sph.harvard.edu/waafle_data/waafledb_taxonomy.tsv) (<1 MB)\n\n## Input data\n\nAn individual WAAFLE run requires one or two non-fixed inputs: 1) a file containing **metagenomic contigs** and (optionally) 2) a GFF file describing **gene/ORF coordinates** along those contigs.\n\n### Input contigs (required)\n\nContigs should be provided as nucleotide sequences in FASTA format. Contigs are expected to have unique, BLAST-compatible headers. WAAFLE is optimized for working with fragmentary contigs from partially assembled metagenomes (spanning 2-20 genes, or roughly 1-20 kb). WAAFLE is not optimized to work with extremely long contigs (100s of kbs), scaffolds, or closed genomes. The WAAFLE developers recommend [MEGAHIT](https://github.com/voutcn/megahit) as a general-purpose metagenomic assembler.\n\n* [A sample contigs input file](https://raw.githubusercontent.com/biobakery/waafle/refs/heads/main/demo/input/demo_contigs.fna)\n\n### Input ORF calls (optional)\n\nThe optional GFF file, if provided, should conform to the [GFF format]([https://useast.ensembl.org/info/website/upload/gff.html). The WAAFLE developers recommend [Prodigal](https://github.com/hyattpd/Prodigal) as a general-purpose ORF caller with GFF output. WAAFLE can alternatively generate a GFF file as part of its normal operation ([as described below](#step-2-gene-calling-with-waafle_genecaller)).\n\n* [A sample GFF file produced by WAAFLE](https://github.com/biobakery/waafle/blob/master/demo/output/demo_contigs.gff)\n* [A sample GFF file produced by Prodigal](https://github.com/biobakery/waafle/blob/master/demo/output_prodigal/demo_contigs.prodigal.gff)\n\n## Performing a WAAFLE analysis\n\nAnalyzing a set of contigs with WAAFLE requires performing four steps: 1) subjecting the contigs to homology-based search, 2) identifying genes / open reading frames (ORFs) along the contigs, 3) combining the results of Steps 1 and 2 to identify candidate LGT events, and 4) performing contig-level quality control (QC) to weed out misassembled contigs. All of these steps can be performed using WAAFLE utilities. Steps 2 and 4 can optionally be performed outside of WAAFLE using other methods.\n\n### Step 1: Homology-based search with  `waafle_search`\n\n`waafle_search` is a light wrapper around `blastn` to help guide the nucleotide-level search of your metagenomic contigs against a WAAFLE-formatted database (for example, it ensures that all of the non-default BLAST output fields required for downstream processing are generated).\n\nA sample call to `waafle_search` with input contigs `contigs.fna` and a blast database located in `waafledb` would be:\n\n```\n$ waafle_search contigs.fna waafledb/waafled\n```\n\nBy default, this produces an output file `contigs.blastout` in the same location as the input contigs. See the `--help` menu for additional configuration options.\n\n### Step 2: Gene calling with `waafle_genecaller`\n\nWAAFLE can identify gene coordinates of interest directly from the BLAST output produced in the previous step:\n\n`$ waafle_genecaller contigs.blastout`\n\nThis produces a file in GFF format called  `contigs.gff` for use in the next and last WAAFLE step. See the `--help` menu for additional configuration options.\n\n[As described above](#input-orf-calls-optional), a user can alternatively provide a GFF file for their contigs that was generated outside of WAAFLE using another method, e.g. [Prodigal](https://github.com/hyattpd/Prodigal).\n\n### Step 3: Identify candidate LGT events with `waafle_orgscorer`\n\nThe next and most critical step of a WAAFLE analysis is combining the BLAST output generated in Step 1 with a GFF file (generated in Step 2 or with an external method) to 1) taxonomically score genes along the length of the input contigs and then 2) identify those contigs as derived from a single clade or a pair of clades (i.e. putative LGT). Assuming you have run Steps 1 and 2 as described above, a sample call to `waafle_orgscorer` would be:\n\n```\n$ waafle_orgscorer \\\n  contigs.fna \\\n  contigs.blastout \\\n  contigs.gff \\\n  taxonomy.tsv\n```\n\nThis will produce three output files which divide and describe your contigs as putative LGT contigs, single-clade (no-LGT) contigs, and unclassified contigs (e.g. those containing no genes):\n\n* `contigs.lgt.tsv`\n* `contigs.no_lgt.tsv`\n* `contigs.unclassified.tsv`\n\nThese files and their formats are described in more detailed below (see \"WAAFLE outputs\").\n\n`waafle_orgscorer` offers many options for fine-tuning your analysis. The various analysis parameters have been pre-optimized for maximum specificity on both short contigs (containing as little as two partial genes) and longer contigs (10s of genes). These options are detailed in the `--help` menu:\n\n### Step 4: Filter out misassembled contigs\n\nWhile WAAFLE's method of LGT detection has been optimized to distinguish LGT from other biological events (e.g. gene deletion), it does so assuming that the input contigs are biologically valid. This is notable as misassembled contigs, in particular chimeric assembly of genetic material from 2+ organisms, can spuriously manifest as LGT. It is therefore critical to filter out low-quality contigs before performing further downstream analyses on WAAFLE outputs.\n\nWAAFLE provides a set of utilities that can aid in the process of performing contig-level QC / filtering using the outputs of Steps 1-3. This method is [described below](#contig-level-quality-control). Alternatively, a user can perform contig-level QC (and remove/correct erroneous contigs) before starting their WAAFLE analysis using an external method, e.g. [metaMIC](https://github.com/ZhaoXM-Lab/metaMIC).\n\nNotably, all contigs included in the demo files linked from this manual passed the QC checks imposed by the WAAFLE QC utilities referenced above.\n\n## WAAFLE outputs\n\nThe `contigs.lgt.tsv` output file lists the details of putative LGT contigs. Its fields are a superset of the types of fields included in the other output files. The following represents the first two lines/rows of a `contigs.lgt.tsv` file *transposed* such that first line (column headers) is shown as the first column and the details of the first LGT contig (second row) are shown as the second column:\n\n```\nCONTIG_NAME          12571\nCALL                 lgt\nCONTIG_LENGTH        9250\nMIN_MAX_SCORE        0.856\nAVG_MAX_SCORE        0.965\nSYNTENY              AABAAAA\nDIRECTION            B>A\nCLADE_A              s__Ruminococcus_bromii\nCLADE_B              s__Faecalibacterium_prausnitzii\nLCA                  f__Ruminococcaceae\nMELDED_A             --\nMELDED_B             --\nTAXONOMY_A           r__Root|k__Bacteria|p__Firmicutes|...|g__Ruminococcus|s__Ruminococcus_bromii\nTAXONOMY_B           r__Root|k__Bacteria|p__Firmicutes|...|g__Faecalibacterium|s__Faecalibacterium_prausnitzii\nLOCI                 252:668:-|792:1367:-|1557:2360:-|2724:3596:-|4540:5592:+|5608:7977:+|8180:8425:+\nANNOTATIONS:UNIPROT  R5E4K6|D4L7I2|D4JXM0|D4L7I1|D4L7I0|None|D4L7H8\n```\n\nThe fields in detail:\n\n* **`CONTIG_NAME`**: the name of the contig from the input FASTA file.\n* **`CALL`**: indicates that this was an LGT contig.\n* **`CONTIG_LENGTH`**: the length of the contig in nucleotides.\n* **`MIN_MAX_SCORE`**: the minimum score for the pair of clades explaining the contig along the length of the contig. (i.e. the score for identifying this as a putative LGT contig, with a default threshold of 0.8.)\n* **`AVG_MAX_SCORE`**: the average score for the pair of clades explaining the contig (used for ranking multiple potential explanations of the contig).\n* **`SYNTENY`**: the pattern of genes assigned to the A or B clades along the contig. `*` indicates that the gene could be contributed by either clade; `~` indicates an ignored gene; `!` indicates a problem (should not occur).\n* **`DIRECTION`**: indicates this as a putative B-to-A transfer event, as determined from synteny (A genes flank the inserted B gene). `A?B` indicates that the direction was uncertain.\n* **`CLADE_A`**: the name of clade A.\n* **`CLADE_B`**: the name of clade B.\n* **`LCA`**: the lowest common ancestor of clades A and B. A higher-level LCA indicates a more remote LGT event.\n* **`MELDED_A`**: if using a meld reporting option, the individual melded clades will be listed here. For example, if a contig could be explained by a transfer from *Genus_1 species_1.1* to either *Genus_2 species_2.1* or *Genus_2 species_2.2*, this field would list `species_2.1; species 2.2` and *Genus 2* would be given as `CLADE_A`.\n* **`MELDED_B`**: *see above*.\n* **`TAXONOMY_A`**: the full taxonomy of `CLADE_A`.\n* **`TAXONOMY_B`**: the full taxonomy of `CLADE_B`.\n* **`LOCI`**: Ccordinates of the loci (genes) that were considered for this contig in format `START:STOP:STRAND`.\n* **`ANNOTATIONS:UNIPROT`**: indicates that UniProt annotations were provided for the genes in the input sequence database (in format `UNIPROT=IDENTIFIER`). The best-scoring UniProt annotation for each gene is given here. (Additional annotations would appear as additional, similarly-formatted columns in the output.)\n\n## Contig-level quality control\n\nWAAFLE bundles two utilities, `waafle_junctions` and `waafle_qc`, that can be used to aid in the identification and removal of low-quality contigs.\n\n### Quantifying junction support with `waafle_junctions`\n\n`waafle_junctions` maps reads to contigs to quantify support for individiual gene-gene junctions. Specifically, two forms of support are considered/computed:\n\n1. Sequencing fragments (paired reads) that span the gene-gene junction. These are less common for junctions that are larger than typical sequencing insert sizes (~300 nts).\n\n2. Relative junction coverage compared to the mean coverage of the two flanking genes. If the junction is not well covered relative to its flanking genes, it may represent a non-biological overlap.\n\nA sample call to `waafle_junctions` looks like:\n\n```\n$ waafle_junctions \\\n  contigs.fna \\\n  contigs.gff \\\n  --reads1 contigs_reads.1.fq \\\n  --reads2 contigs_reads.2.fq \\\n```\n\nWith this call, `waafle_junctions` will use [bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) to index the contigs and then align the input reads (pairwise) against the index to produce a SAM file. (`waafle_junctions` can also interpret a mapping from an existing SAM file.) The alignment results are then interpreted to score individual junctions, producing an output file for each.\n\n* `contigs.junctions.tsv`\n\nA sample report for an individual junction looks like:\n\n```\nCONTIG             SRS011086_k119_10006\nGENE1              1:1363:-\nGENE2              1451:2382:-\nLEN_GENE1          1363\nLEN_GENE2          932\nGAP                87\nJUNCTION_HITS      5\nCOVERAGE_GENE1     8.1079\nCOVERAGE_GENE2     11.3573\nCOVERAGE_JUNCTION  10.9101\nRATIO              1.1210\n```\n\nThis report indicates that the junction between genes 1 and 2 (which may or may not be an LGT junction) was well supported: it was spanned by 5 mate-pairs (`JUNCTION_HITS=5`) and had coverable coverage (`RATIO=1.12`) to the mean of its flanking genes (8.11 and 11.4).\n\n`waafle_junctions` can be tuned to produce additional gene- and nucleotide-level quality reports. Consult the `--help` menu for a full list of options.\n\n### Using junction data for contig QC with `waafle_qc`\n\nThe `waafle_qc` script interprets the output of `waafle_junctions` to remove contigs with weak read support at one or more junctions. Currently, the script focuses on the junctions flanking LGT'ed genes among putative LGT-containing contigs.\n\nA sample call to `waafle_qc` looks like:\n\n```\n$ waafle_qc contigs.lgt.tsv contigs.junctions.tsv\n```\n\nWhere `contigs.junctions.tsv` is the output of `waafle_junctions` on this set of contigs and its underlying reads. This produces a file `contigs.lgt.tsv.qc_pass`: a subset of the original LGT calls that were supported by read-level evidence.\n\nBy default, a junction is supported if it was contained in 2+ mate-pairs *or* had >0.5x the average coverage of its two flanking genes. These thresholds are tunable with the `--min-junction-hits` and `--min-junction-ratio` parameters of `waafle_qc`, respectively. Consult the `--help` menu for a full list of options.\n\n## Advanced topics\n\n### Formatting a sequence database for WAAFLE\n\nWAAFLE performs a nucleotide-level search of metagenomic contigs against a collection of taxonomically annotated protein-coding genes (*not* complete genomes). A common way to build such a database is to combine a collection of microbial pangenomes of interest. The protein-coding genes should be organized in FASTA format and then indexed for use with `blastn`. For example, the FASTA file `waafledb.fnt` would be indexed as:\n\n```\n$ makeblastdb -in waafledb.fnt -dbtype nucl\n```\n\nWAAFLE expects the input FASTA sequence headers to follow a specific format. At a minimum, the headers must contain a unique sequence ID (`GENE_123` below) followed by `|` (pipe) followed by a taxon name or taxonomic ID (`SPECIES_456` below):\n\n```\n>GENE_123|SPECIES_456\n```\n\nHeaders can contain additional `|`-delimited fields corresponding to functional annotations of the gene. These fields have the format `SYSTEM=IDENTIFIER` and multiple such fields can be included per header, as in:\n\n```\n>GENE_123|SPECIES_456|PFAM=P00789|EC=1.2.3.4\n```\nHeaders are allowed to contain different sets of functional annotation fields. WAAFLE currently expects at most one annotation per annotation system per gene; this will be improved in future versions. (We currently recommend annotating genes against the [UniRef90 and UniRef50](http://www.uniprot.org/help/uniref) databases to enable link-outs to more detailed functional annotations in downstream analyses.)\n\nWAAFLE assumes that the taxa listed in the sequence database file are all at the same taxonomic level (for example, all genera or all species or all strains).\n\n### Formatting a WAAFLE taxonomy file\n\nWAAFLE requires a taxonomy file to understand the taxonomic relationships among the taxa whose genes are included in the sequence database. The taxonomy file is a tab delimited list of child-parent relationships, for example:\n\n```\nk__Animalia      r__Root\np__Chordata      k__Animalia\nc__Mammalia      p__Chordata\no__Primates      c__Mammalia\nf__Hominidae     o__Primates\ng__Homo          f__hominidae\ns__Homo sapiens  g__Homo\n```\n\nWhile the format of this file is relatively simple, it has a number of critical structural constraints that must be obeyed:\n\n* All taxon names/IDs used in the sequence database must appear is the taxonomy file.\n \n* The file must contain a root taxon from which all other taxa descend (the root taxon should be named `r__Root`, as above).\n\n* All taxon names/IDs used in the sequence database must be the same distance from the root.\n\nThe following properties of the taxonomy file are optional:\n\n* Additional taxa *below* the level of the taxa in the sequence file can be included in the taxonomy file. For example, a species-level sequence database could contain isolate genomes as children of the species-level clades in the taxonomy file. (WAAFLE can use such entries as \"leaf support\" for LGT events.)\n\n* We recommend prefixing taxonomic clades to indicate their level. For example, `g__Homo` identifies *Homo* as a genus above.\n\n## Contributions ##\nThis work was supported by the National Institutes of Health grants U54DE023798 (CH), R24DK110499 (CH), and K23DK125838 (LHN), the American Gastroenterological Association Research Foundation\u2019s Research Scholars Award (LHN), and the Crohn\u2019s and Colitis Foundation Career Development Award (LHN). The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH. We thank April Pawluk, Kelsey N. Thompson, Kristen Curry, and Todd Treangen for comments on the manuscript and helpful discussions. We also acknowledge Monia Michaud, Casey Dulong, and Yan Yan for their help with validation experiments. Computational work was conducted on the FASRC Cannon cluster supported by the FAS Division of Science Research Computing Group at Harvard University.",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "WAAFLE: a Workflow to Annotate Assemblies and Find LGT Events",
    "version": "1.1.0",
    "project_urls": {
        "Homepage": "https://huttenhower.sph.harvard.edu/waafle"
    },
    "split_keywords": [
        "microbial",
        " microbiome",
        " bioinformatics",
        " microbiology",
        " metagenomic",
        " metatranscriptomic",
        " waafle"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4753ea5b79773cdb155b462c84f5ae4c6aef5a779837879284eaa9f1e7e37e21",
                "md5": "caec551cc96a52c2f03fd48e93acb48a",
                "sha256": "831ad15b6aacfc5de95f0e90c82da13b7e6177a7de23d7d524f0000a569b4a72"
            },
            "downloads": -1,
            "filename": "waafle-1.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "caec551cc96a52c2f03fd48e93acb48a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 37717,
            "upload_time": "2024-10-10T20:40:14",
            "upload_time_iso_8601": "2024-10-10T20:40:14.630709Z",
            "url": "https://files.pythonhosted.org/packages/47/53/ea5b79773cdb155b462c84f5ae4c6aef5a779837879284eaa9f1e7e37e21/waafle-1.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-10 20:40:14",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "waafle"
}

Eric Franzosa