LAFITE
======
Low-abundance Aware Full-length Isoform clusTEr
Overview
--------
LAFITE is designated to identify high-consensus full-length isoforms from Nanopore Direct RNA-seq data. LAFITE combines multiple features from reference annotation and DRS reads (TSS, TES, splicing junction, and read polyadenylation event) and is more sensitive to Low-abundance transcripts.
Prerequisites
-------------
* [bedtools](https://github.com/arq5x/bedtools2)
* [Minimap2](https://github.com/lh3/minimap2)
* [nanopolish](https://github.com/jts/nanopolish)
* [samtools](http://www.htslib.org)
* Python 3.11
Installation
------------
To avoid potential conflicts, we recommend running LAFITE in a conda environment.
```
conda create -n LAFITE_env -c conda-forge python=3.11
conda activate LAFITE_env
# stable release
pip install LAFITE
# or the latest development version
pip install git+https://github.com/TF-Chan-Lab/LAFITE
```
Usage
-----
1. **Run minimap2 and samtools to generate alignment file in bam format**
```
minimap2 -ax splice -u f -k 14 -G 500000 --secondary=no REFERENCE_FA FASTQ > ALIGNMENT_SAM
samtools view -bS ALIGNMENT_SAM|samtools sort - > ALIGNMENT_BAM
```
LAFITE also supports other splicing-aware long read alignment tools.
2. **Run Nanopolish polya to generate read polyadenylation result (optional but recommend)**
Current long-read sequencing technologies (Nanopore cDNA/DRS or PacBio Iso-Seq) are all designed to capture RNA molecules with poly(A) tail. However, RNA fragmentation and pore blocking may bring a considerable part of truncated reads which will interfere downstream analysis. Therefore, LAFITE utilizes the read polyadenylation status reported by Nanopolish to filter reads that have completed the sequencing process.
```
nanopolish index -d PATH_TO_FAST5 -s GUPPY_SEQUENCING_SUMMARY FASTQ
nanopolish polya -t NUM_OF_THREADS -r FASTQ -b ALIGNMENT_BAM -g REFERENCE_FA > Nanopolish_PolyA_RES
```
LAFITE also provides an alternative approach to estimate read polyadenylation status by scanning any poly(A) motifs that existed at the read 3'-end.
1. **Run LAFITE**
```
usage: lafite [-h] -b BAM [-B BEDTOOLS] -g GTF -f GENOME -o OUTPUT
[-n MIN_COUNT_TSS_TES] [-i MIS_INTRON_LENGTH]
[-c MIN_NOVEL_TRANS_COUNT] [-s MIN_SINGLE_EXON_COVERAGE]
[-l MIN_SINGLE_EXON_LEN] [-L LABEL] [-p POLYA]
[-m POLYA_MOTIF_FILE] [-r RELATIVE_ABUNDANCE_THRESHOLD]
[-j SHORT_SJ_TAB] [-w SJ_CORRECTION_WINDOW] [--no_full_cleanup]
[-t THREAD] [-T TSS_PEAK] [-d TSS_CUTOFF]
Low-abundance Aware Full-length Isoform clusTEr
optional arguments:
-h, --help show this help message and exit
-b BAM path to the alignment file in bam format
-B BEDTOOLS path to the executable bedtools
-g GTF path to the reference gene annotation in GTF format
-f GENOME path to the reference genome fasta
-o OUTPUT path to the output file
-n MIN_COUNT_TSS_TES minimum number of reads supporting a alternative TSS or TES, default: 3
-i MIS_INTRON_LENGTH length cutoff for correcting unexpected small intron, default: 150
-c MIN_NOVEL_TRANS_COUNT
minimum occurrences required for a isoform from novel loci, default: 3
-s MIN_SINGLE_EXON_COVERAGE
minimum read coverage required for a novel single-exon transcript, default: 4
-l MIN_SINGLE_EXON_LEN
minimum length for single-exon transcript, default: 100
-L LABEL name prefix for output transcripts, default: LAFT
-p POLYA path to the file contains read Polyadenylation event
-m POLYA_MOTIF_FILE path to the polya motif file
-r RELATIVE_ABUNDANCE_THRESHOLD
minimum abundance of the predicted multi-exon transcripts as a fraction of the
total transcript assembled at a given locus, default: 0.01
-j SHORT_SJ_TAB path to the short read splice junction file
-w SJ_CORRECTION_WINDOW
edit distance to reference splicing site for splicing correction, default: 40
--no_full_cleanup keep all intermediate files
-t THREAD number of the threads, default: 4
-T TSS_PEAK path to the TSS peak file
-d TSS_CUTOFF minimum TSS distance for a transcript to be considered as a novel transcript
```
- LAFITE can run with the following arguments:
```
lafite -b ALIGNMENT_BAM -g REFERENCE_GTF -f REFERENCE_FA -o OUTPUT_GTF -t NUM_OF_THREADS -p Nanopolish_PolyA_RES
```
- LAFITE can also run without the result from *nanoplish polya*. Then, a Poly(A) motif list must be provided for the corresponding species.
We have provided the Poly(A) motif list for human and mouse retrieved from [*Tian* et al.](https://academic.oup.com/nar/article/33/1/201/2401035) .
```
lafite -b ALIGNMENT_BAM -g REFERENCE_GTF -f REFERENCE_FA -o OUTPUT_GTF -t NUM_OF_THREADS -m POLYA_MOTIFS_OF_SPECIES
```
- LAFITE accepts the TSS peaks from 5'-end CAGE data for identifying high-confidence TSSs. Users can prepare the TSS data in the following format where:
- The first column is the chromosome name
- The second column is the 0-based start position of the TSS peak
- The third column is the 1-based end position of the TSS peak
- The fourth column is the strand information
- LAFITE also accepts the splicing junctions from Illumina short read RNA-seq data to proof the long reads. LAFITE supports the SJ.out.tab from STAR aligner. Users can also prepare the splicing junctions in the following format where:
- The first column is the chromosome name
- The second column is the 0-based start position of the splicing junction
- The third column is the 1-based end position of the splicing junction
- The fourth column is the strand information
Development
-----------
LAFITE was developed following the [fastai/nbdev](https://github.com/fastai/nbdev) framework.
Raw data
{
"_id": null,
"home_page": "https://github.com/TF-Chan-Lab/LAFITE/tree/main/",
"name": "LAFITE",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.11",
"maintainer_email": "",
"keywords": "Nanopore DRS transcriptome",
"author": "Jizhou ZHANG",
"author_email": "zjzace@ourlook.com",
"download_url": "https://files.pythonhosted.org/packages/d4/12/e3dc27fdf3da05f13b5a5d7996fa4999cb36b3e641a67dfa3e613d468ac1/LAFITE-1.0.2.tar.gz",
"platform": null,
"description": "LAFITE\n======\n\nLow-abundance Aware Full-length Isoform clusTEr\n\nOverview\n--------\nLAFITE is designated to identify high-consensus full-length isoforms from Nanopore Direct RNA-seq data. LAFITE combines multiple features from reference annotation and DRS reads (TSS, TES, splicing junction, and read polyadenylation event) and is more sensitive to Low-abundance transcripts.\n\n\n\nPrerequisites\n-------------\n* [bedtools](https://github.com/arq5x/bedtools2)\n* [Minimap2](https://github.com/lh3/minimap2)\n* [nanopolish](https://github.com/jts/nanopolish)\n* [samtools](http://www.htslib.org)\n* Python 3.11\n\nInstallation\n------------\nTo avoid potential conflicts, we recommend running LAFITE in a conda environment.\n```\nconda create -n LAFITE_env -c conda-forge python=3.11\nconda activate LAFITE_env\n\n# stable release\npip install LAFITE\n\n# or the latest development version \npip install git+https://github.com/TF-Chan-Lab/LAFITE\n```\n\nUsage\n-----\n1. **Run minimap2 and samtools to generate alignment file in bam format**\n\t```\n\tminimap2 -ax splice -u f -k 14 -G 500000 --secondary=no REFERENCE_FA FASTQ > ALIGNMENT_SAM\n\tsamtools view -bS ALIGNMENT_SAM|samtools sort - > ALIGNMENT_BAM\n\t```\n\tLAFITE also supports other splicing-aware long read alignment tools.\n2. **Run Nanopolish polya to generate read polyadenylation result (optional but recommend)** \nCurrent long-read sequencing technologies (Nanopore cDNA/DRS or PacBio Iso-Seq) are all designed to capture RNA molecules with poly(A) tail. However, RNA fragmentation and pore blocking may bring a considerable part of truncated reads which will interfere downstream analysis. Therefore, LAFITE utilizes the read polyadenylation status reported by Nanopolish to filter reads that have completed the sequencing process. \n\n ```\n nanopolish index -d PATH_TO_FAST5 -s GUPPY_SEQUENCING_SUMMARY FASTQ\n nanopolish polya -t NUM_OF_THREADS -r FASTQ -b ALIGNMENT_BAM -g REFERENCE_FA > Nanopolish_PolyA_RES\n ```\n\tLAFITE also provides an alternative approach to estimate read polyadenylation status by scanning any poly(A) motifs that existed at the read 3'-end. \n\n1. **Run LAFITE** \n\t```\n\tusage: lafite [-h] -b BAM [-B BEDTOOLS] -g GTF -f GENOME -o OUTPUT\n [-n MIN_COUNT_TSS_TES] [-i MIS_INTRON_LENGTH]\n [-c MIN_NOVEL_TRANS_COUNT] [-s MIN_SINGLE_EXON_COVERAGE]\n [-l MIN_SINGLE_EXON_LEN] [-L LABEL] [-p POLYA]\n [-m POLYA_MOTIF_FILE] [-r RELATIVE_ABUNDANCE_THRESHOLD]\n [-j SHORT_SJ_TAB] [-w SJ_CORRECTION_WINDOW] [--no_full_cleanup]\n [-t THREAD] [-T TSS_PEAK] [-d TSS_CUTOFF]\n\n\tLow-abundance Aware Full-length Isoform clusTEr\n\n\toptional arguments:\n\t -h, --help show this help message and exit\n\t -b BAM path to the alignment file in bam format\n\t -B BEDTOOLS path to the executable bedtools\n\t -g GTF path to the reference gene annotation in GTF format\n\t -f GENOME path to the reference genome fasta\n\t -o OUTPUT path to the output file\n\t -n MIN_COUNT_TSS_TES minimum number of reads supporting a alternative TSS or TES, default: 3\n\t -i MIS_INTRON_LENGTH length cutoff for correcting unexpected small intron, default: 150\n\t -c MIN_NOVEL_TRANS_COUNT\n\t minimum occurrences required for a isoform from novel loci, default: 3\n\t -s MIN_SINGLE_EXON_COVERAGE\n\t minimum read coverage required for a novel single-exon transcript, default: 4\n\t -l MIN_SINGLE_EXON_LEN\n\t minimum length for single-exon transcript, default: 100\n\t -L LABEL name prefix for output transcripts, default: LAFT\n\t -p POLYA path to the file contains read Polyadenylation event\n\t -m POLYA_MOTIF_FILE path to the polya motif file\n\t -r RELATIVE_ABUNDANCE_THRESHOLD\n\t minimum abundance of the predicted multi-exon transcripts as a fraction of the\n\t\t\t\t\t\t\ttotal transcript assembled at a given locus, default: 0.01\n\t -j SHORT_SJ_TAB path to the short read splice junction file\n\t -w SJ_CORRECTION_WINDOW\n\t edit distance to reference splicing site for splicing correction, default: 40\n\t --no_full_cleanup keep all intermediate files\n\t -t THREAD number of the threads, default: 4\n\t -T TSS_PEAK path to the TSS peak file\n\t -d TSS_CUTOFF minimum TSS distance for a transcript to be considered as a novel transcript\n\t```\n- LAFITE can run with the following arguments:\n ```\n lafite -b ALIGNMENT_BAM -g REFERENCE_GTF -f REFERENCE_FA -o OUTPUT_GTF -t NUM_OF_THREADS -p Nanopolish_PolyA_RES\n ```\n- LAFITE can also run without the result from *nanoplish polya*. Then, a Poly(A) motif list must be provided for the corresponding species. \n We have provided the Poly(A) motif list for human and mouse retrieved from [*Tian* et al.](https://academic.oup.com/nar/article/33/1/201/2401035) .\n \n ```\n lafite -b ALIGNMENT_BAM -g REFERENCE_GTF -f REFERENCE_FA -o OUTPUT_GTF -t NUM_OF_THREADS -m POLYA_MOTIFS_OF_SPECIES\n ```\n- LAFITE accepts the TSS peaks from 5'-end CAGE data for identifying high-confidence TSSs. Users can prepare the TSS data in the following format where:\n - The first column is the chromosome name\n - The second column is the 0-based start position of the TSS peak\n - The third column is the 1-based end position of the TSS peak\n - The fourth column is the strand information \n- LAFITE also accepts the splicing junctions from Illumina short read RNA-seq data to proof the long reads. LAFITE supports the SJ.out.tab from STAR aligner. Users can also prepare the splicing junctions in the following format where:\n - The first column is the chromosome name\n - The second column is the 0-based start position of the splicing junction\n - The third column is the 1-based end position of the splicing junction\n - The fourth column is the strand information \n\nDevelopment\n-----------\n\nLAFITE was developed following the [fastai/nbdev](https://github.com/fastai/nbdev) framework.\n",
"bugtrack_url": null,
"license": "Apache Software License 2.0",
"summary": "Nanopore Direct RNA-seq Transcriptome Assembly",
"version": "1.0.2",
"split_keywords": [
"nanopore",
"drs",
"transcriptome"
],
"urls": [
{
"comment_text": "",
"digests": {
"md5": "c56ba26ee6c6018359ebe6670081f988",
"sha256": "6fe4e6b86b6bfa6d29b9ea29139effa04e92d8966c79767fdb7a1913710109b2"
},
"downloads": -1,
"filename": "LAFITE-1.0.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "c56ba26ee6c6018359ebe6670081f988",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.11",
"size": 28738,
"upload_time": "2022-12-14T10:42:01",
"upload_time_iso_8601": "2022-12-14T10:42:01.805606Z",
"url": "https://files.pythonhosted.org/packages/78/94/ed8a228ca412cf5ce83c8014d5f86cf16ca973b60c10912cdcc4a36d0621/LAFITE-1.0.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"md5": "d3e74c207e2a4c08968ba2e74977b041",
"sha256": "14bf302f05eae4eed457f506072e36651dad0c281e38cacf9ddb1c003e79f0fd"
},
"downloads": -1,
"filename": "LAFITE-1.0.2.tar.gz",
"has_sig": false,
"md5_digest": "d3e74c207e2a4c08968ba2e74977b041",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.11",
"size": 29163,
"upload_time": "2022-12-14T10:42:03",
"upload_time_iso_8601": "2022-12-14T10:42:03.636866Z",
"url": "https://files.pythonhosted.org/packages/d4/12/e3dc27fdf3da05f13b5a5d7996fa4999cb36b3e641a67dfa3e613d468ac1/LAFITE-1.0.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2022-12-14 10:42:03",
"github": false,
"gitlab": false,
"bitbucket": false,
"lcname": "lafite"
}