LAFITE


NameLAFITE JSON
Version 1.0.2 PyPI version JSON
download
home_pagehttps://github.com/TF-Chan-Lab/LAFITE/tree/main/
SummaryNanopore Direct RNA-seq Transcriptome Assembly
upload_time2022-12-14 10:42:03
maintainer
docs_urlNone
authorJizhou ZHANG
requires_python>=3.11
licenseApache Software License 2.0
keywords nanopore drs transcriptome
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            LAFITE
======

Low-abundance Aware Full-length Isoform clusTEr

Overview
--------
LAFITE is designated to identify high-consensus full-length isoforms from Nanopore Direct RNA-seq data. LAFITE combines multiple features from reference annotation and DRS reads (TSS, TES, splicing junction, and read polyadenylation event) and is more sensitive to Low-abundance transcripts.



Prerequisites
-------------
* [bedtools](https://github.com/arq5x/bedtools2)
* [Minimap2](https://github.com/lh3/minimap2)
* [nanopolish](https://github.com/jts/nanopolish)
* [samtools](http://www.htslib.org)
* Python 3.11

Installation
------------
To avoid potential conflicts, we recommend running LAFITE in a conda environment.
```
conda create -n LAFITE_env -c conda-forge python=3.11
conda activate LAFITE_env

# stable release
pip install LAFITE

# or the latest development version 
pip install git+https://github.com/TF-Chan-Lab/LAFITE
```

Usage
-----
1. **Run minimap2 and samtools to generate alignment file in bam format**
	```
	minimap2 -ax splice -u f -k 14 -G 500000 --secondary=no REFERENCE_FA FASTQ > ALIGNMENT_SAM
	samtools view -bS ALIGNMENT_SAM|samtools sort - > ALIGNMENT_BAM
	```
	LAFITE also supports other splicing-aware long read alignment tools.
2. **Run Nanopolish polya to generate read polyadenylation result (optional but recommend)**  
Current long-read sequencing technologies (Nanopore cDNA/DRS or PacBio Iso-Seq) are all designed to capture RNA molecules with poly(A) tail. However, RNA fragmentation and pore blocking may bring a considerable part of truncated reads which will interfere downstream analysis. Therefore, LAFITE utilizes the read polyadenylation status reported by Nanopolish to filter reads that have completed the sequencing process.  

   ```
   nanopolish index -d PATH_TO_FAST5 -s GUPPY_SEQUENCING_SUMMARY FASTQ
   nanopolish polya -t NUM_OF_THREADS -r FASTQ -b ALIGNMENT_BAM -g REFERENCE_FA > Nanopolish_PolyA_RES
   ```
	LAFITE also provides an alternative approach to estimate read polyadenylation status by scanning any poly(A) motifs that existed at the read 3'-end.  

1. **Run LAFITE**  
	```
	usage: lafite [-h] -b BAM [-B BEDTOOLS] -g GTF -f GENOME -o OUTPUT
              [-n MIN_COUNT_TSS_TES] [-i MIS_INTRON_LENGTH]
              [-c MIN_NOVEL_TRANS_COUNT] [-s MIN_SINGLE_EXON_COVERAGE]
              [-l MIN_SINGLE_EXON_LEN] [-L LABEL] [-p POLYA]
              [-m POLYA_MOTIF_FILE] [-r RELATIVE_ABUNDANCE_THRESHOLD]
              [-j SHORT_SJ_TAB] [-w SJ_CORRECTION_WINDOW] [--no_full_cleanup]
              [-t THREAD] [-T TSS_PEAK] [-d TSS_CUTOFF]

	Low-abundance Aware Full-length Isoform clusTEr

	optional arguments:
	  -h, --help            show this help message and exit
	  -b BAM                path to the alignment file in bam format
	  -B BEDTOOLS           path to the executable bedtools
	  -g GTF                path to the reference gene annotation in GTF format
	  -f GENOME             path to the reference genome fasta
	  -o OUTPUT             path to the output file
	  -n MIN_COUNT_TSS_TES  minimum number of reads supporting a alternative TSS or TES, default: 3
	  -i MIS_INTRON_LENGTH  length cutoff for correcting unexpected small intron, default: 150
	  -c MIN_NOVEL_TRANS_COUNT
	                        minimum occurrences required for a isoform from novel loci, default: 3
	  -s MIN_SINGLE_EXON_COVERAGE
	                        minimum read coverage required for a novel single-exon transcript, default: 4
	  -l MIN_SINGLE_EXON_LEN
	                        minimum length for single-exon transcript, default: 100
	  -L LABEL              name prefix for output transcripts, default: LAFT
	  -p POLYA              path to the file contains read Polyadenylation event
	  -m POLYA_MOTIF_FILE   path to the polya motif file
	  -r RELATIVE_ABUNDANCE_THRESHOLD
	                        minimum abundance of the predicted multi-exon transcripts as a fraction of the
							total transcript assembled at a given locus, default: 0.01
	  -j SHORT_SJ_TAB       path to the short read splice junction file
	  -w SJ_CORRECTION_WINDOW
	                        edit distance to reference splicing site for splicing correction, default: 40
	  --no_full_cleanup     keep all intermediate files
	  -t THREAD             number of the threads, default: 4
	  -T TSS_PEAK           path to the TSS peak file
	  -d TSS_CUTOFF         minimum TSS distance for a transcript to be considered as a novel transcript
	```
- LAFITE can run with the following arguments:
   ```
   lafite -b ALIGNMENT_BAM -g REFERENCE_GTF -f REFERENCE_FA -o OUTPUT_GTF -t NUM_OF_THREADS -p Nanopolish_PolyA_RES
   ```
- LAFITE can also run without the result from *nanoplish polya*. Then, a Poly(A) motif list must be provided for the corresponding species.  
   We have provided the Poly(A) motif list for human and mouse retrieved from [*Tian* et al.](https://academic.oup.com/nar/article/33/1/201/2401035) .
   
   ```
   lafite -b ALIGNMENT_BAM -g REFERENCE_GTF -f REFERENCE_FA -o OUTPUT_GTF -t NUM_OF_THREADS -m POLYA_MOTIFS_OF_SPECIES
   ```
- LAFITE accepts the TSS peaks from 5'-end CAGE data for identifying high-confidence TSSs. Users can prepare the TSS data in the following format where:
  - The first column is the chromosome name
  - The second column is the 0-based start position of the TSS peak
  - The third column is the 1-based end position of the TSS peak
  - The fourth column is the strand information  
- LAFITE also accepts the splicing junctions from Illumina short read RNA-seq data to proof the long reads. LAFITE supports the SJ.out.tab from STAR aligner. Users can also prepare the splicing junctions in the following format where:
  - The first column is the chromosome name
  - The second column is the 0-based start position of the splicing junction
  - The third column is the 1-based end position of the splicing junction
  - The fourth column is the strand information  

Development
-----------

LAFITE was developed following the [fastai/nbdev](https://github.com/fastai/nbdev) framework.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/TF-Chan-Lab/LAFITE/tree/main/",
    "name": "LAFITE",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.11",
    "maintainer_email": "",
    "keywords": "Nanopore DRS transcriptome",
    "author": "Jizhou ZHANG",
    "author_email": "zjzace@ourlook.com",
    "download_url": "https://files.pythonhosted.org/packages/d4/12/e3dc27fdf3da05f13b5a5d7996fa4999cb36b3e641a67dfa3e613d468ac1/LAFITE-1.0.2.tar.gz",
    "platform": null,
    "description": "LAFITE\n======\n\nLow-abundance Aware Full-length Isoform clusTEr\n\nOverview\n--------\nLAFITE is designated to identify high-consensus full-length isoforms from Nanopore Direct RNA-seq data. LAFITE combines multiple features from reference annotation and DRS reads (TSS, TES, splicing junction, and read polyadenylation event) and is more sensitive to Low-abundance transcripts.\n\n\n\nPrerequisites\n-------------\n* [bedtools](https://github.com/arq5x/bedtools2)\n* [Minimap2](https://github.com/lh3/minimap2)\n* [nanopolish](https://github.com/jts/nanopolish)\n* [samtools](http://www.htslib.org)\n* Python 3.11\n\nInstallation\n------------\nTo avoid potential conflicts, we recommend running LAFITE in a conda environment.\n```\nconda create -n LAFITE_env -c conda-forge python=3.11\nconda activate LAFITE_env\n\n# stable release\npip install LAFITE\n\n# or the latest development version \npip install git+https://github.com/TF-Chan-Lab/LAFITE\n```\n\nUsage\n-----\n1. **Run minimap2 and samtools to generate alignment file in bam format**\n\t```\n\tminimap2 -ax splice -u f -k 14 -G 500000 --secondary=no REFERENCE_FA FASTQ > ALIGNMENT_SAM\n\tsamtools view -bS ALIGNMENT_SAM|samtools sort - > ALIGNMENT_BAM\n\t```\n\tLAFITE also supports other splicing-aware long read alignment tools.\n2. **Run Nanopolish polya to generate read polyadenylation result (optional but recommend)**  \nCurrent long-read sequencing technologies (Nanopore cDNA/DRS or PacBio Iso-Seq) are all designed to capture RNA molecules with poly(A) tail. However, RNA fragmentation and pore blocking may bring a considerable part of truncated reads which will interfere downstream analysis. Therefore, LAFITE utilizes the read polyadenylation status reported by Nanopolish to filter reads that have completed the sequencing process.  \n\n   ```\n   nanopolish index -d PATH_TO_FAST5 -s GUPPY_SEQUENCING_SUMMARY FASTQ\n   nanopolish polya -t NUM_OF_THREADS -r FASTQ -b ALIGNMENT_BAM -g REFERENCE_FA > Nanopolish_PolyA_RES\n   ```\n\tLAFITE also provides an alternative approach to estimate read polyadenylation status by scanning any poly(A) motifs that existed at the read 3'-end.  \n\n1. **Run LAFITE**  \n\t```\n\tusage: lafite [-h] -b BAM [-B BEDTOOLS] -g GTF -f GENOME -o OUTPUT\n              [-n MIN_COUNT_TSS_TES] [-i MIS_INTRON_LENGTH]\n              [-c MIN_NOVEL_TRANS_COUNT] [-s MIN_SINGLE_EXON_COVERAGE]\n              [-l MIN_SINGLE_EXON_LEN] [-L LABEL] [-p POLYA]\n              [-m POLYA_MOTIF_FILE] [-r RELATIVE_ABUNDANCE_THRESHOLD]\n              [-j SHORT_SJ_TAB] [-w SJ_CORRECTION_WINDOW] [--no_full_cleanup]\n              [-t THREAD] [-T TSS_PEAK] [-d TSS_CUTOFF]\n\n\tLow-abundance Aware Full-length Isoform clusTEr\n\n\toptional arguments:\n\t  -h, --help            show this help message and exit\n\t  -b BAM                path to the alignment file in bam format\n\t  -B BEDTOOLS           path to the executable bedtools\n\t  -g GTF                path to the reference gene annotation in GTF format\n\t  -f GENOME             path to the reference genome fasta\n\t  -o OUTPUT             path to the output file\n\t  -n MIN_COUNT_TSS_TES  minimum number of reads supporting a alternative TSS or TES, default: 3\n\t  -i MIS_INTRON_LENGTH  length cutoff for correcting unexpected small intron, default: 150\n\t  -c MIN_NOVEL_TRANS_COUNT\n\t                        minimum occurrences required for a isoform from novel loci, default: 3\n\t  -s MIN_SINGLE_EXON_COVERAGE\n\t                        minimum read coverage required for a novel single-exon transcript, default: 4\n\t  -l MIN_SINGLE_EXON_LEN\n\t                        minimum length for single-exon transcript, default: 100\n\t  -L LABEL              name prefix for output transcripts, default: LAFT\n\t  -p POLYA              path to the file contains read Polyadenylation event\n\t  -m POLYA_MOTIF_FILE   path to the polya motif file\n\t  -r RELATIVE_ABUNDANCE_THRESHOLD\n\t                        minimum abundance of the predicted multi-exon transcripts as a fraction of the\n\t\t\t\t\t\t\ttotal transcript assembled at a given locus, default: 0.01\n\t  -j SHORT_SJ_TAB       path to the short read splice junction file\n\t  -w SJ_CORRECTION_WINDOW\n\t                        edit distance to reference splicing site for splicing correction, default: 40\n\t  --no_full_cleanup     keep all intermediate files\n\t  -t THREAD             number of the threads, default: 4\n\t  -T TSS_PEAK           path to the TSS peak file\n\t  -d TSS_CUTOFF         minimum TSS distance for a transcript to be considered as a novel transcript\n\t```\n- LAFITE can run with the following arguments:\n   ```\n   lafite -b ALIGNMENT_BAM -g REFERENCE_GTF -f REFERENCE_FA -o OUTPUT_GTF -t NUM_OF_THREADS -p Nanopolish_PolyA_RES\n   ```\n- LAFITE can also run without the result from *nanoplish polya*. Then, a Poly(A) motif list must be provided for the corresponding species.  \n   We have provided the Poly(A) motif list for human and mouse retrieved from [*Tian* et al.](https://academic.oup.com/nar/article/33/1/201/2401035) .\n   \n   ```\n   lafite -b ALIGNMENT_BAM -g REFERENCE_GTF -f REFERENCE_FA -o OUTPUT_GTF -t NUM_OF_THREADS -m POLYA_MOTIFS_OF_SPECIES\n   ```\n- LAFITE accepts the TSS peaks from 5'-end CAGE data for identifying high-confidence TSSs. Users can prepare the TSS data in the following format where:\n  - The first column is the chromosome name\n  - The second column is the 0-based start position of the TSS peak\n  - The third column is the 1-based end position of the TSS peak\n  - The fourth column is the strand information  \n- LAFITE also accepts the splicing junctions from Illumina short read RNA-seq data to proof the long reads. LAFITE supports the SJ.out.tab from STAR aligner. Users can also prepare the splicing junctions in the following format where:\n  - The first column is the chromosome name\n  - The second column is the 0-based start position of the splicing junction\n  - The third column is the 1-based end position of the splicing junction\n  - The fourth column is the strand information  \n\nDevelopment\n-----------\n\nLAFITE was developed following the [fastai/nbdev](https://github.com/fastai/nbdev) framework.\n",
    "bugtrack_url": null,
    "license": "Apache Software License 2.0",
    "summary": "Nanopore Direct RNA-seq Transcriptome Assembly",
    "version": "1.0.2",
    "split_keywords": [
        "nanopore",
        "drs",
        "transcriptome"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "c56ba26ee6c6018359ebe6670081f988",
                "sha256": "6fe4e6b86b6bfa6d29b9ea29139effa04e92d8966c79767fdb7a1913710109b2"
            },
            "downloads": -1,
            "filename": "LAFITE-1.0.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "c56ba26ee6c6018359ebe6670081f988",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11",
            "size": 28738,
            "upload_time": "2022-12-14T10:42:01",
            "upload_time_iso_8601": "2022-12-14T10:42:01.805606Z",
            "url": "https://files.pythonhosted.org/packages/78/94/ed8a228ca412cf5ce83c8014d5f86cf16ca973b60c10912cdcc4a36d0621/LAFITE-1.0.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "md5": "d3e74c207e2a4c08968ba2e74977b041",
                "sha256": "14bf302f05eae4eed457f506072e36651dad0c281e38cacf9ddb1c003e79f0fd"
            },
            "downloads": -1,
            "filename": "LAFITE-1.0.2.tar.gz",
            "has_sig": false,
            "md5_digest": "d3e74c207e2a4c08968ba2e74977b041",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11",
            "size": 29163,
            "upload_time": "2022-12-14T10:42:03",
            "upload_time_iso_8601": "2022-12-14T10:42:03.636866Z",
            "url": "https://files.pythonhosted.org/packages/d4/12/e3dc27fdf3da05f13b5a5d7996fa4999cb36b3e641a67dfa3e613d468ac1/LAFITE-1.0.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2022-12-14 10:42:03",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "lcname": "lafite"
}
        
Elapsed time: 0.07332s