<!--
* @Author: 'rensc' 'rensc0718@163.com'
* @Date: 2024-10-15 11:44:58
* @LastEditors: 'rensc' 'rensc0718@163.com'
* @LastEditTime: 2025-02-18 03:50
* @FilePath: \RiboParser\README.md
*
-->
# RiboParser
To streamline understanding and application, we will analyze publicly accessible project data, breaking down each analytical step to illustrate the complete workflow.
This process encompasses both general analysis steps and specialized analysis and visualization techniques, facilitated by `RiboParser` and `RiboShiny`.
The specific steps involved are:
1. Software installation
2. Reference file creation
3. Raw data download
4. Raw data cleaning
5. Data alignment
6. Sequencing quality analysis
7. Gene-level analysis
8. Codon-level analysis
The results of this data analysis can be further analyzed and visualized in `RiboShiny`.
## 1. Software configuration
### 1.1 create the environment with `conda`
```bash
conda create -n ribo
conda activate ribo
```
### 1.2 Install software dependencies using `conda`
```bash
conda install cutadapt -c bioconda
conda install bowtie -c bioconda
conda install samtools -c bioconda
conda install star -c bioconda
conda install bedtools -c bioconda
conda install subread -c bioconda
conda install rsem -c bioconda
conda install gffread -c bioconda
conda install sra-tools -c bioconda
conda install ucsc-genepredtogtf -c bioconda
conda install ucsc-gtftogenepred -c bioconda
conda install ucsc-gff3togenepred -c bioconda
conda install ucsc-bedgraphtobigwig -c bioconda
conda install ucsc-bedsort -c bioconda
conda install pigz -c conda-forge
```
or install the packages in a single command.
```bash
conda install cutadapt bowtie samtools star bedtools subread rsem gffread sra-tools \
ucsc-genepredtogtf ucsc-gtftogenepred ucsc-gff3togenepred ucsc-bedgraphtobigwig ucsc-bedsort \
-c bioconda
conda install pigz -c conda-forge
```
### 1.3 `pip` install `RiboParser`
When the server is connected to the network, we can use `pip` to install software directly.
```bash
pip install riboparser
```
Alternatively, we can download the version from GitHub, re-setup and then install it.
```bash
cd RiboParser
python3 setup.py sdist bdist_wheel
pip install dist/RiboParser-0.1.6.1-py3-none-any.whl
```
### 1.4 run the test
Test software for dependency, installation, and operation issues.
```bash
rpf_Check -h
rpf_CST -h
```
## 2. Prepare reference files
### 2.1 An example of the complete project directory is shown below
The complete data analysis includes reference preparation, rawdata, RNA-seq data analysis, and Ribo-seq data analysis.
```bash
$ cd && cd ./sce/
$ tree -d
.
├── 1.reference
│ ├── genome
│ ├── mrna
│ ├── ncrna
│ ├── norm
│ ├── rrna
│ ├── rsem-index
│ ├── star-index
│ └── trna
├── 2.rawdata
│ ├── ribo-seq
│ └── rna-seq
├── 3.rna-seq
│ ├── 1.cleandata
│ ├── 2.bowtie
│ ├── 3.star
│ ├── 4.quantification
│ └── 5.riboparser
│ ├── 01.qc
│ ├── 02.digestion
│ ├── 03.offset
│ ├── 04.density
│ ├── 05.merge
│ ├── 06.periodicity
│ ├── 07.metaplot
│ ├── 08.coverage
│ ├── 09.correlation
│ ├── 10.shuffle
│ └── 11.retrieve
└── 4.ribo-seq
├── 1.cleandata
├── 2.bowtie
├── 3.star
├── 4.quantification
└── 5.riboparser
├── 01.qc
├── 02.digestion
├── 03.offset
├── 04.density
├── 05.merge
├── 06.periodicity
├── 07.metaplot
├── 08.coverage
├── 09.correlation
├── 10.quantification
├── 11.pausing_score
├── 12.codon_occupancy
├── 13.codon_decoding_time
├── 14.codon_selection_time
├── 15.coefficient_of_variation
├── 16.meta_codon
├── 17.shuffle
├── 18.retrieve
└── 19.frame_shift
```
### 2.2 Prepare the reference genome index
#### 2.2.1 Create the directory
Create folders to hold different types of reference sequence files.
```bash
$ mkdir -p ./sce/1.reference/
$ cd ./sce/1.reference/
$ mkdir cdna genome gtf mrna ncrna rrna trna norm rsem-index
```
#### 2.2.2 Download reference files from NCBI
Use the most common data analysis file format, the genome sequence in fasta format, and the reference file in GTF or GFF3 format.
```bash
# genome sequence
$ wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.fna.gz
# GTF or GFF3
$ wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.gtf.gz
$ wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.gff.gz
# cDNA sequence
$ wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_rna.fna.gz
# feature table
$ wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_feature_table.txt.gz
# decompression
$ gunzip *.gz
$ gffread -g GCF_000146045.2_R64_genomic.fna GCF_000146045.2_R64_genomic.gff -F -w cdna.fa
```
#### 2.2.3 Create the `genome` index using `bowtie`
```bash
$ bowtie-build ../GCF_000146045.2_R64_genomic.fna ./genome/genome --threads 12 &>> ./genome/genome_build.log
```
#### 2.2.4 Create an `mRNA` index using `bowtie`
A custom script is used here to extract the corresponding sequence information
from the `fasta` file based on the sequence name.
```bash
$ retrieve_seq -h
usage: retrieve_seq [-h] [-v] -i INPUT -n NAME [-u UNMAPPED] -o OUTPUT
This script is used to retrieve the fasta sequence by name.
options:
-h, --help show this help message and exit
-v, --version show programs version number and exit
-i INPUT input the fasta file
-n NAME gene ids in txt format
-u UNMAPPED output the unmapped gene ids
-o OUTPUT prefix of output file name (default results_peaks.txt)
```
```bash
# filter the mrna sequence
$ grep -i 'gbkey=mRNA' ./cdna.fa | cut -d ' ' -f 1 | cut -c 2- > ./mrna/mrna.ids
$ retrieve_seq -i ./cdna.fa -n ./mrna/mrna.ids -o ./mrna/mrna.fa &>> ./mrna/mrna_build.log
# build the mrna index
$ bowtie-build ./mrna/mrna.fa ./mrna/mrna --threads 12 &>> ./mrna/mrna_build.log
```
#### 2.2.5 Create an `rRNA` index using `bowtie`
```bash
# filter the rrna sequence
$ grep -i 'gbkey=rRNA' ./cdna.fa | cut -d ' ' -f 1 | cut -c 2- > ./rrna/rrna.ids
$ retrieve_seq -i ./cdna.fa -n ./rrna/rrna.ids -o ./rrna/rrna.fa &>> ./rrna/rrna_build.log
# build the rrna index
$ bowtie-build ./rrna/rrna.fa ./rrna/rrna --threads 12 &>> ./rrna/rrna_build.log
```
#### 2.2.6 Create an `tRNA` index using `bowtie`
```bash
# filter the trna sequence
$ grep -i 'gbkey=tRNA' ./cdna.fa | cut -d ' ' -f 1 | cut -c 2- > ./trna/trna.ids
$ retrieve_seq -i ./cdna.fa -n ./trna/trna.ids -o ./trna/trna.fa &>> ./trna/trna_build.log
# build the trna index
$ bowtie-build ./trna/trna.fa ./trna/trna --threads 12 &>> ./trna/trna_build.log
```
#### 2.2.7 Create an `ncRNA` index using `bowtie`
```bash
# filter the ncrna sequence
$ grep -iE 'gbkey=ncRNA|gbkey=lnc_RNA|gbkey=miRNA|gbkey=snoRNA|gbkey=snRNA|gbkey=misc_RNA' ./cdna.fa | cut -d ' ' -f 1 | cut -c 2- > ./ncrna/ncrna.ids
$ retrieve_seq -i ./cdna.fa -n ./ncrna/ncrna.ids -o ./ncrna/ncrna.fa &>> ./ncrna/ncrna_build.log
# build the ncrna index
$ bowtie-build ./ncrna/ncrna.fa ./ncrna/ncrna --threads 12 &>> ./ncrna/ncrna_build.log
```
#### 2.2.8 Standardized `gtf` or `gff3` files
- Explanation of `rpf_Reference`
```bash
$ rpf_Reference -h
usage: rpf_Reference [-h] -g GENOME -t GTF -o OUTPUT [-u UTR] [-c] [-l] [-w]
This script is used to build the references for the RiboParser.
options:
-h, --help show this help message and exit
-u UTR add the pseudo UTR to the leaderless transcripts (default: 0 nt).
-c only retain the protein coding transcripts (default: False).
-l only retain the longest protein coding transcripts, it is recommended to select
the longest transcript for subsequent analysis. (default: False).
-w output whole message (default: False).
Required arguments:
-g GENOME the input file name of genome sequence
-t GTF the input file name of gtf file
-o OUTPUT the prefix of output file. (prefix + _norm.gtf)
```
- create the references from GTF and Genome fasta
```bash
$ rpf_Reference \
-g ../GCF_000146045.2_R64_genomic.fna \
-t ../GCF_000146045.2_R64_genomic.gff \
-u 30 -o ./norm/gene &>> ./norm/norm_build.log
```
#### 2.2.9 Create a `genome` index using `STAR`
```bash
$ STAR \
--genomeSAindexNbases 11 \
--runThreadN 12 \
--runMode genomeGenerate \
--genomeDir ./star-index \
--genomeFastaFiles GCF_000146045.2_R64_genomic.fna \
--sjdbGTFfile ./norm/gene.norm.gtf
```
#### 2.2.10 Create a `transcriptome` index using `rsem`
```bash
$ rsem-prepare-reference \
-p 12 \
--gtf ../norm/gene.norm.gtf ../GCF_000146045.2_R64_genomic.fna ./rsem-index/rsem
```
## 3. Data preprocessing and alignment
In order to introduce the analysis process and usage of `RiboParser`, RNA-seq and Ribo-seq data from dataset `GSE67387` are used as examples here.
```shell
# dataset
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE67387
# reference
Nedialkova DD, Leidel SA. Optimization of Codon Translation Rates via tRNA Modifications Maintains Proteome Integrity. Cell 2015 Jun 18;161(7):1606-18.
PMID: 26052047
```
### 3.1 Fundamental analysis of RNA-seq data in `GSE67387` dataset
#### 3.1.1 Download RNA-seq raw data
Use `prefetch` in `sra-tools` to download the raw SRA-format data and extract it into `fastq` format files.
```bash
$ mkdir -p ./sce/2.rawdata/rna-seq/
$ cd ./sce/2.rawdata/rna-seq/
#################################################
# download rna-seq
$ prefetch -o SRR1944925.sra SRR1944925
$ prefetch -o SRR1944926.sra SRR1944926
$ prefetch -o SRR1944927.sra SRR1944927
$ prefetch -o SRR1944928.sra SRR1944928
$ prefetch -o SRR1944929.sra SRR1944929
$ prefetch -o SRR1944930.sra SRR1944930
$ prefetch -o SRR1944931.sra SRR1944931
$ prefetch -o SRR1944932.sra SRR1944932
$ prefetch -o SRR1944933.sra SRR1944933
$ prefetch -o SRR1944934.sra SRR1944934
$ prefetch -o SRR1944935.sra SRR1944935
# decompression
for sra in *.sra
do
fastq-dump $sra
pigz *fastq
done
```
#### 3.1.2 RNA-seq data cleaning
Because the data from the gse project is cleaned, it does not include adapter and index sequences. So the following is just to show the general steps, do not need to run.
1. RNA-seq data cleaning
```bash
$ cd
$ mkdir -p ./sce/3.rna-seq/1.cleandata/
$ cd ./sce/3.rna-seq/1.cleandata/
#################################################
# run the cutadapt
for fq in ../../2.rawdata/rna-seq/*fastq.gz
do
cutadapt --match-read-wildcards \
-a AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGC \
-m 25 -O 6 -j 12 \
-o `\basename $fq fastq.gz`clean.fastq.gz $fq &> $fq".log"
done
```
#### 3.1.3 Align clean data to different types of reference files
To assess library quality and eliminate the impact of reads originating from different non-coding RNAs (ncRNAs) on subsequent analysis, we employed `bowtie` to classify the reads form sequencing data.
Under normal circumstances, especially for RNA-seq libraries constructed using the oligo(dT) method, most reads originate from mRNA. Therefore, for RNA-seq analysis, this step is generally not necessary. It is suitable for use in libraries constructed by the rRNA-depletion method.
1. Align the RNA-seq data to references
```bash
$ cd
$ mkdir -p ./sce/3.rna-seq/2.bowtie/
$ cd ./sce/3.rna-seq/2.bowtie/
#################################################
# set database
rrna='../../sce/1.reference/rrna/rrna'
trna='../../sce/1.reference/trna/trna'
ncrna='../../sce/1.reference/ncrna/ncrna'
mrna='../../sce/1.reference/mrna/mrna'
chrom='../../sce/1.reference/genome/genome'
threads=12
mismatch=1
# alignment reads to reference
for fq in ../1.cleandata/*fastq.gz
do
fqname=`\basename $fq .fastq.gz`
## rrna
bowtie -p $threads -v $mismatch --un="$fqname".norrna.fq --al="$fqname".rrna.fq \
-x $rrna $fq -S "$fqname".rrna.sam 2>> "$fqname".log
## trna
bowtie -p $threads -v $mismatch --un="$fqname".notrna.fq --al="$fqname".trna.fq \
-x $trna "$fqname".norrna.fq -S "$fqname".trna.sam 2>> "$fqname".log
## ncrna
bowtie -p $threads -v $mismatch --un="$fqname".noncrna.fq --al="$fqname".ncrna.fq \
-x $ncrna "$fqname".notrna.fq -S "$fqname".ncrna.sam 2>> "$fqname".log
## mrna
bowtie -p $threads -v $mismatch --un="$fqname".nomrna.fq --al="$fqname".mrna.fq \
-x $mrna "$fqname".noncrna.fq -S "$fqname".mrna.sam 2>> "$fqname".log
## genome
bowtie -p $threads -v $mismatch --un="$fqname".nogenome.fq --al="$fqname".genome.fq \
-x $chrom "$fqname".nomrna.fq -S "$fqname".genome.sam 2>> "$fqname".log
## compress fastq
pigz *fq
## compress sam
for sam in *.sam
do
samtools view -h -F 4 $sam | samtools sort -@ $threads -o `\basename $sam sam`bam
rm $sam
done
done
```
2. Statistical alignment results for all references.
- Explanation of `merge_bwt_log`
```bash
$ merge_bwt_log -h
Step1: Checking the input Arguments.
usage: merge_bwt_log [-h] -l LIST [LIST ...] -o OUTPUT [-n NAME]
This script is used to statstic the mapped reads from log files
options:
-h, --help show this help message and exit
-n NAME, --name NAME set the name of each database (default: rRNA,tRNA,ncRNA,mRNA,Genome).
Required arguments:
-l LIST [LIST ...], --list LIST [LIST ...]
List for bowtie mapping log files (e.g., '*log').
-o OUTPUT prefix of output file name.
```
- Statistical alignment results
```bash
#################################################
# merge all log files
merge_bwt_log -n rRNA,tRNA,ncRNA,mRNA,Genome -l *log -o RNA_seq &>> merge_bowtie.log
```
#### 3.1.4 Aligning mRNA reads using `STAR`
Following the removal of ncRNA reads, the remaining clean reads were realigned to the yeast genome using `STAR`.
1. Aligning mRNA reads (RNA-seq) using `STAR`
```bash
$ cd
$ mkdir -p ./sce/3.rna-seq/3.star/
$ cd ./sce/3.rna-seq/3.star/
#################################################
# set the option and database
genome='../../1.reference/star-index/'
threads=12
#################################################
# map the all rna-seq reads to genome and transcriptome region
for fastq in ../2.bowtie/*.noncrna.fq.gz
do
## get file name
output=$(basename $fastq .noncrna.fq.gz)
#################################################
## run the alignment
STAR --runThreadN $threads \
--readFilesCommand zcat \
--genomeDir $genome \
--readFilesIn $fastq \
--outFileNamePrefix $output \
--outSAMtype BAM Unsorted \
--outFilterType BySJout \
--quantMode TranscriptomeSAM GeneCounts \
--outReadsUnmapped Fastx \
--outSAMattributes All \
--alignEndsType Local \
--outFilterMultimapNmax 3 \
--outFilterMismatchNmax 1 \
--alignIntronMax 10000 \
--outFilterMatchNmin 20
# --outWigType wiggle --outWigNorm RPM
pigz *mate1
#################################################
## sort the bam file
samtools sort -@ $threads $output"Aligned.out.bam" -o $output"Aligned.sortedByCoord.out.bam"
samtools index -@ $threads $output"Aligned.sortedByCoord.out.bam"
rm $output"Aligned.out.bam"
done
```
#### 3.1.5 Estimating gene expression levels with either `RSEM` or `featureCounts`
Both `RSEM` and `featureCounts` can be employed to quantify gene expression levels. For the purpose of this analysis, we will utilize `RSEM` as a representative tool.
1. Estimating transcript abundance using RNA-seq data
```bash
$ cd
$ mkdir -p ./sce/3.rna-seq/4.quantification/
$ cd ./sce/3.rna-seq/4.quantification/
#################################################
# quantify the gene expression
for bam in ../3.star/*Aligned.toTranscriptome.out.bam
do
rsem-calculate-expression -p 10 \
--no-bam-output --alignments \
-q $bam ../../1.reference/rsem-index/rsem `\basename $bam Aligned.toTranscriptome.out.bam`
# rsem-calculate-expression -p 10 \
# --paired-end --no-bam-output --alignments \
# -q $bam ../../1.reference/rsem-index/rsem `\basename $bam Aligned.toTranscriptome.out.bam`
done
```
2. Integrating RNA-seq quantification values for all samples
- Explanation of `merge_rsem`
```bash
$ merge_rsem -h
usage: merge_rsem [-h] -l LIST [LIST ...] -o OUTPUT [-c {expected_count,TPM,FPKM}]
This script is used to merge specified columns from result files
options:
-h, --help show this help message and exit
-c {expected_count,TPM,FPKM}, --column {expected_count,TPM,FPKM}
Column name to merge (e.g., 'expected_count').
Required arguments:
-l LIST [LIST ...], --list LIST [LIST ...]
List for result files (e.g., '*results').
-o OUTPUT output file name.
```
- Integrating RNA-seq quantification values
```bash
#################################################
# merge the gene expression
merge_rsem -c expected_count -l *.genes.results -o gene.expected_count.txt &>> merge_rsem.log
merge_rsem -c TPM -l *.genes.results -o gene.TPM.txt &>> merge_rsem.log
merge_rsem -c FPKM -l *.genes.results -o gene.FPKM.txt &>> merge_rsem.log
#################################################
# merge the isoforms expression
merge_rsem -c expected_count -l *.isoforms.results -o isoforms.expected_count.txt &>> merge_rsem.log
merge_rsem -c TPM -l *.isoforms.results -o isoforms.TPM.txt &>> merge_rsem.log
merge_rsem -c FPKM -l *.isoforms.results -o isoforms.FPKM.txt &>> merge_rsem.log
```
### 3.2 Fundamental analysis of Ribo-seq data in `GSE67387` dataset
#### 3.2.1 Download Ribo-seq raw data
Use `prefetch` in `sra-tools` to download the raw SRA-format data and extract it into `fastq` format files.
```bash
$ cd
$ mkdir -p ./sce/2.rawdata/ribo-seq/
$ cd ./sce/2.rawdata/ribo-seq/
#################################################
# download ribo-seq
prefetch -o SRR1944912.sra SRR1944912
prefetch -o SRR1944913.sra SRR1944913
prefetch -o SRR1944914.sra SRR1944914
prefetch -o SRR1944915.sra SRR1944915
prefetch -o SRR1944916.sra SRR1944916
prefetch -o SRR1944917.sra SRR1944917
prefetch -o SRR1944918.sra SRR1944918
prefetch -o SRR1944919.sra SRR1944919
prefetch -o SRR1944920.sra SRR1944920
prefetch -o SRR1944921.sra SRR1944921
prefetch -o SRR1944922.sra SRR1944922
prefetch -o SRR1944923.sra SRR1944923
# decompression
for sra in *.sra
do
fastq-dump $sra
pigz *fastq
done
```
#### 3.2.2 Ribo-seq data cleaning
```bash
$ cd
$ mkdir -p ./sce/4.ribo-seq/1.cleandata/
$ cd ./sce/4.ribo-seq/1.cleandata/
#################################################
# run the cutadapt
for fq in ../../2.rawdata/ribo-seq/*fastq.gz
do
cutadapt --match-read-wildcards \
-a AAAAAAAA \
-m 25 -O 6 -j 10 \
-o `\basename $fq fastq.gz`clean.fastq.gz $fq &> $fq".log"
done
```
#### 3.2.3 Align clean data to different types of reference files
To assess library quality and eliminate the impact of reads originating from different non-coding RNAs (ncRNAs) on subsequent analysis, we employed `bowtie` to classify the reads form sequencing data.
1. Aligning Ribo-seq data
```bash
$ cd
$ mkdir -p ./sce/4.ribo-seq/2.bowtie/
$ cd ./sce/4.ribo-seq/2.bowtie/
#################################################
# set database
rrna='../../sce/1.reference/rrna/rrna'
trna='../../sce/1.reference/trna/trna'
ncrna='../../sce/1.reference/ncrna/ncrna'
mrna='../../sce/1.reference/mrna/mrna'
chrom='../../sce/1.reference/genome/genome'
threads=12
mismatch=1
# alignment reads to reference
for fq in ../1.cleandata/*fastq.gz
do
fqname=`\basename $fq .fastq.gz`
## rrna
bowtie -p $threads -v $mismatch --un="$fqname".norrna.fq --al="$fqname".rrna.fq \
-x $rrna $fq -S "$fqname".rrna.sam 2>> "$fqname".log
## trna
bowtie -p $threads -v $mismatch --un="$fqname".notrna.fq --al="$fqname".trna.fq \
-x $trna "$fqname".norrna.fq -S "$fqname".trna.sam 2>> "$fqname".log
## ncrna
bowtie -p $threads -v $mismatch --un="$fqname".noncrna.fq --al="$fqname".ncrna.fq \
-x $ncrna "$fqname".notrna.fq -S "$fqname".ncrna.sam 2>> "$fqname".log
## mrna
bowtie -p $threads -v $mismatch --un="$fqname".nomrna.fq --al="$fqname".mrna.fq \
-x $mrna "$fqname".noncrna.fq -S "$fqname".mrna.sam 2>> "$fqname".log
## genome
bowtie -p $threads -v $mismatch --un="$fqname".nogenome.fq --al="$fqname".genome.fq \
-x $chrom "$fqname".nomrna.fq -S "$fqname".genome.sam 2>> "$fqname".log
## compress fastq
pigz *fq
## compress sam
for sam in *.sam
do
samtools view -h -F 4 $sam | samtools sort -@ $threads -o `\basename $sam sam`bam
rm $sam
done
done
```
2. Statistical alignment results for all databases.
```bash
#################################################
# merge all log files
merge_bwt_log -n rRNA,tRNA,ncRNA,mRNA,Genome -l *log -o sce &>> merge_bowtie.log
```
#### 3.2.4 Aligning mRNA reads (Ribo-seq) using `STAR`
Following the removal of ncRNA reads, the remaining clean reads were realigned to the yeast genome using `STAR`.
```bash
$ cd
$ mkdir -p ./sce/4.ribo-seq/3.star/
$ cd ./sce/4.ribo-seq/3.star/
#################################################
# set the option and database
genome='../../1.reference/star-index/'
threads=12
#################################################
# map the all rna-seq reads to genome and transcriptome region
for fastq in ../2.bowtie/*.noncrna.fq.gz
do
## get file name
output=$(basename $fastq .noncrna.fq.gz)
#################################################
## run the alignment
STAR --runThreadN $threads \
--readFilesCommand zcat \
--genomeDir $genome \
--readFilesIn $fastq \
--outFileNamePrefix $output \
--outSAMtype BAM Unsorted \
--outFilterType BySJout \
--quantMode TranscriptomeSAM GeneCounts \
--outReadsUnmapped Fastx \
--outSAMattributes All \
--alignEndsType Local \
--outFilterMultimapNmax 3 \
--outFilterMismatchNmax 1 \
--alignIntronMax 10000 \
--outFilterMatchNmin 20
# --outWigType wiggle --outWigNorm RPM
pigz *mate1
#################################################
## sort the bam file
samtools sort -@ $threads $output"Aligned.out.bam" -o $output"Aligned.sortedByCoord.out.bam"
samtools index -@ $threads $output"Aligned.sortedByCoord.out.bam"
rm $output"Aligned.out.bam"
done
```
#### 3.2.5 Estimating gene expression levels with either `RSEM` or `featureCounts`
Both `RSEM` and `featureCounts` can be employed to quantify gene expression levels. For the purpose of this analysis, we will utilize `RSEM` as a representative tool.
1. Estimating transcript abundance using Ribo-seq data
```bash
$ cd
$ mkdir -p ./sce/4.ribo-seq/4.quantification/
$ cd ./sce/4.ribo-seq/4.quantification/
#################################################
# quantify the isoforms expression
for bam in ../3.star/*Aligned.toTranscriptome.out.bam
do
rsem-calculate-expression -p 12 \
--no-bam-output --alignments \
-q $bam ../../1.reference/rsem-index/rsem `\basename $bam Aligned.toTranscriptome.out.bam`
done
```
2. Integrating Ribo-seq quantification values for all samples
```bash
#################################################
# merge the gene expression
merge_rsem -c expected_count -l *.genes.results -o gene.expected_count.txt &>> merge_rsem.log
merge_rsem -c TPM -l *.genes.results -o gene.TPM.txt &>> merge_rsem.log
merge_rsem -c FPKM -l *.genes.results -o gene.FPKM.txt &>> merge_rsem.log
#################################################
# merge the isoforms expression
merge_rsem -c expected_count -l *.isoforms.results -o isoforms.expected_count.txt &>> merge_rsem.log
merge_rsem -c TPM -l *.isoforms.results -o isoforms.TPM.txt &>> merge_rsem.log
merge_rsem -c FPKM -l *.isoforms.results -o isoforms.FPKM.txt &>> merge_rsem.log
```
## 4. Perform RNA-seq data analysis of `GSE67387` with `RiboParser`
### 4.0 Prepare the directory to store the results
```bash
$ cd
$ mkdir -p ./3.rna-seq/5.riboparser/
$ cd ./3.rna-seq/5.riboparser/
$ mkdir 01.qc 02.digestion 03.offset 04.density 05.merge \
06.periodicity 07.metaplot 08.coverage 09.correlation 10.shuffle
```
### 4.1 Quality check of sequencing data
1. Quality check of RNA-seq data
```bash
$ cd ./01.qc
#################################################
# check the ribo-seq quality
for bam in ../../3.star/*Aligned.toTranscriptome.out.bam
do
prefix_name=$(basename $bam Aligned.toTranscriptome.out.bam)
rpf_Check -b $bam -s --thread 10 \
-t ../../../1.reference/norm/gene.norm.txt \
-o $prefix_name &> $prefix_name".log"
done
```
2. Integrating RNA-seq quality check results for all samples
```bash
#################################################
# merge the rna-seq quality results
merge_length -l *length_distribution.txt -o sce
merge_saturation -l *gene_saturation.txt -o sce
cd ..
```
### 4.2 Enzymatic bias in NGS library preparation
1. Bias in restriction enzyme digestion and ligation in sequencing data
```bash
$ cd ./02.digestion/
#################################################
# check the reads digestion
for bam in ../01.qc/*.bam
do
prefix_name=$(basename $bam .bam)
rpf_Digest -b $bam -m 25 -M 50 --scale \
-s ../../../1.reference/norm/gene.norm.rna.fa \
-t ../../../1.reference/norm/gene.norm.txt \
-o $prefix_name &> $prefix_name".log"
done
```
2. Integrating reads digestion results for all samples
```bash
#################################################
# merge the rpf digestion
merge_digestion -l ./02.digestion/*pwm.txt -o sce
cd ..
```
### 4.3 Use `RiboParser` to create the `offset` table
1. Create the `offset` table for RNA-seq
`Offset` prediction is unnecessary for RNA-seq analysis. A constant `offset` of 12 can be assigned to all entries in the table.
```bash
$ cd ./03.offset/
#################################################
# set the offset table
for bam in ../01.qc/*.bam
do
prefix_name=$(basename $bam .bam)
rna_Offset -m 27 -M 50 -e 12 -o $prefix_name &> $prefix_name".log"
done
```
### 4.4 Convert the `BAM` file to reads density
Transform the read counts in a `BAM` file into density values and save them in a `TXT` format file.
1. Transform RNA-seq data output
```bash
$ cd ./04.density/
#################################################
# convert the reads to density
for bam in ../01.qc/*.bam
do
prefix_name=$(basename $bam .bam)
rna_Density -b $bam -m 27 -M 33 -l --thread 10 \
-p ../03.offset/$prefix_name"_offset.txt" \
-s ../../../1.reference/norm/gene.norm.rna.fa \
-t ../../../1.reference/norm/gene.norm.txt \
-o $prefix_name &> $prefix_name".log"
done
cd ..
```
### 4.5 Integrating density file for all samples
Data from various batches and samples can be integrated for unified analysis. If not integrated, individual analysis for each sample is required, increasing the number of operational steps.
1. Integrating RNA-seq density results for all samples
```bash
$ cd ./05.merge/
#################################################
# create the samples file: RNA.file.list
merge_dst_list -l ../04.density/*_rna.txt -o RNA.file.list
cat RNA.file.list
Name File Type
wt_rna_YPD1 /home/sce/3.rna-seq/5.riboparser/04.density/SRR1944924_rna.txt RNA
wt_rna_YPD2 /home/sce/3.rna-seq/5.riboparser/04.density/SRR1944925_rna.txt RNA
wt_rna_YPD3 /home/sce/3.rna-seq/5.riboparser/04.density/SRR1944926_rna.txt RNA
ncs2d_rna_YPD1 /home/sce/3.rna-seq/5.riboparser/04.density/SRR1944927rna.txt RNA
ncs2d_rna_YPD2 /home/sce/3.rna-seq/5.riboparser/04.density/SRR1944928_rna.txt RNA
ncs2d_rna_YPD3 /home/sce/3.rna-seq/5.riboparser/04.density/SRR1944929_rna.txt RNA
elp6d_rna_YPD1 /home/sce/3.rna-seq/5.riboparser/04.density/SRR1944930_rna.txt RNA
elp6d_rna_YPD2 /home/sce/3.rna-seq/5.riboparser/04.density/SRR1944931_rna.txt RNA
elp6d_rna_YPD3 /home/sce/3.rna-seq/5.riboparser/04.density/SRR1944932_rna.txt RNA
ncs2d_elp6d_rna_YPD1 /home/sce/3.rna-seq/5.riboparser/04.density/SRR1944933_rna.txt RNA
ncs2d_elp6d_rna_YPD2 /home/sce/3.rna-seq/5.riboparser/04.density/SRR1944934_rna.txt RNA
ncs2d_elp6d_rna_YPD3 /home/sce/3.rna-seq/5.riboparser/04.density/SRR1944935_rna.txt RNA
#################################################
# merge all the RNA-seq files
rpf_Merge -l RNA.file.list -o RNA &> RNA.log
cd ..
```
### 4.6 Calculate tri-nucleotide periodicity
1. Check the tri-nucleotide periodicity of RNA-seq
```bash
$ cd ./06.periodicity/
#################################################
# check the periodicity
rpf_Periodicity \
-r ../05.merge/RNA_merged.txt \
-m 30 --tis 0 --tts 0 -o RNA &> RNA.log
```
### 4.7 `Meta-gene` analysis
Investigation of reads density in the vicinity of start and stop codons using `meta-gene` analysis.
1. `Meta-gene` analysis of RNA-seq
```bash
$ cd ./07.metaplot/
#################################################
# metagene analysis
rpf_Metaplot \
-t ../../../1.reference/norm/gene.norm.txt \
-r ../05.merge/RNA_merged.txt \
-m 50 --mode bar -o RNA &> RNA.log
```
### 4.8 Gene coverage
Examine the distribution of read density along the gene body.
1. Check gene density of RNA-seq
```bash
$ cd ./08.coverage/
#################################################
# check the reads density along with the gene body
rpf_Coverage \
-t ../../../1.reference/norm/gene.norm.txt \
-r ../05.merge/RNA_merged.txt \
-m 50 --outlier \
-b 10,100,10 \
-n --heat \
-o RNA &> RNA.log
```
### 4.9 Check the repeatability of samples
1. Check the repeatability of RNA-seq
```bash
$ cd ./09.correlation/
#################################################
# calculate the samples replication of RNA-seq
rpf_Corr \
-r ../05.merge/RNA_merged.txt \
-o RNA &> RNA.log
```
## 5. Perform RNA-seq data analysis of `GSE67387` with `RiboParser`
### 5.0 Prepare the directory to store the results
```bash
$ cd
$ mkdir -p ./4.ribo-seq/5.riboparser/
$ cd ./4.ribo-seq/5.riboparser/
$ mkdir 01.qc 02.digestion 03.offset 04.density 05.merge \
06.periodicity 07.metaplot 08.coverage 09.correlation 10.quantification \
11.pausing_score 12.codon_occupancy 13.codon_decoding_time 14.codon_selection_time \
15.coefficient_of_variation 16.meta_codon 17.shuffle 18.retrieve
```
### 5.1 Quality check of sequencing data
To ensure the reliability of downstream analyses, rigorous quality control (QC) of ribosome profiling data must include systematic evaluation of peak detection and gene detection rate.
`Peak Detection`:\
Calculate the dominant RPF length (expected range: 26–34 nt) and its proportion within the total reads (QC pass threshold: ≥70% in expected range).
`Gene Detection Rate`:\
Sampling from 5% to 95% of the data, the number of covered genes and their expression levels are calculated. When the slope of the gene count curve approaches 0, it typically indicates that the sequencing data is near saturation.
1. Explanation of `rpf_Check`
```bash
$ rpf_Check -h
Check the RPFs mapping condition.
Step1: Checking the input Arguments.
usage: rpf_Check [-h] -t TRANSCRIPT -b BAM -o OUTPUT [--thread THREAD] [-g {0,1}] [-a {star,hisat2,bowtie2}] [-r] [-l] [-s]
This script is used to summary the BAM condition.
options:
-h, --help show this help message and exit.
--thread THREAD the number of threads (default: 1). Suitable for large bam files > 1G.
It will take a lot of memory.
-g {0,1} filter the number of reads mapped loci (default: 0). [0]: all reads will be
used; [1]: reads with unique mapped loci will be used
-a {star,hisat2,bowtie2}
screen the reads mapped loci from BAM file generate with different reads alignment methods (default: star).
-r reads aligned to negative strand will also be counted. (default: False).
-l only keep the longest transcripts (default: False).
-s whether to calculate RPF saturation. (default: False). This step will take
a lot of time and memory.
Required arguments:
-t TRANSCRIPT the input file name of gene annotation.
-b BAM the input file name of bam.
-o OUTPUT the prefix of output file.
```
2. Quality check of ribo-seq data
```bash
$ cd ./01.qc/
#################################################
# check the ribo-seq quality
for bam in ../../3.star/*Aligned.toTranscriptome.out.bam
do
prefix_name=$(basename $bam Aligned.toTranscriptome.out.bam)
rpf_Check -b $bam -s --thread 10 \
-t ../../../1.reference/norm/gene.norm.txt \
-o $prefix_name &> $prefix_name".log"
done
```
3. Results of `rpf_Check`
```bash
# results of sample SRR1944912
SRR1944912.bam # Filtered and sorted BAM file
SRR1944912.bam.bai # Index file for the BAM file
SRR1944912_gene_saturation.pdf # Barplot for gene saturation analysis
SRR1944912_gene_saturation.png # Barplot for gene saturation analysis
SRR1944912_gene_saturation.txt # Statistical results of gene saturation analysis
SRR1944912_length_distribution.pdf # Line graph of reads length distribution
SRR1944912_length_distribution.png # Line graph of reads length distribution
SRR1944912_length_distribution.txt # Statistical results of reads length distribution
SRR1944912.log # Log file of program execution
SRR1944912_reads_saturation.pdf # Boxplot for reads saturation analysis
SRR1944912_reads_saturation.png # Boxplot for reads saturation analysis
SRR1944912_reads_saturation.txt # Statistical results of reads saturation analysis
```
4. Explanation of `merge_length` and `merge_saturation`
```bash
Step1: Checking the input Arguments.
usage: merge_length [-h] -l LIST [LIST ...] -o OUTPUT
This script is used to merge the length distribution files
options:
-h, --help show this help message and exit
Required arguments:
-l LIST [LIST ...], --list LIST [LIST ...]
List for length distribution files (e.g., '*length_distribution.txt').
-o OUTPUT prefix of output file name.
```
5. Integrating Ribo-seq quality check results for all samples
```bash
#################################################
# merge the ribo-seq quality results
merge_length -l *length_distribution.txt -o RIBO
merge_saturation -l *gene_saturation.txt -o RIBO
cd ..
```
### 5.2 Enzymatic bias in NGS library preparation
Ribonuclease digestion and ligation steps in Ribo-seq can introduce biases that affect the representation of RNA fragments.
`Ribonuclease Digestion Bias`:\
Ribonucleases may preferentially cleave certain RNA sequences or structures, leading to uneven fragmentation and skewed representation of RNA species in the data.
`Ligation Bias`:\
Ligation efficiency can vary by sequence context, RNA fragment length, or secondary structure. This may result in overrepresentation of certain fragments due to more favorable ligation sites.
To reduce these biases, it's important to analyze digestion and ligation efficiency, optimize protocols, and include controls to ensure data accuracy and reliability.
1. Explanation of `rpf_Digest`
```bash
$ rpf_Digest -h
Step1: Checking the input Arguments.
usage: rpf_Digest [-h] -t TRANSCRIPT -s SEQUENCE -b BAM -o OUTPUT [-l] [--scale] [-m MIN] [-M MAX]
This script is used to Detect the digestion sites.
options:
-h, --help show this help message and exit
-l only retain the transcript with longest CDS of each gene (default: False).Recommended : True
--scale scale the motif matrix (default: False).
-m MIN the minimum reads length to keep (default: 20 nt).
-M MAX the maximum reads length to keep (default: 100 nt).
Required arguments:
-t TRANSCRIPT the name of input transcript file in TXT format.
-s SEQUENCE the name of input transcript sequence file in FA format.
-b BAM the name of mapping file in BAM format.
-o OUTPUT the name of output file. (prefix + _digestion_sites.txt)
```
2. Bias in restriction enzyme digestion and ligation in sequencing data
```bash
$ cd ./02.digestion/
#################################################
# check the reads digestion
for bam in ../01.qc/*.bam
do
prefix_name=$(basename $bam .bam)
rpf_Digest -b $bam -m 27 -M 33 --scale \
-s ../../../1.reference/norm/gene.norm.rna.fa \
-t ../../../1.reference/norm/gene.norm.txt \
-o $prefix_name &> $prefix_name".log"
done
```
3. Results of `rpf_Digest`
```bash
# results of SRR1944912
SRR1944912_3end_counts.txt # Statistical values of different bases at the 3'-end of reads
SRR1944912_3end_pwm.txt # PWM matrix of bases at the 3'-end of reads
SRR1944912_3end_seqlogo2.pdf # Seqlogo of bases at the 3'-end of reads
SRR1944912_5end_counts.txt # Statistical values of different bases at the 5'-end of reads
SRR1944912_5end_pwm.txt # PWM matrix of bases at the 5'-end of reads
SRR1944912_5end_seqlogo2.pdf # Seqlogo of bases at the 5'-end of reads
SRR1944912_digestion_sites.txt # Open reading frame aligned at the ends of reads
SRR1944912.log # Log file of program execution
SRR1944912_scaled_digestion_sites_plot.pdf # Heatmap of open reading frame aligned at the ends of reads
```
4. Explanation of `merge_digestion`
```bash
$ merge_digestion -h
Step1: Checking the input Arguments.
usage: merge_digestion [-h] -l LIST [LIST ...] -o OUTPUT
This script is used to merge the reads digestion files
options:
-h, --help show this help message and exit
Required arguments:
-l LIST [LIST ...], --list LIST [LIST ...]
List for digestion files (e.g., '*_5end_pwm.txt').
-o OUTPUT prefix of output file name.
```
5. Integrating reads digestion results for all samples
```bash
#################################################
# merge the rpf digestion
merge_digestion -l *pwm.txt -o RIBO
cd ..
```
### 5.3 Use `RiboParser` to predict the optimal offset
The ability of Ribo-seq to analyze translation at codon resolution relies on accurately identifying the codons located at ribosomal `A-`, `P-`, and `E-`sites for each RPF.
The offset represents the distance of the 5’ end of the RPF to the first nucleotide of the `P-site` codon.
Two commonly used methods are the ribosome structure-based model (`RSBM`) and the start/stop-based model (`SSCBM`).
1. Explanation of `rpf_Offset`
```bash
$ rpf_Offset -h
Step1: Checking the input Arguments.
usage: rpf_Offset [-h] -t TRANSCRIPT -b BAM -o OUTPUT [--mode {SSCBM,RSBM}] [-a {both,tis,tts}] [-l] [-m MIN] [-M MAX] [-p EXP_PEAK] [-s SHIFT] [--silence] [-d]
This script is used to detect the P-site offset.
options:
-h, --help show this help message and exit
--mode {SSCBM,RSBM} specify the mode of offset detect [SSCBM, RSBM]. (default: SSCBM).
-a {both,tis,tts} specify the alignment of reads for offset detect [both, tis, tts]. (default: both).
-l only retain the transcript with longest CDS of each gene (default: False).
-m MIN the minimum reads length to keep (default: 27 nt).
-M MAX the maximum reads length to keep (default: 33 nt).
-p EXP_PEAK expected RPFs length fitted to ribosome structure [~30 nt] (default: 30 nt).
-s SHIFT psite shift for different RPFs length. (default: 2 nt).
--silence discard the warning information. (default: True).
-d output the details of offset (default: False).
Required arguments:
-t TRANSCRIPT the name of input transcript filein TXT format.
-b BAM the name of mapping file in BAM format.
-o OUTPUT the prefix of output file. (prefix + _offset.txt)
```
2. Predict the optimal offset of Ribo-seq
```bash
$ cd ../03.offset/
#################################################
# predict the offset table
for bam in ../01.qc/*.bam
do
prefix_name=$(basename $bam .bam)
rpf_Offset -b $bam -m 27 -M 33 -p 30 -d \
--mode SSCBM \
-t ../../../1.reference/norm/gene.norm.txt \
-o $prefix_name &> $prefix_name".log"
done
```
3. Results of `rpf_Offset`
```bash
# results of SRR1944912
SRR1944912.log # Log file of program execution
SRR1944912_RSBM_offset.pdf # Heatmap of RSBM model calculation results
SRR1944912_RSBM_offset.png # Heatmap of RSBM model calculation results
SRR1944912_RSBM_offset.txt # Calculation results of the RSBM model
SRR1944912_SSCBM_offset.pdf # Heatmap of SSCBM model calculation results
SRR1944912_SSCBM_offset.png # Heatmap of SSCBM model calculation results
SRR1944912_SSCBM_offset_scale.pdf # Row-normalized heatmap of SSCBM model calculation results
SRR1944912_SSCBM_offset_scale.png # Row-normalized heatmap of SSCBM model calculation results
SRR1944912_SSCBM_offset.txt # Calculation results of the SSCBM model
SRR1944912_tis_3end.txt # Distribution statistics of reads 3'-end at the start codon
SRR1944912_tis_5end.txt # Distribution statistics of reads 5'-end at the start codon
SRR1944912_tts_3end.txt # Distribution statistics of reads 3'-end at the stop codon
SRR1944912_tts_5end.txt # Distribution statistics of reads 5'-end at the stop codon
```
4. Explanation of `merge_offset`
```bash
$ merge_offset -h
Step1: Checking the input Arguments.
usage: merge_offset [-h] -l LIST [LIST ...] -o OUTPUT
This script is used to merge the reads offset files
options:
-h, --help show this help message and exit
Required arguments:
-l LIST [LIST ...], --list LIST [LIST ...]
List for RSBM/SSCBM offset files (e.g., '*RSBM_offset.txt').
-o OUTPUT prefix of output file name (default: prefix + _offset.txt).
```
5. Integrating offset table results for all samples
```bash
#################################################
# merge the ribo-seq offset results
merge_offset_detail -l *end.txt -o RIBO
merge_offset -l *SSCBM_offset.txt -o RIBO_SSCBM
merge_offset -l *RSBM_offset.txt -o RIBO_RSBM
cd ..
```
### 5.4 Convert the `BAM` file to reads density
Transform the read counts in a `BAM` file into density values and save them in a `TXT` format file.
1. Explanation of `rpf_Density`
```bash
$ rpf_Density -h
Convert reads to RPFs density.
Step1: Checking the input Arguments.
usage: rpf_Density [-h] -t TRANSCRIPT -s SEQUENCE -b BAM -p PSITE -o OUTPUT [-l] [-m MIN] [-M MAX] [--period PERIODICITY] [--silence] [--thread THREAD]
This script is used to convert Ribo-seq bam to p-site density.
options:
-h, --help show this help message and exit
-l only retain the transcript with longest CDS of each gene (default: False).Recommended : True
-m MIN the minimum reads length to keep (default: 27 nt).
-M MAX the maximum reads length to keep (default: 33 nt).
--silence discard the warning information. (default: True).
--thread THREAD the number of threads (default: 1). It will take a lot of memory.
Required arguments:
-t TRANSCRIPT the name of input transcript file in TXT format.
-s SEQUENCE the name of input transcript sequence file in FA format.
-b BAM the name of mapping file in BAM format.
-p PSITE the name of p-site offset file in TXT format.
-o OUTPUT the prefix of output file. (output = prefix + _rpf.txt)
--period PERIODICITY the minimum 3nt periodicity to keep. (default: 40).
```
2. Transform Ribo-seq data output
```bash
#################################################
# convert the rpf to density
for bam in ../01.qc/*.bam
do
prefix_name=$(basename $bam .bam)
rpf_Density -b $bam -m 27 -M 33 --period 40 -l --thread 12 \
-p ../03.offset/$prefix_name"_SSCBM_offset.txt" \
-s ../../../1.reference/norm/gene.norm.rna.fa \
-t ../../../1.reference/norm/gene.norm.txt \
-o $prefix_name &> $prefix_name".log"
done
cd ..
```
3. Results of `rpf_Density`
```bash
# results of SRR1944912
SRR1944912.log # Log file of program execution
SRR1944912_rpf.txt # File contains RPFs density on each gene
```
### 5.5 Integrating density file for all samples
Data from various batches and samples can be integrated for unified analysis. If not integrated, individual analysis for each sample is required, increasing the number of operational steps.
1. Explanation of `merge_dst_list`
```bash
$ merge_dst_list -h
Step1: Checking the input Arguments.
usage: merge_dst_list [-h] -l LIST [LIST ...] [-o OUTPUT]
This script is used to create the list of density files
options:
-h, --help show this help message and exit
Required arguments:
-l LIST [LIST ...], --list LIST [LIST ...]
List for density files (e.g., '*_rpf.txt').
-o OUTPUT output file name (default: RIBO.file.list).
```
2. create the samples list
```bash
$ cd ./05.merge/
#################################################
# create the samples file: Ribo.file.list
merge_dst_list -l ../04.density/*_rpf.txt -o RIBO.file.list
cat RIBO.file.list
Name File Type
wt_ribo_YPD1 /home/sce/4.ribo-seq/5.riboparser/04.density/SRR1944912_rpf.txt Ribo
wt_ribo_YPD2 /home/sce/4.ribo-seq/5.riboparser/04.density/SRR1944913_rpf.txt Ribo
wt_ribo_YPD3 /home/sce/4.ribo-seq/5.riboparser/04.density/SRR1944914_rpf.txt Ribo
ncs2d_ribo_YPD1 /home/sce/4.ribo-seq/5.riboparser/04.density/SRR1944915_rpf.txt Ribo
ncs2d_ribo_YPD2 /home/sce/4.ribo-seq/5.riboparser/04.density/SRR1944916_rpf.txt Ribo
ncs2d_ribo_YPD3 /home/sce/4.ribo-seq/5.riboparser/04.density/SRR1944917_rpf.txt Ribo
elp6d_ribo_YPD1 /home/sce/4.ribo-seq/5.riboparser/04.density/SRR1944918_rpf.txt Ribo
elp6d_ribo_YPD2 /home/sce/4.ribo-seq/5.riboparser/04.density/SRR1944919_rpf.txt Ribo
elp6d_ribo_YPD3 /home/sce/4.ribo-seq/5.riboparser/04.density/SRR1944920_rpf.txt Ribo
ncs2d_elp6d_ribo_YPD1 /home/sce/4.ribo-seq/5.riboparser/04.density/SRR1944921_rpf.txt Ribo
ncs2d_elp6d_ribo_YPD2 /home/sce/4.ribo-seq/5.riboparser/04.density/SRR1944922_rpf.txt Ribo
ncs2d_elp6d_ribo_YPD3 /home/sce/4.ribo-seq/5.riboparser/04.density/SRR1944923_rpf.txt Ribo
```
3. Explanation of `rpf_Merge`
```bash
$ rpf_Merge -h
Merge RPFs files from different samples.
Step1: Checking the input Arguments.
usage: rpf_Merge [-h] -l LIST -o OUTPUT
This script is used to merge the density file.
options:
-h, --help show this help message and exit
Required arguments:
-l LIST the sample list in TXT format.
-o OUTPUT the prefix of output file. (prefix + _merged.txt)
```
4. Integrating Ribo-seq density results for all samples
```bash
#################################################
# merge all the Ribo-seq files
rpf_Merge -l RIBO.file.list -o RIBO &> RIBO.log
cd ..
```
5. Results of `rpf_Merge`
```bash
RIBO.log # Log file of program execution
RIBO.file.list # RPFs density file list of samples
RIBO_merged.txt # Merged RPFs density file
```
### 5.6 Calculate tri-nucleotide periodicity
Tri-nucleotide periodicity serves as a critical quality metric in Ribo-seq data analysis, fundamentally determining the biological interpretability of codon-resolution findings.
High-quality periodicity (typically >0.6 phase coherence score) constitutes an essential prerequisite for robust codon-level analysis, whereas insufficient periodicity (<0.45) systematically introduces frame ambiguity artifacts that compromise translational measurements.
Loss of reading frame synchronization leads to misassignment of ribosome `A/P/E` sites, generating false-positive pause sites from out-of-frame read aggregation.
1. Explanation of `rpf_Periodicity`
```bash
$ rpf_Periodicity -h
Draw the periodicity plot.
Step1: Checking the input Arguments.
usage: rpf_Periodicity [-h] -r RPF -o OUTPUT [-t TRANSCRIPT] [-m MIN] [--tis TIS] [--tts TTS]
This script is used to draw the periodicity plot.
options:
-h, --help show this help message and exit
-m MIN retain transcript with more than minimum RPFs. (default: 50).
--tis TIS The number of codons after TIS will be discarded.. (default: 0 AA).
--tts TTS The number of codons before TTS will be discarded.. (default: 0 AA).
Required arguments:
-r RPF the name of input RPFs file in TXT format.
-o OUTPUT the prefix of output file.
-t TRANSCRIPT the name of input transcript filein TXT format.
```
2. Check the tri-nucleotide periodicity of Ribo-seq
```bash
$ cd ./06.periodicity/
#################################################
# check the periodicity
rpf_Periodicity \
-r ../05.merge/RIBO_merged.txt \
-m 30 --tis 0 --tts 0 -o RIBO &> RIBO.log
cd ..
```
3. Results of `rpf_Periodicity`
```bash
RIBO_count_periodicity_plot.pdf # Barplot of read counts showing 3-nucleotide periodicity
RIBO_count_periodicity_plot.png # Barplot of read counts showing 3-nucleotide periodicity
RIBO.log # Log file of program execution
RIBO_periodicity.txt # Statistical values of 3-nucleotide periodicity of all samples
RIBO_ratio_periodicity_plot.pdf # Barplot of read ratios showing 3-nucleotide periodicity
RIBO_ratio_periodicity_plot.png # Barplot of read ratios showing 3-nucleotide periodicity
```
### 5.7 `Meta-gene` analysis
Our `Meta-gene` analysis framework enables systematic investigation of translation dynamics proximal to start and stop codons.
It entails aggregates ribosome-protected fragment (RPF) density across -15 and +60 codon windows relative to initiation/termination sites, normalized by transcript abundance.
This reveals:\
Peak/trough patterns indicative of translation initiation efficiency,\
Read accumulation gradients reflecting termination kinetics,\
Tri-Nucleotide Periodicity Quantification.
1. Explanation of `rpf_Metaplot`
```bash
$ rpf_Metaplot -h
Draw the metaplot.
Step1: Checking the input Arguments.
usage: rpf_Metaplot [-h] -t TRANSCRIPT -r RPF -o OUTPUT [-m MIN] [--utr5 UTR5] [--cds CDS] [--utr3 UTR3] [-n] [--mode {line,bar}]
This script is used to draw the meta plot.
options:
-h, --help show this help message and exit
-m MIN delete transcript with less than minimum RPFs. (default: 50).
--utr5 UTR5 the codon number in 5-utr region (default: 20 AA).
--cds CDS the codon number in cds region (default: 50 AA).
--utr3 UTR3 the codon number in 3-utr region (default: 20 AA).
-n normalize the RPFs count to RPM. (default: False).
--mode {line,bar} specify the mode of metaplot. (default: bar).
Required arguments:
-t TRANSCRIPT the name of input transcript filein TXT format.
-r RPF the name of input RPFs file in TXT format.
-o OUTPUT the prefix name of output file.
```
2. `Meta-gene` analysis of Ribo-seq
```bash
$ cd ./07.metaplot/
#################################################
# metagene analysis
rpf_Metaplot \
-t ../../../1.reference/norm/gene.norm.txt \
-r ../05.merge/RIBO_merged.txt \
-m 50 --mode bar -o RIBO &> RIBO.log
cd ..
```
3. Results of `rpf_Metaplot`
```bash
RIBO.log # Log file of program execution
RIBO_tis_tts_metaplot.txt # Metagene statistical values at start and stop codons across all samples
RIBO_SRR1944912_meta_bar_plot.pdf # Metaplot of all samples at start and stop codons
RIBO_SRR1944912_meta_bar_plot.png # Metaplot of all samples at start and stop codons
RIBO_SRR1944912_tis_tts_metaplot.txt # Metagene statistical values at start and stop codons for sample SRR1944912
```
### 5.8 Gene coverage
To systematically assess potential technical or translation biases in Ribo-seq data, we developed an analytical pipeline that quantifies genome-wide RPF coverage uniformity across gene bodies.
Genes are partitioned into fixed-length intervals (default: 10% of CDS length), with raw RPF counts per `bin` recorded.
These values undergo library-size normalization using reads per million (RPM) to account for transcript length and sequencing depth variations.
Processed data is rendered through three complementary analytical views:\
`Aggregate Density Profile`: Smoothed line plot displaying mean RPF density across all genes, highlighting global coverage trends.\
`Gene-specific Heatmap`: Matrix visualization of normalized RPF counts (log₂-transformed) ordered by gene expression levels, revealing individual gene coverage patterns.\
`Binned Coverage Distribution`: Stacked bar plot showing the percentage of genes achieving threshold coverage (≥50% of expected reads) in each interval.
1. Explanation of `rpf_Coverage`
```bash
$ rpf_Coverage -h
Draw the metagene coverage.
Step1: Checking the input Arguments.
usage: rpf_Coverage [-h] -t TRANSCRIPT -r RPF [-o OUTPUT] [-f {0,1,2,all}] [-m MIN] [-b BIN] [-n] [--thread THREAD] [--outlier] [--set {intersect,union}] [--heat] [--bar]
This script is used to draw the coverage meta plot.
options:
-h, --help show this help message and exit
-f {0,1,2,all} set the reading frame for occupancy calculation. (default: all).
-m MIN retain transcript with more than minimum RPFs. (default: 50).
-b BIN adjust the transcript to specified bins. 30 for 5'-UTRand 3'-UTR, 100 for CDS. (default: 30,100,30).
-n normalize the RPFs count to RPM. (default: False).
--thread THREAD the number of threads. (default: 1).
--outlier filter the outliers (default: False).
--set {intersect,union}
filter the gene list with 5-UTR / CDS / 3-UTR. (default: union).
--heat draw the coverage heatmap of whole gene. (default: False).
--bar draw the coverage barplot of whole gene. (default: False).
Required arguments:
-t TRANSCRIPT the name of input transcript filein TXT format.
-r RPF the name of input RPFs file in TXT format.
-o OUTPUT the prefix of output file.
```
2. Check gene density of Ribo-seq
```bash
$ cd ./08.coverage/
#################################################
# check the rpf density along with the gene body
rpf_Coverage \
-t ../../../1.reference/norm/gene.norm.txt \
-r ../05.merge/RIBO_merged.txt \
-m 50 --outlier \
-b 10,100,10 \
-n --heat \
-o RIBO &> RIBO.log
cd ..
```
3. Results of `rpf_Coverage`
```bash
RIBO.log # Log file of program execution
RIBO_SRR1944912_10_150_10_coverage.txt # Statistical results of RPF density distribution across genes
RIBO_SRR1944912_10_150_10_heat_plot.png # Heatmap of RPF density distribution across genes
RIBO_SRR1944912_coverage_bar_plot.pdf # Percentage barplot of RPF density distribution across genes
RIBO_SRR1944912_coverage_bar_plot.png # Percentage barplot of RPF density distribution across genes
RIBO_SRR1944912_coverage_line_plot.pdf # Percentage lineplot of RPF density distribution across genes
RIBO_SRR1944912_coverage_line_plot.png # Percentage lineplot of RPF density distribution across genes
```
### 5.9 Check the repeatability of samples
Our analytical pipeline provides a hierarchical framework for assessing sample reproducibility in ribosome profiling studies, implementing dual-level correlation analyses.
`Gene-Level Reproducibility`:\
Quantifies ribosome-protected fragments (RPFs) across entire gene bodies, followed by inter-sample Pearson correlation coefficient calculation.
This approach evaluates global translation consistency, particularly suitable for highly expressed genes with robust ribosome coverage.
`ORF-Level Reproducibility`:\
Performs nucleotide-resolution quantification of RPFs within individual open reading frames (ORFs), then computes Pearson correlations between replicate samples.
This finer-grained analysis detects localized translational variations while maintaining phase-awareness through in-frame read filtering.
1. Explanation of `rpf_Corr`
```bash
$ rpf_Corr -h
Draw the correlation of samples.
Step1: Checking the input Arguments.
usage: rpf_Corr [-h] -r RPF -o OUTPUT
This script is used to draw the correlation of rpf density.
options:
-h, --help show this help message and exit
Required arguments:
-r RPF the name of input RPFs file in TXT format.
-o OUTPUT the prefix of output file. (prefix + _rpf_merged.txt)
```
2. Check the repeatability of Ribo-seq
```bash
$ cd ./09.correlation/
#################################################
# calculate the samples replication of Ribo-seq
rpf_Corr \
-r ../05.merge/RIBO_merged.txt \
-o RIBO &> RIBO.log
```
3. Results of `rpf_Corr`
```bash
RIBO_gene_corr_f0.txt # Pearson correlation of total RPFs on gene frame 0
RIBO_gene_corr_f1.txt # Pearson correlation of total RPFs on gene frame 1
RIBO_gene_corr_f2.txt # Pearson correlation of total RPFs on gene frame 2
RIBO_gene_corr_frame.txt # Pearson correlation of RPFs across all gene frames
RIBO_gene_correlation_plot.pdf # Heatmap of Pearson correlation of RPFs across all gene frames
RIBO_gene_correlation_plot.png # Heatmap of Pearson correlation of RPFs across all gene frames
RIBO.log # Log file of program execution
RIBO_rpf_correlation_plot.pdf # Heatmap of Pearson correlation of RPFs on gene frame 0
RIBO_rpf_correlation_plot.png # Heatmap of Pearson correlation of RPFs on gene frame 0
RIBO_rpf_corr_f0.txt # Pearson correlation between RPFs on gene frame 0
RIBO_rpf_corr_f1.txt # Pearson correlation between RPFs on gene frame 1
RIBO_rpf_corr_f2.txt # Pearson correlation between RPFs on gene frame 2
RIBO_rpf_corr_frame.txt # Pearson correlation between RPFs across all gene frames
```
### 5.10 Quantification of gene expression and translation levels
Ribo-seq quantification fundamentally differs from RNA-seq approaches.
While RNA-seq quantifies expression via read coverage across transcripts, Ribo-seq specifically measures ribosome occupancy density within coding sequences (CDS).
To mitigate artifacts from transitional translation states, standard analytical pipelines exclude the first 15 codons downstream of start codons and the last 5 codons upstream of stop codons.
For enhanced precision, rigorous filtering protocols may be implemented to retain only in-frame ribosome-protected fragments (RPFs),
systematically removing out-of-frame reads that likely represent stochastic noise rather than true elongation events.
1. Explanation of `rpf_Quant`
```bash
$ rpf_Quant -h
Quantify the RPFs in the different region.
Step1: Checking the input Arguments.
usage: rpf_Quant [-h] -r RPF -o OUTPUT [-f {0,1,2,all}] [--tis TIS] [--tts TTS] [--utr5] [--utr3]
This script is used to quantify RPFs in the CDS region.
options:
-h, --help show this help message and exit
-f {0,1,2,all} set the reading frame for occupancy calculation. (default: all).
--tis TIS The number of codons after TIS will be discarded. (default: 0).
--tts TTS The number of codons before TES will be discarded. (default: 0).
--utr5 quantification of 5'-utr. (default: False).
--utr3 quantification of 3'-utr. (default: False).
Required arguments:
-r RPF the name of input RPFs file in TXT format.
-o OUTPUT the prefix of output file. (default: prefix + _rpf_quant.txt)
```
2. Quantification of Ribo-seq
```bash
$ cd ./10.quantification/
#################################################
# quantify the gene expression
rpf_Quant \
-r ../05.merge/RIBO_merged.txt \
--tis 15 \
--tts 5 \
-o RIBO &> RIBO.log
cd ..
```
3. Results of `rpf_Quant`
```bash
RIBO_cds_rpm_bar_plot.pdf # Barplot of RPFs distribution in the gene CDS region
RIBO_cds_rpm_cdf_plot.pdf # Cumulative distribution plot of RPFs in the gene CDS region
RIBO_cds_rpm_heatmap.pdf # Expression heatmap of RPFs in the gene CDS region
RIBO_cds_rpm_pca_plot.pdf # Principal component analysis (PCA) plot of RPFs in the gene CDS region
RIBO_cds_rpm_pca.txt # Principal component analysis (PCA) results of RPFs in the gene CDS region
RIBO_cds_rpf_quant.txt # Statistical results of RPFs in the gene CDS region
RIBO_cds_rpkm_quant.txt # Statistical results of RPKM in the gene CDS region
RIBO_cds_rpm_quant.txt # Statistical results of RPM in the gene CDS region
RIBO_cds_tpm_quant.txt # Statistical results of TPM in the gene CDS region
RIBO.log # Log file of program execution
RIBO_total.txt # Total RPFs across all samples
```
### 5.11 Calculate codon pausing score
Ribosome profiling studies are able to capture ribosome pausing because when a specific codon is translated more slowly on average, the increased ribosome occupancy indicates that there are more RPFs associated with that codon genome-wide.
Pause scores were calculated by normalizing per-nucleotide read counts against the gene's mean read density. For codon-level analysis, read counts across all three nucleotide positions of each codon could used for calculation. Final pause scores represent the averaged values derived from all occurrences of the target codon.
1. Explanation of `rpf_Pausing`
```bash
$ rpf_Pausing -h
Calculate the relative codon pausing score.
Step1: Checking the input Arguments.
usage: rpf_Pausing [-h] -r RPF [-l LIST] -o OUTPUT [-s {E,P,A}] [-f {0,1,2,all}] [-b BACKGROUND] [-m MIN] [--tis TIS] [--tts TTS] [-n] [--scale {zscore,minmax}] [--stop]
[--fig {none,png,pdf}] [--all]
This script is used to calculate the relative pausing score.
options:
-h, --help show this help message and exit
-s {E,P,A} set the E/P/A-site for pausing calculation. (default: P).
-f {0,1,2,all} set the reading frame for pausing calculation. (default: all).
-b BACKGROUND set the codon number before and after p-site as the background. (default: 2).
-m MIN retain transcript with more than minimum RPFs. (default: 50).
--tis TIS The number of codons after TIS will be discarded. (default: 10 AA).
--tts TTS The number of codons before TTS will be discarded. (default: 5 AA).
-n normalize the RPFs count to RPM. (default: False).
--scale {zscore,minmax}
normalize the pausing score. (default: minmax).
--stop rmove the stop codon. (default: False).
--fig {none,png,pdf} draw the rpf pausing score of each gene (it will takes a lot of time). (default: none).
--all output all pausing score of each gene. (default: False).
Required arguments:
-r RPF the name of input RPFs file in TXT format.
-l LIST the gene name list in TXT format. (default: whole).
-o OUTPUT the prefix of output file.
```
2. Calculate codon-level pausing scores in Ribo-seq data
```bash
$ cd ./11.pausing_score/
#################################################
# calculate the codon pausing score of E/P/A site
for sites in E P A
do
rpf_Pausing \
-l ../../../1.reference/norm/gene.norm.txt \
-r ../05.merge/RIBO_merged.txt \
-b 0 --stop \
-m 30 \
-s $sites \
-f 0 \
--scale minmax \
-o "$sites"_site &> "$sites"_site.log
done
cd ..
```
3. Results of `rpf_Pausing`
```bash
A_site_cds_codon_pausing_score.txt # Pausing score for each codon at the A-site in the gene CDS region
A_site_cds_pausing_score.txt # Sum of pausing score at the A-site in the gene CDS region
A_site.log # Log file of program execution
A_site_sum_codon_pausing_score.txt # Sum of pausing score for all codons at the A-site in the gene CDS region
A_site_total_pausing_heatplot.pdf # Heatmap of pausing score for all codons at the A-site in the gene CDS region
A_site_total_pausing_heatplot.png # Heatmap of pausing score for all codons at the A-site in the gene CDS region
A_site_valid_pausing_heatplot.pdf # Heatmap of pausing score for valid codons at the A-site in the gene CDS region
A_site_valid_pausing_heatplot.png # Heatmap of pausing score for valid codons at the A-site in the gene CDS region
```
### 5.12 Calculate codon occupancy
For global codon occupancy analysis,
`A(P/E)-site` codons were assigned using the established criteria,
and read counts at each codon were normalized against the average per-codon read density within their respective open reading frames (ORFs).
1. Explanation of `rpf_Occupancy`
```bash
$ rpf_Occupancy -h
Calculate the codon occupancy.
Step1: Checking the input Arguments.
usage: rpf_Occupancy [-h] -r RPF [-l LIST] -o OUTPUT [-s {E,P,A}] [-f {0,1,2,all}] [-m MIN] [-n] [--tis TIS] [--tts TTS] [--scale {zscore,minmax}] [--stop] [--all]
This script is used to draw the codon occupancy plot.
options:
-h, --help show this help message and exit
-s {E,P,A} set the E/P/A-site for occupancy calculation. (default: P).
-f {0,1,2,all} set the reading frame for occupancy calculation. (default: all).
-m MIN retain transcript with more than minimum RPFs. (default: 30).
-n normalize the RPFs count to RPM. (default: False).
--tis TIS The number of codons after TIS will be discarded. (default: 15 AA).
--tts TTS The number of codons before TTS will be discarded.. (default: 5 AA).
--scale {zscore,minmax}
normalize the occupancy. (default: minmax).
--stop rmove the stop codon. (default: False).
--all output all RPFs density. (default: False).
Required arguments:
-r RPF the name of input RPFs file in TXT format.
-l LIST the gene name list in TXT format. (default: whole).
-o OUTPUT the prefix of output file.
```
2. Calculate codon-level occupancy in Ribo-seq data
```bash
$ cd ./12.codon_occupancy/
#################################################
# calculate the codon occupancy of E/P/A site
for sites in E P A
do
rpf_Occupancy \
-l ../../../1.reference/norm/gene.norm.txt \
-r ../05.merge/RIBO_merged.txt \
-m 30 \
-s "$sites" \
-f 0 --stop \
--scale minmax \
-o "$sites"_site &> "$sites"_site.log
done
cd ..
```
3. Results of `rpf_Occupancy`
```bash
A_site_codon_density.txt # Codon occupancy for each codon at the A-site in the gene CDS region
A_site_codon_occupancy.txt # Codon occupancy for all codon at the A-site in the gene CDS region
A_site.log # Log file of program execution
A_site_occupancy_corrplot.pdf # Correlation heatmap of codon occupancy at the A-site
A_site_occupancy_corrplot.png # Correlation heatmap of codon occupancy at the A-site
A_site_occupancy_corr.txt # Correlation of codon occupancy at the A-site
A_site_occupancy_heatplot.pdf # Heatmap of codon occupancy at the A-site
A_site_occupancy_heatplot.png # Heatmap of codon occupancy at the A-site
A_site_occupancy_relative_heatplot.pdf # Heatmap of relative codon occupancy at the A-site
A_site_occupancy_relative_heatplot.png # Heatmap of relative codon occupancy at the A-site
A_site_occupancy_relative_lineplot.pdf # Lineplot of relative codon occupancy at the A-site
A_site_occupancy_relative_lineplot.png # Lineplot of relative codon occupancy at the A-site
```
### 5.13 Calculate decoding time
In addition to translation elongation analysis performed at the Ribo-seq level, RNA-seq can also be used for correction to account for the influence of the mRNA's inherent state on translation.
1. Explanation of `rpf_CDT`
```bash
$ rpf_CDT -h
Calculate the codon decoding time.
Step1: Checking the input Arguments.
usage: rpf_CDT [-h] --rpf RPF --rna RNA -l LIST -o OUTPUT [-s {E,P,A}] [-f {0,1,2,all}] [-m MIN] [--tis TIS] [--tts TTS] [--scale {zscore,minmax}] [--stop]
This script is used to draw the codon decoding time plot.
options:
-h, --help show this help message and exit
-s {E,P,A} set the E/P/A-site for codon decoding time calculation. (default: P).
-f {0,1,2,all} set the reading frame for codon decoding time calculation. (default: all).
-m MIN retain transcript with more than minimum RPFs. (default: 30).
--tis TIS The number of codons after TIS will be discarded.. (default: 15 AA).
--tts TTS The number of codons before TTS will be discarded.. (default: 5 AA).
--scale {zscore,minmax}
normalize the codon decoding time. (default: minmax).
--stop rmove the stop codon. (default: False).
Required arguments:
--rpf RPF the name of input RPFs file in TXT format.
--rna RNA the name of input reads file in TXT format.
-l LIST the gene name list in TXT format. (default: whole).
-o OUTPUT the prefix of output file.
```
2. Calculate codon-level decoding time in Ribo-seq data
```bash
$ cd ./13.codon_decoding_time/
#################################################
# calculate the codon decoding time of E/P/A site
for sites in E P A
do
rpf_CDT \
-l ../../../1.reference/norm/gene.norm.txt \
--rna ../../../3.rna-seq/5.riboparser/05.merge/RNA_merged.txt \
--rpf ../05.merge/RIBO_merged.txt \
--stop \
-m 50 \
-f 0 \
-s $sites \
--tis 10 \
--tts 5 \
-o "$sites"_site &> "$sites"_site.log
done
cd ..
```
3. Results of `rpf_CDT`
```bash
A_site_cdt_corrplot.pdf # Correlation heatmap of codon decoding time at the A-site
A_site_cdt_corrplot.png # Correlation heatmap of codon decoding time at the A-site
A_site_cdt_corr.txt # Correlation of codon decoding time at the A-site
A_site_cdt_heatplot.pdf # Heatmap of codon decoding time at the A-site
A_site_cdt_heatplot.png # Heatmap of codon decoding time at the A-site
A_site_cdt.txt # Codon decoding time for all codon at the A-site in the gene CDS region
A_site.log # Log file of program execution
```
### 5.14 Calculate selection time
Codon usage is not equal across genomes, with some codons being more frequently used than others, thought to improve translational efficiency.
However, previous studies reveal that translational efficiency is optimized by a mechanism involving proportional codon usage based on tRNA concentrations.
These results provide new insights into protein translation, explain unequal codon usage, and highlight natural selection for translational efficiency. Codon selection time can be used to quantify this process.
1. Explanation of `rpf_CST`
```bash
$ rpf_CST -h
Calculate the codon decoding time.
Step1: Checking the input Arguments.
usage: rpf_CST [-h] --rpf RPF --rna RNA [-l LIST] -o OUTPUT [-s {E,P,A}] [-f {0,1,2,all}] [-m MIN] [-t TIMES] [--tis TIS] [--tts TTS] [--scale {zscore,minmax}] [--stop]
This script is used to draw the codon decoding time plot.
options:
-h, --help show this help message and exit
-s {E,P,A} set the E/P/A-site for codon decoding time calculation. (default: P).
-f {0,1,2,all} set the reading frame for codon decoding time calculation. (default: all).
-m MIN retain transcript with more than minimum RPFs. (default: 30).
-t TIMES Specify the number of iteration times required for computation. (default: 10).
--tis TIS The number of codons after TIS will be discarded. (default: 0 AA).
--tts TTS The number of codons before TTS will be discarded. (default: 0 AA).
--scale {zscore,minmax}
normalize the codon selection time. (default: minmax).
--stop rmove the stop codon. (default: False).
Required arguments:
--rpf RPF the name of input RPFs file in TXT format.
--rna RNA the name of input reads file in TXT format.
-l LIST the gene name list in TXT format. (default: whole).
-o OUTPUT the prefix of output file.
```
2. Calculate codon-level selection time in Ribo-seq data
```bash
$ cd ./14.codon_selection_time/
#################################################
# calculate the codon selection time of E/P/A site
for sites in E P A
do
rpf_CST \
-l ../../../1.reference/norm/gene.norm.txt \
--rna ../../../3.rna-seq/5.riboparser/05.merge/RNA_merged.txt \
--rpf ../05.merge/RIBO_merged.txt \
--stop \
-m 50 \
-f 0 \
-s $sites \
--tis 10 \
--tts 5 \
-o "$sites"_site &> "$sites"_site.log
done
cd ..
```
3. Results of `rpf_CST`
```bash
A_site_codon_selection_time.txt # Codon selection time for all codon at the A-site in the gene CDS region
A_site_cst_corrplot.pdf # Correlation heatmap of codon selection time at the A-site
A_site_cst_corrplot.png # Correlation heatmap of codon selection time at the A-site
A_site_cst_corr.txt # Correlation of codon selection time at the A-site
A_site_cst_heatplot.pdf # Heatmap of codon selection time at the A-site
A_site_cst_heatplot.png # Heatmap of codon selection time at the A-site
A_site_iterative_codon_selection_time.txt # Codon selection time for all codon at the A-site in the gene CDS region
A_site.log # Log file of program execution
```
### 5.15 Calculate Coefficient of Variation
While codon-level analysis can indicates alterations in translation elongation, gene-level validation is complicated by variable ribosome profiling coverage due to gene expression dynamics.
Low-coverage genes may artifactually appear to exhibit more translational pausing due to noise inversely scaling with coverage.
To resolve this, Peter et. al developed a gene-level analytical method that explicitly models noise dependence on coverage. They fit the following two-parameter model to the data to accommodate a variety of statistical behaviors for counting noise:
$$log_2(CV) = \frac{1}{2} log_2 (\frac{β}{μ} + α)$$
where CV is the coefficient of variation in the ribosome profile of a given gene,
μ is mean coverage (RPF reads percodon), and α and β are fitting parameters.
Importantly, when α = 0 and β = 1, Equation results from a Poisson distribution,
whereas α > 0 and β = 1 indicates a negative binomial distribution.
1. Explanation of `rpf_CoV`
```bash
$ rpf_CoV -h
Calculate CoV in the CDS region.
Step1: Checking the input Arguments.
usage: rpf_CoV [-h] -r RPF [-g GROUP] [-l LIST] -o OUTPUT [-f {0,1,2,all}] [-m MIN] [-n] [--tis TIS] [--tts TTS] [--fig]
This script is used to calculate CoV in the CDS region.
options:
-h, --help show this help message and exit
-f {0,1,2,all} set the reading frame for occupancy calculation. (default: all).
-m MIN retain transcript with more than minimum RPFs. (default: 5).
-n normalize the RPFs count to RPM. (default: False).
--tis TIS The number of codons after TIS will be discarded. (default: 15).
--tts TTS The number of codons before TES will be discarded. (default: 5).
--fig show the figure. (default: False).
Required arguments:
-r RPF the name of input RPFs file in TXT format.
-g GROUP specify the list of sample group. (default: None)
-l LIST the gene name list in TXT format. (default: whole).
-o OUTPUT the prefix of output file. (prefix + _Cov.txt)
The group file needs to contain at least two column:
+----------+---------+
| name | group |
+==========+=========+
| wt1 | wt |
| wt2 | wt |
| treat1 | treat |
| treat2 | treat |
| ko1 | ko |
| ko2 | ko |
+----------+---------+
```
2. Calculate gene coefficient of variation in Ribo-seq data
```bash
$ cd ./15.coefficient_of_variation/
#################################################
# Here we can configure the design file to calculate differences between different groups.
$ cat design.txt
Name Group
WT_ribo_YPD1 WT_ribo_YPD
WT_ribo_YPD2 WT_ribo_YPD
WT_ribo_YPD3 WT_ribo_YPD
ncs2d_ribo_YPD1 ncs2d_ribo_YPD
ncs2d_ribo_YPD2 ncs2d_ribo_YPD
ncs2d_ribo_YPD3 ncs2d_ribo_YPD
elp6d_ribo_YPD1 elp6d_ribo_YPD
elp6d_ribo_YPD2 elp6d_ribo_YPD
elp6d_ribo_YPD3 elp6d_ribo_YPD
ncs2d_elp6d_ribo_YPD1 ncs2d_elp6d_ribo_YPD
ncs2d_elp6d_ribo_YPD2 ncs2d_elp6d_ribo_YPD
ncs2d_elp6d_ribo_YPD3 ncs2d_elp6d_ribo_YPD
#################################################
# calculate the coefficient of variation
rpf_CoV \
-l ../../../1.reference/norm/gene.norm.txt \
-r ../05.merge/RIBO_merged.txt \
-f 0 \
-m 30 \
--tis 10 \
--tts 5 \
--fig \
-g design.txt \
-o RIBO &> RIBO.log
cd ..
```
3. Results of `rpf_CoV`
```bash
gene_compared_CoV.txt # Compared gene coefficient of variation
gene_CoV.txt # Gene coefficient of variation
gene.log # Log file of program execution
gene_WT_ribo_YPD_vs_ncs2d_ribo_YPD_CoV_fitplot.pdf # Fitted lineplot of gene coefficient of variation
gene_WT_ribo_YPD_vs_ncs2d_ribo_YPD_CoV_fitplot.png # Fitted lineplot of gene coefficient of variation
```
### 5.16 Meta-codon analysis
To demonstrate the pausing patterns of different codons, a meta plot is utilized here to display the RPF density within a 20-nucleotide window surrounding the specified codon.
The program also supports the visualization of different frames and incorporates a smoothing function to mitigate spike signals caused by data instability.
1. Explanation of `rpf_Meta_Codon`
```bash
$ rpf_Meta_Codon -h
Draw the meta-codon plot.
Step1: Checking the input Arguments.
usage: rpf_Meta_Codon [-h] [-l LIST] -r RPF [-c CODON] -o OUTPUT [-f {0,1,2}] [-a AROUND] [-m MIN] [--tis TIS] [--tts TTS] [-n] [-u] [-s] [--smooth SMOOTH] [--thread THREAD] [--fig]
This script is used to draw the meta codon plot.
options:
-h, --help show this help message and exit
-f {0,1,2} set the reading frame for occupancy calculation. (default: all).
-a AROUND retrieve length of codon upstream and downstream. (default: 20).
-m MIN retain transcript with more than minimum RPFs. (default: 50).
--tis TIS The number of codons after TIS will be discarded.. (default: 0 AA).
--tts TTS The number of codons before TTS will be discarded.. (default: 0 AA).
-n normalize the RPFs count to RPM. (default: False).
-u delete the cross repetition codon in different window. (default: False).
-s scale the window density with gene density. (default: False).
--smooth SMOOTH smooth the window density [eg, 3,1]. (default: None).
--thread THREAD the number of threads (default: 1).
--fig output the figure. (default: False).
Required arguments:
-l LIST the gene name list in TXT format. (default: whole).
-r RPF the name of input RPFs file in TXT format.
-c CODON the codon list in TXT format.
-o OUTPUT the prefix of output file.
```
2. Calculate meta-codon density in Ribo-seq data
```bash
$ cd ./16.meta_codon/
#################################################
# Here we can configure the codon list.
$ cat codon_list.txt
AAA
AAC
AAG
AAT
AAGAAG
ATGATG
CCCGGG
...
#################################################
# codon meta analysis
rpf_Meta_Codon \
-r ../05.merge/RIBO_merged.txt \
-m 50 -f 0 \
-c codon_list.txt \
-a 15 -u -n \
-o RIBO &> RIBO.log
cd ..
```
3. Results of `rpf_Meta_Codon`
```bash
RIBO.log # Log file of program execution
RIBO_AAA_97591_8146_meta_density.txt # RPFs density of AAA codon
RIBO_AAA_97591_8146_meta_sequence.txt # Upstream and downstream sequence around AAA codon
RIBO_AAA.pdf # Metaplot of AAA codon
RIBO_AAA.png # Metaplot of AAA codon
```
## 6. Other toolkits
### 6.1 Data shuffling
Some analysis processes require randomly assigned data for control,
so a step is added here to reshuffling the RPFs density file.
1. Explanation of `rpf_Shuffle`
```bash
$ rpf_Shuffle -h
Shuffle the RPFs data.
Step1: Checking the input Arguments.
usage: rpf_Shuffle [-h] -r RPF -o OUTPUT [-l LIST] [-s SEED] [-i]
This script is used to shuffle the RPFs data.
options:
-h, --help show this help message and exit
-l LIST the list of input genes for transcript id.
-s SEED the random seed for shuffle. (default: 0).
-i shuffle the RPFs data for each samples. (default: False).
Required arguments:
-r RPF the name of input RPFs density file in TXT format.
-o OUTPUT the prefix of output file. (prefix + _shuffle.txt)
```
2. Shuffle gene density values in Ribo-seq data.
```bash
$ cd
$ cd ./sce/4.ribo-seq/5.riboparser/17.shuffle/
#################################################
# codon meta analysis
rpf_Shuffle \
-l ../../../1.reference/norm/gene.norm.txt \
-r ../05.merge/RIBO_merged.txt \
-s 0 \
-i \
-o RIBO &> RIBO.log
```
3. Shuffle gene density values in RNA-seq data
```bash
$ cd
$ cd ./sce/3.rna-seq/5.riboparser/10.shuffle/
#################################################
# retrieve and format the gene density
rpf_Shuffle \
-l ../../../1.reference/norm/gene.norm.txt \
-r ../05.merge/RNA_merged.txt \
-s 0 \
-i \
-o RNA &> RNA.log
```
4. Results of `rpf_Shuffle`
```bash
# results of ribo-seq
RIBO.log # Log file of program execution
RIBO_shuffle.txt # Shuffled RPFs density file
```
### 6.2 Retrieve and format the gene density
In many cases, it is necessary to perform some additional operations on the gene set in the RPFs density file,
such as filtering, RPM standardization, long and width data format conversion, etc.
A specialized tool is provided here for these operations.
1. Explanation of `rpf_Retrieve`
```bash
$ rpf_Retrieve -h
Retrieve the RPFs with gene list.
Step1: Checking the input Arguments.
usage: rpf_Retrieve [-h] -r RPF [-o OUTPUT] [-l LIST] [-m MIN] [-n] [-f] [-s]
This script is used to retrieve density files.
options:
-h, --help show this help message and exit
-l LIST the list of input genes for transcript id.
-m MIN retain transcript with more than minimum RPFs (default: 0).
-n normalize the RPFs count to RPM (default: False).
-f melt three column data of each sample to one column (default: False).
-s split gene rpf to each TXT file (default: False).
Required arguments:
-r RPF the name of input RPFs density file in TXT format.
-o OUTPUT prefix of output file name (default: filename + '_retrieve.txt'.
```
2. Extract and format gene density from Ribo-seq data
```bash
$ cd
$ cd ./sce/4.ribo-seq/5.riboparser/18.retrieve/
#################################################
# retrieve and format the gene density with gene covered more than 50 reads
rpf_Retrieve \
-l ../../../1.reference/norm/gene.norm.txt \
-r ../05.merge/RIBO_merged.txt \
-m 0 \
-f \
-n \
-o RIBO &> RIBO.log
cd ..
```
3. Extract and format gene density from RNA-seq data
```bash
$ cd
$ cd ./sce/3.rna-seq/5.riboparser/11.retrieve/
#################################################
# retrieve and format the gene density with gene covered more than 50 reads
rpf_Retrieve \
-l ../../../1.reference/norm/gene.norm.txt \
-r ../05.merge/RNA_merged.txt \
-m 50 \
-f \
-n \
-o RNA &> RNA.log
cd ..
```
4. Results of `rpf_Retrieve`
```bash
# results of ribo-seq
RIBO.log
RIBO_retrieve.txt
```
### 6.3 Filter the frame shifting genes
A frameshift in translation occurs when the ribosome shifts by one or more nucleotides in the mRNA sequence, causing a misreading of the codons.
This results in a completely altered amino acid sequence downstream of the shift,
often leading to premature termination or a nonfunctional protein.
Frameshifts can occur naturally due to mutations or during translation errors, and they can significantly impact protein function.
Whenever a stable frameshift occurs, we can detect ribosome occupancy in different reading frames from the Ribo-seq data.
1. Explanation of `rpf_Shift`
```bash
$ rpf_Shift -h
Draw the frame shifting plot.
Step1: Checking the input Arguments.
usage: rpf_Shift.py [-h] -r RPF -o OUTPUT [-t TRANSCRIPT] [-p PERIOD] [-m MIN] [--tis TIS] [--tts TTS]
This script is used to draw the frame shift plot.
options:
-h, --help show this help message and exit
-p PERIOD the minimum in-frame value for frame shifting screen, range [0 - 1]. (default: 0.45).
-m MIN retain transcript with more than minimum RPFs. (default: 50).
--tis TIS the number of codons after TIS will be discarded.. (default: 0 AA).
--tts TTS the number of codons before TTS will be discarded.. (default: 0 AA).
Required arguments:
-r RPF the name of input RPFs file in TXT format.
-o OUTPUT the prefix of output file.
-t TRANSCRIPT the name of input transcript filein TXT format.
```
2. Filter the frame shifting genes
```bash
$ cd
$ cd ./sce/4.ribo-seq/5.riboparser/19.frame_shift/
#################################################
# filter the frame shifting genes
rpf_Shift \
-t ../../../1.reference/norm/gene.norm.txt \
-r ../05.merge/RIBO_merged.txt \
--tis 5 --tts 5 \
-m 50 \
-p 45 \
-f \
-n \
-o RIBO &> RIBO.log
cd ..
```
4. Results of `rpf_Shift`
```bash
# results of ribo-seq
RIBO.log
RIBO_gene_frame_shift_count_plot.pdf
RIBO_gene_frame_shift_count_plot.png
RIBO_gene_frame_shift_count.txt
RIBO_gene_periodicity.txt
RIBO_SRR1944912_gene_frame_shift.txt
```
## 7. Contribution
Thanks for all the open source tools used in the process.
Thanks to Nedialkova DD and Leidel SA for providing the excellent dataset.
Contribute to our open-source project by submitting questions and code.
Contact `rensc0718@163.com` for more information.
## 8. License
GPL License.
Raw data
{
"_id": null,
"home_page": "https://github.com/renscq/RiboParser",
"name": "RiboParser",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.12",
"maintainer_email": null,
"keywords": null,
"author": "Ren Shuchao",
"author_email": "rensc0718@163.com",
"download_url": "https://files.pythonhosted.org/packages/6b/6a/38b3429c02d9b2cb744c55ab7371dddd0831e3a5f282a153a870a884e807/riboparser-0.1.7.4.tar.gz",
"platform": "linux-x86_64",
"description": "<!--\n * @Author: 'rensc' 'rensc0718@163.com'\n * @Date: 2024-10-15 11:44:58\n * @LastEditors: 'rensc' 'rensc0718@163.com'\n * @LastEditTime: 2025-02-18 03:50\n * @FilePath: \\RiboParser\\README.md \n * \n-->\n\n# RiboParser\n\nTo streamline understanding and application, we will analyze publicly accessible project data, breaking down each analytical step to illustrate the complete workflow.\n\nThis process encompasses both general analysis steps and specialized analysis and visualization techniques, facilitated by `RiboParser` and `RiboShiny`.\n\nThe specific steps involved are:\n\n1. Software installation\n2. Reference file creation\n3. Raw data download\n4. Raw data cleaning\n5. Data alignment\n6. Sequencing quality analysis\n7. Gene-level analysis\n8. Codon-level analysis\n\nThe results of this data analysis can be further analyzed and visualized in `RiboShiny`.\n\n\n## 1. Software configuration\n\n### 1.1 create the environment with `conda`\n\n```bash\nconda create -n ribo\nconda activate ribo\n```\n\n### 1.2 Install software dependencies using `conda`\n\n```bash\nconda install cutadapt -c bioconda\nconda install bowtie -c bioconda\nconda install samtools -c bioconda\nconda install star -c bioconda\nconda install bedtools -c bioconda\nconda install subread -c bioconda\nconda install rsem -c bioconda\nconda install gffread -c bioconda\nconda install sra-tools -c bioconda\nconda install ucsc-genepredtogtf -c bioconda\nconda install ucsc-gtftogenepred -c bioconda\nconda install ucsc-gff3togenepred -c bioconda\nconda install ucsc-bedgraphtobigwig -c bioconda\nconda install ucsc-bedsort -c bioconda\nconda install pigz -c conda-forge\n```\n\nor install the packages in a single command.\n\n```bash\nconda install cutadapt bowtie samtools star bedtools subread rsem gffread sra-tools \\\n ucsc-genepredtogtf ucsc-gtftogenepred ucsc-gff3togenepred ucsc-bedgraphtobigwig ucsc-bedsort \\\n -c bioconda\n\nconda install pigz -c conda-forge\n```\n\n\n### 1.3 `pip` install `RiboParser`\n\nWhen the server is connected to the network, we can use `pip` to install software directly. \n\n```bash\npip install riboparser\n```\n\nAlternatively, we can download the version from GitHub, re-setup and then install it.\n\n```bash\ncd RiboParser\n\npython3 setup.py sdist bdist_wheel\npip install dist/RiboParser-0.1.6.1-py3-none-any.whl\n```\n\n### 1.4 run the test\nTest software for dependency, installation, and operation issues.\n\n```bash\nrpf_Check -h\nrpf_CST -h\n```\n\n## 2. Prepare reference files\n\n### 2.1 An example of the complete project directory is shown below\n\nThe complete data analysis includes reference preparation, rawdata, RNA-seq data analysis, and Ribo-seq data analysis.\n\n```bash\n$ cd && cd ./sce/\n$ tree -d\n\n.\n\u251c\u2500\u2500 1.reference\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 genome\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 mrna\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 ncrna\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 norm\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 rrna\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 rsem-index\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 star-index\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 trna\n\u251c\u2500\u2500 2.rawdata\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 ribo-seq\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 rna-seq\n\u251c\u2500\u2500 3.rna-seq\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 1.cleandata\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 2.bowtie\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 3.star\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 4.quantification\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 5.riboparser\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 01.qc\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 02.digestion\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 03.offset\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 04.density\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 05.merge\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 06.periodicity\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 07.metaplot\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 08.coverage\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 09.correlation\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 10.shuffle\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 11.retrieve\n\u2514\u2500\u2500 4.ribo-seq\n \u251c\u2500\u2500 1.cleandata\n \u251c\u2500\u2500 2.bowtie\n \u251c\u2500\u2500 3.star\n \u251c\u2500\u2500 4.quantification\n \u2514\u2500\u2500 5.riboparser\n \u251c\u2500\u2500 01.qc\n \u251c\u2500\u2500 02.digestion\n \u251c\u2500\u2500 03.offset\n \u251c\u2500\u2500 04.density\n \u251c\u2500\u2500 05.merge\n \u251c\u2500\u2500 06.periodicity\n \u251c\u2500\u2500 07.metaplot\n \u251c\u2500\u2500 08.coverage\n \u251c\u2500\u2500 09.correlation\n \u251c\u2500\u2500 10.quantification\n \u251c\u2500\u2500 11.pausing_score\n \u251c\u2500\u2500 12.codon_occupancy\n \u251c\u2500\u2500 13.codon_decoding_time\n \u251c\u2500\u2500 14.codon_selection_time\n \u251c\u2500\u2500 15.coefficient_of_variation\n \u251c\u2500\u2500 16.meta_codon\n \u251c\u2500\u2500 17.shuffle\n \u251c\u2500\u2500 18.retrieve\n \u2514\u2500\u2500 19.frame_shift\n```\n\n### 2.2 Prepare the reference genome index\n#### 2.2.1 Create the directory\n\nCreate folders to hold different types of reference sequence files.\n\n```bash\n$ mkdir -p ./sce/1.reference/\n$ cd ./sce/1.reference/\n$ mkdir cdna genome gtf mrna ncrna rrna trna norm rsem-index\n```\n\n#### 2.2.2 Download reference files from NCBI\n\nUse the most common data analysis file format, the genome sequence in fasta format, and the reference file in GTF or GFF3 format.\n\n```bash\n# genome sequence\n$ wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.fna.gz\n\n# GTF or GFF3\n$ wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.gtf.gz\n$ wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.gff.gz\n\n# cDNA sequence\n$ wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_rna.fna.gz\n\n# feature table\n$ wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_feature_table.txt.gz\n\n# decompression\n$ gunzip *.gz\n\n$ gffread -g GCF_000146045.2_R64_genomic.fna GCF_000146045.2_R64_genomic.gff -F -w cdna.fa\n```\n\n#### 2.2.3 Create the `genome` index using `bowtie`\n\n```bash\n$ bowtie-build ../GCF_000146045.2_R64_genomic.fna ./genome/genome --threads 12 &>> ./genome/genome_build.log\n```\n\n#### 2.2.4 Create an `mRNA` index using `bowtie`\n\nA custom script is used here to extract the corresponding sequence information \nfrom the `fasta` file based on the sequence name.\n\n```bash\n$ retrieve_seq -h\n\nusage: retrieve_seq [-h] [-v] -i INPUT -n NAME [-u UNMAPPED] -o OUTPUT\n\nThis script is used to retrieve the fasta sequence by name.\n\noptions:\n -h, --help show this help message and exit\n -v, --version show programs version number and exit\n -i INPUT input the fasta file\n -n NAME gene ids in txt format\n -u UNMAPPED output the unmapped gene ids\n -o OUTPUT prefix of output file name (default results_peaks.txt)\n```\n\n\n```bash\n# filter the mrna sequence\n$ grep -i 'gbkey=mRNA' ./cdna.fa | cut -d ' ' -f 1 | cut -c 2- > ./mrna/mrna.ids\n$ retrieve_seq -i ./cdna.fa -n ./mrna/mrna.ids -o ./mrna/mrna.fa &>> ./mrna/mrna_build.log\n# build the mrna index\n$ bowtie-build ./mrna/mrna.fa ./mrna/mrna --threads 12 &>> ./mrna/mrna_build.log\n```\n\n#### 2.2.5 Create an `rRNA` index using `bowtie`\n```bash\n# filter the rrna sequence\n$ grep -i 'gbkey=rRNA' ./cdna.fa | cut -d ' ' -f 1 | cut -c 2- > ./rrna/rrna.ids\n$ retrieve_seq -i ./cdna.fa -n ./rrna/rrna.ids -o ./rrna/rrna.fa &>> ./rrna/rrna_build.log\n# build the rrna index\n$ bowtie-build ./rrna/rrna.fa ./rrna/rrna --threads 12 &>> ./rrna/rrna_build.log\n```\n\n#### 2.2.6 Create an `tRNA` index using `bowtie`\n```bash\n# filter the trna sequence\n$ grep -i 'gbkey=tRNA' ./cdna.fa | cut -d ' ' -f 1 | cut -c 2- > ./trna/trna.ids\n$ retrieve_seq -i ./cdna.fa -n ./trna/trna.ids -o ./trna/trna.fa &>> ./trna/trna_build.log\n# build the trna index\n$ bowtie-build ./trna/trna.fa ./trna/trna --threads 12 &>> ./trna/trna_build.log\n```\n\n\n#### 2.2.7 Create an `ncRNA` index using `bowtie`\n```bash\n# filter the ncrna sequence\n$ grep -iE 'gbkey=ncRNA|gbkey=lnc_RNA|gbkey=miRNA|gbkey=snoRNA|gbkey=snRNA|gbkey=misc_RNA' ./cdna.fa | cut -d ' ' -f 1 | cut -c 2- > ./ncrna/ncrna.ids\n$ retrieve_seq -i ./cdna.fa -n ./ncrna/ncrna.ids -o ./ncrna/ncrna.fa &>> ./ncrna/ncrna_build.log\n# build the ncrna index\n$ bowtie-build ./ncrna/ncrna.fa ./ncrna/ncrna --threads 12 &>> ./ncrna/ncrna_build.log\n```\n\n#### 2.2.8 Standardized `gtf` or `gff3` files\n\n- Explanation of `rpf_Reference` \n\n```bash\n$ rpf_Reference -h\n\nusage: rpf_Reference [-h] -g GENOME -t GTF -o OUTPUT [-u UTR] [-c] [-l] [-w]\n\nThis script is used to build the references for the RiboParser.\n\noptions:\n -h, --help show this help message and exit\n -u UTR add the pseudo UTR to the leaderless transcripts (default: 0 nt).\n -c only retain the protein coding transcripts (default: False).\n -l only retain the longest protein coding transcripts, it is recommended to select \n the longest transcript for subsequent analysis. (default: False).\n -w output whole message (default: False).\n\nRequired arguments:\n -g GENOME the input file name of genome sequence\n -t GTF the input file name of gtf file\n -o OUTPUT the prefix of output file. (prefix + _norm.gtf)\n```\n\n- create the references from GTF and Genome fasta\n\n```bash\n$ rpf_Reference \\\n -g ../GCF_000146045.2_R64_genomic.fna \\\n -t ../GCF_000146045.2_R64_genomic.gff \\\n -u 30 -o ./norm/gene &>> ./norm/norm_build.log\n```\n\n#### 2.2.9 Create a `genome` index using `STAR`\n\n```bash\n$ STAR \\\n --genomeSAindexNbases 11 \\\n --runThreadN 12 \\\n --runMode genomeGenerate \\\n --genomeDir ./star-index \\\n --genomeFastaFiles GCF_000146045.2_R64_genomic.fna \\\n --sjdbGTFfile ./norm/gene.norm.gtf\n```\n\n#### 2.2.10 Create a `transcriptome` index using `rsem`\n\n```bash\n$ rsem-prepare-reference \\\n -p 12 \\\n --gtf ../norm/gene.norm.gtf ../GCF_000146045.2_R64_genomic.fna ./rsem-index/rsem\n```\n\n\n## 3. Data preprocessing and alignment\n\nIn order to introduce the analysis process and usage of `RiboParser`, RNA-seq and Ribo-seq data from dataset `GSE67387` are used as examples here.\n\n```shell\n# dataset\nhttps://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE67387\n\n# reference\nNedialkova DD, Leidel SA. Optimization of Codon Translation Rates via tRNA Modifications Maintains Proteome Integrity. Cell 2015 Jun 18;161(7):1606-18. \nPMID: 26052047\n```\n\n\n### 3.1 Fundamental analysis of RNA-seq data in `GSE67387` dataset\n\n#### 3.1.1 Download RNA-seq raw data\n\nUse `prefetch` in `sra-tools` to download the raw SRA-format data and extract it into `fastq` format files.\n\n```bash\n$ mkdir -p ./sce/2.rawdata/rna-seq/\n$ cd ./sce/2.rawdata/rna-seq/\n\n#################################################\n# download rna-seq\n$ prefetch -o SRR1944925.sra SRR1944925\n$ prefetch -o SRR1944926.sra SRR1944926\n$ prefetch -o SRR1944927.sra SRR1944927\n$ prefetch -o SRR1944928.sra SRR1944928\n$ prefetch -o SRR1944929.sra SRR1944929\n$ prefetch -o SRR1944930.sra SRR1944930\n$ prefetch -o SRR1944931.sra SRR1944931\n$ prefetch -o SRR1944932.sra SRR1944932\n$ prefetch -o SRR1944933.sra SRR1944933\n$ prefetch -o SRR1944934.sra SRR1944934\n$ prefetch -o SRR1944935.sra SRR1944935\n\n# decompression\nfor sra in *.sra\ndo\n\nfastq-dump $sra\npigz *fastq\n\ndone\n```\n\n#### 3.1.2 RNA-seq data cleaning\n\nBecause the data from the gse project is cleaned, it does not include adapter and index sequences. So the following is just to show the general steps, do not need to run.\n\n1. RNA-seq data cleaning\n\n```bash\n$ cd\n$ mkdir -p ./sce/3.rna-seq/1.cleandata/\n$ cd ./sce/3.rna-seq/1.cleandata/\n\n#################################################\n# run the cutadapt\nfor fq in ../../2.rawdata/rna-seq/*fastq.gz\ndo\ncutadapt --match-read-wildcards \\\n -a AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGC \\\n -m 25 -O 6 -j 12 \\\n -o `\\basename $fq fastq.gz`clean.fastq.gz $fq &> $fq\".log\"\n\ndone\n```\n\n#### 3.1.3 Align clean data to different types of reference files\n\nTo assess library quality and eliminate the impact of reads originating from different non-coding RNAs (ncRNAs) on subsequent analysis, we employed `bowtie` to classify the reads form sequencing data.\n\nUnder normal circumstances, especially for RNA-seq libraries constructed using the oligo(dT) method, most reads originate from mRNA. Therefore, for RNA-seq analysis, this step is generally not necessary. It is suitable for use in libraries constructed by the rRNA-depletion method.\n\n1. Align the RNA-seq data to references\n\n```bash\n$ cd\n$ mkdir -p ./sce/3.rna-seq/2.bowtie/\n$ cd ./sce/3.rna-seq/2.bowtie/\n\n#################################################\n# set database\nrrna='../../sce/1.reference/rrna/rrna'\ntrna='../../sce/1.reference/trna/trna'\nncrna='../../sce/1.reference/ncrna/ncrna'\nmrna='../../sce/1.reference/mrna/mrna'\nchrom='../../sce/1.reference/genome/genome'\n\nthreads=12\nmismatch=1\n\n# alignment reads to reference\nfor fq in ../1.cleandata/*fastq.gz\ndo\nfqname=`\\basename $fq .fastq.gz`\n\n## rrna\nbowtie -p $threads -v $mismatch --un=\"$fqname\".norrna.fq --al=\"$fqname\".rrna.fq \\\n -x $rrna $fq -S \"$fqname\".rrna.sam 2>> \"$fqname\".log\n\n## trna\nbowtie -p $threads -v $mismatch --un=\"$fqname\".notrna.fq --al=\"$fqname\".trna.fq \\\n -x $trna \"$fqname\".norrna.fq -S \"$fqname\".trna.sam 2>> \"$fqname\".log\n\n## ncrna\nbowtie -p $threads -v $mismatch --un=\"$fqname\".noncrna.fq --al=\"$fqname\".ncrna.fq \\\n -x $ncrna \"$fqname\".notrna.fq -S \"$fqname\".ncrna.sam 2>> \"$fqname\".log\n\n## mrna\nbowtie -p $threads -v $mismatch --un=\"$fqname\".nomrna.fq --al=\"$fqname\".mrna.fq \\\n -x $mrna \"$fqname\".noncrna.fq -S \"$fqname\".mrna.sam 2>> \"$fqname\".log\n\n## genome\nbowtie -p $threads -v $mismatch --un=\"$fqname\".nogenome.fq --al=\"$fqname\".genome.fq \\\n -x $chrom \"$fqname\".nomrna.fq -S \"$fqname\".genome.sam 2>> \"$fqname\".log\n\n## compress fastq\npigz *fq\n\n## compress sam\nfor sam in *.sam\ndo\n\nsamtools view -h -F 4 $sam | samtools sort -@ $threads -o `\\basename $sam sam`bam\nrm $sam\n\ndone\n\ndone\n```\n\n\n2. Statistical alignment results for all references.\n\n- Explanation of `merge_bwt_log`\n\n```bash\n$ merge_bwt_log -h\n\nStep1: Checking the input Arguments.\nusage: merge_bwt_log [-h] -l LIST [LIST ...] -o OUTPUT [-n NAME]\n\nThis script is used to statstic the mapped reads from log files\n\noptions:\n -h, --help show this help message and exit\n -n NAME, --name NAME set the name of each database (default: rRNA,tRNA,ncRNA,mRNA,Genome).\n\nRequired arguments:\n -l LIST [LIST ...], --list LIST [LIST ...]\n List for bowtie mapping log files (e.g., '*log').\n -o OUTPUT prefix of output file name.\n```\n\n- Statistical alignment results\n\n```bash\n#################################################\n# merge all log files\nmerge_bwt_log -n rRNA,tRNA,ncRNA,mRNA,Genome -l *log -o RNA_seq &>> merge_bowtie.log\n```\n\n#### 3.1.4 Aligning mRNA reads using `STAR`\n\nFollowing the removal of ncRNA reads, the remaining clean reads were realigned to the yeast genome using `STAR`.\n\n1. Aligning mRNA reads (RNA-seq) using `STAR`\n```bash\n$ cd\n$ mkdir -p ./sce/3.rna-seq/3.star/\n$ cd ./sce/3.rna-seq/3.star/\n\n#################################################\n# set the option and database\ngenome='../../1.reference/star-index/'\nthreads=12\n\n#################################################\n# map the all rna-seq reads to genome and transcriptome region\nfor fastq in ../2.bowtie/*.noncrna.fq.gz\ndo\n\n## get file name\noutput=$(basename $fastq .noncrna.fq.gz)\n\n#################################################\n## run the alignment\nSTAR --runThreadN $threads \\\n --readFilesCommand zcat \\\n --genomeDir $genome \\\n --readFilesIn $fastq \\\n --outFileNamePrefix $output \\\n --outSAMtype BAM Unsorted \\\n --outFilterType BySJout \\\n --quantMode TranscriptomeSAM GeneCounts \\\n --outReadsUnmapped Fastx \\\n --outSAMattributes All \\\n --alignEndsType Local \\\n --outFilterMultimapNmax 3 \\\n --outFilterMismatchNmax 1 \\\n --alignIntronMax 10000 \\\n --outFilterMatchNmin 20\n# --outWigType wiggle --outWigNorm RPM\n\npigz *mate1\n\n#################################################\n## sort the bam file\nsamtools sort -@ $threads $output\"Aligned.out.bam\" -o $output\"Aligned.sortedByCoord.out.bam\"\nsamtools index -@ $threads $output\"Aligned.sortedByCoord.out.bam\"\nrm $output\"Aligned.out.bam\"\n\ndone\n```\n\n#### 3.1.5 Estimating gene expression levels with either `RSEM` or `featureCounts`\n\nBoth `RSEM` and `featureCounts` can be employed to quantify gene expression levels. For the purpose of this analysis, we will utilize `RSEM` as a representative tool.\n\n1. Estimating transcript abundance using RNA-seq data\n\n```bash\n$ cd\n$ mkdir -p ./sce/3.rna-seq/4.quantification/\n$ cd ./sce/3.rna-seq/4.quantification/\n\n#################################################\n# quantify the gene expression\nfor bam in ../3.star/*Aligned.toTranscriptome.out.bam\ndo\nrsem-calculate-expression -p 10 \\\n --no-bam-output --alignments \\\n -q $bam ../../1.reference/rsem-index/rsem `\\basename $bam Aligned.toTranscriptome.out.bam`\n# rsem-calculate-expression -p 10 \\\n# --paired-end --no-bam-output --alignments \\\n# -q $bam ../../1.reference/rsem-index/rsem `\\basename $bam Aligned.toTranscriptome.out.bam`\n\ndone\n```\n\n2. Integrating RNA-seq quantification values for all samples\n\n- Explanation of `merge_rsem`\n\n```bash\n$ merge_rsem -h\n\nusage: merge_rsem [-h] -l LIST [LIST ...] -o OUTPUT [-c {expected_count,TPM,FPKM}]\n\nThis script is used to merge specified columns from result files\n\noptions:\n -h, --help show this help message and exit\n -c {expected_count,TPM,FPKM}, --column {expected_count,TPM,FPKM}\n Column name to merge (e.g., 'expected_count').\n\nRequired arguments:\n -l LIST [LIST ...], --list LIST [LIST ...]\n List for result files (e.g., '*results').\n -o OUTPUT output file name.\n```\n\n- Integrating RNA-seq quantification values\n\n```bash\n#################################################\n# merge the gene expression\nmerge_rsem -c expected_count -l *.genes.results -o gene.expected_count.txt &>> merge_rsem.log\nmerge_rsem -c TPM -l *.genes.results -o gene.TPM.txt &>> merge_rsem.log\nmerge_rsem -c FPKM -l *.genes.results -o gene.FPKM.txt &>> merge_rsem.log\n\n#################################################\n# merge the isoforms expression\nmerge_rsem -c expected_count -l *.isoforms.results -o isoforms.expected_count.txt &>> merge_rsem.log\nmerge_rsem -c TPM -l *.isoforms.results -o isoforms.TPM.txt &>> merge_rsem.log\nmerge_rsem -c FPKM -l *.isoforms.results -o isoforms.FPKM.txt &>> merge_rsem.log\n```\n\n\n### 3.2 Fundamental analysis of Ribo-seq data in `GSE67387` dataset\n\n#### 3.2.1 Download Ribo-seq raw data\n\nUse `prefetch` in `sra-tools` to download the raw SRA-format data and extract it into `fastq` format files.\n\n```bash\n$ cd\n$ mkdir -p ./sce/2.rawdata/ribo-seq/\n$ cd ./sce/2.rawdata/ribo-seq/\n\n#################################################\n# download ribo-seq\nprefetch -o SRR1944912.sra SRR1944912\nprefetch -o SRR1944913.sra SRR1944913\nprefetch -o SRR1944914.sra SRR1944914\nprefetch -o SRR1944915.sra SRR1944915\nprefetch -o SRR1944916.sra SRR1944916\nprefetch -o SRR1944917.sra SRR1944917\nprefetch -o SRR1944918.sra SRR1944918\nprefetch -o SRR1944919.sra SRR1944919\nprefetch -o SRR1944920.sra SRR1944920\nprefetch -o SRR1944921.sra SRR1944921\nprefetch -o SRR1944922.sra SRR1944922\nprefetch -o SRR1944923.sra SRR1944923\n\n# decompression\nfor sra in *.sra\ndo\n\nfastq-dump $sra\npigz *fastq\n\ndone\n```\n\n#### 3.2.2 Ribo-seq data cleaning\n\n```bash\n$ cd\n$ mkdir -p ./sce/4.ribo-seq/1.cleandata/\n$ cd ./sce/4.ribo-seq/1.cleandata/\n\n#################################################\n# run the cutadapt\nfor fq in ../../2.rawdata/ribo-seq/*fastq.gz\ndo\ncutadapt --match-read-wildcards \\\n -a AAAAAAAA \\\n -m 25 -O 6 -j 10 \\\n -o `\\basename $fq fastq.gz`clean.fastq.gz $fq &> $fq\".log\"\ndone\n```\n\n\n#### 3.2.3 Align clean data to different types of reference files\n\nTo assess library quality and eliminate the impact of reads originating from different non-coding RNAs (ncRNAs) on subsequent analysis, we employed `bowtie` to classify the reads form sequencing data.\n\n1. Aligning Ribo-seq data\n```bash\n$ cd\n$ mkdir -p ./sce/4.ribo-seq/2.bowtie/\n$ cd ./sce/4.ribo-seq/2.bowtie/\n\n#################################################\n# set database\nrrna='../../sce/1.reference/rrna/rrna'\ntrna='../../sce/1.reference/trna/trna'\nncrna='../../sce/1.reference/ncrna/ncrna'\nmrna='../../sce/1.reference/mrna/mrna'\nchrom='../../sce/1.reference/genome/genome'\n\nthreads=12\nmismatch=1\n\n# alignment reads to reference\nfor fq in ../1.cleandata/*fastq.gz\ndo\nfqname=`\\basename $fq .fastq.gz`\n\n## rrna\nbowtie -p $threads -v $mismatch --un=\"$fqname\".norrna.fq --al=\"$fqname\".rrna.fq \\\n -x $rrna $fq -S \"$fqname\".rrna.sam 2>> \"$fqname\".log\n\n## trna\nbowtie -p $threads -v $mismatch --un=\"$fqname\".notrna.fq --al=\"$fqname\".trna.fq \\\n -x $trna \"$fqname\".norrna.fq -S \"$fqname\".trna.sam 2>> \"$fqname\".log\n\n## ncrna\nbowtie -p $threads -v $mismatch --un=\"$fqname\".noncrna.fq --al=\"$fqname\".ncrna.fq \\\n -x $ncrna \"$fqname\".notrna.fq -S \"$fqname\".ncrna.sam 2>> \"$fqname\".log\n\n## mrna\nbowtie -p $threads -v $mismatch --un=\"$fqname\".nomrna.fq --al=\"$fqname\".mrna.fq \\\n -x $mrna \"$fqname\".noncrna.fq -S \"$fqname\".mrna.sam 2>> \"$fqname\".log\n\n## genome\nbowtie -p $threads -v $mismatch --un=\"$fqname\".nogenome.fq --al=\"$fqname\".genome.fq \\\n -x $chrom \"$fqname\".nomrna.fq -S \"$fqname\".genome.sam 2>> \"$fqname\".log\n\n## compress fastq\npigz *fq\n\n## compress sam\nfor sam in *.sam\ndo\n\nsamtools view -h -F 4 $sam | samtools sort -@ $threads -o `\\basename $sam sam`bam\nrm $sam\n\ndone\n\ndone\n```\n\n2. Statistical alignment results for all databases.\n```bash\n#################################################\n# merge all log files\nmerge_bwt_log -n rRNA,tRNA,ncRNA,mRNA,Genome -l *log -o sce &>> merge_bowtie.log\n```\n\n\n#### 3.2.4 Aligning mRNA reads (Ribo-seq) using `STAR`\n\nFollowing the removal of ncRNA reads, the remaining clean reads were realigned to the yeast genome using `STAR`.\n\n```bash\n$ cd\n$ mkdir -p ./sce/4.ribo-seq/3.star/\n$ cd ./sce/4.ribo-seq/3.star/\n\n#################################################\n# set the option and database\ngenome='../../1.reference/star-index/'\nthreads=12\n\n#################################################\n# map the all rna-seq reads to genome and transcriptome region\nfor fastq in ../2.bowtie/*.noncrna.fq.gz\ndo\n\n## get file name\noutput=$(basename $fastq .noncrna.fq.gz)\n\n#################################################\n## run the alignment\nSTAR --runThreadN $threads \\\n --readFilesCommand zcat \\\n --genomeDir $genome \\\n --readFilesIn $fastq \\\n --outFileNamePrefix $output \\\n --outSAMtype BAM Unsorted \\\n --outFilterType BySJout \\\n --quantMode TranscriptomeSAM GeneCounts \\\n --outReadsUnmapped Fastx \\\n --outSAMattributes All \\\n --alignEndsType Local \\\n --outFilterMultimapNmax 3 \\\n --outFilterMismatchNmax 1 \\\n --alignIntronMax 10000 \\\n --outFilterMatchNmin 20\n# --outWigType wiggle --outWigNorm RPM\n\npigz *mate1\n\n#################################################\n## sort the bam file\nsamtools sort -@ $threads $output\"Aligned.out.bam\" -o $output\"Aligned.sortedByCoord.out.bam\"\nsamtools index -@ $threads $output\"Aligned.sortedByCoord.out.bam\"\nrm $output\"Aligned.out.bam\"\n\ndone\n```\n\n\n#### 3.2.5 Estimating gene expression levels with either `RSEM` or `featureCounts`\n\nBoth `RSEM` and `featureCounts` can be employed to quantify gene expression levels. For the purpose of this analysis, we will utilize `RSEM` as a representative tool.\n\n1. Estimating transcript abundance using Ribo-seq data\n\n```bash\n$ cd\n$ mkdir -p ./sce/4.ribo-seq/4.quantification/\n$ cd ./sce/4.ribo-seq/4.quantification/\n\n#################################################\n# quantify the isoforms expression\nfor bam in ../3.star/*Aligned.toTranscriptome.out.bam\ndo\n\nrsem-calculate-expression -p 12 \\\n --no-bam-output --alignments \\\n -q $bam ../../1.reference/rsem-index/rsem `\\basename $bam Aligned.toTranscriptome.out.bam`\n\ndone\n```\n\n2. Integrating Ribo-seq quantification values for all samples\n\n```bash\n#################################################\n# merge the gene expression\nmerge_rsem -c expected_count -l *.genes.results -o gene.expected_count.txt &>> merge_rsem.log\nmerge_rsem -c TPM -l *.genes.results -o gene.TPM.txt &>> merge_rsem.log\nmerge_rsem -c FPKM -l *.genes.results -o gene.FPKM.txt &>> merge_rsem.log\n\n#################################################\n# merge the isoforms expression\nmerge_rsem -c expected_count -l *.isoforms.results -o isoforms.expected_count.txt &>> merge_rsem.log\nmerge_rsem -c TPM -l *.isoforms.results -o isoforms.TPM.txt &>> merge_rsem.log\nmerge_rsem -c FPKM -l *.isoforms.results -o isoforms.FPKM.txt &>> merge_rsem.log\n```\n\n\n## 4. Perform RNA-seq data analysis of `GSE67387` with `RiboParser`\n\n### 4.0 Prepare the directory to store the results\n\n```bash\n$ cd\n$ mkdir -p ./3.rna-seq/5.riboparser/\n$ cd ./3.rna-seq/5.riboparser/\n$ mkdir 01.qc 02.digestion 03.offset 04.density 05.merge \\\n 06.periodicity 07.metaplot 08.coverage 09.correlation 10.shuffle\n```\n\n### 4.1 Quality check of sequencing data\n\n1. Quality check of RNA-seq data\n\n```bash\n$ cd ./01.qc\n\n#################################################\n# check the ribo-seq quality\nfor bam in ../../3.star/*Aligned.toTranscriptome.out.bam\ndo\nprefix_name=$(basename $bam Aligned.toTranscriptome.out.bam)\n\nrpf_Check -b $bam -s --thread 10 \\\n -t ../../../1.reference/norm/gene.norm.txt \\\n -o $prefix_name &> $prefix_name\".log\"\n\ndone\n```\n\n2. Integrating RNA-seq quality check results for all samples\n```bash\n\n#################################################\n# merge the rna-seq quality results\nmerge_length -l *length_distribution.txt -o sce\nmerge_saturation -l *gene_saturation.txt -o sce\n\ncd ..\n```\n\n### 4.2 Enzymatic bias in NGS library preparation\n1. Bias in restriction enzyme digestion and ligation in sequencing data\n\n```bash\n$ cd ./02.digestion/\n\n#################################################\n# check the reads digestion\nfor bam in ../01.qc/*.bam\ndo\nprefix_name=$(basename $bam .bam)\n\nrpf_Digest -b $bam -m 25 -M 50 --scale \\\n -s ../../../1.reference/norm/gene.norm.rna.fa \\\n -t ../../../1.reference/norm/gene.norm.txt \\\n -o $prefix_name &> $prefix_name\".log\"\n\ndone\n```\n\n2. Integrating reads digestion results for all samples\n\n```bash\n#################################################\n# merge the rpf digestion\nmerge_digestion -l ./02.digestion/*pwm.txt -o sce\n\ncd ..\n```\n\n### 4.3 Use `RiboParser` to create the `offset` table\n\n1. Create the `offset` table for RNA-seq\n\n`Offset` prediction is unnecessary for RNA-seq analysis. A constant `offset` of 12 can be assigned to all entries in the table.\n\n```bash\n$ cd ./03.offset/\n\n#################################################\n# set the offset table\nfor bam in ../01.qc/*.bam\ndo\n\nprefix_name=$(basename $bam .bam)\nrna_Offset -m 27 -M 50 -e 12 -o $prefix_name &> $prefix_name\".log\"\n\ndone\n```\n\n### 4.4 Convert the `BAM` file to reads density\n\nTransform the read counts in a `BAM` file into density values and save them in a `TXT` format file.\n\n1. Transform RNA-seq data output\n\n```bash\n$ cd ./04.density/\n\n#################################################\n# convert the reads to density\nfor bam in ../01.qc/*.bam\ndo\nprefix_name=$(basename $bam .bam)\n\nrna_Density -b $bam -m 27 -M 33 -l --thread 10 \\\n -p ../03.offset/$prefix_name\"_offset.txt\" \\\n -s ../../../1.reference/norm/gene.norm.rna.fa \\\n -t ../../../1.reference/norm/gene.norm.txt \\\n -o $prefix_name &> $prefix_name\".log\"\n\ndone\n\ncd ..\n```\n\n\n### 4.5 Integrating density file for all samples\n\nData from various batches and samples can be integrated for unified analysis. If not integrated, individual analysis for each sample is required, increasing the number of operational steps.\n\n1. Integrating RNA-seq density results for all samples\n```bash\n$ cd ./05.merge/\n\n#################################################\n# create the samples file: RNA.file.list\nmerge_dst_list -l ../04.density/*_rna.txt -o RNA.file.list\n\ncat RNA.file.list\n\nName File Type\nwt_rna_YPD1 /home/sce/3.rna-seq/5.riboparser/04.density/SRR1944924_rna.txt RNA\nwt_rna_YPD2 /home/sce/3.rna-seq/5.riboparser/04.density/SRR1944925_rna.txt RNA\nwt_rna_YPD3 /home/sce/3.rna-seq/5.riboparser/04.density/SRR1944926_rna.txt RNA\nncs2d_rna_YPD1 /home/sce/3.rna-seq/5.riboparser/04.density/SRR1944927rna.txt RNA\nncs2d_rna_YPD2 /home/sce/3.rna-seq/5.riboparser/04.density/SRR1944928_rna.txt RNA\nncs2d_rna_YPD3 /home/sce/3.rna-seq/5.riboparser/04.density/SRR1944929_rna.txt RNA\nelp6d_rna_YPD1 /home/sce/3.rna-seq/5.riboparser/04.density/SRR1944930_rna.txt RNA\nelp6d_rna_YPD2 /home/sce/3.rna-seq/5.riboparser/04.density/SRR1944931_rna.txt RNA\nelp6d_rna_YPD3 /home/sce/3.rna-seq/5.riboparser/04.density/SRR1944932_rna.txt RNA\nncs2d_elp6d_rna_YPD1 /home/sce/3.rna-seq/5.riboparser/04.density/SRR1944933_rna.txt RNA\nncs2d_elp6d_rna_YPD2 /home/sce/3.rna-seq/5.riboparser/04.density/SRR1944934_rna.txt RNA\nncs2d_elp6d_rna_YPD3 /home/sce/3.rna-seq/5.riboparser/04.density/SRR1944935_rna.txt RNA\n\n#################################################\n# merge all the RNA-seq files\nrpf_Merge -l RNA.file.list -o RNA &> RNA.log\n\ncd ..\n```\n\n### 4.6 Calculate tri-nucleotide periodicity\n\n1. Check the tri-nucleotide periodicity of RNA-seq\n```bash\n$ cd ./06.periodicity/\n\n#################################################\n# check the periodicity\nrpf_Periodicity \\\n -r ../05.merge/RNA_merged.txt \\\n -m 30 --tis 0 --tts 0 -o RNA &> RNA.log\n```\n\n### 4.7 `Meta-gene` analysis\n\nInvestigation of reads density in the vicinity of start and stop codons using `meta-gene` analysis.\n\n1. `Meta-gene` analysis of RNA-seq\n\n```bash\n$ cd ./07.metaplot/\n\n#################################################\n# metagene analysis\nrpf_Metaplot \\\n -t ../../../1.reference/norm/gene.norm.txt \\\n -r ../05.merge/RNA_merged.txt \\\n -m 50 --mode bar -o RNA &> RNA.log\n```\n\n### 4.8 Gene coverage\n\nExamine the distribution of read density along the gene body.\n\n1. Check gene density of RNA-seq\n\n```bash\n$ cd ./08.coverage/\n\n#################################################\n# check the reads density along with the gene body\nrpf_Coverage \\\n -t ../../../1.reference/norm/gene.norm.txt \\\n -r ../05.merge/RNA_merged.txt \\\n -m 50 --outlier \\\n -b 10,100,10 \\\n -n --heat \\\n -o RNA &> RNA.log\n```\n\n### 4.9 Check the repeatability of samples\n\n1. Check the repeatability of RNA-seq\n\n```bash\n$ cd ./09.correlation/\n\n#################################################\n# calculate the samples replication of RNA-seq\nrpf_Corr \\\n -r ../05.merge/RNA_merged.txt \\\n -o RNA &> RNA.log\n```\n\n\n## 5. Perform RNA-seq data analysis of `GSE67387` with `RiboParser`\n\n### 5.0 Prepare the directory to store the results\n\n```bash\n$ cd\n$ mkdir -p ./4.ribo-seq/5.riboparser/\n$ cd ./4.ribo-seq/5.riboparser/\n$ mkdir 01.qc 02.digestion 03.offset 04.density 05.merge \\\n 06.periodicity 07.metaplot 08.coverage 09.correlation 10.quantification \\\n 11.pausing_score 12.codon_occupancy 13.codon_decoding_time 14.codon_selection_time \\\n 15.coefficient_of_variation 16.meta_codon 17.shuffle 18.retrieve\n```\n\n### 5.1 Quality check of sequencing data\n\nTo ensure the reliability of downstream analyses, rigorous quality control (QC) of ribosome profiling data must include systematic evaluation of peak detection and gene detection rate.\n\n`Peak Detection`:\\\nCalculate the dominant RPF length (expected range: 26\u201334 nt) and its proportion within the total reads (QC pass threshold: \u226570% in expected range).\n\n`Gene Detection Rate`:\\\nSampling from 5% to 95% of the data, the number of covered genes and their expression levels are calculated. When the slope of the gene count curve approaches 0, it typically indicates that the sequencing data is near saturation.\n\n1. Explanation of `rpf_Check`\n\n```bash\n$ rpf_Check -h\n\nCheck the RPFs mapping condition.\n\nStep1: Checking the input Arguments.\n\nusage: rpf_Check [-h] -t TRANSCRIPT -b BAM -o OUTPUT [--thread THREAD] [-g {0,1}] [-a {star,hisat2,bowtie2}] [-r] [-l] [-s]\n\nThis script is used to summary the BAM condition.\n\noptions:\n -h, --help show this help message and exit.\n --thread THREAD the number of threads (default: 1). Suitable for large bam files > 1G.\n It will take a lot of memory.\n -g {0,1} filter the number of reads mapped loci (default: 0). [0]: all reads will be\n used; [1]: reads with unique mapped loci will be used\n -a {star,hisat2,bowtie2}\n screen the reads mapped loci from BAM file generate with different reads alignment methods (default: star).\n -r reads aligned to negative strand will also be counted. (default: False).\n -l only keep the longest transcripts (default: False).\n -s whether to calculate RPF saturation. (default: False). This step will take \n a lot of time and memory.\n\nRequired arguments:\n -t TRANSCRIPT the input file name of gene annotation.\n -b BAM the input file name of bam.\n -o OUTPUT the prefix of output file.\n```\n\n2. Quality check of ribo-seq data\n\n```bash\n$ cd ./01.qc/\n\n#################################################\n# check the ribo-seq quality\nfor bam in ../../3.star/*Aligned.toTranscriptome.out.bam\ndo\nprefix_name=$(basename $bam Aligned.toTranscriptome.out.bam)\n\nrpf_Check -b $bam -s --thread 10 \\\n -t ../../../1.reference/norm/gene.norm.txt \\\n -o $prefix_name &> $prefix_name\".log\"\n\ndone\n```\n\n3. Results of `rpf_Check`\n\n```bash\n# results of sample SRR1944912\nSRR1944912.bam # Filtered and sorted BAM file\nSRR1944912.bam.bai # Index file for the BAM file\nSRR1944912_gene_saturation.pdf # Barplot for gene saturation analysis\nSRR1944912_gene_saturation.png # Barplot for gene saturation analysis\nSRR1944912_gene_saturation.txt # Statistical results of gene saturation analysis\nSRR1944912_length_distribution.pdf # Line graph of reads length distribution\nSRR1944912_length_distribution.png # Line graph of reads length distribution\nSRR1944912_length_distribution.txt # Statistical results of reads length distribution\nSRR1944912.log # Log file of program execution\nSRR1944912_reads_saturation.pdf # Boxplot for reads saturation analysis\nSRR1944912_reads_saturation.png # Boxplot for reads saturation analysis\nSRR1944912_reads_saturation.txt # Statistical results of reads saturation analysis\n```\n\n\n4. Explanation of `merge_length` and `merge_saturation`\n\n```bash\nStep1: Checking the input Arguments.\nusage: merge_length [-h] -l LIST [LIST ...] -o OUTPUT\n\nThis script is used to merge the length distribution files\n\noptions:\n -h, --help show this help message and exit\n\nRequired arguments:\n -l LIST [LIST ...], --list LIST [LIST ...]\n List for length distribution files (e.g., '*length_distribution.txt').\n -o OUTPUT prefix of output file name.\n```\n\n5. Integrating Ribo-seq quality check results for all samples\n\n```bash\n#################################################\n# merge the ribo-seq quality results\nmerge_length -l *length_distribution.txt -o RIBO\nmerge_saturation -l *gene_saturation.txt -o RIBO\n\ncd ..\n```\n\n\n### 5.2 Enzymatic bias in NGS library preparation\n\nRibonuclease digestion and ligation steps in Ribo-seq can introduce biases that affect the representation of RNA fragments.\n\n`Ribonuclease Digestion Bias`:\\\nRibonucleases may preferentially cleave certain RNA sequences or structures, leading to uneven fragmentation and skewed representation of RNA species in the data.\n\n`Ligation Bias`:\\\nLigation efficiency can vary by sequence context, RNA fragment length, or secondary structure. This may result in overrepresentation of certain fragments due to more favorable ligation sites.\n\nTo reduce these biases, it's important to analyze digestion and ligation efficiency, optimize protocols, and include controls to ensure data accuracy and reliability.\n\n1. Explanation of `rpf_Digest`\n\n```bash\n$ rpf_Digest -h\n\nStep1: Checking the input Arguments.\n\nusage: rpf_Digest [-h] -t TRANSCRIPT -s SEQUENCE -b BAM -o OUTPUT [-l] [--scale] [-m MIN] [-M MAX]\n\nThis script is used to Detect the digestion sites.\n\noptions:\n -h, --help show this help message and exit\n -l only retain the transcript with longest CDS of each gene (default: False).Recommended : True\n --scale scale the motif matrix (default: False).\n -m MIN the minimum reads length to keep (default: 20 nt).\n -M MAX the maximum reads length to keep (default: 100 nt).\n\nRequired arguments:\n -t TRANSCRIPT the name of input transcript file in TXT format.\n -s SEQUENCE the name of input transcript sequence file in FA format.\n -b BAM the name of mapping file in BAM format.\n -o OUTPUT the name of output file. (prefix + _digestion_sites.txt)\n```\n\n2. Bias in restriction enzyme digestion and ligation in sequencing data\n\n```bash\n$ cd ./02.digestion/\n\n#################################################\n# check the reads digestion\nfor bam in ../01.qc/*.bam\ndo\nprefix_name=$(basename $bam .bam)\n\nrpf_Digest -b $bam -m 27 -M 33 --scale \\\n -s ../../../1.reference/norm/gene.norm.rna.fa \\\n -t ../../../1.reference/norm/gene.norm.txt \\\n -o $prefix_name &> $prefix_name\".log\"\n\ndone\n```\n\n3. Results of `rpf_Digest`\n\n```bash\n# results of SRR1944912\nSRR1944912_3end_counts.txt # Statistical values of different bases at the 3'-end of reads\nSRR1944912_3end_pwm.txt # PWM matrix of bases at the 3'-end of reads\nSRR1944912_3end_seqlogo2.pdf # Seqlogo of bases at the 3'-end of reads\nSRR1944912_5end_counts.txt # Statistical values of different bases at the 5'-end of reads\nSRR1944912_5end_pwm.txt # PWM matrix of bases at the 5'-end of reads\nSRR1944912_5end_seqlogo2.pdf # Seqlogo of bases at the 5'-end of reads\nSRR1944912_digestion_sites.txt # Open reading frame aligned at the ends of reads\nSRR1944912.log # Log file of program execution\nSRR1944912_scaled_digestion_sites_plot.pdf # Heatmap of open reading frame aligned at the ends of reads\n```\n\n\n4. Explanation of `merge_digestion`\n\n```bash\n$ merge_digestion -h\n\nStep1: Checking the input Arguments.\nusage: merge_digestion [-h] -l LIST [LIST ...] -o OUTPUT\n\nThis script is used to merge the reads digestion files\n\noptions:\n -h, --help show this help message and exit\n\nRequired arguments:\n -l LIST [LIST ...], --list LIST [LIST ...]\n List for digestion files (e.g., '*_5end_pwm.txt').\n -o OUTPUT prefix of output file name.\n```\n\n5. Integrating reads digestion results for all samples\n```bash\n#################################################\n# merge the rpf digestion\nmerge_digestion -l *pwm.txt -o RIBO\n\ncd ..\n```\n\n\n### 5.3 Use `RiboParser` to predict the optimal offset\n\nThe ability of Ribo-seq to analyze translation at codon resolution relies on accurately identifying the codons located at ribosomal `A-`, `P-`, and `E-`sites for each RPF.\nThe offset represents the distance of the 5\u2019 end of the RPF to the first nucleotide of the `P-site` codon. \nTwo commonly used methods are the ribosome structure-based model (`RSBM`) and the start/stop-based model (`SSCBM`).\n\n1. Explanation of `rpf_Offset`\n\n```bash\n$ rpf_Offset -h\n\nStep1: Checking the input Arguments.\n\nusage: rpf_Offset [-h] -t TRANSCRIPT -b BAM -o OUTPUT [--mode {SSCBM,RSBM}] [-a {both,tis,tts}] [-l] [-m MIN] [-M MAX] [-p EXP_PEAK] [-s SHIFT] [--silence] [-d]\n\nThis script is used to detect the P-site offset.\n\noptions:\n -h, --help show this help message and exit\n --mode {SSCBM,RSBM} specify the mode of offset detect [SSCBM, RSBM]. (default: SSCBM).\n -a {both,tis,tts} specify the alignment of reads for offset detect [both, tis, tts]. (default: both).\n -l only retain the transcript with longest CDS of each gene (default: False).\n -m MIN the minimum reads length to keep (default: 27 nt).\n -M MAX the maximum reads length to keep (default: 33 nt).\n -p EXP_PEAK expected RPFs length fitted to ribosome structure [~30 nt] (default: 30 nt).\n -s SHIFT psite shift for different RPFs length. (default: 2 nt).\n --silence discard the warning information. (default: True).\n -d output the details of offset (default: False).\n\nRequired arguments:\n -t TRANSCRIPT the name of input transcript filein TXT format.\n -b BAM the name of mapping file in BAM format.\n -o OUTPUT the prefix of output file. (prefix + _offset.txt)\n```\n\n2. Predict the optimal offset of Ribo-seq\n\n```bash\n$ cd ../03.offset/\n\n#################################################\n# predict the offset table\nfor bam in ../01.qc/*.bam\ndo\nprefix_name=$(basename $bam .bam)\n\nrpf_Offset -b $bam -m 27 -M 33 -p 30 -d \\\n --mode SSCBM \\\n -t ../../../1.reference/norm/gene.norm.txt \\\n -o $prefix_name &> $prefix_name\".log\"\n\ndone\n```\n\n3. Results of `rpf_Offset`\n\n```bash\n# results of SRR1944912\nSRR1944912.log # Log file of program execution\nSRR1944912_RSBM_offset.pdf # Heatmap of RSBM model calculation results\nSRR1944912_RSBM_offset.png # Heatmap of RSBM model calculation results\nSRR1944912_RSBM_offset.txt # Calculation results of the RSBM model\nSRR1944912_SSCBM_offset.pdf # Heatmap of SSCBM model calculation results\nSRR1944912_SSCBM_offset.png # Heatmap of SSCBM model calculation results\nSRR1944912_SSCBM_offset_scale.pdf # Row-normalized heatmap of SSCBM model calculation results\nSRR1944912_SSCBM_offset_scale.png # Row-normalized heatmap of SSCBM model calculation results\nSRR1944912_SSCBM_offset.txt # Calculation results of the SSCBM model\nSRR1944912_tis_3end.txt # Distribution statistics of reads 3'-end at the start codon\nSRR1944912_tis_5end.txt # Distribution statistics of reads 5'-end at the start codon\nSRR1944912_tts_3end.txt # Distribution statistics of reads 3'-end at the stop codon\nSRR1944912_tts_5end.txt # Distribution statistics of reads 5'-end at the stop codon\n```\n\n4. Explanation of `merge_offset`\n\n```bash\n$ merge_offset -h\n\nStep1: Checking the input Arguments.\nusage: merge_offset [-h] -l LIST [LIST ...] -o OUTPUT\n\nThis script is used to merge the reads offset files\n\noptions:\n -h, --help show this help message and exit\n\nRequired arguments:\n -l LIST [LIST ...], --list LIST [LIST ...]\n List for RSBM/SSCBM offset files (e.g., '*RSBM_offset.txt').\n -o OUTPUT prefix of output file name (default: prefix + _offset.txt).\n```\n\n5. Integrating offset table results for all samples\n\n```bash\n#################################################\n# merge the ribo-seq offset results\nmerge_offset_detail -l *end.txt -o RIBO\nmerge_offset -l *SSCBM_offset.txt -o RIBO_SSCBM\nmerge_offset -l *RSBM_offset.txt -o RIBO_RSBM\n\ncd ..\n```\n\n\n### 5.4 Convert the `BAM` file to reads density\n\nTransform the read counts in a `BAM` file into density values and save them in a `TXT` format file.\n\n1. Explanation of `rpf_Density`\n\n```bash\n$ rpf_Density -h\n\nConvert reads to RPFs density.\n\nStep1: Checking the input Arguments.\n\nusage: rpf_Density [-h] -t TRANSCRIPT -s SEQUENCE -b BAM -p PSITE -o OUTPUT [-l] [-m MIN] [-M MAX] [--period PERIODICITY] [--silence] [--thread THREAD]\n\nThis script is used to convert Ribo-seq bam to p-site density.\n\noptions:\n -h, --help show this help message and exit\n -l only retain the transcript with longest CDS of each gene (default: False).Recommended : True\n -m MIN the minimum reads length to keep (default: 27 nt).\n -M MAX the maximum reads length to keep (default: 33 nt).\n --silence discard the warning information. (default: True).\n --thread THREAD the number of threads (default: 1). It will take a lot of memory.\n\nRequired arguments:\n -t TRANSCRIPT the name of input transcript file in TXT format.\n -s SEQUENCE the name of input transcript sequence file in FA format.\n -b BAM the name of mapping file in BAM format.\n -p PSITE the name of p-site offset file in TXT format.\n -o OUTPUT the prefix of output file. (output = prefix + _rpf.txt)\n --period PERIODICITY the minimum 3nt periodicity to keep. (default: 40).\n```\n\n2. Transform Ribo-seq data output\n\n```bash\n#################################################\n# convert the rpf to density\nfor bam in ../01.qc/*.bam\ndo\nprefix_name=$(basename $bam .bam)\n\nrpf_Density -b $bam -m 27 -M 33 --period 40 -l --thread 12 \\\n -p ../03.offset/$prefix_name\"_SSCBM_offset.txt\" \\\n -s ../../../1.reference/norm/gene.norm.rna.fa \\\n -t ../../../1.reference/norm/gene.norm.txt \\\n -o $prefix_name &> $prefix_name\".log\"\n\ndone\n\ncd ..\n```\n\n3. Results of `rpf_Density`\n\n```bash\n# results of SRR1944912\nSRR1944912.log # Log file of program execution\nSRR1944912_rpf.txt # File contains RPFs density on each gene\n```\n\n\n### 5.5 Integrating density file for all samples\n\nData from various batches and samples can be integrated for unified analysis. If not integrated, individual analysis for each sample is required, increasing the number of operational steps.\n\n1. Explanation of `merge_dst_list`\n\n```bash\n$ merge_dst_list -h\n\nStep1: Checking the input Arguments.\nusage: merge_dst_list [-h] -l LIST [LIST ...] [-o OUTPUT]\n\nThis script is used to create the list of density files\n\noptions:\n -h, --help show this help message and exit\n\nRequired arguments:\n -l LIST [LIST ...], --list LIST [LIST ...]\n List for density files (e.g., '*_rpf.txt').\n -o OUTPUT output file name (default: RIBO.file.list).\n```\n\n2. create the samples list\n\n```bash\n$ cd ./05.merge/\n\n#################################################\n# create the samples file: Ribo.file.list\nmerge_dst_list -l ../04.density/*_rpf.txt -o RIBO.file.list\n\ncat RIBO.file.list\n\nName File Type\nwt_ribo_YPD1\t/home/sce/4.ribo-seq/5.riboparser/04.density/SRR1944912_rpf.txt Ribo\nwt_ribo_YPD2\t/home/sce/4.ribo-seq/5.riboparser/04.density/SRR1944913_rpf.txt Ribo\nwt_ribo_YPD3\t/home/sce/4.ribo-seq/5.riboparser/04.density/SRR1944914_rpf.txt Ribo\nncs2d_ribo_YPD1\t/home/sce/4.ribo-seq/5.riboparser/04.density/SRR1944915_rpf.txt Ribo\nncs2d_ribo_YPD2\t/home/sce/4.ribo-seq/5.riboparser/04.density/SRR1944916_rpf.txt Ribo\nncs2d_ribo_YPD3\t/home/sce/4.ribo-seq/5.riboparser/04.density/SRR1944917_rpf.txt Ribo\nelp6d_ribo_YPD1\t/home/sce/4.ribo-seq/5.riboparser/04.density/SRR1944918_rpf.txt Ribo\nelp6d_ribo_YPD2\t/home/sce/4.ribo-seq/5.riboparser/04.density/SRR1944919_rpf.txt Ribo\nelp6d_ribo_YPD3\t/home/sce/4.ribo-seq/5.riboparser/04.density/SRR1944920_rpf.txt Ribo\nncs2d_elp6d_ribo_YPD1\t/home/sce/4.ribo-seq/5.riboparser/04.density/SRR1944921_rpf.txt Ribo\nncs2d_elp6d_ribo_YPD2\t/home/sce/4.ribo-seq/5.riboparser/04.density/SRR1944922_rpf.txt Ribo\nncs2d_elp6d_ribo_YPD3\t/home/sce/4.ribo-seq/5.riboparser/04.density/SRR1944923_rpf.txt Ribo\n```\n\n3. Explanation of `rpf_Merge`\n\n```bash\n$ rpf_Merge -h\n\nMerge RPFs files from different samples.\n\nStep1: Checking the input Arguments.\nusage: rpf_Merge [-h] -l LIST -o OUTPUT\n\nThis script is used to merge the density file.\n\noptions:\n -h, --help show this help message and exit\n\nRequired arguments:\n -l LIST the sample list in TXT format.\n -o OUTPUT the prefix of output file. (prefix + _merged.txt)\n```\n\n\n4. Integrating Ribo-seq density results for all samples\n\n```bash\n#################################################\n# merge all the Ribo-seq files\nrpf_Merge -l RIBO.file.list -o RIBO &> RIBO.log\n\ncd ..\n```\n\n5. Results of `rpf_Merge`\n\n```bash\nRIBO.log # Log file of program execution\nRIBO.file.list # RPFs density file list of samples\nRIBO_merged.txt # Merged RPFs density file\n```\n\n\n\n### 5.6 Calculate tri-nucleotide periodicity\n\nTri-nucleotide periodicity serves as a critical quality metric in Ribo-seq data analysis, fundamentally determining the biological interpretability of codon-resolution findings.\n\nHigh-quality periodicity (typically >0.6 phase coherence score) constitutes an essential prerequisite for robust codon-level analysis, whereas insufficient periodicity (<0.45) systematically introduces frame ambiguity artifacts that compromise translational measurements.\n\nLoss of reading frame synchronization leads to misassignment of ribosome `A/P/E` sites, generating false-positive pause sites from out-of-frame read aggregation.\n\n1. Explanation of `rpf_Periodicity`\n\n```bash\n$ rpf_Periodicity -h\n\nDraw the periodicity plot.\n\nStep1: Checking the input Arguments.\n\nusage: rpf_Periodicity [-h] -r RPF -o OUTPUT [-t TRANSCRIPT] [-m MIN] [--tis TIS] [--tts TTS]\n\nThis script is used to draw the periodicity plot.\n\noptions:\n -h, --help show this help message and exit\n -m MIN retain transcript with more than minimum RPFs. (default: 50).\n --tis TIS The number of codons after TIS will be discarded.. (default: 0 AA).\n --tts TTS The number of codons before TTS will be discarded.. (default: 0 AA).\n\nRequired arguments:\n -r RPF the name of input RPFs file in TXT format.\n -o OUTPUT the prefix of output file.\n -t TRANSCRIPT the name of input transcript filein TXT format.\n```\n\n2. Check the tri-nucleotide periodicity of Ribo-seq\n\n```bash\n$ cd ./06.periodicity/\n\n#################################################\n# check the periodicity\nrpf_Periodicity \\\n -r ../05.merge/RIBO_merged.txt \\\n -m 30 --tis 0 --tts 0 -o RIBO &> RIBO.log\n\ncd ..\n```\n\n3. Results of `rpf_Periodicity`\n\n```bash\nRIBO_count_periodicity_plot.pdf # Barplot of read counts showing 3-nucleotide periodicity\nRIBO_count_periodicity_plot.png # Barplot of read counts showing 3-nucleotide periodicity\nRIBO.log # Log file of program execution\nRIBO_periodicity.txt # Statistical values of 3-nucleotide periodicity of all samples\nRIBO_ratio_periodicity_plot.pdf # Barplot of read ratios showing 3-nucleotide periodicity\nRIBO_ratio_periodicity_plot.png # Barplot of read ratios showing 3-nucleotide periodicity\n```\n\n\n### 5.7 `Meta-gene` analysis\n\nOur `Meta-gene` analysis framework enables systematic investigation of translation dynamics proximal to start and stop codons.\nIt entails aggregates ribosome-protected fragment (RPF) density across -15 and +60 codon windows relative to initiation/termination sites, normalized by transcript abundance.\n\nThis reveals:\\\nPeak/trough patterns indicative of translation initiation efficiency,\\\nRead accumulation gradients reflecting termination kinetics,\\\nTri-Nucleotide Periodicity Quantification.\n\n1. Explanation of `rpf_Metaplot`\n\n```bash\n$ rpf_Metaplot -h\n\nDraw the metaplot.\n\nStep1: Checking the input Arguments.\n\nusage: rpf_Metaplot [-h] -t TRANSCRIPT -r RPF -o OUTPUT [-m MIN] [--utr5 UTR5] [--cds CDS] [--utr3 UTR3] [-n] [--mode {line,bar}]\n\nThis script is used to draw the meta plot.\n\noptions:\n -h, --help show this help message and exit\n -m MIN delete transcript with less than minimum RPFs. (default: 50).\n --utr5 UTR5 the codon number in 5-utr region (default: 20 AA).\n --cds CDS the codon number in cds region (default: 50 AA).\n --utr3 UTR3 the codon number in 3-utr region (default: 20 AA).\n -n normalize the RPFs count to RPM. (default: False).\n --mode {line,bar} specify the mode of metaplot. (default: bar).\n\nRequired arguments:\n -t TRANSCRIPT the name of input transcript filein TXT format.\n -r RPF the name of input RPFs file in TXT format.\n -o OUTPUT the prefix name of output file.\n```\n\n2. `Meta-gene` analysis of Ribo-seq\n\n```bash\n$ cd ./07.metaplot/\n\n#################################################\n# metagene analysis\nrpf_Metaplot \\\n -t ../../../1.reference/norm/gene.norm.txt \\\n -r ../05.merge/RIBO_merged.txt \\\n -m 50 --mode bar -o RIBO &> RIBO.log\n\ncd ..\n```\n\n3. Results of `rpf_Metaplot`\n\n```bash\nRIBO.log # Log file of program execution\nRIBO_tis_tts_metaplot.txt # Metagene statistical values at start and stop codons across all samples\nRIBO_SRR1944912_meta_bar_plot.pdf # Metaplot of all samples at start and stop codons\nRIBO_SRR1944912_meta_bar_plot.png # Metaplot of all samples at start and stop codons\nRIBO_SRR1944912_tis_tts_metaplot.txt # Metagene statistical values at start and stop codons for sample SRR1944912\n```\n\n\n### 5.8 Gene coverage\n\nTo systematically assess potential technical or translation biases in Ribo-seq data, we developed an analytical pipeline that quantifies genome-wide RPF coverage uniformity across gene bodies.\n\nGenes are partitioned into fixed-length intervals (default: 10% of CDS length), with raw RPF counts per `bin` recorded.\nThese values undergo library-size normalization using reads per million (RPM) to account for transcript length and sequencing depth variations.\n\nProcessed data is rendered through three complementary analytical views:\\\n`Aggregate Density Profile`: Smoothed line plot displaying mean RPF density across all genes, highlighting global coverage trends.\\\n`Gene-specific Heatmap`: Matrix visualization of normalized RPF counts (log\u2082-transformed) ordered by gene expression levels, revealing individual gene coverage patterns.\\\n`Binned Coverage Distribution`: Stacked bar plot showing the percentage of genes achieving threshold coverage (\u226550% of expected reads) in each interval.\n\n1. Explanation of `rpf_Coverage`\n\n```bash\n$ rpf_Coverage -h\n\nDraw the metagene coverage.\n\nStep1: Checking the input Arguments.\nusage: rpf_Coverage [-h] -t TRANSCRIPT -r RPF [-o OUTPUT] [-f {0,1,2,all}] [-m MIN] [-b BIN] [-n] [--thread THREAD] [--outlier] [--set {intersect,union}] [--heat] [--bar]\n\nThis script is used to draw the coverage meta plot.\n\noptions:\n -h, --help show this help message and exit\n -f {0,1,2,all} set the reading frame for occupancy calculation. (default: all).\n -m MIN retain transcript with more than minimum RPFs. (default: 50).\n -b BIN adjust the transcript to specified bins. 30 for 5'-UTRand 3'-UTR, 100 for CDS. (default: 30,100,30).\n -n normalize the RPFs count to RPM. (default: False).\n --thread THREAD the number of threads. (default: 1).\n --outlier filter the outliers (default: False).\n --set {intersect,union}\n filter the gene list with 5-UTR / CDS / 3-UTR. (default: union).\n --heat draw the coverage heatmap of whole gene. (default: False).\n --bar draw the coverage barplot of whole gene. (default: False).\n\nRequired arguments:\n -t TRANSCRIPT the name of input transcript filein TXT format.\n -r RPF the name of input RPFs file in TXT format.\n -o OUTPUT the prefix of output file.\n```\n\n2. Check gene density of Ribo-seq\n\n```bash\n$ cd ./08.coverage/\n\n#################################################\n# check the rpf density along with the gene body\nrpf_Coverage \\\n -t ../../../1.reference/norm/gene.norm.txt \\\n -r ../05.merge/RIBO_merged.txt \\\n -m 50 --outlier \\\n -b 10,100,10 \\\n -n --heat \\\n -o RIBO &> RIBO.log\n\ncd ..\n```\n\n3. Results of `rpf_Coverage`\n\n```bash\nRIBO.log # Log file of program execution\nRIBO_SRR1944912_10_150_10_coverage.txt # Statistical results of RPF density distribution across genes\nRIBO_SRR1944912_10_150_10_heat_plot.png # Heatmap of RPF density distribution across genes\nRIBO_SRR1944912_coverage_bar_plot.pdf # Percentage barplot of RPF density distribution across genes\nRIBO_SRR1944912_coverage_bar_plot.png # Percentage barplot of RPF density distribution across genes\nRIBO_SRR1944912_coverage_line_plot.pdf # Percentage lineplot of RPF density distribution across genes\nRIBO_SRR1944912_coverage_line_plot.png # Percentage lineplot of RPF density distribution across genes\n```\n\n\n### 5.9 Check the repeatability of samples\n\nOur analytical pipeline provides a hierarchical framework for assessing sample reproducibility in ribosome profiling studies, implementing dual-level correlation analyses.\n\n`Gene-Level Reproducibility`:\\\nQuantifies ribosome-protected fragments (RPFs) across entire gene bodies, followed by inter-sample Pearson correlation coefficient calculation.\nThis approach evaluates global translation consistency, particularly suitable for highly expressed genes with robust ribosome coverage.\n\n`ORF-Level Reproducibility`:\\\nPerforms nucleotide-resolution quantification of RPFs within individual open reading frames (ORFs), then computes Pearson correlations between replicate samples.\nThis finer-grained analysis detects localized translational variations while maintaining phase-awareness through in-frame read filtering.\n\n\n1. Explanation of `rpf_Corr`\n\n```bash\n$ rpf_Corr -h\n\nDraw the correlation of samples.\n\nStep1: Checking the input Arguments.\n\nusage: rpf_Corr [-h] -r RPF -o OUTPUT\n\nThis script is used to draw the correlation of rpf density.\n\noptions:\n -h, --help show this help message and exit\n\nRequired arguments:\n -r RPF the name of input RPFs file in TXT format.\n -o OUTPUT the prefix of output file. (prefix + _rpf_merged.txt)\n```\n\n2. Check the repeatability of Ribo-seq\n\n```bash\n$ cd ./09.correlation/\n\n#################################################\n# calculate the samples replication of Ribo-seq\nrpf_Corr \\\n -r ../05.merge/RIBO_merged.txt \\\n -o RIBO &> RIBO.log\n```\n\n3. Results of `rpf_Corr`\n\n```bash\nRIBO_gene_corr_f0.txt # Pearson correlation of total RPFs on gene frame 0\nRIBO_gene_corr_f1.txt # Pearson correlation of total RPFs on gene frame 1\nRIBO_gene_corr_f2.txt # Pearson correlation of total RPFs on gene frame 2\nRIBO_gene_corr_frame.txt # Pearson correlation of RPFs across all gene frames\nRIBO_gene_correlation_plot.pdf # Heatmap of Pearson correlation of RPFs across all gene frames\nRIBO_gene_correlation_plot.png # Heatmap of Pearson correlation of RPFs across all gene frames\nRIBO.log # Log file of program execution\nRIBO_rpf_correlation_plot.pdf # Heatmap of Pearson correlation of RPFs on gene frame 0\nRIBO_rpf_correlation_plot.png # Heatmap of Pearson correlation of RPFs on gene frame 0\nRIBO_rpf_corr_f0.txt # Pearson correlation between RPFs on gene frame 0\nRIBO_rpf_corr_f1.txt # Pearson correlation between RPFs on gene frame 1\nRIBO_rpf_corr_f2.txt # Pearson correlation between RPFs on gene frame 2\nRIBO_rpf_corr_frame.txt # Pearson correlation between RPFs across all gene frames\n```\n\n\n### 5.10 Quantification of gene expression and translation levels\n\nRibo-seq quantification fundamentally differs from RNA-seq approaches.\nWhile RNA-seq quantifies expression via read coverage across transcripts, Ribo-seq specifically measures ribosome occupancy density within coding sequences (CDS).\nTo mitigate artifacts from transitional translation states, standard analytical pipelines exclude the first 15 codons downstream of start codons and the last 5 codons upstream of stop codons.\n\nFor enhanced precision, rigorous filtering protocols may be implemented to retain only in-frame ribosome-protected fragments (RPFs),\nsystematically removing out-of-frame reads that likely represent stochastic noise rather than true elongation events.\n\n\n1. Explanation of `rpf_Quant`\n\n```bash\n$ rpf_Quant -h\n\nQuantify the RPFs in the different region.\n\nStep1: Checking the input Arguments.\nusage: rpf_Quant [-h] -r RPF -o OUTPUT [-f {0,1,2,all}] [--tis TIS] [--tts TTS] [--utr5] [--utr3]\n\nThis script is used to quantify RPFs in the CDS region.\n\noptions:\n -h, --help show this help message and exit\n -f {0,1,2,all} set the reading frame for occupancy calculation. (default: all).\n --tis TIS The number of codons after TIS will be discarded. (default: 0).\n --tts TTS The number of codons before TES will be discarded. (default: 0).\n --utr5 quantification of 5'-utr. (default: False).\n --utr3 quantification of 3'-utr. (default: False).\n\nRequired arguments:\n -r RPF the name of input RPFs file in TXT format.\n -o OUTPUT the prefix of output file. (default: prefix + _rpf_quant.txt)\n```\n\n2. Quantification of Ribo-seq\n\n```bash\n$ cd ./10.quantification/\n\n#################################################\n# quantify the gene expression\nrpf_Quant \\\n -r ../05.merge/RIBO_merged.txt \\\n --tis 15 \\\n --tts 5 \\\n -o RIBO &> RIBO.log \n\ncd ..\n```\n\n3. Results of `rpf_Quant`\n\n```bash\nRIBO_cds_rpm_bar_plot.pdf # Barplot of RPFs distribution in the gene CDS region\nRIBO_cds_rpm_cdf_plot.pdf # Cumulative distribution plot of RPFs in the gene CDS region\nRIBO_cds_rpm_heatmap.pdf # Expression heatmap of RPFs in the gene CDS region\nRIBO_cds_rpm_pca_plot.pdf # Principal component analysis (PCA) plot of RPFs in the gene CDS region\nRIBO_cds_rpm_pca.txt # Principal component analysis (PCA) results of RPFs in the gene CDS region\nRIBO_cds_rpf_quant.txt # Statistical results of RPFs in the gene CDS region\nRIBO_cds_rpkm_quant.txt # Statistical results of RPKM in the gene CDS region\nRIBO_cds_rpm_quant.txt # Statistical results of RPM in the gene CDS region\nRIBO_cds_tpm_quant.txt # Statistical results of TPM in the gene CDS region\nRIBO.log # Log file of program execution\nRIBO_total.txt # Total RPFs across all samples\n```\n\n\n### 5.11 Calculate codon pausing score\n\nRibosome profiling studies are able to capture ribosome pausing because when a specific codon is translated more slowly on average, the increased ribosome occupancy indicates that there are more RPFs associated with that codon genome-wide.\n\nPause scores were calculated by normalizing per-nucleotide read counts against the gene's mean read density. For codon-level analysis, read counts across all three nucleotide positions of each codon could used for calculation. Final pause scores represent the averaged values derived from all occurrences of the target codon.\n\n1. Explanation of `rpf_Pausing`\n\n```bash\n$ rpf_Pausing -h\n\nCalculate the relative codon pausing score.\n\nStep1: Checking the input Arguments.\nusage: rpf_Pausing [-h] -r RPF [-l LIST] -o OUTPUT [-s {E,P,A}] [-f {0,1,2,all}] [-b BACKGROUND] [-m MIN] [--tis TIS] [--tts TTS] [-n] [--scale {zscore,minmax}] [--stop]\n [--fig {none,png,pdf}] [--all]\n\nThis script is used to calculate the relative pausing score.\n\noptions:\n -h, --help show this help message and exit\n -s {E,P,A} set the E/P/A-site for pausing calculation. (default: P).\n -f {0,1,2,all} set the reading frame for pausing calculation. (default: all).\n -b BACKGROUND set the codon number before and after p-site as the background. (default: 2).\n -m MIN retain transcript with more than minimum RPFs. (default: 50).\n --tis TIS The number of codons after TIS will be discarded. (default: 10 AA).\n --tts TTS The number of codons before TTS will be discarded. (default: 5 AA).\n -n normalize the RPFs count to RPM. (default: False).\n --scale {zscore,minmax}\n normalize the pausing score. (default: minmax).\n --stop rmove the stop codon. (default: False).\n --fig {none,png,pdf} draw the rpf pausing score of each gene (it will takes a lot of time). (default: none).\n --all output all pausing score of each gene. (default: False).\n\nRequired arguments:\n -r RPF the name of input RPFs file in TXT format.\n -l LIST the gene name list in TXT format. (default: whole).\n -o OUTPUT the prefix of output file.\n```\n\n2. Calculate codon-level pausing scores in Ribo-seq data\n\n```bash\n$ cd ./11.pausing_score/\n\n#################################################\n# calculate the codon pausing score of E/P/A site\nfor sites in E P A\ndo\nrpf_Pausing \\\n -l ../../../1.reference/norm/gene.norm.txt \\\n -r ../05.merge/RIBO_merged.txt \\\n -b 0 --stop \\\n -m 30 \\\n -s $sites \\\n -f 0 \\\n --scale minmax \\\n -o \"$sites\"_site &> \"$sites\"_site.log\ndone\n\ncd ..\n```\n\n3. Results of `rpf_Pausing`\n\n```bash\nA_site_cds_codon_pausing_score.txt # Pausing score for each codon at the A-site in the gene CDS region\nA_site_cds_pausing_score.txt # Sum of pausing score at the A-site in the gene CDS region\nA_site.log # Log file of program execution\nA_site_sum_codon_pausing_score.txt # Sum of pausing score for all codons at the A-site in the gene CDS region\nA_site_total_pausing_heatplot.pdf # Heatmap of pausing score for all codons at the A-site in the gene CDS region\nA_site_total_pausing_heatplot.png # Heatmap of pausing score for all codons at the A-site in the gene CDS region\nA_site_valid_pausing_heatplot.pdf # Heatmap of pausing score for valid codons at the A-site in the gene CDS region\nA_site_valid_pausing_heatplot.png # Heatmap of pausing score for valid codons at the A-site in the gene CDS region\n```\n\n\n### 5.12 Calculate codon occupancy\n\nFor global codon occupancy analysis, \n`A(P/E)-site` codons were assigned using the established criteria, \nand read counts at each codon were normalized against the average per-codon read density within their respective open reading frames (ORFs).\n\n\n1. Explanation of `rpf_Occupancy`\n\n```bash\n$ rpf_Occupancy -h\n\nCalculate the codon occupancy.\n\nStep1: Checking the input Arguments.\nusage: rpf_Occupancy [-h] -r RPF [-l LIST] -o OUTPUT [-s {E,P,A}] [-f {0,1,2,all}] [-m MIN] [-n] [--tis TIS] [--tts TTS] [--scale {zscore,minmax}] [--stop] [--all]\n\nThis script is used to draw the codon occupancy plot.\n\noptions:\n -h, --help show this help message and exit\n -s {E,P,A} set the E/P/A-site for occupancy calculation. (default: P).\n -f {0,1,2,all} set the reading frame for occupancy calculation. (default: all).\n -m MIN retain transcript with more than minimum RPFs. (default: 30).\n -n normalize the RPFs count to RPM. (default: False).\n --tis TIS The number of codons after TIS will be discarded. (default: 15 AA).\n --tts TTS The number of codons before TTS will be discarded.. (default: 5 AA).\n --scale {zscore,minmax}\n normalize the occupancy. (default: minmax).\n --stop rmove the stop codon. (default: False).\n --all output all RPFs density. (default: False).\n\nRequired arguments:\n -r RPF the name of input RPFs file in TXT format.\n -l LIST the gene name list in TXT format. (default: whole).\n -o OUTPUT the prefix of output file.\n```\n\n2. Calculate codon-level occupancy in Ribo-seq data\n\n```bash\n$ cd ./12.codon_occupancy/\n\n#################################################\n# calculate the codon occupancy of E/P/A site\nfor sites in E P A\ndo\nrpf_Occupancy \\\n -l ../../../1.reference/norm/gene.norm.txt \\\n -r ../05.merge/RIBO_merged.txt \\\n -m 30 \\\n -s \"$sites\" \\\n -f 0 --stop \\\n --scale minmax \\\n -o \"$sites\"_site &> \"$sites\"_site.log\n \ndone\n\ncd ..\n```\n\n3. Results of `rpf_Occupancy`\n\n```bash\nA_site_codon_density.txt # Codon occupancy for each codon at the A-site in the gene CDS region\nA_site_codon_occupancy.txt # Codon occupancy for all codon at the A-site in the gene CDS region\nA_site.log # Log file of program execution\nA_site_occupancy_corrplot.pdf # Correlation heatmap of codon occupancy at the A-site\nA_site_occupancy_corrplot.png # Correlation heatmap of codon occupancy at the A-site\nA_site_occupancy_corr.txt # Correlation of codon occupancy at the A-site\nA_site_occupancy_heatplot.pdf # Heatmap of codon occupancy at the A-site\nA_site_occupancy_heatplot.png # Heatmap of codon occupancy at the A-site\nA_site_occupancy_relative_heatplot.pdf # Heatmap of relative codon occupancy at the A-site\nA_site_occupancy_relative_heatplot.png # Heatmap of relative codon occupancy at the A-site\nA_site_occupancy_relative_lineplot.pdf # Lineplot of relative codon occupancy at the A-site\nA_site_occupancy_relative_lineplot.png # Lineplot of relative codon occupancy at the A-site\n```\n\n\n### 5.13 Calculate decoding time\n\nIn addition to translation elongation analysis performed at the Ribo-seq level, RNA-seq can also be used for correction to account for the influence of the mRNA's inherent state on translation.\n\n\n1. Explanation of `rpf_CDT`\n\n```bash\n$ rpf_CDT -h\n\nCalculate the codon decoding time.\n\nStep1: Checking the input Arguments.\nusage: rpf_CDT [-h] --rpf RPF --rna RNA -l LIST -o OUTPUT [-s {E,P,A}] [-f {0,1,2,all}] [-m MIN] [--tis TIS] [--tts TTS] [--scale {zscore,minmax}] [--stop]\n\nThis script is used to draw the codon decoding time plot.\n\noptions:\n -h, --help show this help message and exit\n -s {E,P,A} set the E/P/A-site for codon decoding time calculation. (default: P).\n -f {0,1,2,all} set the reading frame for codon decoding time calculation. (default: all).\n -m MIN retain transcript with more than minimum RPFs. (default: 30).\n --tis TIS The number of codons after TIS will be discarded.. (default: 15 AA).\n --tts TTS The number of codons before TTS will be discarded.. (default: 5 AA).\n --scale {zscore,minmax}\n normalize the codon decoding time. (default: minmax).\n --stop rmove the stop codon. (default: False).\n\nRequired arguments:\n --rpf RPF the name of input RPFs file in TXT format.\n --rna RNA the name of input reads file in TXT format.\n -l LIST the gene name list in TXT format. (default: whole).\n -o OUTPUT the prefix of output file.\n```\n\n2. Calculate codon-level decoding time in Ribo-seq data\n\n```bash\n$ cd ./13.codon_decoding_time/\n\n#################################################\n# calculate the codon decoding time of E/P/A site\nfor sites in E P A\ndo\nrpf_CDT \\\n -l ../../../1.reference/norm/gene.norm.txt \\\n --rna ../../../3.rna-seq/5.riboparser/05.merge/RNA_merged.txt \\\n --rpf ../05.merge/RIBO_merged.txt \\\n --stop \\\n -m 50 \\\n -f 0 \\\n -s $sites \\\n --tis 10 \\\n --tts 5 \\\n -o \"$sites\"_site &> \"$sites\"_site.log\n\ndone\n\ncd ..\n```\n\n3. Results of `rpf_CDT`\n\n```bash\nA_site_cdt_corrplot.pdf # Correlation heatmap of codon decoding time at the A-site\nA_site_cdt_corrplot.png # Correlation heatmap of codon decoding time at the A-site\nA_site_cdt_corr.txt # Correlation of codon decoding time at the A-site\nA_site_cdt_heatplot.pdf # Heatmap of codon decoding time at the A-site\nA_site_cdt_heatplot.png # Heatmap of codon decoding time at the A-site\nA_site_cdt.txt # Codon decoding time for all codon at the A-site in the gene CDS region\nA_site.log # Log file of program execution\n```\n\n\n### 5.14 Calculate selection time\n\nCodon usage is not equal across genomes, with some codons being more frequently used than others, thought to improve translational efficiency.\nHowever, previous studies reveal that translational efficiency is optimized by a mechanism involving proportional codon usage based on tRNA concentrations.\nThese results provide new insights into protein translation, explain unequal codon usage, and highlight natural selection for translational efficiency. Codon selection time can be used to quantify this process.\n\n1. Explanation of `rpf_CST`\n\n```bash\n$ rpf_CST -h\n\nCalculate the codon decoding time.\n\nStep1: Checking the input Arguments.\nusage: rpf_CST [-h] --rpf RPF --rna RNA [-l LIST] -o OUTPUT [-s {E,P,A}] [-f {0,1,2,all}] [-m MIN] [-t TIMES] [--tis TIS] [--tts TTS] [--scale {zscore,minmax}] [--stop]\n\nThis script is used to draw the codon decoding time plot.\n\noptions:\n -h, --help show this help message and exit\n -s {E,P,A} set the E/P/A-site for codon decoding time calculation. (default: P).\n -f {0,1,2,all} set the reading frame for codon decoding time calculation. (default: all).\n -m MIN retain transcript with more than minimum RPFs. (default: 30).\n -t TIMES Specify the number of iteration times required for computation. (default: 10).\n --tis TIS The number of codons after TIS will be discarded. (default: 0 AA).\n --tts TTS The number of codons before TTS will be discarded. (default: 0 AA).\n --scale {zscore,minmax}\n normalize the codon selection time. (default: minmax).\n --stop rmove the stop codon. (default: False).\n\nRequired arguments:\n --rpf RPF the name of input RPFs file in TXT format.\n --rna RNA the name of input reads file in TXT format.\n -l LIST the gene name list in TXT format. (default: whole).\n -o OUTPUT the prefix of output file.\n```\n\n2. Calculate codon-level selection time in Ribo-seq data\n\n```bash\n$ cd ./14.codon_selection_time/\n\n#################################################\n# calculate the codon selection time of E/P/A site\nfor sites in E P A\ndo\nrpf_CST \\\n -l ../../../1.reference/norm/gene.norm.txt \\\n --rna ../../../3.rna-seq/5.riboparser/05.merge/RNA_merged.txt \\\n --rpf ../05.merge/RIBO_merged.txt \\\n --stop \\\n -m 50 \\\n -f 0 \\\n -s $sites \\\n --tis 10 \\\n --tts 5 \\\n -o \"$sites\"_site &> \"$sites\"_site.log\n\ndone\n\ncd ..\n```\n\n3. Results of `rpf_CST`\n\n```bash\nA_site_codon_selection_time.txt # Codon selection time for all codon at the A-site in the gene CDS region\nA_site_cst_corrplot.pdf # Correlation heatmap of codon selection time at the A-site\nA_site_cst_corrplot.png # Correlation heatmap of codon selection time at the A-site\nA_site_cst_corr.txt # Correlation of codon selection time at the A-site\nA_site_cst_heatplot.pdf # Heatmap of codon selection time at the A-site\nA_site_cst_heatplot.png # Heatmap of codon selection time at the A-site\nA_site_iterative_codon_selection_time.txt # Codon selection time for all codon at the A-site in the gene CDS region\nA_site.log # Log file of program execution\n```\n\n\n### 5.15 Calculate Coefficient of Variation\n\nWhile codon-level analysis can indicates alterations in translation elongation, gene-level validation is complicated by variable ribosome profiling coverage due to gene expression dynamics.\nLow-coverage genes may artifactually appear to exhibit more translational pausing due to noise inversely scaling with coverage. \n\nTo resolve this, Peter et. al developed a gene-level analytical method that explicitly models noise dependence on coverage. They fit the following two-parameter model to the data to accommodate a variety of statistical behaviors for counting noise:\n\n$$log_2(CV) = \\frac{1}{2} log_2 (\\frac{\u03b2}{\u03bc} + \u03b1)$$\n\nwhere CV is the coefficient of variation in the ribosome profile of a given gene, \n\u03bc is mean coverage (RPF reads percodon), and \u03b1 and \u03b2 are fitting parameters. \nImportantly, when \u03b1 = 0 and \u03b2 = 1, Equation results from a Poisson distribution, \nwhereas \u03b1 > 0 and \u03b2 = 1 indicates a negative binomial distribution.\n\n1. Explanation of `rpf_CoV`\n\n```bash\n$ rpf_CoV -h\n\nCalculate CoV in the CDS region.\n\nStep1: Checking the input Arguments.\nusage: rpf_CoV [-h] -r RPF [-g GROUP] [-l LIST] -o OUTPUT [-f {0,1,2,all}] [-m MIN] [-n] [--tis TIS] [--tts TTS] [--fig]\n\nThis script is used to calculate CoV in the CDS region.\n\noptions:\n -h, --help show this help message and exit\n -f {0,1,2,all} set the reading frame for occupancy calculation. (default: all).\n -m MIN retain transcript with more than minimum RPFs. (default: 5).\n -n normalize the RPFs count to RPM. (default: False).\n --tis TIS The number of codons after TIS will be discarded. (default: 15).\n --tts TTS The number of codons before TES will be discarded. (default: 5).\n --fig show the figure. (default: False).\n\nRequired arguments:\n -r RPF the name of input RPFs file in TXT format.\n -g GROUP specify the list of sample group. (default: None)\n -l LIST the gene name list in TXT format. (default: whole).\n -o OUTPUT the prefix of output file. (prefix + _Cov.txt)\n\nThe group file needs to contain at least two column:\n+----------+---------+\n| name | group |\n+==========+=========+\n| wt1 | wt |\n| wt2 | wt |\n| treat1 | treat |\n| treat2 | treat |\n| ko1 | ko |\n| ko2 | ko |\n+----------+---------+\n```\n\n2. Calculate gene coefficient of variation in Ribo-seq data\n\n```bash\n$ cd ./15.coefficient_of_variation/\n\n#################################################\n# Here we can configure the design file to calculate differences between different groups.\n$ cat design.txt\nName\tGroup\nWT_ribo_YPD1\tWT_ribo_YPD\nWT_ribo_YPD2\tWT_ribo_YPD\nWT_ribo_YPD3\tWT_ribo_YPD\nncs2d_ribo_YPD1\tncs2d_ribo_YPD\nncs2d_ribo_YPD2\tncs2d_ribo_YPD\nncs2d_ribo_YPD3\tncs2d_ribo_YPD\nelp6d_ribo_YPD1\telp6d_ribo_YPD\nelp6d_ribo_YPD2\telp6d_ribo_YPD\nelp6d_ribo_YPD3\telp6d_ribo_YPD\nncs2d_elp6d_ribo_YPD1\tncs2d_elp6d_ribo_YPD\nncs2d_elp6d_ribo_YPD2\tncs2d_elp6d_ribo_YPD\nncs2d_elp6d_ribo_YPD3\tncs2d_elp6d_ribo_YPD\n\n#################################################\n# calculate the coefficient of variation\nrpf_CoV \\\n -l ../../../1.reference/norm/gene.norm.txt \\\n -r ../05.merge/RIBO_merged.txt \\\n -f 0 \\\n -m 30 \\\n --tis 10 \\\n --tts 5 \\\n --fig \\\n -g design.txt \\\n -o RIBO &> RIBO.log\n\ncd ..\n```\n\n3. Results of `rpf_CoV`\n\n```bash\ngene_compared_CoV.txt # Compared gene coefficient of variation\ngene_CoV.txt # Gene coefficient of variation\ngene.log # Log file of program execution\ngene_WT_ribo_YPD_vs_ncs2d_ribo_YPD_CoV_fitplot.pdf # Fitted lineplot of gene coefficient of variation\ngene_WT_ribo_YPD_vs_ncs2d_ribo_YPD_CoV_fitplot.png # Fitted lineplot of gene coefficient of variation\n```\n\n\n### 5.16 Meta-codon analysis\n\nTo demonstrate the pausing patterns of different codons, a meta plot is utilized here to display the RPF density within a 20-nucleotide window surrounding the specified codon. \nThe program also supports the visualization of different frames and incorporates a smoothing function to mitigate spike signals caused by data instability.\n\n1. Explanation of `rpf_Meta_Codon`\n\n```bash\n$ rpf_Meta_Codon -h\n\nDraw the meta-codon plot.\n\nStep1: Checking the input Arguments.\nusage: rpf_Meta_Codon [-h] [-l LIST] -r RPF [-c CODON] -o OUTPUT [-f {0,1,2}] [-a AROUND] [-m MIN] [--tis TIS] [--tts TTS] [-n] [-u] [-s] [--smooth SMOOTH] [--thread THREAD] [--fig]\n\nThis script is used to draw the meta codon plot.\n\noptions:\n -h, --help show this help message and exit\n -f {0,1,2} set the reading frame for occupancy calculation. (default: all).\n -a AROUND retrieve length of codon upstream and downstream. (default: 20).\n -m MIN retain transcript with more than minimum RPFs. (default: 50).\n --tis TIS The number of codons after TIS will be discarded.. (default: 0 AA).\n --tts TTS The number of codons before TTS will be discarded.. (default: 0 AA).\n -n normalize the RPFs count to RPM. (default: False).\n -u delete the cross repetition codon in different window. (default: False).\n -s scale the window density with gene density. (default: False).\n --smooth SMOOTH smooth the window density [eg, 3,1]. (default: None).\n --thread THREAD the number of threads (default: 1).\n --fig output the figure. (default: False).\n\nRequired arguments:\n -l LIST the gene name list in TXT format. (default: whole).\n -r RPF the name of input RPFs file in TXT format.\n -c CODON the codon list in TXT format.\n -o OUTPUT the prefix of output file.\n```\n\n2. Calculate meta-codon density in Ribo-seq data\n\n```bash\n$ cd ./16.meta_codon/\n\n#################################################\n# Here we can configure the codon list.\n$ cat codon_list.txt\nAAA\nAAC\nAAG\nAAT\nAAGAAG\nATGATG\nCCCGGG\n...\n\n#################################################\n# codon meta analysis\nrpf_Meta_Codon \\\n -r ../05.merge/RIBO_merged.txt \\\n -m 50 -f 0 \\\n -c codon_list.txt \\\n -a 15 -u -n \\\n -o RIBO &> RIBO.log\n\ncd ..\n```\n\n3. Results of `rpf_Meta_Codon`\n\n```bash\nRIBO.log # Log file of program execution\nRIBO_AAA_97591_8146_meta_density.txt # RPFs density of AAA codon\nRIBO_AAA_97591_8146_meta_sequence.txt # Upstream and downstream sequence around AAA codon\nRIBO_AAA.pdf # Metaplot of AAA codon\nRIBO_AAA.png # Metaplot of AAA codon\n```\n\n\n## 6. Other toolkits\n### 6.1 Data shuffling\n\nSome analysis processes require randomly assigned data for control,\n so a step is added here to reshuffling the RPFs density file.\n\n1. Explanation of `rpf_Shuffle`\n\n```bash\n$ rpf_Shuffle -h\n\nShuffle the RPFs data.\n\nStep1: Checking the input Arguments.\n\nusage: rpf_Shuffle [-h] -r RPF -o OUTPUT [-l LIST] [-s SEED] [-i]\n\nThis script is used to shuffle the RPFs data.\n\noptions:\n -h, --help show this help message and exit\n -l LIST the list of input genes for transcript id.\n -s SEED the random seed for shuffle. (default: 0).\n -i shuffle the RPFs data for each samples. (default: False).\n\nRequired arguments:\n -r RPF the name of input RPFs density file in TXT format.\n -o OUTPUT the prefix of output file. (prefix + _shuffle.txt)\n```\n\n2. Shuffle gene density values in Ribo-seq data.\n\n```bash\n$ cd\n$ cd ./sce/4.ribo-seq/5.riboparser/17.shuffle/\n\n#################################################\n# codon meta analysis\nrpf_Shuffle \\\n -l ../../../1.reference/norm/gene.norm.txt \\\n -r ../05.merge/RIBO_merged.txt \\\n -s 0 \\\n -i \\\n -o RIBO &> RIBO.log\n```\n\n3. Shuffle gene density values in RNA-seq data\n\n```bash\n$ cd\n$ cd ./sce/3.rna-seq/5.riboparser/10.shuffle/\n\n#################################################\n# retrieve and format the gene density\nrpf_Shuffle \\\n -l ../../../1.reference/norm/gene.norm.txt \\\n -r ../05.merge/RNA_merged.txt \\\n -s 0 \\\n -i \\\n -o RNA &> RNA.log\n```\n\n4. Results of `rpf_Shuffle`\n\n```bash\n# results of ribo-seq\nRIBO.log # Log file of program execution\nRIBO_shuffle.txt # Shuffled RPFs density file\n```\n\n\n### 6.2 Retrieve and format the gene density\n\nIn many cases, it is necessary to perform some additional operations on the gene set in the RPFs density file, \nsuch as filtering, RPM standardization, long and width data format conversion, etc. \nA specialized tool is provided here for these operations.\n\n1. Explanation of `rpf_Retrieve`\n\n```bash\n$ rpf_Retrieve -h\n\nRetrieve the RPFs with gene list.\n\nStep1: Checking the input Arguments.\n\nusage: rpf_Retrieve [-h] -r RPF [-o OUTPUT] [-l LIST] [-m MIN] [-n] [-f] [-s]\n\nThis script is used to retrieve density files.\n\noptions:\n -h, --help show this help message and exit\n -l LIST the list of input genes for transcript id.\n -m MIN retain transcript with more than minimum RPFs (default: 0).\n -n normalize the RPFs count to RPM (default: False).\n -f melt three column data of each sample to one column (default: False).\n -s split gene rpf to each TXT file (default: False).\n\nRequired arguments:\n -r RPF the name of input RPFs density file in TXT format.\n -o OUTPUT prefix of output file name (default: filename + '_retrieve.txt'.\n```\n\n2. Extract and format gene density from Ribo-seq data\n\n```bash\n$ cd\n$ cd ./sce/4.ribo-seq/5.riboparser/18.retrieve/\n\n#################################################\n# retrieve and format the gene density with gene covered more than 50 reads\nrpf_Retrieve \\\n -l ../../../1.reference/norm/gene.norm.txt \\\n -r ../05.merge/RIBO_merged.txt \\\n -m 0 \\\n -f \\\n -n \\\n -o RIBO &> RIBO.log\n\ncd ..\n```\n\n3. Extract and format gene density from RNA-seq data\n\n```bash\n$ cd\n$ cd ./sce/3.rna-seq/5.riboparser/11.retrieve/\n\n#################################################\n# retrieve and format the gene density with gene covered more than 50 reads\nrpf_Retrieve \\\n -l ../../../1.reference/norm/gene.norm.txt \\\n -r ../05.merge/RNA_merged.txt \\\n -m 50 \\\n -f \\\n -n \\\n -o RNA &> RNA.log\n\ncd ..\n```\n\n4. Results of `rpf_Retrieve`\n\n```bash\n# results of ribo-seq\nRIBO.log\nRIBO_retrieve.txt\n```\n\n### 6.3 Filter the frame shifting genes\n\nA frameshift in translation occurs when the ribosome shifts by one or more nucleotides in the mRNA sequence, causing a misreading of the codons.\nThis results in a completely altered amino acid sequence downstream of the shift,\noften leading to premature termination or a nonfunctional protein.\nFrameshifts can occur naturally due to mutations or during translation errors, and they can significantly impact protein function.\nWhenever a stable frameshift occurs, we can detect ribosome occupancy in different reading frames from the Ribo-seq data.\n\n1. Explanation of `rpf_Shift`\n\n```bash\n$ rpf_Shift -h\n\nDraw the frame shifting plot.\n\nStep1: Checking the input Arguments.\n\nusage: rpf_Shift.py [-h] -r RPF -o OUTPUT [-t TRANSCRIPT] [-p PERIOD] [-m MIN] [--tis TIS] [--tts TTS]\n\nThis script is used to draw the frame shift plot.\n\noptions:\n -h, --help show this help message and exit\n -p PERIOD the minimum in-frame value for frame shifting screen, range [0 - 1]. (default: 0.45).\n -m MIN retain transcript with more than minimum RPFs. (default: 50).\n --tis TIS the number of codons after TIS will be discarded.. (default: 0 AA).\n --tts TTS the number of codons before TTS will be discarded.. (default: 0 AA).\n\nRequired arguments:\n -r RPF the name of input RPFs file in TXT format.\n -o OUTPUT the prefix of output file.\n -t TRANSCRIPT the name of input transcript filein TXT format.\n\n```\n\n2. Filter the frame shifting genes\n\n```bash\n$ cd\n$ cd ./sce/4.ribo-seq/5.riboparser/19.frame_shift/\n\n#################################################\n# filter the frame shifting genes\nrpf_Shift \\\n -t ../../../1.reference/norm/gene.norm.txt \\\n -r ../05.merge/RIBO_merged.txt \\\n --tis 5 --tts 5 \\\n -m 50 \\\n -p 45 \\\n -f \\\n -n \\\n -o RIBO &> RIBO.log\n\ncd ..\n```\n\n4. Results of `rpf_Shift`\n\n```bash\n# results of ribo-seq\nRIBO.log\nRIBO_gene_frame_shift_count_plot.pdf\nRIBO_gene_frame_shift_count_plot.png\nRIBO_gene_frame_shift_count.txt\nRIBO_gene_periodicity.txt\nRIBO_SRR1944912_gene_frame_shift.txt\n\n```\n\n## 7. Contribution\n\nThanks for all the open source tools used in the process.\n\nThanks to Nedialkova DD and Leidel SA for providing the excellent dataset.\n\nContribute to our open-source project by submitting questions and code.\n\nContact `rensc0718@163.com` for more information.\n\n## 8. License\n\nGPL License.\n",
"bugtrack_url": null,
"license": "GPL",
"summary": "A pipeline for ribosome profiling data analysis",
"version": "0.1.7.4",
"project_urls": {
"Homepage": "https://github.com/renscq/RiboParser"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "97d2368592e1a7da48687bee0d2214c457c2ee4af453d7f38b5f7a668c88881f",
"md5": "5367800914b837922c310fc41bc6c0ce",
"sha256": "91c98720c38a949b175bc218b791377aebcd23cc39c6136fce24d3b806973cbd"
},
"downloads": -1,
"filename": "RiboParser-0.1.7.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "5367800914b837922c310fc41bc6c0ce",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.12",
"size": 223843,
"upload_time": "2025-02-19T13:42:06",
"upload_time_iso_8601": "2025-02-19T13:42:06.717418Z",
"url": "https://files.pythonhosted.org/packages/97/d2/368592e1a7da48687bee0d2214c457c2ee4af453d7f38b5f7a668c88881f/RiboParser-0.1.7.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "6b6a38b3429c02d9b2cb744c55ab7371dddd0831e3a5f282a153a870a884e807",
"md5": "7155c6e33df5d84a88349b62bcb69bdc",
"sha256": "5b9a39b055c7fb5c8aa3e1ae0ff2452d5c8895e2c5424dd36441d4c98c301cbe"
},
"downloads": -1,
"filename": "riboparser-0.1.7.4.tar.gz",
"has_sig": false,
"md5_digest": "7155c6e33df5d84a88349b62bcb69bdc",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.12",
"size": 189778,
"upload_time": "2025-02-19T13:42:08",
"upload_time_iso_8601": "2025-02-19T13:42:08.878316Z",
"url": "https://files.pythonhosted.org/packages/6b/6a/38b3429c02d9b2cb744c55ab7371dddd0831e3a5f282a153a870a884e807/riboparser-0.1.7.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-02-19 13:42:08",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "renscq",
"github_project": "RiboParser",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "numpy",
"specs": [
[
"~=",
"1.26.4"
]
]
},
{
"name": "pandas",
"specs": [
[
"~=",
"2.2.2"
]
]
},
{
"name": "pyarrow",
"specs": [
[
"~=",
"16.1.0"
]
]
},
{
"name": "polars",
"specs": [
[
"~=",
"0.20.31"
]
]
},
{
"name": "interval",
"specs": [
[
"~=",
"1.0.0"
]
]
},
{
"name": "joblib",
"specs": [
[
"~=",
"1.4.2"
]
]
},
{
"name": "biopython",
"specs": [
[
"~=",
"1.78"
]
]
},
{
"name": "pysam",
"specs": [
[
"~=",
"0.22.1"
]
]
},
{
"name": "scipy",
"specs": [
[
"~=",
"1.12.0"
]
]
},
{
"name": "scikit-learn",
"specs": [
[
"~=",
"1.4.2"
]
]
},
{
"name": "statsmodels",
"specs": [
[
"~=",
"0.14.2"
]
]
},
{
"name": "matplotlib",
"specs": [
[
"~=",
"3.8.4"
]
]
},
{
"name": "matplotlib-venn",
"specs": [
[
"~=",
"1.1.1"
]
]
},
{
"name": "seaborn",
"specs": [
[
"~=",
"0.13.2"
]
]
},
{
"name": "plotly",
"specs": [
[
"~=",
"5.22.0"
]
]
},
{
"name": "seqlogo",
"specs": [
[
"~=",
"5.29.9"
]
]
},
{
"name": "kaleido",
"specs": [
[
"~=",
"0.2.1"
]
]
}
],
"lcname": "riboparser"
}