# ![image](https://github.com/yangyn533/3UTRBERT/blob/main/3UTRBERT.png)
## ⌛️ Data availability
[3UTRBERT_dataset](https://figshare.com/articles/dataset/3UTRBERT_dataset_availability/22845644)
## ⌛️ Download pre-trained 3UTRBERT model
[3UTRBERT-3mer](https://figshare.com/articles/software/Pre-trained_3mer_model/22847354)
[3UTRBERT-4mer](https://figshare.com/articles/software/Pre-trained_4mer_model/22851119)
[3UTRBERT-5mer](https://figshare.com/articles/software/Pre-trained_5mer_model/22851191)
[3UTRBERT-6mer](https://figshare.com/articles/software/Pre-trained_6mer_model/22851272)
## 📘 Environment Setup
#### 1.1 Create and activate a new virtual environment
```
conda create -n 3UTRBERT python=3.6.13
conda activate 3UTRBERT
```
#### 1.2 Install the package and other requirements
```
conda install pytorch==1.10.1 torchvision==0.11.2 torchaudio==0.10.1 cudatoolkit=10.2 -c pytorch
git clone https://github.com/yangyn533/3UTRBERT
cd 3UTRBERT
python3 -m pip install --editable .
python3 -m pip install -r requirements.txt
```
If above commands do not run correctly. Following commands could be used to install missing packages manually **after running the above commands.** If you use the commands to install packages manually, above commands should be run first.
```
pip install seaborn
pip install transformers
pip install pyfaidx
pip install python-decouple
pip install sacremoses
pip install boto3
pip install sentencepiece
pip install Bio
pip install pyahocorasick
```
## ⌛️ Process data
The input file is in `.fasta` format. For each sequence, the label of the sequence should be in the sequence ID. (example file can be found in example_data folder).
By running the following code, the input fasta file will be separated into train, dev and test sets. Each sequence will be tokenized into 3mer tokens. Example data locates in the example/data folder. `train.tsv` is for training, `dev.tsv` for validation and `test.tsv` for test the performance.
```
python preprocess.py \
--data_dir <PATH_TO_YOUR_DATA> \
--output_dir <PATH_TO_YOUR_OUTPUT_DIRECTORY> \
--kmer 3
```
## ⌛️ Train
`train.py` is used for fine-tune the model. The input data are `train.tsv` and `dev.tsv`. Make sure `train.tsv` and `dev.tsv` are in the same directory and the input path to this directory as the `--data_dir` argument (not include the file name itself). `--model_name_or_path` needs to be the path to your pre-trained model. `--output_dir` is the location to store the fine-tuned model.
```
python train.py \
--data_dir <PATH_TO_YOUR_DATA> \
--output_dir <PATH_TO_YOUR_OUTPUT_DIRECTORY> \
--model_type 3utrprom \
--tokenizer_name rna3 \
--model_name_or_path <PATH_TO_YOUR_MODEL> \
--do_train \
--per_gpu_train_batch_size 32 \
--per_gpu_eval_batch_size 32 \
--learning_rate 5e-5 \
--logging_steps 100 \
--save_steps 1000 \
--num_train_epochs 3 \
--evaluate_during_training \
--max_seq_length 100 \
--warmup_percent 0.1 \
--hidden_dropout_prob 0.1 \
--overwrite_output \
--weight_decay 0.01 \
--seed 6
```
Please change the tokenizer name { rna3, rna4, rna5, rna6 } when changing the kmer choice.
## ⌛️ Predict
`predict.py` is used for producing prediction results from the fine-tuned model. The input data is the `test.tsv`. Make sure `train.tsv`, `dev.tsv` and `test.tsv` are in the same directory and input path to this directory as the `--data_dir` argument (not include the file name itself). `--model_name_or_path` needs to be the path to your fine-tuned model. The output files of `predict.py` are mainly `pred_results.npy` and `pred_results_scores.npy`. `pred_results.npy` stores the probability for each sequence. `pred_results_scores.npy` stores the metrics to evaluate the model.
```
python predict.py \
--data_dir <PATH_TO_YOUR_DATA> \
--output_dir <PATH_TO_YOUR_OUTPUT_DIRECTORY> \
--do_predict \
--tokenizer_name rna3 \
--model_type 3utrprom \
--model_name_or_path <PATH_TO_YOUR_MODEL> \
--max_seq_length 100 \
--per_gpu_eval_batch_size 32
```
Please change the tokenizer name { rna3, rna4, rna5, rna6 } when changing the kmer choice.
## 📊 Single resolution importance analysis
The following code extracted the attention scores and visualizes them.
```
python single_resolution_importance.py \
--kmer 3 \
--model_path <PATH_TO_YOUR_MODEL> \
--start_layer 11 \
--end_layer 11 \
--metric mean \
--sequence <SEQUENCE_USED> \
--save_path <PATH_TO_YOUR_OUTPUT_DIRECTORY>
```
Please make sure that the input sequence does not exceed the max-length limit.
## 📊 Mutation analysis
Before run the shell script. Make sure the parameters in the shell script are indicated.
```
source mutation_heatmap.sh
```
The following commonds comes from `mutation_heatmap.sh`.
`KMER` indicates the kmer used.
`ORIGINAL_SEQ_PATH` should be the path the the folder where your sequence file locates (not include the file name itself).
`MUTATE_SEQ_PATH` should be the folder your want to store the mutated sequence file (not include the file name itself).
`WT_SEQ` should be the same sequence in the original sequence file.
Please store the original sequence in a `.tsv` file called `test.tsv`.
The example `test.tsv` file is in example `mutation_analysis/original_seq`, only one sequence is allowed to be in the file. The label for the sequence can be 0 or 1.
```
export KMER=3
export MODEL_PATH=<PATH_TO_YOUR_MODEL>
export ORIGINAL_SEQ_PATH=<PATH_TO_YOUR_ORIGINAL_SEQUENCE_FILE>
export MUTATE_SEQ_PATH=<PATH_TO_YOUR_MUTATED_SEQUENCE_FILE>
export PREDICTION_PATH=<PATH_TO_STORE_PREDICTION>
export WT_SEQ=<THE_SEQUENCE_USED_FOR_MUTATION>
export OUTPUT_PATH=<PATH_TO_YOUR_OUTPUT_DIRECTORY>
# mutate sequence
python mutate.py --seq_file $ORIGINAL_SEQ_PATH/test.tsv --save_file_dir $MUTATE_SEQ_PATH --k $KMER
# predict on sequence
mkdir $PREDICTION_PATH/original_pred
mkdir $PREDICTION_PATH/mutate_pred
python predict.py \
--data_dir $ORIGINAL_SEQ_PATH \
--output_dir $PREDICTION_PATH/original_pred \
--do_predict \
--tokenizer_name rna3 \
--model_type 3utrprom \
--model_name_or_path $MODEL_PATH \
--max_seg_length 100 \
--per_gpu_eval_batch_size 32
python predict.py \
--data_dir $MUTATE_SEQ_PATH \
--output_dir $PREDICTION_PATH/mutate_pred \
--do_predict \
--tokenizer_name rna3 \
--model_type 3utrprom \
--model_name_or_path $MODEL_PATH \
--max_seg_length 100 \
--per_gpu_eval_batch_size 32
# calculate scores
python calculate_diff_scores.py \
--orig_seq_file $ORIGINAL_SEQ_PATH/test.tsv \
--orig_pred_file $PREDICTION_PATH/original_pred/pred_results.npy \
--mut_seq_file $MUTATE_SEQ_PATH/test.tsv \
--mut_pred_file $PREDICTION_PATH/mutate_pred/pred_results.npy \
--save_file_dir $OUTPUT_PATH
# draw heatmap
python heatmap.py \
--score_file $OUTPUT_PATH \
--save_file_dir $OUTPUT_PATH \
--wt_seq $WT_SEQ
```
## 📊 Feature extraction
`<PATH_TO_DATA>` is the path the folder that the data in (not include the data file name). The input fasta file should be named as `seq_to_extract.fasta`. THe example data can be found in example folder.
```
python extract_LS_embedding.py \
--data_path <PATH_TO_DATA> \
--output_path <PATH_TO_YOUR_OUTPUT_DIRECTORY> \
--model_path <PATH_TO_YOUR_MODEL>
```
## 🧬 Motif analysis
The motif analysis requires the output of attentions. The required attention can be obtained from `single_resolution_importance.py`. Store the attention into the directory used as input `--predict_dir`.
```
python find_motifs.py \
--data_dir <PATH_TO_YOUR_DATA> \
--predict_dir <PATH_TO_YOUR_PREDICTION_OUTPUT_DIRECTORY> \
--window_size <ADJUST_THIS> \
--min_len <ADJUST_THIS> \
--pval_cutoff <ADJUST_THIS> \
--min_n_motif <ADJUST_THIS> \
--align_all_ties \
--save_file_dir <PATH_TO_YOUR_OUTPUT_DIRECTORY> \
--verbose
```
Raw data
{
"_id": null,
"home_page": "https://github.com/yangyn533/3UTRBERT",
"name": "UTRBERT",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.5.0",
"maintainer_email": "",
"keywords": "NLP deep learning transformer pytorch tensorflow RNA 3utr bert",
"author": "Yuning Yang, Gen Li, Kuan Pang, Wuxinhao Cao, Xiangtao Li, and Zhaolei Zhang",
"author_email": "yyn.yang@mail.utoronto.ca",
"download_url": "https://files.pythonhosted.org/packages/29/96/02a5084f227c57348a4d5be36c9dd4c6e64a97dc5c4bac353c73ac649792/UTRBERT-1.0.1.tar.gz",
"platform": null,
"description": "\n# ![image](https://github.com/yangyn533/3UTRBERT/blob/main/3UTRBERT.png)\n## \u231b\ufe0f Data availability\n[3UTRBERT_dataset](https://figshare.com/articles/dataset/3UTRBERT_dataset_availability/22845644)\n\n## \u231b\ufe0f Download pre-trained 3UTRBERT model\n[3UTRBERT-3mer](https://figshare.com/articles/software/Pre-trained_3mer_model/22847354)\n\n[3UTRBERT-4mer](https://figshare.com/articles/software/Pre-trained_4mer_model/22851119)\n\n[3UTRBERT-5mer](https://figshare.com/articles/software/Pre-trained_5mer_model/22851191)\n\n[3UTRBERT-6mer](https://figshare.com/articles/software/Pre-trained_6mer_model/22851272)\n\n## \ud83d\udcd8 Environment Setup\n\n#### 1.1 Create and activate a new virtual environment\n```\nconda create -n 3UTRBERT python=3.6.13 \nconda activate 3UTRBERT\n```\n\n#### 1.2 Install the package and other requirements\n```\nconda install pytorch==1.10.1 torchvision==0.11.2 torchaudio==0.10.1 cudatoolkit=10.2 -c pytorch\ngit clone https://github.com/yangyn533/3UTRBERT\ncd 3UTRBERT\npython3 -m pip install --editable .\npython3 -m pip install -r requirements.txt\n```\n\nIf above commands do not run correctly. Following commands could be used to install missing packages manually **after running the above commands.** If you use the commands to install packages manually, above commands should be run first.\n\n```\npip install seaborn\npip install transformers\npip install pyfaidx\npip install python-decouple\npip install sacremoses\npip install boto3\npip install sentencepiece\npip install Bio\npip install pyahocorasick\n```\n\n\n## \u231b\ufe0f Process data\nThe input file is in `.fasta` format. For each sequence, the label of the sequence should be in the sequence ID. (example file can be found in example_data folder).\nBy running the following code, the input fasta file will be separated into train, dev and test sets. Each sequence will be tokenized into 3mer tokens. Example data locates in the example/data folder. `train.tsv` is for training, `dev.tsv` for validation and `test.tsv` for test the performance.\n```\npython preprocess.py \\\n --data_dir <PATH_TO_YOUR_DATA> \\\n --output_dir <PATH_TO_YOUR_OUTPUT_DIRECTORY> \\\n --kmer 3\n```\n\n\n## \u231b\ufe0f Train\n`train.py` is used for fine-tune the model. The input data are `train.tsv` and `dev.tsv`. Make sure `train.tsv` and `dev.tsv` are in the same directory and the input path to this directory as the `--data_dir` argument (not include the file name itself). `--model_name_or_path` needs to be the path to your pre-trained model. `--output_dir` is the location to store the fine-tuned model.\n```\npython train.py \\\n --data_dir <PATH_TO_YOUR_DATA> \\\n --output_dir <PATH_TO_YOUR_OUTPUT_DIRECTORY> \\\n --model_type 3utrprom \\\n --tokenizer_name rna3 \\\n --model_name_or_path <PATH_TO_YOUR_MODEL> \\\n --do_train \\\n --per_gpu_train_batch_size 32 \\\n --per_gpu_eval_batch_size 32 \\\n --learning_rate 5e-5 \\\n --logging_steps 100 \\\n --save_steps 1000 \\\n --num_train_epochs 3 \\\n --evaluate_during_training \\\n --max_seq_length 100 \\\n --warmup_percent 0.1 \\\n --hidden_dropout_prob 0.1 \\\n --overwrite_output \\\n --weight_decay 0.01 \\\n --seed 6\n```\nPlease change the tokenizer name { rna3, rna4, rna5, rna6 } when changing the kmer choice.\n\n\n## \u231b\ufe0f Predict\n`predict.py` is used for producing prediction results from the fine-tuned model. The input data is the `test.tsv`. Make sure `train.tsv`, `dev.tsv` and `test.tsv` are in the same directory and input path to this directory as the `--data_dir` argument (not include the file name itself). `--model_name_or_path` needs to be the path to your fine-tuned model. The output files of `predict.py` are mainly `pred_results.npy` and `pred_results_scores.npy`. `pred_results.npy` stores the probability for each sequence. `pred_results_scores.npy` stores the metrics to evaluate the model.\n```\npython predict.py \\\n --data_dir <PATH_TO_YOUR_DATA> \\\n --output_dir <PATH_TO_YOUR_OUTPUT_DIRECTORY> \\\n --do_predict \\\n --tokenizer_name rna3 \\\n --model_type 3utrprom \\\n --model_name_or_path <PATH_TO_YOUR_MODEL> \\\n --max_seq_length 100 \\\n --per_gpu_eval_batch_size 32\n```\nPlease change the tokenizer name { rna3, rna4, rna5, rna6 } when changing the kmer choice.\n\n\n## \ud83d\udcca Single resolution importance analysis\nThe following code extracted the attention scores and visualizes them.\n```\npython single_resolution_importance.py \\\n --kmer 3 \\\n --model_path <PATH_TO_YOUR_MODEL> \\\n --start_layer 11 \\\n --end_layer 11 \\\n --metric mean \\\n --sequence <SEQUENCE_USED> \\\n --save_path <PATH_TO_YOUR_OUTPUT_DIRECTORY>\n```\nPlease make sure that the input sequence does not exceed the max-length limit.\n\n\n## \ud83d\udcca Mutation analysis\nBefore run the shell script. Make sure the parameters in the shell script are indicated.\n```\nsource mutation_heatmap.sh\n```\nThe following commonds comes from `mutation_heatmap.sh`.\n`KMER` indicates the kmer used. \n`ORIGINAL_SEQ_PATH` should be the path the the folder where your sequence file locates (not include the file name itself).\n`MUTATE_SEQ_PATH` should be the folder your want to store the mutated sequence file (not include the file name itself). \n`WT_SEQ` should be the same sequence in the original sequence file.\nPlease store the original sequence in a `.tsv` file called `test.tsv`.\n\nThe example `test.tsv` file is in example `mutation_analysis/original_seq`, only one sequence is allowed to be in the file. The label for the sequence can be 0 or 1.\n```\nexport KMER=3\nexport MODEL_PATH=<PATH_TO_YOUR_MODEL>\nexport ORIGINAL_SEQ_PATH=<PATH_TO_YOUR_ORIGINAL_SEQUENCE_FILE>\nexport MUTATE_SEQ_PATH=<PATH_TO_YOUR_MUTATED_SEQUENCE_FILE>\nexport PREDICTION_PATH=<PATH_TO_STORE_PREDICTION>\nexport WT_SEQ=<THE_SEQUENCE_USED_FOR_MUTATION>\nexport OUTPUT_PATH=<PATH_TO_YOUR_OUTPUT_DIRECTORY>\n\n\n# mutate sequence\npython mutate.py --seq_file $ORIGINAL_SEQ_PATH/test.tsv --save_file_dir $MUTATE_SEQ_PATH --k $KMER\n\n\n# predict on sequence\nmkdir $PREDICTION_PATH/original_pred\nmkdir $PREDICTION_PATH/mutate_pred\n\npython predict.py \\\n --data_dir $ORIGINAL_SEQ_PATH \\\n --output_dir $PREDICTION_PATH/original_pred \\\n --do_predict \\\n --tokenizer_name rna3 \\\n --model_type 3utrprom \\\n --model_name_or_path $MODEL_PATH \\\n --max_seg_length 100 \\\n --per_gpu_eval_batch_size 32\n\npython predict.py \\\n --data_dir $MUTATE_SEQ_PATH \\\n --output_dir $PREDICTION_PATH/mutate_pred \\\n --do_predict \\\n --tokenizer_name rna3 \\\n --model_type 3utrprom \\\n --model_name_or_path $MODEL_PATH \\\n --max_seg_length 100 \\\n --per_gpu_eval_batch_size 32\n\n\n# calculate scores\npython calculate_diff_scores.py \\\n --orig_seq_file $ORIGINAL_SEQ_PATH/test.tsv \\\n --orig_pred_file $PREDICTION_PATH/original_pred/pred_results.npy \\\n --mut_seq_file $MUTATE_SEQ_PATH/test.tsv \\\n --mut_pred_file $PREDICTION_PATH/mutate_pred/pred_results.npy \\\n --save_file_dir $OUTPUT_PATH\n\n\n# draw heatmap\npython heatmap.py \\\n --score_file $OUTPUT_PATH \\\n --save_file_dir $OUTPUT_PATH \\\n --wt_seq $WT_SEQ\n```\n\n\n## \ud83d\udcca Feature extraction\n`<PATH_TO_DATA>` is the path the folder that the data in (not include the data file name). The input fasta file should be named as `seq_to_extract.fasta`. THe example data can be found in example folder.\n```\npython extract_LS_embedding.py \\\n --data_path <PATH_TO_DATA> \\\n --output_path <PATH_TO_YOUR_OUTPUT_DIRECTORY> \\\n --model_path <PATH_TO_YOUR_MODEL>\n```\n\n\n## \ud83e\uddec Motif analysis\nThe motif analysis requires the output of attentions. The required attention can be obtained from `single_resolution_importance.py`. Store the attention into the directory used as input `--predict_dir`.\n```\npython find_motifs.py \\\n --data_dir <PATH_TO_YOUR_DATA> \\\n --predict_dir <PATH_TO_YOUR_PREDICTION_OUTPUT_DIRECTORY> \\\n --window_size <ADJUST_THIS> \\\n --min_len <ADJUST_THIS> \\\n --pval_cutoff <ADJUST_THIS> \\\n --min_n_motif <ADJUST_THIS> \\\n --align_all_ties \\\n --save_file_dir <PATH_TO_YOUR_OUTPUT_DIRECTORY> \\\n --verbose\n```\n\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Deciphering 3'UTR mediated gene regulation using interpretable deep representation learning",
"version": "1.0.1",
"project_urls": {
"Homepage": "https://github.com/yangyn533/3UTRBERT"
},
"split_keywords": [
"nlp",
"deep",
"learning",
"transformer",
"pytorch",
"tensorflow",
"rna",
"3utr",
"bert"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "299602a5084f227c57348a4d5be36c9dd4c6e64a97dc5c4bac353c73ac649792",
"md5": "d7a1a5d8e329a4339cb2b3f401e92147",
"sha256": "1d9a8229a2cc21107c84b3b9873530d803873447e91ed8512e8d86cd702e8d31"
},
"downloads": -1,
"filename": "UTRBERT-1.0.1.tar.gz",
"has_sig": false,
"md5_digest": "d7a1a5d8e329a4339cb2b3f401e92147",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.5.0",
"size": 416662,
"upload_time": "2023-08-29T14:15:08",
"upload_time_iso_8601": "2023-08-29T14:15:08.489118Z",
"url": "https://files.pythonhosted.org/packages/29/96/02a5084f227c57348a4d5be36c9dd4c6e64a97dc5c4bac353c73ac649792/UTRBERT-1.0.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-08-29 14:15:08",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "yangyn533",
"github_project": "3UTRBERT",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "utrbert"
}