# 𧬠lncrna-PI - LncRNAβProtein Interaction Prediction
lncrnaPI is a command-line tool for predicting **lncRNAβProtein interactions** using **pre-trained language models (DNABERT-2 and ESM-2)** for sequence embedding and a **CatBoost classifier** for interaction probability estimation.
It supports two modes:
- **Rapid** β based on sequence composition features (fast, lightweight)
- **LLM** β based on transformer embeddings (DNABERT2 + ESM2)
The script performs **all-by-all predictions** between every lncRNA and every protein sequence in the provided FASTA files.
---
## π¦ Features
- Vectorized and efficient FASTA parsing
- All-by-all pairing of lncRNA and protein sequences
- Automatic **feature extraction**:
- **Rapid mode:** nucleotide and amino acid composition (%)
- **LLM mode:** transformer-based embeddings (DNABERT2 + ESM2)
- Automatic model selection:
- `catboost_model_rapid.joblib` β Composition model
- `catboost_dnabert2_esm-t30.joblib` β Embedding model
- **GPU-aware** embedding generation (with safe fallback to CPU)
- Generates probability and binary interaction predictions
---
## π§° Dependencies
The tool was developed using Python 3.10. Install the following dependencies before running the script:
```bash
pip install torch==2.6.0 transformers==4.57.0 catboost==1.2.8 joblib tqdm numpy pandas
```
---
## βοΈ Usage
### 1οΈβ£ **Rapid (Composition-Based) Prediction**
This mode uses simple % composition features (very fast).
```bash
python lncrnapi_cli.py -lf ./data/example_lncRNA.fasta -pf ./data/example_protein.fasta -wd ./output -model rapid
```
**Model used:**
`./model/catboost_model_rapid.joblib`
---
### 2οΈβ£ **LLM (Embedding-Based) Prediction**
This mode uses transformer embeddings from **DNABERT2** (for lncRNA) and **ESM2-T30** (for protein).
```bash
python lncrnapi_cli.py -lf ./data/example_lncRNA.fasta -pf ./data/example_protein.fasta -wd ./output -model llm
```
**Model used:**
`./model/catboost_dnabert2_esm-t30.joblib`
---
### **Arguments**
| Argument | Description | Required |
|-----------|--------------|-----------|
| `-lf` | Path to the FASTA file containing lncRNA sequences. | β
|
| `-pf` | Path to the FASTA file containing protein sequences. | β
|
| `-wd` | Path to the working directory. | β
|
| `-model` | Choice of model to be used. | β
|
| `-t` | Threshold | β |
---
## πΎ Output
A CSV file named `output.csv` is generated in the output directory:
| lncRNA_ID | Protein_ID | Interaction_Probability | Predicted_Label |
|------------|-------------|--------------------------|------------------|
| lnc1 | P12345 | 0.87 | 1 |
| lnc1 | P67890 | 0.34 | 0 |
| ... | ... | ... | ... |
- **Interaction_Probability:** Probability predicted by CatBoost
- **Predicted_Label:** 1 β interaction, 0 β non-interaction
---
## β‘ Hardware Acceleration
The script automatically detects and uses available hardware:
- β
**CUDA GPU** (NVIDIA)
- β
**MPS** (Apple Silicon)
- β οΈ **CPU** (fallback)
---
## π Troubleshooting
| Issue | Possible Cause | Solution |
|-------|----------------|-----------|
| `Model file not found` | Wrong `--model_path` | Check the file path |
| `No sequences found in FASTA` | Invalid FASTA format | Ensure `>` headers are present |
| `safetensors` error | Missing library | Install with `pip install safetensors` |
| Slow performance | CPU usage | Use GPU-enabled environment |
---
## π Citation
If you use this tool in your research, please cite:
> ** Choudhury et al.**
> *Prediction of lncRNA-protein interacting pairs using LLM embeddings based on evolutionary information* (2025)
---
## π§© Repository Structure
```
βββ lncrnapi_cli.py # Main CLI script
βββ README.md
βββ LICENSE
βββ test_lncrna.fa
βββ test_protein.fa
βββ output.csv
βββ models/
βββ catboost_model_rapid.joblib
βββ catboost_dnabert2_esm-t30.joblib
```
---
Raw data
{
"_id": null,
"home_page": "https://github.com/raghavagps/lncrnapi",
"name": "lncrnapi",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "lncrna protein interaction prediction bioinformatics catboost transformers",
"author": "Gajendra P.S. Raghava",
"author_email": "raghava@iiitd.ac.in",
"download_url": "https://files.pythonhosted.org/packages/1e/52/f3140613c6d265404aac937902445ab54443aab51e9cfa4e2a439ddae103/lncrnapi-1.2.tar.gz",
"platform": null,
"description": "# \ud83e\uddec lncrna-PI - LncRNA\u2013Protein Interaction Prediction\n\nlncrnaPI is a command-line tool for predicting **lncRNA\u2013Protein interactions** using **pre-trained language models (DNABERT-2 and ESM-2)** for sequence embedding and a **CatBoost classifier** for interaction probability estimation.\n\nIt supports two modes:\n- **Rapid** \u2014 based on sequence composition features (fast, lightweight)\n- **LLM** \u2014 based on transformer embeddings (DNABERT2 + ESM2)\n\nThe script performs **all-by-all predictions** between every lncRNA and every protein sequence in the provided FASTA files.\n\n---\n\n## \ud83d\udce6 Features\n\n- Vectorized and efficient FASTA parsing \n- All-by-all pairing of lncRNA and protein sequences \n- Automatic **feature extraction**:\n - **Rapid mode:** nucleotide and amino acid composition (%)\n - **LLM mode:** transformer-based embeddings (DNABERT2 + ESM2)\n- Automatic model selection:\n - `catboost_model_rapid.joblib` \u2192 Composition model\n - `catboost_dnabert2_esm-t30.joblib` \u2192 Embedding model\n- **GPU-aware** embedding generation (with safe fallback to CPU)\n- Generates probability and binary interaction predictions\n\n---\n\n## \ud83e\uddf0 Dependencies\n\nThe tool was developed using Python 3.10. Install the following dependencies before running the script:\n\n```bash\npip install torch==2.6.0 transformers==4.57.0 catboost==1.2.8 joblib tqdm numpy pandas\n```\n---\n\n## \u2699\ufe0f Usage\n\n### 1\ufe0f\u20e3 **Rapid (Composition-Based) Prediction**\n\nThis mode uses simple % composition features (very fast).\n\n```bash\npython lncrnapi_cli.py -lf ./data/example_lncRNA.fasta -pf ./data/example_protein.fasta -wd ./output -model rapid\n```\n\n**Model used:** \n`./model/catboost_model_rapid.joblib`\n\n---\n\n### 2\ufe0f\u20e3 **LLM (Embedding-Based) Prediction**\n\nThis mode uses transformer embeddings from **DNABERT2** (for lncRNA) and **ESM2-T30** (for protein).\n\n```bash\npython lncrnapi_cli.py -lf ./data/example_lncRNA.fasta -pf ./data/example_protein.fasta -wd ./output -model llm\n```\n\n**Model used:** \n`./model/catboost_dnabert2_esm-t30.joblib`\n\n---\n\n### **Arguments**\n\n| Argument | Description | Required |\n|-----------|--------------|-----------|\n| `-lf` | Path to the FASTA file containing lncRNA sequences. | \u2705 |\n| `-pf` | Path to the FASTA file containing protein sequences. | \u2705 |\n| `-wd` | Path to the working directory. | \u2705 |\n| `-model` | Choice of model to be used. | \u2705 |\n| `-t` | Threshold | \u274c |\n---\n\n## \ud83d\udcbe Output\n\nA CSV file named `output.csv` is generated in the output directory:\n\n| lncRNA_ID | Protein_ID | Interaction_Probability | Predicted_Label |\n|------------|-------------|--------------------------|------------------|\n| lnc1 | P12345 | 0.87 | 1 |\n| lnc1 | P67890 | 0.34 | 0 |\n| ... | ... | ... | ... |\n\n- **Interaction_Probability:** Probability predicted by CatBoost \n- **Predicted_Label:** 1 \u2192 interaction, 0 \u2192 non-interaction\n\n---\n\n## \u26a1 Hardware Acceleration\n\nThe script automatically detects and uses available hardware:\n\n- \u2705 **CUDA GPU** (NVIDIA)\n- \u2705 **MPS** (Apple Silicon)\n- \u26a0\ufe0f **CPU** (fallback)\n\n---\n\n## \ud83d\udee0 Troubleshooting\n\n| Issue | Possible Cause | Solution |\n|-------|----------------|-----------|\n| `Model file not found` | Wrong `--model_path` | Check the file path |\n| `No sequences found in FASTA` | Invalid FASTA format | Ensure `>` headers are present |\n| `safetensors` error | Missing library | Install with `pip install safetensors` |\n| Slow performance | CPU usage | Use GPU-enabled environment |\n\n---\n\n## \ud83d\udcdc Citation\n\nIf you use this tool in your research, please cite:\n\n> ** Choudhury et al.** \n> *Prediction of lncRNA-protein interacting pairs using LLM embeddings based on evolutionary information* (2025)\n\n---\n\n## \ud83e\udde9 Repository Structure\n\n```\n\u251c\u2500\u2500 lncrnapi_cli.py # Main CLI script\n\u251c\u2500\u2500 README.md\n\u251c\u2500\u2500 LICENSE\n\u251c\u2500\u2500 test_lncrna.fa\n\u251c\u2500\u2500 test_protein.fa\n\u251c\u2500\u2500 output.csv \n\u2514\u2500\u2500 models/\n \u251c\u2500\u2500 catboost_model_rapid.joblib\n \u2514\u2500\u2500 catboost_dnabert2_esm-t30.joblib \n```\n\n---\n\n",
"bugtrack_url": null,
"license": null,
"summary": "A CLI tool for predicting lncRNA\u2013Protein interactions using transformer embeddings and CatBoost",
"version": "1.2",
"project_urls": {
"Homepage": "https://github.com/raghavagps/lncrnapi"
},
"split_keywords": [
"lncrna",
"protein",
"interaction",
"prediction",
"bioinformatics",
"catboost",
"transformers"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "1245f45a6b087aca6cac649095afafc266f99a755cb825866d2a27efd42c9fa9",
"md5": "655ca15fdba1a8c622e3d62cc2bb2a88",
"sha256": "de50404660bb4439f91ae8c50dce6cbf5b812774a7d52ccd4dd1d32876f299f4"
},
"downloads": -1,
"filename": "lncrnapi-1.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "655ca15fdba1a8c622e3d62cc2bb2a88",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 1906483,
"upload_time": "2025-11-01T11:08:38",
"upload_time_iso_8601": "2025-11-01T11:08:38.340040Z",
"url": "https://files.pythonhosted.org/packages/12/45/f45a6b087aca6cac649095afafc266f99a755cb825866d2a27efd42c9fa9/lncrnapi-1.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "1e52f3140613c6d265404aac937902445ab54443aab51e9cfa4e2a439ddae103",
"md5": "8f86a39a4ca23436c0ce48f57bcf9fbd",
"sha256": "413669292fb0d092040ead85aa03f02da91e21510ca3b2a3420fec7c814a1953"
},
"downloads": -1,
"filename": "lncrnapi-1.2.tar.gz",
"has_sig": false,
"md5_digest": "8f86a39a4ca23436c0ce48f57bcf9fbd",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 1894814,
"upload_time": "2025-11-01T11:08:42",
"upload_time_iso_8601": "2025-11-01T11:08:42.917976Z",
"url": "https://files.pythonhosted.org/packages/1e/52/f3140613c6d265404aac937902445ab54443aab51e9cfa4e2a439ddae103/lncrnapi-1.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-11-01 11:08:42",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "raghavagps",
"github_project": "lncrnapi",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "lncrnapi"
}