lncrnapi


Namelncrnapi JSON
Version 1.2 PyPI version JSON
download
home_pagehttps://github.com/raghavagps/lncrnapi
SummaryA CLI tool for predicting lncRNA–Protein interactions using transformer embeddings and CatBoost
upload_time2025-11-01 11:08:42
maintainerNone
docs_urlNone
authorGajendra P.S. Raghava
requires_python>=3.9
licenseNone
keywords lncrna protein interaction prediction bioinformatics catboost transformers
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # 🧬 lncrna-PI - LncRNA–Protein Interaction Prediction

lncrnaPI is a command-line tool for predicting **lncRNA–Protein interactions** using **pre-trained language models (DNABERT-2 and ESM-2)** for sequence embedding and a **CatBoost classifier** for interaction probability estimation.

It supports two modes:
- **Rapid** β€” based on sequence composition features (fast, lightweight)
- **LLM** β€” based on transformer embeddings (DNABERT2 + ESM2)

The script performs **all-by-all predictions** between every lncRNA and every protein sequence in the provided FASTA files.

---

## πŸ“¦ Features

- Vectorized and efficient FASTA parsing  
- All-by-all pairing of lncRNA and protein sequences  
- Automatic **feature extraction**:
  - **Rapid mode:** nucleotide and amino acid composition (%)
  - **LLM mode:** transformer-based embeddings (DNABERT2 + ESM2)
- Automatic model selection:
  - `catboost_model_rapid.joblib` β†’ Composition model
  - `catboost_dnabert2_esm-t30.joblib` β†’ Embedding model
- **GPU-aware** embedding generation (with safe fallback to CPU)
- Generates probability and binary interaction predictions

---

## 🧰 Dependencies

The tool was developed using Python 3.10. Install the following dependencies before running the script:

```bash
pip install torch==2.6.0 transformers==4.57.0 catboost==1.2.8 joblib tqdm numpy pandas
```
---

## βš™οΈ Usage

### 1️⃣ **Rapid (Composition-Based) Prediction**

This mode uses simple % composition features (very fast).

```bash
python lncrnapi_cli.py   -lf ./data/example_lncRNA.fasta   -pf ./data/example_protein.fasta   -wd ./output   -model rapid
```

**Model used:**  
`./model/catboost_model_rapid.joblib`

---

### 2️⃣ **LLM (Embedding-Based) Prediction**

This mode uses transformer embeddings from **DNABERT2** (for lncRNA) and **ESM2-T30** (for protein).

```bash
python lncrnapi_cli.py  -lf ./data/example_lncRNA.fasta   -pf ./data/example_protein.fasta   -wd ./output   -model llm
```

**Model used:**  
`./model/catboost_dnabert2_esm-t30.joblib`

---

### **Arguments**

| Argument | Description | Required |
|-----------|--------------|-----------|
| `-lf` | Path to the FASTA file containing lncRNA sequences. | βœ… |
| `-pf` | Path to the FASTA file containing protein sequences. | βœ… |
| `-wd` | Path to the working directory. | βœ… |
| `-model` | Choice of model to be used. | βœ… |
| `-t` | Threshold | ❌ |
---

## πŸ’Ύ Output

A CSV file named `output.csv` is generated in the output directory:

| lncRNA_ID | Protein_ID | Interaction_Probability | Predicted_Label |
|------------|-------------|--------------------------|------------------|
| lnc1 | P12345 | 0.87 | 1 |
| lnc1 | P67890 | 0.34 | 0 |
| ... | ... | ... | ... |

- **Interaction_Probability:** Probability predicted by CatBoost  
- **Predicted_Label:** 1 β†’ interaction, 0 β†’ non-interaction

---

## ⚑ Hardware Acceleration

The script automatically detects and uses available hardware:

- βœ… **CUDA GPU** (NVIDIA)
- βœ… **MPS** (Apple Silicon)
- ⚠️ **CPU** (fallback)

---

## πŸ›  Troubleshooting

| Issue | Possible Cause | Solution |
|-------|----------------|-----------|
| `Model file not found` | Wrong `--model_path` | Check the file path |
| `No sequences found in FASTA` | Invalid FASTA format | Ensure `>` headers are present |
| `safetensors` error | Missing library | Install with `pip install safetensors` |
| Slow performance | CPU usage | Use GPU-enabled environment |

---

## πŸ“œ Citation

If you use this tool in your research, please cite:

> ** Choudhury et al.**  
> *Prediction of lncRNA-protein interacting pairs using LLM embeddings based on evolutionary information* (2025)

---

## 🧩 Repository Structure

```
β”œβ”€β”€ lncrnapi_cli.py       # Main CLI script
β”œβ”€β”€ README.md
β”œβ”€β”€ LICENSE
β”œβ”€β”€ test_lncrna.fa
β”œβ”€β”€ test_protein.fa
β”œβ”€β”€ output.csv 
└── models/
    β”œβ”€β”€ catboost_model_rapid.joblib
    └── catboost_dnabert2_esm-t30.joblib                
```

---


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/raghavagps/lncrnapi",
    "name": "lncrnapi",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "lncrna protein interaction prediction bioinformatics catboost transformers",
    "author": "Gajendra P.S. Raghava",
    "author_email": "raghava@iiitd.ac.in",
    "download_url": "https://files.pythonhosted.org/packages/1e/52/f3140613c6d265404aac937902445ab54443aab51e9cfa4e2a439ddae103/lncrnapi-1.2.tar.gz",
    "platform": null,
    "description": "# \ud83e\uddec lncrna-PI - LncRNA\u2013Protein Interaction Prediction\n\nlncrnaPI is a command-line tool for predicting **lncRNA\u2013Protein interactions** using **pre-trained language models (DNABERT-2 and ESM-2)** for sequence embedding and a **CatBoost classifier** for interaction probability estimation.\n\nIt supports two modes:\n- **Rapid** \u2014 based on sequence composition features (fast, lightweight)\n- **LLM** \u2014 based on transformer embeddings (DNABERT2 + ESM2)\n\nThe script performs **all-by-all predictions** between every lncRNA and every protein sequence in the provided FASTA files.\n\n---\n\n## \ud83d\udce6 Features\n\n- Vectorized and efficient FASTA parsing  \n- All-by-all pairing of lncRNA and protein sequences  \n- Automatic **feature extraction**:\n  - **Rapid mode:** nucleotide and amino acid composition (%)\n  - **LLM mode:** transformer-based embeddings (DNABERT2 + ESM2)\n- Automatic model selection:\n  - `catboost_model_rapid.joblib` \u2192 Composition model\n  - `catboost_dnabert2_esm-t30.joblib` \u2192 Embedding model\n- **GPU-aware** embedding generation (with safe fallback to CPU)\n- Generates probability and binary interaction predictions\n\n---\n\n## \ud83e\uddf0 Dependencies\n\nThe tool was developed using Python 3.10. Install the following dependencies before running the script:\n\n```bash\npip install torch==2.6.0 transformers==4.57.0 catboost==1.2.8 joblib tqdm numpy pandas\n```\n---\n\n## \u2699\ufe0f Usage\n\n### 1\ufe0f\u20e3 **Rapid (Composition-Based) Prediction**\n\nThis mode uses simple % composition features (very fast).\n\n```bash\npython lncrnapi_cli.py   -lf ./data/example_lncRNA.fasta   -pf ./data/example_protein.fasta   -wd ./output   -model rapid\n```\n\n**Model used:**  \n`./model/catboost_model_rapid.joblib`\n\n---\n\n### 2\ufe0f\u20e3 **LLM (Embedding-Based) Prediction**\n\nThis mode uses transformer embeddings from **DNABERT2** (for lncRNA) and **ESM2-T30** (for protein).\n\n```bash\npython lncrnapi_cli.py  -lf ./data/example_lncRNA.fasta   -pf ./data/example_protein.fasta   -wd ./output   -model llm\n```\n\n**Model used:**  \n`./model/catboost_dnabert2_esm-t30.joblib`\n\n---\n\n### **Arguments**\n\n| Argument | Description | Required |\n|-----------|--------------|-----------|\n| `-lf` | Path to the FASTA file containing lncRNA sequences. | \u2705 |\n| `-pf` | Path to the FASTA file containing protein sequences. | \u2705 |\n| `-wd` | Path to the working directory. | \u2705 |\n| `-model` | Choice of model to be used. | \u2705 |\n| `-t` | Threshold | \u274c |\n---\n\n## \ud83d\udcbe Output\n\nA CSV file named `output.csv` is generated in the output directory:\n\n| lncRNA_ID | Protein_ID | Interaction_Probability | Predicted_Label |\n|------------|-------------|--------------------------|------------------|\n| lnc1 | P12345 | 0.87 | 1 |\n| lnc1 | P67890 | 0.34 | 0 |\n| ... | ... | ... | ... |\n\n- **Interaction_Probability:** Probability predicted by CatBoost  \n- **Predicted_Label:** 1 \u2192 interaction, 0 \u2192 non-interaction\n\n---\n\n## \u26a1 Hardware Acceleration\n\nThe script automatically detects and uses available hardware:\n\n- \u2705 **CUDA GPU** (NVIDIA)\n- \u2705 **MPS** (Apple Silicon)\n- \u26a0\ufe0f **CPU** (fallback)\n\n---\n\n## \ud83d\udee0 Troubleshooting\n\n| Issue | Possible Cause | Solution |\n|-------|----------------|-----------|\n| `Model file not found` | Wrong `--model_path` | Check the file path |\n| `No sequences found in FASTA` | Invalid FASTA format | Ensure `>` headers are present |\n| `safetensors` error | Missing library | Install with `pip install safetensors` |\n| Slow performance | CPU usage | Use GPU-enabled environment |\n\n---\n\n## \ud83d\udcdc Citation\n\nIf you use this tool in your research, please cite:\n\n> ** Choudhury et al.**  \n> *Prediction of lncRNA-protein interacting pairs using LLM embeddings based on evolutionary information* (2025)\n\n---\n\n## \ud83e\udde9 Repository Structure\n\n```\n\u251c\u2500\u2500 lncrnapi_cli.py       # Main CLI script\n\u251c\u2500\u2500 README.md\n\u251c\u2500\u2500 LICENSE\n\u251c\u2500\u2500 test_lncrna.fa\n\u251c\u2500\u2500 test_protein.fa\n\u251c\u2500\u2500 output.csv \n\u2514\u2500\u2500 models/\n    \u251c\u2500\u2500 catboost_model_rapid.joblib\n    \u2514\u2500\u2500 catboost_dnabert2_esm-t30.joblib                \n```\n\n---\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A CLI tool for predicting lncRNA\u2013Protein interactions using transformer embeddings and CatBoost",
    "version": "1.2",
    "project_urls": {
        "Homepage": "https://github.com/raghavagps/lncrnapi"
    },
    "split_keywords": [
        "lncrna",
        "protein",
        "interaction",
        "prediction",
        "bioinformatics",
        "catboost",
        "transformers"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "1245f45a6b087aca6cac649095afafc266f99a755cb825866d2a27efd42c9fa9",
                "md5": "655ca15fdba1a8c622e3d62cc2bb2a88",
                "sha256": "de50404660bb4439f91ae8c50dce6cbf5b812774a7d52ccd4dd1d32876f299f4"
            },
            "downloads": -1,
            "filename": "lncrnapi-1.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "655ca15fdba1a8c622e3d62cc2bb2a88",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 1906483,
            "upload_time": "2025-11-01T11:08:38",
            "upload_time_iso_8601": "2025-11-01T11:08:38.340040Z",
            "url": "https://files.pythonhosted.org/packages/12/45/f45a6b087aca6cac649095afafc266f99a755cb825866d2a27efd42c9fa9/lncrnapi-1.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "1e52f3140613c6d265404aac937902445ab54443aab51e9cfa4e2a439ddae103",
                "md5": "8f86a39a4ca23436c0ce48f57bcf9fbd",
                "sha256": "413669292fb0d092040ead85aa03f02da91e21510ca3b2a3420fec7c814a1953"
            },
            "downloads": -1,
            "filename": "lncrnapi-1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "8f86a39a4ca23436c0ce48f57bcf9fbd",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 1894814,
            "upload_time": "2025-11-01T11:08:42",
            "upload_time_iso_8601": "2025-11-01T11:08:42.917976Z",
            "url": "https://files.pythonhosted.org/packages/1e/52/f3140613c6d265404aac937902445ab54443aab51e9cfa4e2a439ddae103/lncrnapi-1.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-11-01 11:08:42",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "raghavagps",
    "github_project": "lncrnapi",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "lncrnapi"
}
        
Elapsed time: 1.57854s