# iimi: identifying infection with machine intelligence
`iimi` is a python package for plant virus diagnostics using high-throughput genome sequencing data. It provides tools for converting BAM files into coverage profiles, processing and visualizing genomic data with handling for unreliable regions, and training machine learning models to predict viral infections.
## Installation
```bash
pip install iimi
```
## Usage
```python
import iimi
```
## Data Processing and Coverage Profile Generation
To convert the indexed and sorted BAM file(s) into coverage profiles and feature-extracted data frame. The coverage profiles will be used to visualize the mapping information. The feature-extracted data frame will be used in the model training and predictions.
```python
# convert BAM files to coverage profiles
bam_files = ["path/to/sample1.sorted.bam", "path/to/sample2.sorted.bam"]
iimi.convert_bam_to_rle(bam_files)
# convert coverage profiles to a feature-extracted DataFrame
rle_data = {
"sample1": {"seg1": [1, 2, 3, 0, 0, 4], "seg2": [0, 0, 0, 1, 1, 2]},
"sample2": {"seg3": [2, 3, 4, 5, 0, 1]},
}
additional_info = pd.DataFrame({
"virus_name": ["Virus4"],
"iso_id": ["Iso4"],
"seg_id": ["seg4"],
"A_percent": [40],
"C_percent": [20],
"T_percent": [20],
"GC_percent": [20],
"seg_len": [800],
})
iimi.convert_rle_to_df(rle_data, additional_nucleotide_info=additional_info)
```
## Handling Unreliable Regions
Unreliable regions contain high nucleotide content regions and have a mappability profile. Identifying these regions helps eliminate false peaks.
### High Nucleotide Content Regions
High nucleotide content regions is a profile of areas on a virus genome that has high GC content and/or high A nucleotide percentage.
```python
virus_info = {
"seg1": "ATGCGATCGATCGATCGTACGATCGATCGATCGATCGTACGATCG",
"seg2": "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
}
# identify regions with high GC content
create_high_nucleotide_content(
gc=0.4, a=0.0, window=10, virus_info=virus_info
)
# identify regions with high A content
create_high_nucleotide_content(
gc=0.0, a=0.8, window=10, virus_info=virus_info
)
```
### Mappability Profile
Mappability profile is a profile of areas on a virus genome that can be mapped to other viruses or host genome. This tool uses Arabidopsis Thaliana as the host genome.
```python
# generate mappability profile from host or virus BAM files
create_mappability_profile(
path_to_bam_files="path/to/bam/files",
virus_info=virus_info,
window=10
)
```
## Machine Learning Models to Predict Viral Infections
### Using Pre-trained Models
To use a provided model, input your data to newdata and choose a method: `xgb`, `en`, and `rf`, which stand for pre-trained XGBoost, elastic net, and random forest models. The prediction is `TRUE` if virus infected the sample, `FALSE` if virus did not infect the sample.
```python
# predict using pre-trained random forest model
predict_iimi(newdata=df, method="rf")
```
### Training a Custom Model
The `train_iimi()` function trains a machine learning model on the provided feature-extracted data frame of plant sample (`train_x`) and known target labels (`train_y`). It supports also three models: `xgb`, `en`, and `rf`.
```python
# train random forest model
train_iimi(train_x, train_y, method="rf", ntree=100, mtry=2)
# train XGBoost model
train_iimi(train_x, train_y, method="xgb", nrounds=100)
# train elastic net model
en_model = train_iimi(train_x, train_y, method="en", k=5)
```
## Visualizing Coverage Profiles
`plot_cov()` plots the coverage profile of the mapped plant sample and the percentage of A nucleotides and GC content for a sliding window of k-mer with the default step being 75 bases.
```python
covs = {
"sample1": {
"seg1": [20, 30, 50, 60, 80],
"seg2": [15, 25, 45, 55, 75],
}
}
virus_info = {
"seg1": "ACGT" * 250,
"seg2": "TGCA" * 250,
}
# plot coverage of segments without unreliable regions
plot_cov(
covs,
legend_status=True,
nucleotide_status=True,
virus_info=virus_info,
unreliable_regions=None,
)
```
## Sample Data and Models Provided
- `iimi/data/example_cov.pkl` Coverage profiles of three plant samples: A list of coverage profiles for three plant samples
- `iimi/data/example_diag.pdl` Known diagnostics result of virus segments: A matrix containing the known truth about the diagnostics result (using virus database version 1.4.0) for each plant sample for the example data
- `iimi/data/nucleotide_info.pkl` Nucleotide information of virus segments: A data set containing the GC content and other information about the virus segments from the official Virtool virus data base (version 1.4.0)
- `iimi/data/unreliable_regions.pkl` The unreliable regions of the virus segments: A data frame of unmappable regions and regions of CG% and A% over 60% and 45% respectively
for the virus segments
- `iimi/data/trained_rf.pkl` A trained model using the default Random Forest settings
- `iimi/data/trained_xgb.model` A trained model using the default XGBoost settings
- `iimi/data/trained_en.pkl` A trained model using the default Elastic Net settings
## References
- H. Ning, I. Boyes, Ibrahim Numanagić, M. Rott, L. Xing, and X. Zhang, “Diagnostics of viral infections using high-throughput genome sequencing data,” Briefings in Bioinformatics, vol. 25, no. 6, Sep. 2024, doi: https://doi.org/10.1093/bib/bbae501.
- Grigorii Sukhorukov, M. Khalili, Olivier Gascuel, Thierry Candresse, Armelle Marais-Colombel, and Macha Nikolski, “VirHunter: A Deep Learning-Based Method for Detection of Novel RNA Viruses in Plant Sequencing Data,” Frontiers in bioinformatics, vol. 2, May 2022, doi: https://doi.org/10.3389/fbinf.2022.867111.
Raw data
{
"_id": null,
"home_page": "https://github.com/jaspreetks/iimi",
"name": "iimi",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": null,
"keywords": "bioinformatics, plant-virus, machine-learning, diagnostics",
"author": "Jaspreet S",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/98/ed/82bb69e46cd751288abb7bfe4b2cfbdaf7b276f542acf77a4143d7e19a85/iimi-0.2.7.tar.gz",
"platform": null,
"description": "# iimi: identifying infection with machine intelligence\n\n`iimi` is a python package for plant virus diagnostics using high-throughput genome sequencing data. It provides tools for converting BAM files into coverage profiles, processing and visualizing genomic data with handling for unreliable regions, and training machine learning models to predict viral infections.\n\n## Installation\n\n```bash\npip install iimi\n```\n\n## Usage\n\n```python\nimport iimi\n```\n\n## Data Processing and Coverage Profile Generation\n\nTo convert the indexed and sorted BAM file(s) into coverage profiles and feature-extracted data frame. The coverage profiles will be used to visualize the mapping information. The feature-extracted data frame will be used in the model training and predictions.\n\n```python\n# convert BAM files to coverage profiles\nbam_files = [\"path/to/sample1.sorted.bam\", \"path/to/sample2.sorted.bam\"]\niimi.convert_bam_to_rle(bam_files)\n\n# convert coverage profiles to a feature-extracted DataFrame\nrle_data = {\n \"sample1\": {\"seg1\": [1, 2, 3, 0, 0, 4], \"seg2\": [0, 0, 0, 1, 1, 2]},\n \"sample2\": {\"seg3\": [2, 3, 4, 5, 0, 1]},\n}\n\nadditional_info = pd.DataFrame({\n \"virus_name\": [\"Virus4\"],\n \"iso_id\": [\"Iso4\"],\n \"seg_id\": [\"seg4\"],\n \"A_percent\": [40],\n \"C_percent\": [20],\n \"T_percent\": [20],\n \"GC_percent\": [20],\n \"seg_len\": [800],\n})\n\niimi.convert_rle_to_df(rle_data, additional_nucleotide_info=additional_info)\n```\n\n## Handling Unreliable Regions\n\nUnreliable regions contain high nucleotide content regions and have a mappability profile. Identifying these regions helps eliminate false peaks.\n\n### High Nucleotide Content Regions\n\nHigh nucleotide content regions is a profile of areas on a virus genome that has high GC content and/or high A nucleotide percentage.\n\n```python\nvirus_info = {\n \"seg1\": \"ATGCGATCGATCGATCGTACGATCGATCGATCGATCGTACGATCG\",\n \"seg2\": \"AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA\"\n}\n\n# identify regions with high GC content\ncreate_high_nucleotide_content(\n gc=0.4, a=0.0, window=10, virus_info=virus_info\n)\n# identify regions with high A content\ncreate_high_nucleotide_content(\n gc=0.0, a=0.8, window=10, virus_info=virus_info\n)\n```\n\n### Mappability Profile\n\nMappability profile is a profile of areas on a virus genome that can be mapped to other viruses or host genome. This tool uses Arabidopsis Thaliana as the host genome.\n\n```python\n# generate mappability profile from host or virus BAM files\ncreate_mappability_profile(\n path_to_bam_files=\"path/to/bam/files\",\n virus_info=virus_info,\n window=10\n)\n```\n\n## Machine Learning Models to Predict Viral Infections\n\n### Using Pre-trained Models\n\nTo use a provided model, input your data to newdata and choose a method: `xgb`, `en`, and `rf`, which stand for pre-trained XGBoost, elastic net, and random forest models. The prediction is `TRUE` if virus infected the sample, `FALSE` if virus did not infect the sample.\n\n```python\n# predict using pre-trained random forest model\npredict_iimi(newdata=df, method=\"rf\")\n```\n\n### Training a Custom Model\n\nThe `train_iimi()` function trains a machine learning model on the provided feature-extracted data frame of plant sample (`train_x`) and known target labels (`train_y`). It supports also three models: `xgb`, `en`, and `rf`.\n\n```python\n# train random forest model\ntrain_iimi(train_x, train_y, method=\"rf\", ntree=100, mtry=2)\n# train XGBoost model\ntrain_iimi(train_x, train_y, method=\"xgb\", nrounds=100)\n# train elastic net model\nen_model = train_iimi(train_x, train_y, method=\"en\", k=5)\n```\n\n## Visualizing Coverage Profiles\n\n`plot_cov()` plots the coverage profile of the mapped plant sample and the percentage of A nucleotides and GC content for a sliding window of k-mer with the default step being 75 bases.\n\n```python\ncovs = {\n \"sample1\": {\n \"seg1\": [20, 30, 50, 60, 80],\n \"seg2\": [15, 25, 45, 55, 75],\n }\n}\nvirus_info = {\n \"seg1\": \"ACGT\" * 250,\n \"seg2\": \"TGCA\" * 250,\n}\n\n# plot coverage of segments without unreliable regions\nplot_cov(\n covs,\n legend_status=True,\n nucleotide_status=True,\n virus_info=virus_info,\n unreliable_regions=None,\n)\n```\n\n## Sample Data and Models Provided\n\n- `iimi/data/example_cov.pkl` Coverage profiles of three plant samples: A list of coverage profiles for three plant samples\n- `iimi/data/example_diag.pdl` Known diagnostics result of virus segments: A matrix containing the known truth about the diagnostics result (using virus database version 1.4.0) for each plant sample for the example data\n- `iimi/data/nucleotide_info.pkl` Nucleotide information of virus segments: A data set containing the GC content and other information about the virus segments from the official Virtool virus data base (version 1.4.0)\n- `iimi/data/unreliable_regions.pkl` The unreliable regions of the virus segments: A data frame of unmappable regions and regions of CG% and A% over 60% and 45% respectively\nfor the virus segments\n- `iimi/data/trained_rf.pkl` A trained model using the default Random Forest settings\n- `iimi/data/trained_xgb.model` A trained model using the default XGBoost settings\n- `iimi/data/trained_en.pkl` A trained model using the default Elastic Net settings\n\n## References\n\n- H. Ning, I. Boyes, Ibrahim Numanagi\u0107, M. Rott, L. Xing, and X. Zhang, \u201cDiagnostics of viral infections using high-throughput genome sequencing data,\u201d Briefings in Bioinformatics, vol. 25, no. 6, Sep. 2024, doi: https://doi.org/10.1093/bib/bbae501.\n- Grigorii Sukhorukov, M. Khalili, Olivier Gascuel, Thierry Candresse, Armelle Marais-Colombel, and Macha Nikolski, \u201cVirHunter: A Deep Learning-Based Method for Detection of Novel RNA Viruses in Plant Sequencing Data,\u201d Frontiers in bioinformatics, vol. 2, May 2022, doi: https://doi.org/10.3389/fbinf.2022.867111.\n\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "identifying plant infection with machine intelligence.",
"version": "0.2.7",
"project_urls": {
"Homepage": "https://github.com/jaspreetks/iimi"
},
"split_keywords": [
"bioinformatics",
" plant-virus",
" machine-learning",
" diagnostics"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "b17fa48daef2afaa63e32fe9564c384d1eec5956ac74b2a5679d725187e0f58e",
"md5": "152112955801559abe666d0b5dac05f5",
"sha256": "66e6fbfb14eff8dfb20ab01996b7bca4cdfb4e50c6e39978e6dc44458fa12ccf"
},
"downloads": -1,
"filename": "iimi-0.2.7-py3-none-any.whl",
"has_sig": false,
"md5_digest": "152112955801559abe666d0b5dac05f5",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 450864,
"upload_time": "2024-12-13T07:46:19",
"upload_time_iso_8601": "2024-12-13T07:46:19.753035Z",
"url": "https://files.pythonhosted.org/packages/b1/7f/a48daef2afaa63e32fe9564c384d1eec5956ac74b2a5679d725187e0f58e/iimi-0.2.7-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "98ed82bb69e46cd751288abb7bfe4b2cfbdaf7b276f542acf77a4143d7e19a85",
"md5": "e654a02ac3fc7bdb990506730c147965",
"sha256": "dc75cc728b400a6a2186ab424601f44a54ab17ad4de13841c830f5d15991049c"
},
"downloads": -1,
"filename": "iimi-0.2.7.tar.gz",
"has_sig": false,
"md5_digest": "e654a02ac3fc7bdb990506730c147965",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 447025,
"upload_time": "2024-12-13T07:46:21",
"upload_time_iso_8601": "2024-12-13T07:46:21.012928Z",
"url": "https://files.pythonhosted.org/packages/98/ed/82bb69e46cd751288abb7bfe4b2cfbdaf7b276f542acf77a4143d7e19a85/iimi-0.2.7.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-13 07:46:21",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "jaspreetks",
"github_project": "iimi",
"github_not_found": true,
"lcname": "iimi"
}