# BioCatalyzer
BioCatalyzer is a python tool that predicts enzymatic metabolism products using a rule-based approach.
BioCatalyzer is implemented as a Command Line Interface that takes as input a set of compounds represented as SMILES
strings and outputs a set of predicted metabolic products and associated enzymes.
This metabolic products can then be matched with experimental MS data using this same tool.
## Installation
Installing from Pypi package repository:
`pip install biocatalyzer`
Installing from GitHub:
1. clone the repository: `git clone https://github.com/jcorreia11/BioCatalyzer.git`
2. run: `python setup.py install`
## Command Line Interface
```bash
biocatalyzer_cli <PATH_TO_COMPOUNDS> <OUTPUT_DIRECTORY> [--neutralize=<BOOL>] [--reaction_rules=<FILE_PATH>] [--organisms=<FILE_PATH>] [--patterns_to_remove=<FILE_PATH>] [--molecules_to_remove=<FILE_PATH>] [--min_atom_count=<INT>] [--match_ms_data=<BOOL>] [--ms_data_path=<FILE_PATH>] [--tolerance=<FLOAT>] [--n_jobs=<INT>]
```
| Argument | Example | Description | Default |
|--------------------------------------|---------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------|
| compounds <PATH_TO_COMPOUNDS> | `file.tsv` or `"smile1;smiles2;smile3;etc"` | The path to the file containing the compounds to use as reactants. Or ;-separated SMILES strings.<sup>1</sup> | |
| output_directory <OUTPUT_DIRECTORY> | `output/directory/` | The path directory to save the results to. | |
| neutralize | `True` or `False` | Whether to neutralize the compounds before predicting the products. In this case the new products will also be neutralized. | `False` |
| reaction_rules | `file.tsv` or `None` | The path to the file containing the reaction rules to use.<sup>2</sup> | [all_reaction_rules_forward_no_smarts_duplicates_sample.tsv](src/biocatalyzer/data/reactionrules/all_reaction_rules_forward_no_smarts_duplicates_sample.tsv) |
| organisms | `file.tsv` or `"org_id1;org_id2;org_id3;etc"` or `None` | The path to the file containing the organisms to use. Or ;-separated organisms identifiers. Reaction Rules will be selected accordingly (select only rules associated with enzymes encoded by genes from these organisms).<sup>3</sup> | All reaction rules are used. |
| patterns_to_remove | `patterns.tsv` or `None` | The path to the file containing the patterns to remove from the products. <sup>4</sup> | [patterns.tsv](src/biocatalyzer/data/patterns_to_remove/patterns.tsv) |
| molecules_to_remove | `molecules.tsv` or `None` | The path to the file containing the molecules to remove from the products. <sup>5</sup> | [byproducts.tsv](src/biocatalyzer/data/byproducts_to_remove/byproducts.tsv) |
| min_atom_count | `4` | The minimum number of heavy atoms a product must have. | `5` |
| match_ms_data | `True` or `False` | Whether to match the predicted products to the MS data. | `False` |
| ms_data_path | `ms_data.tsv` | The path to the file containing the MS data. <sup>6</sup> | `None` |
| tolerance | `0.02` | The mass tolerance to use when matching masses. | `0.02` |
| n_jobs | `6` | The number of jobs to run in parallel (-1 uses all). | `1` |
### Compounds
See [drugs.csv](src/biocatalyzer/data/compounds/drugs.csv)<sup>1</sup> for an example.
The file must be tab-separated and contain the following columns:
- `smiles` - the SMILES representation of the compounds;
- `compound_id` - the compounds identifiers.
Alternatively, the compounds can be passed as ;-separated string with the SMILES representations.
### Output directory
The output path must be a directory. The results will be saved in the following files:
- `new_compouds.tsv` - the predicted products;
- `matches.tsv` (if `match_ms_data` is set to `True`) - the matches between the predicted products and the MS data;
### Neutralize
If set to `True`, the compounds will be neutralized before predicting the products. In this case the new products will
also be neutralized.
### Reaction Rules
See [all_reaction_rules_forward_no_smarts_duplicates_sample.tsv](src/biocatalyzer/data/reactionrules/all_reaction_rules_forward_no_smarts_duplicates_sample.tsv)<sup>2</sup>
for an example.
The file must be tab-separated and contain the following columns:
- `InternalID` - The ID of the Reaction Rule. # TODO: change the name of this column
- `Reactants` - The Reactants of the ReactionRule. Coreactants must be defined by their ID as in the Coreactants file.
The compound to match must be identified by the string 'Any'. The format must be: `coreactant1_id;Any;coreactant_id`.
The order in which the reactants and the compound to match are defined is relevant and specific to the Reaction Rule.
If the Reaction Rules are mono-component (i.e. they do not contain any additional coreactant) the format must be: `Any`.
- `SMARTS` - The SMARTS representation of the Reaction Rule.
- `EC_Numbers` - The EC Numbers associated with the Reaction Rule.
- `Organisms` - The Organisms associated with the Reaction Rule.
By default our set of reaction rules is used.
### Organisms
All organisms' identifiers are defined in:
[https://www.genome.jp/kegg/catalog/org_list.html](https://www.genome.jp/kegg/catalog/org_list.html) are allowed.
Example:
[hsa](https://www.genome.jp/kegg-bin/show_organism?org=hsa) is for *Homo sapiens* (human).
[eco](https://www.genome.jp/kegg-bin/show_organism?org=eco) is for *Escherichia coli K-12 MG1655*.
[sce](https://www.genome.jp/kegg-bin/show_organism?org=sce) is for *Saccharomyces cerevisiae (budding yeast)*.
If you want to use your own organisms see
[organisms.csv](src/biocatalyzer/data/organisms/organisms_to_use.tsv)<sup>3</sup> for an example.
The file must be tab-separated and contain a column named `org_id` with the organisms' identifiers (KEGG identifiers).
Alternatively, the organisms can be passed as ;-separated string with the organisms identifiers.
### Patterns to remove
If you want to use your own patterns to remove see
[patterns.tsv](src/biocatalyzer/data/patterns_to_remove/patterns.tsv)<sup>4</sup> for an example.
The file must be tab-separated and contain a column named `smarts` with the SMARTS representation of the patterns to remove.
### Molecules to remove
If you want to use your own molecules to remove see
[byproducts.tsv](src/biocatalyzer/data/byproducts_to_remove/byproducts.tsv)<sup>5</sup> for an example.
The file must be tab-separated and contain a column named `smiles` with the SMILES representation of the molecules to remove.
### Match MS data
If set to `True`, the predicted products will be matched to the MS data.
In this case the `ms_data_path` must be set.
### MS data path
See [ms_data.tsv](src/biocatalyzer/data/ms_data_example/ms_data_paper.tsv)<sup>6</sup> for an example.
The file must be tab-separated and contain the following columns:
- `ParentCompound` - the parent/original compound identifiers.
- `ParentCompoundSmiles` - the SMILES representation of the compounds (optional).
- `Mass` - the mass of the molecule.
### Mass Tolerance
The mass tolerance (`float`) to use when matching masses. Masses between `mass - mass_tolerance` and `mass + mass_tolerance` will be considered as a match.
### Number of jobs
The number of jobs to run in parallel. If `-1` is passed, all available cores will be used.
### Usage example
```bash
biocatalyzer_cli file.tsv output_dir/ --neutralize=True --reaction_rules=reaction_rules.tsv --organisms="hsa;eco;sce" --patterns_to_remove=patterns.tsv --molecules_to_remove=byproducts.tsv --match_ms_data=True --ms_data_path=ms_data.tsv --mass_tolerance=0.1 --n_jobs=-1
```
For predicting compound metabolism only:
```bash
biocatalyzer_cli file.tsv output_dir/ --neutralize=True --reaction_rules=reaction_rules.tsv --organisms="hsa;eco;sce" --patterns_to_remove=patterns.tsv --molecules_to_remove=byproducts.tsv --n_jobs=-1
```
## Individual CLIs
Both parts of this CLI (the generation of new compounds (`bioreactor_cli`) and the matching with the MS data
(`matcher_cli`)) can be run individually.
For the `bioreactor_cli` see [readme_bioreactor_cli.md](readme_bioreactor_cli.md).
For the `matcher_cli` see [readme_matcher_cli.md](readme_matcher_cli.md).
## Cite
Manuscript under preparation!
### Credits and License
Developed at Centre of Biological Engineering, University of Minho and EMBL Heidelberg (Zimmermann-Kogadeeva Group).
This project has received funding from the Portuguese FCT and EMBL CPP Scientific Visitors Fellowships.
Released under an MIT License. <!-- # TODO: check if licence is in accordance with packages/data used -->
Raw data
{
"_id": null,
"home_page": "https://github.com/jcorreia11/BioCatalyzer",
"name": "biocatalyzer",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": null,
"keywords": "reaction-rules, metabolism, enzymatic-reactions, chemoinformatics, cheminformatics",
"author": "Jo\u00e3o Correia",
"author_email": "jfscorreia95@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/14/b4/4a2240b9d8f607c7ac3202cc7f33f8f8ccc55b50cb3bce7b8223e0b67885/biocatalyzer-0.1.2b0.tar.gz",
"platform": "unix",
"description": "# BioCatalyzer\n\nBioCatalyzer is a python tool that predicts enzymatic metabolism products using a rule-based approach.\n\nBioCatalyzer is implemented as a Command Line Interface that takes as input a set of compounds represented as SMILES \nstrings and outputs a set of predicted metabolic products and associated enzymes.\n\nThis metabolic products can then be matched with experimental MS data using this same tool.\n\n## Installation\n\nInstalling from Pypi package repository:\n\n`pip install biocatalyzer`\n\nInstalling from GitHub:\n\n1. clone the repository: `git clone https://github.com/jcorreia11/BioCatalyzer.git`\n\n2. run: `python setup.py install`\n\n## Command Line Interface\n\n```bash\nbiocatalyzer_cli <PATH_TO_COMPOUNDS> <OUTPUT_DIRECTORY> [--neutralize=<BOOL>] [--reaction_rules=<FILE_PATH>] [--organisms=<FILE_PATH>] [--patterns_to_remove=<FILE_PATH>] [--molecules_to_remove=<FILE_PATH>] [--min_atom_count=<INT>] [--match_ms_data=<BOOL>] [--ms_data_path=<FILE_PATH>] [--tolerance=<FLOAT>] [--n_jobs=<INT>]\n```\n\n| Argument | Example | Description | Default |\n|--------------------------------------|---------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| compounds <PATH_TO_COMPOUNDS> | `file.tsv` or `\"smile1;smiles2;smile3;etc\"` | The path to the file containing the compounds to use as reactants. Or ;-separated SMILES strings.<sup>1</sup> | |\n | output_directory <OUTPUT_DIRECTORY> | `output/directory/` | The path directory to save the results to. | |\n| neutralize | `True` or `False` | Whether to neutralize the compounds before predicting the products. In this case the new products will also be neutralized. | `False` |\n| reaction_rules | `file.tsv` or `None` | The path to the file containing the reaction rules to use.<sup>2</sup> | [all_reaction_rules_forward_no_smarts_duplicates_sample.tsv](src/biocatalyzer/data/reactionrules/all_reaction_rules_forward_no_smarts_duplicates_sample.tsv) |\n| organisms | `file.tsv` or `\"org_id1;org_id2;org_id3;etc\"` or `None` | The path to the file containing the organisms to use. Or ;-separated organisms identifiers. Reaction Rules will be selected accordingly (select only rules associated with enzymes encoded by genes from these organisms).<sup>3</sup> | All reaction rules are used. |\n| patterns_to_remove | `patterns.tsv` or `None` | The path to the file containing the patterns to remove from the products. <sup>4</sup> | [patterns.tsv](src/biocatalyzer/data/patterns_to_remove/patterns.tsv) |\n| molecules_to_remove | `molecules.tsv` or `None` | The path to the file containing the molecules to remove from the products. <sup>5</sup> | [byproducts.tsv](src/biocatalyzer/data/byproducts_to_remove/byproducts.tsv) |\n| min_atom_count | `4` | The minimum number of heavy atoms a product must have. | `5` |\n| match_ms_data | `True` or `False` | Whether to match the predicted products to the MS data. | `False` |\n| ms_data_path | `ms_data.tsv` | The path to the file containing the MS data. <sup>6</sup> | `None` |\n| tolerance | `0.02` | The mass tolerance to use when matching masses. | `0.02` |\n| n_jobs | `6` | The number of jobs to run in parallel (-1 uses all). | `1` |\n\n### Compounds\n\nSee [drugs.csv](src/biocatalyzer/data/compounds/drugs.csv)<sup>1</sup> for an example. \n\nThe file must be tab-separated and contain the following columns:\n- `smiles` - the SMILES representation of the compounds;\n- `compound_id` - the compounds identifiers.\n\nAlternatively, the compounds can be passed as ;-separated string with the SMILES representations.\n\n### Output directory\n\nThe output path must be a directory. The results will be saved in the following files:\n- `new_compouds.tsv` - the predicted products;\n- `matches.tsv` (if `match_ms_data` is set to `True`) - the matches between the predicted products and the MS data;\n\n### Neutralize\n\nIf set to `True`, the compounds will be neutralized before predicting the products. In this case the new products will \nalso be neutralized.\n\n### Reaction Rules\n\nSee [all_reaction_rules_forward_no_smarts_duplicates_sample.tsv](src/biocatalyzer/data/reactionrules/all_reaction_rules_forward_no_smarts_duplicates_sample.tsv)<sup>2</sup> \nfor an example.\n\nThe file must be tab-separated and contain the following columns:\n\n- `InternalID` - The ID of the Reaction Rule. # TODO: change the name of this column\n- `Reactants` - The Reactants of the ReactionRule. Coreactants must be defined by their ID as in the Coreactants file.\nThe compound to match must be identified by the string 'Any'. The format must be: `coreactant1_id;Any;coreactant_id`.\nThe order in which the reactants and the compound to match are defined is relevant and specific to the Reaction Rule.\nIf the Reaction Rules are mono-component (i.e. they do not contain any additional coreactant) the format must be: `Any`.\n- `SMARTS` - The SMARTS representation of the Reaction Rule.\n- `EC_Numbers` - The EC Numbers associated with the Reaction Rule.\n- `Organisms` - The Organisms associated with the Reaction Rule.\n\nBy default our set of reaction rules is used.\n\n### Organisms\n\nAll organisms' identifiers are defined in: \n[https://www.genome.jp/kegg/catalog/org_list.html](https://www.genome.jp/kegg/catalog/org_list.html) are allowed.\n\nExample:\n\n[hsa](https://www.genome.jp/kegg-bin/show_organism?org=hsa) is for *Homo sapiens* (human).\n\n[eco](https://www.genome.jp/kegg-bin/show_organism?org=eco) is for *Escherichia coli K-12 MG1655*.\n\n[sce](https://www.genome.jp/kegg-bin/show_organism?org=sce) is for *Saccharomyces cerevisiae (budding yeast)*.\n\nIf you want to use your own organisms see \n[organisms.csv](src/biocatalyzer/data/organisms/organisms_to_use.tsv)<sup>3</sup> for an example.\n\nThe file must be tab-separated and contain a column named `org_id` with the organisms' identifiers (KEGG identifiers).\n\nAlternatively, the organisms can be passed as ;-separated string with the organisms identifiers.\n\n### Patterns to remove\n\nIf you want to use your own patterns to remove see \n[patterns.tsv](src/biocatalyzer/data/patterns_to_remove/patterns.tsv)<sup>4</sup> for an example.\n\nThe file must be tab-separated and contain a column named `smarts` with the SMARTS representation of the patterns to remove.\n\n### Molecules to remove\n\nIf you want to use your own molecules to remove see \n[byproducts.tsv](src/biocatalyzer/data/byproducts_to_remove/byproducts.tsv)<sup>5</sup> for an example.\n\nThe file must be tab-separated and contain a column named `smiles` with the SMILES representation of the molecules to remove.\n\n### Match MS data\n\nIf set to `True`, the predicted products will be matched to the MS data.\n\nIn this case the `ms_data_path` must be set.\n\n### MS data path\n\nSee [ms_data.tsv](src/biocatalyzer/data/ms_data_example/ms_data_paper.tsv)<sup>6</sup> for an example.\n\nThe file must be tab-separated and contain the following columns:\n- `ParentCompound` - the parent/original compound identifiers.\n- `ParentCompoundSmiles` - the SMILES representation of the compounds (optional).\n- `Mass` - the mass of the molecule.\n\n### Mass Tolerance\n\nThe mass tolerance (`float`) to use when matching masses. Masses between `mass - mass_tolerance` and `mass + mass_tolerance` will be considered as a match.\n\n### Number of jobs\n\nThe number of jobs to run in parallel. If `-1` is passed, all available cores will be used.\n\n### Usage example\n\n```bash\nbiocatalyzer_cli file.tsv output_dir/ --neutralize=True --reaction_rules=reaction_rules.tsv --organisms=\"hsa;eco;sce\" --patterns_to_remove=patterns.tsv --molecules_to_remove=byproducts.tsv --match_ms_data=True --ms_data_path=ms_data.tsv --mass_tolerance=0.1 --n_jobs=-1\n```\n\nFor predicting compound metabolism only:\n\n```bash\nbiocatalyzer_cli file.tsv output_dir/ --neutralize=True --reaction_rules=reaction_rules.tsv --organisms=\"hsa;eco;sce\" --patterns_to_remove=patterns.tsv --molecules_to_remove=byproducts.tsv --n_jobs=-1\n```\n\n## Individual CLIs\n\nBoth parts of this CLI (the generation of new compounds (`bioreactor_cli`) and the matching with the MS data \n(`matcher_cli`)) can be run individually.\n\nFor the `bioreactor_cli` see [readme_bioreactor_cli.md](readme_bioreactor_cli.md).\n\nFor the `matcher_cli` see [readme_matcher_cli.md](readme_matcher_cli.md).\n\n## Cite\n\nManuscript under preparation!\n\n### Credits and License\n\nDeveloped at Centre of Biological Engineering, University of Minho and EMBL Heidelberg (Zimmermann-Kogadeeva Group).\n\nThis project has received funding from the Portuguese FCT and EMBL CPP Scientific Visitors Fellowships.\n\nReleased under an MIT License. <!-- # TODO: check if licence is in accordance with packages/data used -->\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "BioCatalyzer: a rule-based tool to predict compound metabolism",
"version": "0.1.2b0",
"project_urls": {
"Download": "https://github.com/jcorreia11/BioCatalyzer/archive/refs/tags/v0.1.2-beta.tar.gz",
"Homepage": "https://github.com/jcorreia11/BioCatalyzer"
},
"split_keywords": [
"reaction-rules",
" metabolism",
" enzymatic-reactions",
" chemoinformatics",
" cheminformatics"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "edc1302883d7e8310ce1846d3fd16d25e81a2959c8dbf8c6a825bcde07e261c5",
"md5": "41d7b3f43bf284c2b127b6052f0b859e",
"sha256": "8f38bf2e37c4bd7dac7656482ad9dec82071f5947a040b524e13027cd3da6831"
},
"downloads": -1,
"filename": "biocatalyzer-0.1.2b0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "41d7b3f43bf284c2b127b6052f0b859e",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 11110233,
"upload_time": "2025-02-12T10:42:39",
"upload_time_iso_8601": "2025-02-12T10:42:39.444309Z",
"url": "https://files.pythonhosted.org/packages/ed/c1/302883d7e8310ce1846d3fd16d25e81a2959c8dbf8c6a825bcde07e261c5/biocatalyzer-0.1.2b0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "14b44a2240b9d8f607c7ac3202cc7f33f8f8ccc55b50cb3bce7b8223e0b67885",
"md5": "687653b71fb13eee6f13e3ff60527317",
"sha256": "8e1aba92270dedcd6b6161aa180c2728dd240758ca731b5f3fddc2a31c8b035c"
},
"downloads": -1,
"filename": "biocatalyzer-0.1.2b0.tar.gz",
"has_sig": false,
"md5_digest": "687653b71fb13eee6f13e3ff60527317",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 11095839,
"upload_time": "2025-02-12T10:42:44",
"upload_time_iso_8601": "2025-02-12T10:42:44.248012Z",
"url": "https://files.pythonhosted.org/packages/14/b4/4a2240b9d8f607c7ac3202cc7f33f8f8ccc55b50cb3bce7b8223e0b67885/biocatalyzer-0.1.2b0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-02-12 10:42:44",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "jcorreia11",
"github_project": "BioCatalyzer",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "rdkit-pypi",
"specs": [
[
"==",
"2022.3.5"
]
]
},
{
"name": "click",
"specs": [
[
"==",
"8.1.3"
]
]
},
{
"name": "pandas",
"specs": [
[
"==",
"1.5.1"
]
]
},
{
"name": "numpy",
"specs": [
[
"==",
"1.23.3"
]
]
},
{
"name": "tqdm",
"specs": [
[
"==",
"4.66.3"
]
]
},
{
"name": "pandarallel",
"specs": [
[
"==",
"1.6.3"
]
]
}
],
"tox": true,
"lcname": "biocatalyzer"
}