# biochemical-data-connectors
`biochemical-data-connectors` is a Python package for extracting chemical, biochemical, and bioactivity data from public databases like ChEMBL, PubChem, BindingDB, IUPHAR/BPS Guide to PHARMACOLOGY, and the Open Reaction Database (ORD).
## Overview
`biochemical-data-connectors` provides a simple and consistent interface to query major cheminformatics bioinformatics databases for compounds. It is designed to be a modular and reusable tool for researchers and developers in computational chemistry and drug discovery, enabling the rapid curation of high-quality datasets for machine learning and analysis.
### Key Features
1. **Bioactive Compounds**
* **Unified Interface**: A single, easy-to-use abstract base class for fetching bioactives for a given target.
* **Multiple Data Sources**: Includes concrete connectors for major public databases:
1. ChEMBL (`ChemblBioactivesExtractor`)
2. PubChem (`PubChemBioactivesExtractor`)
3. BindingDB (`BindingDbBioactivesConnector`)
4. IUPHAR/BPS Guide to PHARMACOLOGY (IUPHARBioactivesConnector)
* **Powerful Filtering**: Filter compounds by bioactivity type (e.g., Kd, IC50) and potency value.
* **Efficient Fetching**: Uses concurrency to fetch data from APIs efficiently.
2. **Chemical Reactions**
* **Local ORD Processing**: Includes a connector (`OpenReactionDatabaseConnector`) to efficiently process a local copy of the Open Reaction Database.
* **Reaction Role Correction**: Uses RDKit to automatically correct and reassign reactant/product roles from the source data, improving data quality.
* **Robust SMILES Extraction**: Canonicalizes and validates SMILES strings for both reactants and products to ensure high-quality, standardized output.
* **Memory-Efficient Processing**: Employs a generator-based extraction method, allowing for iteration over massive reaction datasets with a low memory footprint.
## Installation
You can install this package locally via:
```
pip install biochemical-data-connectors
```
## Quick Start
Here is a simple example of how to retrieve all compounds from ChEMBL with a measured Kd of less than or equal to 1000 nM for the EGFR protein (UniProt ID: `P00533`).
```
import logging
from biochemical_data_connectors import ChEMBLConnector
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
# 1. Instantiate the connector for the desired database
chembl_connector = ChEMBLConnector(
bioactivity_measure='Kd',
bioactivity_threshold=1000.0, # in nM
logger=logger
)
# 2. Specify the target's UniProt ID
target_uniprot_id = "P00533" # EGFR
# 3. Get the bioactive compounds
print(f"Fetching bioactive compounds for {target_uniprot_id} from ChEMBL...")
smiles_list = chembl_connector.get_bioactive_compounds(target_uniprot_id)
# 4. Print the results
if smiles_list:
print(f"\nFound {len(smiles_list)} compounds.")
print("First 5 compounds:")
for smiles in smiles_list[:5]:
print(smiles)
else:
print("No compounds found matching the criteria.")
```
## Package Structure
```
biochemical-data-connectors/
├── pyproject.toml
├── requirements-dev.txt
├── src/
│ └── biochemical_data_connectors/
│ ├── __init__.py
│ ├── constants.py
│ ├── models.py
│ ├── connectors/
│ │ ├── __init__.py
│ │ ├── ord_connectors.py
│ │ └── bioactive_compounds
│ │ ├── __init__.py
│ │ ├── base_bioactives_connector.py
│ │ ├── bindingdb_bioactives_connector.py
│ │ ├── chembl_bioactives_connector.py
│ │ ├── iuphar_bioactives_connector.py
│ │ └── pubchem_bioactives_connector.py
│ └── utils/
│ ├── __init__.py
│ ├── files_utils.py
│ ├── iter_utils.py
│ ├── standardization_utils.py
│ └── api/
│ ├── __init__.py
│ ├── base_api.py
│ ├── bindingbd_api.py
│ ├── chembl_api.py
│ ├── iuphar_api.py
│ ├── mappings.py
│ └── pubchem_api.py
├── tests/
│ └── ...
└── README.md
```
## License
This project is licensed under the terms of the [MIT License](https://opensource.org/license/mit).
Raw data
{
"_id": null,
"home_page": null,
"name": "biochemical-data-connectors",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "bioinformatics, cheminformatics, drug discovery, chembl, pubchem, bioactivity, api",
"author": null,
"author_email": "Christopher van den Berg <cvandenberg1105@googlemail.com>",
"download_url": "https://files.pythonhosted.org/packages/5a/e4/59e91a717041deb544cdeb8b3cfcecee31a938ea3978300c45013af28ea8/biochemical_data_connectors-3.2.2.tar.gz",
"platform": null,
"description": "# biochemical-data-connectors\n\n`biochemical-data-connectors` is a Python package for extracting chemical, biochemical, and bioactivity data from public databases like ChEMBL, PubChem, BindingDB, IUPHAR/BPS Guide to PHARMACOLOGY, and the Open Reaction Database (ORD).\n\n## Overview\n`biochemical-data-connectors` provides a simple and consistent interface to query major cheminformatics bioinformatics databases for compounds. It is designed to be a modular and reusable tool for researchers and developers in computational chemistry and drug discovery, enabling the rapid curation of high-quality datasets for machine learning and analysis.\n\n### Key Features\n1. **Bioactive Compounds**\n * **Unified Interface**: A single, easy-to-use abstract base class for fetching bioactives for a given target.\n * **Multiple Data Sources**: Includes concrete connectors for major public databases:\n 1. ChEMBL (`ChemblBioactivesExtractor`)\n 2. PubChem (`PubChemBioactivesExtractor`)\n 3. BindingDB (`BindingDbBioactivesConnector`)\n 4. IUPHAR/BPS Guide to PHARMACOLOGY (IUPHARBioactivesConnector)\n * **Powerful Filtering**: Filter compounds by bioactivity type (e.g., Kd, IC50) and potency value.\n * **Efficient Fetching**: Uses concurrency to fetch data from APIs efficiently.\n2. **Chemical Reactions**\n * **Local ORD Processing**: Includes a connector (`OpenReactionDatabaseConnector`) to efficiently process a local copy of the Open Reaction Database.\n * **Reaction Role Correction**: Uses RDKit to automatically correct and reassign reactant/product roles from the source data, improving data quality.\n * **Robust SMILES Extraction**: Canonicalizes and validates SMILES strings for both reactants and products to ensure high-quality, standardized output.\n * **Memory-Efficient Processing**: Employs a generator-based extraction method, allowing for iteration over massive reaction datasets with a low memory footprint.\n\n## Installation\nYou can install this package locally via:\n```\npip install biochemical-data-connectors\n```\n\n## Quick Start\nHere is a simple example of how to retrieve all compounds from ChEMBL with a measured Kd of less than or equal to 1000 nM for the EGFR protein (UniProt ID: `P00533`).\n```\nimport logging\nfrom biochemical_data_connectors import ChEMBLConnector\n\nlogging.basicConfig(level=logging.INFO)\nlogger = logging.getLogger(__name__)\n\n# 1. Instantiate the connector for the desired database\nchembl_connector = ChEMBLConnector(\n bioactivity_measure='Kd',\n bioactivity_threshold=1000.0, # in nM\n logger=logger\n)\n\n# 2. Specify the target's UniProt ID\ntarget_uniprot_id = \"P00533\" # EGFR\n\n# 3. Get the bioactive compounds\nprint(f\"Fetching bioactive compounds for {target_uniprot_id} from ChEMBL...\")\nsmiles_list = chembl_connector.get_bioactive_compounds(target_uniprot_id)\n\n# 4. Print the results\nif smiles_list:\n print(f\"\\nFound {len(smiles_list)} compounds.\")\n print(\"First 5 compounds:\")\n for smiles in smiles_list[:5]:\n print(smiles)\nelse:\n print(\"No compounds found matching the criteria.\")\n```\n\n## Package Structure\n```\nbiochemical-data-connectors/\n\u251c\u2500\u2500 pyproject.toml\n\u251c\u2500\u2500 requirements-dev.txt\n\u251c\u2500\u2500 src/\n\u2502 \u2514\u2500\u2500 biochemical_data_connectors/\n\u2502 \u251c\u2500\u2500 __init__.py\n\u2502 \u251c\u2500\u2500 constants.py\n\u2502 \u251c\u2500\u2500 models.py\n\u2502 \u251c\u2500\u2500 connectors/\n\u2502 \u2502 \u251c\u2500\u2500 __init__.py\n\u2502 \u2502 \u251c\u2500\u2500 ord_connectors.py\n\u2502 \u2502 \u2514\u2500\u2500 bioactive_compounds\n\u2502 \u2502 \u251c\u2500\u2500 __init__.py\n\u2502 \u2502 \u251c\u2500\u2500 base_bioactives_connector.py\n\u2502 \u2502 \u251c\u2500\u2500 bindingdb_bioactives_connector.py\n\u2502 \u2502 \u251c\u2500\u2500 chembl_bioactives_connector.py\n\u2502 \u2502 \u251c\u2500\u2500 iuphar_bioactives_connector.py\n\u2502 \u2502 \u2514\u2500\u2500 pubchem_bioactives_connector.py\n\u2502 \u2514\u2500\u2500 utils/\n\u2502 \u251c\u2500\u2500 __init__.py\n\u2502 \u251c\u2500\u2500 files_utils.py\n\u2502 \u251c\u2500\u2500 iter_utils.py\n\u2502 \u251c\u2500\u2500 standardization_utils.py\n\u2502 \u2514\u2500\u2500 api/\n\u2502 \u251c\u2500\u2500 __init__.py\n\u2502 \u251c\u2500\u2500 base_api.py\n\u2502 \u251c\u2500\u2500 bindingbd_api.py\n\u2502 \u251c\u2500\u2500 chembl_api.py\n\u2502 \u251c\u2500\u2500 iuphar_api.py\n\u2502 \u251c\u2500\u2500 mappings.py\n\u2502 \u2514\u2500\u2500 pubchem_api.py\n\u251c\u2500\u2500 tests/\n\u2502 \u2514\u2500\u2500 ...\n\u2514\u2500\u2500 README.md\n```\n\n## License\nThis project is licensed under the terms of the [MIT License](https://opensource.org/license/mit).\n",
"bugtrack_url": null,
"license": "MIT License",
"summary": "A Python package to extract chemical, biochemical, and bioactivity data from public databases like ORD, ChEMBL and PubChem.",
"version": "3.2.2",
"project_urls": {
"Bug Tracker": "https://github.com/your-username/biochemical-data-connectors/issues",
"Repository": "https://github.com/your-username/biochemical-data-connectors"
},
"split_keywords": [
"bioinformatics",
" cheminformatics",
" drug discovery",
" chembl",
" pubchem",
" bioactivity",
" api"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "9a2aa79d736199db9eb4afed367e3132f0ca8e87b238084a9262a8b19ec73935",
"md5": "cfb89ee0faacedd708d765db60c1a121",
"sha256": "0b31f8beda60056c6ad9063bc3b1be62f13caa69517ee1bfcedf2b89c373db42"
},
"downloads": -1,
"filename": "biochemical_data_connectors-3.2.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "cfb89ee0faacedd708d765db60c1a121",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 38869,
"upload_time": "2025-08-05T00:28:53",
"upload_time_iso_8601": "2025-08-05T00:28:53.793737Z",
"url": "https://files.pythonhosted.org/packages/9a/2a/a79d736199db9eb4afed367e3132f0ca8e87b238084a9262a8b19ec73935/biochemical_data_connectors-3.2.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "5ae459e91a717041deb544cdeb8b3cfcecee31a938ea3978300c45013af28ea8",
"md5": "7049bd37b79a2b1dc3239267815d90ab",
"sha256": "dcc588bc2ad3fbec1d0c7d836d971b5fbd464b5b49e354ada73a761da0933a68"
},
"downloads": -1,
"filename": "biochemical_data_connectors-3.2.2.tar.gz",
"has_sig": false,
"md5_digest": "7049bd37b79a2b1dc3239267815d90ab",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 27627,
"upload_time": "2025-08-05T00:28:55",
"upload_time_iso_8601": "2025-08-05T00:28:55.511380Z",
"url": "https://files.pythonhosted.org/packages/5a/e4/59e91a717041deb544cdeb8b3cfcecee31a938ea3978300c45013af28ea8/biochemical_data_connectors-3.2.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-05 00:28:55",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "your-username",
"github_project": "biochemical-data-connectors",
"github_not_found": true,
"lcname": "biochemical-data-connectors"
}