Name | molecule-resolver JSON |
Version |
0.3.2
JSON |
| download |
home_page | None |
Summary | A package to use several web services to find molecule structures, synonyms and CAS. |
upload_time | 2024-12-31 23:36:53 |
maintainer | None |
docs_url | None |
author | Simon Muller |
requires_python | <3.12,>=3.9 |
license | MIT |
keywords |
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# MoleculeResolver
The **moleculeresolver** was born out of the need to annotate large datasets with accurate structural information fast and to crosscheck whether given metadata (name, SMILES) agrees with each other. It also allows to efficiently compare whether structures are available in two large datasets.
In short it's a Python module that allows you to retrieve molecular structures from multiple chemical databases, perform crosschecks to ensure data reliability, and standardize the best representation of molecules. It also provides functions for comparing molecules and sets of molecules based on specific configurations. This makes it a useful tool for researchers, chemists, or anyone working in computational chemistry / cheminformatics who needs to ensure they are working with the best available data for a molecule. The tool
## Installation
The package is available on [pypi](https://pypi.org/project/molecule-resolver/):
```sh
pip install molecule-resolver
```
While the source code is available here: [https://github.com/MoleculeResolver/molecule-resolver](https://github.com/MoleculeResolver/molecule-resolver)
## Features
- **🔍 Retrieve Molecular Structures**: Fetch molecular structures from different chemical databases, including PubChem, Comptox, Chemo, and others.
- **🆔 Support for Different Identifier Types**: Retrieve molecular structures using a variety of identifier types, including CAS numbers, SMILES, InChI, InChIkey and common names.
- **✅ Cross-check Capabilities**: Use data from multiple sources to verify molecular structures and identify the best representation.
- **🔄 Molecule Comparison**: Compare molecules or sets of molecules based on their structure, properties, and specified ⚙️ configurations.
- **⚙️ Standardization**: Standardize molecular structures, including handling isomers, tautomers, and isotopes.
- **💾 Caching Mechanism**: Use local caching to store molecules and reduce the number of repeated requests to external services, improving performance and reducing latency.
## Services used
At this moment, the following services are used to get the best structure for a given identifier. In the future, this list might be reviewed to improve perfomance, adding new services or removing some.
In case you want to add an additional service, open an issue or a pull request.
The MoleculeResolver does not offer all options/configurations for each service available with the specific related repos as it focusses on getting the structure based on the identifiers and doing so as accurate as possible while still being fast using parallelization under the hood.
| Service | Name | CAS | Formula | SMILES | InChI | InChIKey | CID | Batch search | Repos |
|-------------------------------------------------------------------------|------|-----|---------|--------|-------|----------|-----|--------------------|------------------------------------------------------------------------------|
| [cas_registry](https://commonchemistry.cas.org/) | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | |
| [chebi](https://www.ebi.ac.uk/chebi/) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | |
| [chemeo](https://www.chemeo.com/) | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | |
| [cir](https://cactus.nci.nih.gov/chemical/structure) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | - [CIRpy](https://github.com/mcs07/CIRpy "wrapper for the CIR. FYI, CIR uses OPSIN under the hood, unless specified otherwise.") |
| [comptox](https://comptox.epa.gov/dashboard) | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ | ❌ | ✅ | |
| [cts](https://cts.fiehnlab.ucdavis.edu/) | (✅) | ✅ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | |
| [nist](https://webbook.nist.gov/chemistry/) | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | |
| [opsin](https://opsin.ch.cam.ac.uk/) | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | - [py2opsin](https://github.com/JacksonBurns/py2opsin "lightweight OPSIN wrapper only depending on having Java installed.") <br> - [pyopsin](https://github.com/Dingyun-Huang/pyopsin "lightweight OPSIN wrapper depending on having Java installed + additional dependencies.") |
| [pubchem](https://pubchem.ncbi.nlm.nih.gov/)</li></ul> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | - [PubChemPy](https://github.com/mcs07/PubChemPy "wrapper for the pubchem PUG API") |
| [srs](https://cdxapps.epa.gov/oms-substance-registry-services/search) | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | |
ChemSpider was not used as it is already included in CIR [[1]](https://matt-swain.com/blog/2012-03-20-cirpy-python-nci-chemical-identifier-resolver) [[2]](https://cactus.nci.nih.gov/blog/?p=1456) [[3]](https://github.com/mcs07/ChemSpiPy). ChemIDplus and the Drug Information Portal were retired in 2022 [[4]](https://www.nlm.nih.gov/pubs/techbull/ja22/ja22_pubchem.html).
## 🚀 Usage
### Initialization
To use **Molecule Resolver**, first import and initialize the `MoleculeResolver` class. it is supposed to be used as a context manager:
```python
from moleculeresolver import MoleculeResolver
with MoleculeResolver(available_service_API_keys={"chemeo": "YOUR_API_KEY"}) as mr:
...
```
### Retrieve and Compare Molecules by Name and CAS
Retrieve a molecule using both its common name and CAS number, then compare the two to ensure they represent the same structure:
```python
from rdkit import Chem
from moleculeresolver import MoleculeResolver
with MoleculeResolver(available_service_API_keys={"chemeo": "YOUR_API_KEY"}) as mr:
molecule_name = mr.find_single_molecule(["aspirin"], ["name"])
molecule_cas = mr.find_single_molecule(["50-78-2"], ["cas"])
are_same = mr.are_equal(Chem.MolFromSmiles(molecule_name.SMILES),
Chem.MolFromSmiles(molecule_cas.SMILES))
print(f"Are the molecules the same? {are_same}")
```
### Parallelized Molecule Retrieval and Saving to JSON
Use the parallelized version to retrieve multiple molecules. If a large number of molecules is searched, moleculeresolver will try to use batch download capabilities whenever the database supports this.
```python
import json
from moleculeresolver import MoleculeResolver
molecule_names = ["aspirin", "propanol", "ibuprofen", "non-exixtent-name"]
not_found_molecules = []
molecules_dicts = {}
with MoleculeResolver(available_service_API_keys={"chemeo": "YOUR_API_KEY"}) as mr:
molecules = mr.find_multiple_molecules_parallelized(molecule_names, [["name"]] * len(molecule_names))
for name, molecule in zip(molecule_names, molecules):
if molecule:
molecules_dicts[name] = molecule.to_dict(found_molecules='remove')
else:
not_found_molecules.append(name)
with open("molecules.json", "w") as json_file:
json.dump(molecules_dicts, json_file, indent=4)
print(f"Molecules not found: {not_found_molecules}")
```
## ⚙️ Configuration
The `MoleculeResolver` class allows users to configure various options like:
- **API Keys**: Set API keys for accessing different molecular databases. Currently only chemeo needs one.
- **Standardization Options**: Choose how to handle molecular standardization (e.g., normalizing functional groups, disconnecting metals, handling isomers, etc.).
- **Differentiation Settings**: Options for distinguishing between isomers, tautomers, and isotopes.
## 🤝 Contributing
Contributions are welcome! If you have suggestions for improving the Molecule Resolver or want to add new features, feel free to submit an issue or a pull request on GitHub.
Raw data
{
"_id": null,
"home_page": null,
"name": "molecule-resolver",
"maintainer": null,
"docs_url": null,
"requires_python": "<3.12,>=3.9",
"maintainer_email": null,
"keywords": null,
"author": "Simon Muller",
"author_email": "simon.mueller@tuhh.de",
"download_url": "https://files.pythonhosted.org/packages/08/56/a32be6d0924ef9f675b02a7771ce6026e8a032fc912909a5e445f095bbc0/molecule_resolver-0.3.2.tar.gz",
"platform": null,
"description": "\n# MoleculeResolver\n\nThe **moleculeresolver** was born out of the need to annotate large datasets with accurate structural information fast and to crosscheck whether given metadata (name, SMILES) agrees with each other. It also allows to efficiently compare whether structures are available in two large datasets. \n\nIn short it's a Python module that allows you to retrieve molecular structures from multiple chemical databases, perform crosschecks to ensure data reliability, and standardize the best representation of molecules. It also provides functions for comparing molecules and sets of molecules based on specific configurations. This makes it a useful tool for researchers, chemists, or anyone working in computational chemistry / cheminformatics who needs to ensure they are working with the best available data for a molecule. The tool\n\n## Installation\n\nThe package is available on [pypi](https://pypi.org/project/molecule-resolver/):\n\n```sh\npip install molecule-resolver\n```\nWhile the source code is available here: [https://github.com/MoleculeResolver/molecule-resolver](https://github.com/MoleculeResolver/molecule-resolver)\n\n## Features\n\n- **\ud83d\udd0d Retrieve Molecular Structures**: Fetch molecular structures from different chemical databases, including PubChem, Comptox, Chemo, and others.\n- **\ud83c\udd94 Support for Different Identifier Types**: Retrieve molecular structures using a variety of identifier types, including CAS numbers, SMILES, InChI, InChIkey and common names.\n- **\u2705 Cross-check Capabilities**: Use data from multiple sources to verify molecular structures and identify the best representation.\n- **\ud83d\udd04 Molecule Comparison**: Compare molecules or sets of molecules based on their structure, properties, and specified \u2699\ufe0f configurations.\n- **\u2699\ufe0f Standardization**: Standardize molecular structures, including handling isomers, tautomers, and isotopes.\n- **\ud83d\udcbe Caching Mechanism**: Use local caching to store molecules and reduce the number of repeated requests to external services, improving performance and reducing latency.\n\n## Services used\nAt this moment, the following services are used to get the best structure for a given identifier. In the future, this list might be reviewed to improve perfomance, adding new services or removing some.\nIn case you want to add an additional service, open an issue or a pull request.\n\nThe MoleculeResolver does not offer all options/configurations for each service available with the specific related repos as it focusses on getting the structure based on the identifiers and doing so as accurate as possible while still being fast using parallelization under the hood.\n| Service | Name | CAS | Formula | SMILES | InChI | InChIKey | CID | Batch search | Repos |\n|-------------------------------------------------------------------------|------|-----|---------|--------|-------|----------|-----|--------------------|------------------------------------------------------------------------------|\n| [cas_registry](https://commonchemistry.cas.org/) | \u2705 | \u2705 | \u274c | \u2705 | \u2705 | \u274c | \u274c | \u274c | |\n| [chebi](https://www.ebi.ac.uk/chebi/) | \u2705 | \u2705 | \u2705 | \u2705 | \u2705 | \u2705 | \u274c | \u274c | |\n| [chemeo](https://www.chemeo.com/) | \u2705 | \u2705 | \u274c | \u2705 | \u2705 | \u2705 | \u274c | \u274c | |\n| [cir](https://cactus.nci.nih.gov/chemical/structure) | \u2705 | \u2705 | \u2705 | \u2705 | \u2705 | \u2705 | \u274c | \u274c | - [CIRpy](https://github.com/mcs07/CIRpy \"wrapper for the CIR. FYI, CIR uses OPSIN under the hood, unless specified otherwise.\") |\n| [comptox](https://comptox.epa.gov/dashboard) | \u2705 | \u2705 | \u274c | \u274c | \u274c | \u2705 | \u274c | \u2705 | |\n| [cts](https://cts.fiehnlab.ucdavis.edu/) | (\u2705) | \u2705 | \u274c | \u2705 | \u274c | \u274c | \u274c | \u274c | |\n| [nist](https://webbook.nist.gov/chemistry/) | \u2705 | \u2705 | \u2705 | \u2705 | \u274c | \u274c | \u274c | \u274c | |\n| [opsin](https://opsin.ch.cam.ac.uk/) | \u2705 | \u274c | \u274c | \u274c | \u274c | \u274c | \u274c | \u2705 | - [py2opsin](https://github.com/JacksonBurns/py2opsin \"lightweight OPSIN wrapper only depending on having Java installed.\") <br> - [pyopsin](https://github.com/Dingyun-Huang/pyopsin \"lightweight OPSIN wrapper depending on having Java installed + additional dependencies.\") |\n| [pubchem](https://pubchem.ncbi.nlm.nih.gov/)</li></ul> | \u2705 | \u2705 | \u2705 | \u2705 | \u2705 | \u2705 | \u2705 | \u2705 | - [PubChemPy](https://github.com/mcs07/PubChemPy \"wrapper for the pubchem PUG API\") |\n| [srs](https://cdxapps.epa.gov/oms-substance-registry-services/search) | \u2705 | \u2705 | \u274c | \u274c | \u274c | \u274c | \u274c | \u2705 | |\n\nChemSpider was not used as it is already included in CIR [[1]](https://matt-swain.com/blog/2012-03-20-cirpy-python-nci-chemical-identifier-resolver) [[2]](https://cactus.nci.nih.gov/blog/?p=1456) [[3]](https://github.com/mcs07/ChemSpiPy). ChemIDplus and the Drug Information Portal were retired in 2022 [[4]](https://www.nlm.nih.gov/pubs/techbull/ja22/ja22_pubchem.html).\n\n## \ud83d\ude80 Usage\n\n### Initialization\n\nTo use **Molecule Resolver**, first import and initialize the `MoleculeResolver` class. it is supposed to be used as a context manager:\n\n```python\nfrom moleculeresolver import MoleculeResolver\n\nwith MoleculeResolver(available_service_API_keys={\"chemeo\": \"YOUR_API_KEY\"}) as mr:\n ...\n```\n\n### Retrieve and Compare Molecules by Name and CAS\n\nRetrieve a molecule using both its common name and CAS number, then compare the two to ensure they represent the same structure:\n\n```python\nfrom rdkit import Chem\nfrom moleculeresolver import MoleculeResolver\n\nwith MoleculeResolver(available_service_API_keys={\"chemeo\": \"YOUR_API_KEY\"}) as mr:\n molecule_name = mr.find_single_molecule([\"aspirin\"], [\"name\"])\n molecule_cas = mr.find_single_molecule([\"50-78-2\"], [\"cas\"])\n \n are_same = mr.are_equal(Chem.MolFromSmiles(molecule_name.SMILES), \n Chem.MolFromSmiles(molecule_cas.SMILES))\n print(f\"Are the molecules the same? {are_same}\")\n```\n\n### Parallelized Molecule Retrieval and Saving to JSON\n\nUse the parallelized version to retrieve multiple molecules. If a large number of molecules is searched, moleculeresolver will try to use batch download capabilities whenever the database supports this.\n\n```python\nimport json\nfrom moleculeresolver import MoleculeResolver\n\nmolecule_names = [\"aspirin\", \"propanol\", \"ibuprofen\", \"non-exixtent-name\"]\nnot_found_molecules = []\nmolecules_dicts = {}\n\nwith MoleculeResolver(available_service_API_keys={\"chemeo\": \"YOUR_API_KEY\"}) as mr:\n molecules = mr.find_multiple_molecules_parallelized(molecule_names, [[\"name\"]] * len(molecule_names))\n for name, molecule in zip(molecule_names, molecules):\n if molecule:\n molecules_dicts[name] = molecule.to_dict(found_molecules='remove')\n else:\n not_found_molecules.append(name)\n\nwith open(\"molecules.json\", \"w\") as json_file:\n json.dump(molecules_dicts, json_file, indent=4)\n\nprint(f\"Molecules not found: {not_found_molecules}\")\n```\n\n## \u2699\ufe0f Configuration\n\nThe `MoleculeResolver` class allows users to configure various options like:\n\n- **API Keys**: Set API keys for accessing different molecular databases. Currently only chemeo needs one.\n- **Standardization Options**: Choose how to handle molecular standardization (e.g., normalizing functional groups, disconnecting metals, handling isomers, etc.).\n- **Differentiation Settings**: Options for distinguishing between isomers, tautomers, and isotopes.\n\n## \ud83e\udd1d Contributing\n\nContributions are welcome! If you have suggestions for improving the Molecule Resolver or want to add new features, feel free to submit an issue or a pull request on GitHub.\n\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A package to use several web services to find molecule structures, synonyms and CAS.",
"version": "0.3.2",
"project_urls": null,
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "9bde0941eba771a674a5d2635818476bb6651095b7b44f761671718c9b159a4c",
"md5": "76b23f04a083986aa1d8a9e7f299a653",
"sha256": "cc227b76bdc7ceaf2e109b15ea44400f24e97fa47adb1a5a99f0d1b9d0f5f11f"
},
"downloads": -1,
"filename": "molecule_resolver-0.3.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "76b23f04a083986aa1d8a9e7f299a653",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<3.12,>=3.9",
"size": 13043360,
"upload_time": "2024-12-31T23:36:47",
"upload_time_iso_8601": "2024-12-31T23:36:47.138776Z",
"url": "https://files.pythonhosted.org/packages/9b/de/0941eba771a674a5d2635818476bb6651095b7b44f761671718c9b159a4c/molecule_resolver-0.3.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "0856a32be6d0924ef9f675b02a7771ce6026e8a032fc912909a5e445f095bbc0",
"md5": "fedf09f5588a1ea22bb938d077964e16",
"sha256": "c929c988c4cb43f74a779bd6380bb7a503e409f75cc5142d25f163842c0bcafb"
},
"downloads": -1,
"filename": "molecule_resolver-0.3.2.tar.gz",
"has_sig": false,
"md5_digest": "fedf09f5588a1ea22bb938d077964e16",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<3.12,>=3.9",
"size": 13042414,
"upload_time": "2024-12-31T23:36:53",
"upload_time_iso_8601": "2024-12-31T23:36:53.717559Z",
"url": "https://files.pythonhosted.org/packages/08/56/a32be6d0924ef9f675b02a7771ce6026e8a032fc912909a5e445f095bbc0/molecule_resolver-0.3.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-31 23:36:53",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "molecule-resolver"
}