# ORDerly
๐งช Cleaning chemical reaction data ๐งช
๐ฏ [Condition Prediction Benchmark](https://figshare.com/articles/dataset/ORDerly_chemical_reactions_condition_benchmarks/23298467) ๐ฏ
## Quick Install
Requires Python 3.10 (Tested on MacOS and Linux)
```pip install orderly```
๐ค What is this?
-----------------
Machine learning has the potential to provide tremendous value to chemistry. However, large amounts of clean high-quality data are needed to train models
ORDerly cleans chemical reaction data from the growing [Open Reaction Database (ORD)](https://docs.open-reaction-database.org/en/latest/).
Use ORDerly to:
- Extract and clean your own datasets.
- Access the [ORDerly condition prediction benchmark dataset](https://figshare.com/articles/dataset/ORDerly_chemical_reactions_condition_benchmarks/23298467) for reaction condition prediction.
- Reproduce results from our paper including training a ML model to predict reaction conditions.
<img src="images/abstract_fig.png" alt="Abstract Figure" width="300">
<!-- Section on extracting and cleaning a dataset-->
๐ Extract and clean a dataset
------------------------------
### Download data from ORD
<!-- ```python -m orderly.download.ord```
This will create a folder called ```/data/ord/``` in your current directory, and download the data into ```ord/``` -->
Data in ORD format should be placed in a folder called ```/data/ord/```. You can either use your own data, or the open-source ORD data.
To download the ORD data follow the instructions in the [ORD repository](https://github.com/open-reaction-database/ord-data) (i.e. download [Git LFS](https://git-lfs.com/) and clone their repository) and then place it within a folder called ```/data/ord/```.
### Extract data from the ORD files
```python -m orderly.extract```
If you want to run ORDerly on your own data, and want to specify the input and output path:
```python -m orderly.extract --input_path="/data/ord/" --output_path="/data/orderly/"```
This will generate a parquet file for each ORD file.
### Clean the data
This will produce train and test parquet files, along with a .json file showing the arguments used and a .log file showing the operations run.
```python -m orderly.clean```
<!-- Section on downloading the benchmark -->
๐ Download the condition prediction benchmark dataset
--------------------------------------------------------
Reaction condition prediction is the problem of predicting the things "above the arrow" in chemical reactions.
<!-- Include image of a reactions -->
There are three options for donwloading the benchmark.
1) If you have orderly installed you can download the benchmark using this command:
```python -m orderly.download.benchmark```
2) Or you can either download the [ORDerly condition prediction benchmark dataset](https://figshare.com/articles/dataset/ORDerly_chemical_reactions_condition_benchmarks/23298467) directly
3) Or use the following code to download it (without installing ORDerly). Make sure to install needed dependencies first (shown below).
<details>
<summary>Toggle to see code to download benchmark</summary>
```pip install requests fastparquet pandas```
```python
import pathlib
import zipfile
import pandas as pd
import requests
def download_benchmark(
benchmark_zip_file="orderly_benchmark.zip",
benchmark_directory="orderly_benchmark/",
version=2,
):
figshare_url = (
f"https://figshare.com/ndownloader/articles/23298467/versions/{version}"
)
print(f"Downloading benchmark from {figshare_url} to {benchmark_zip_file}")
r = requests.get(figshare_url, allow_redirects=True)
with open(benchmark_zip_file, "wb") as f:
f.write(r.content)
print("Unzipping benchmark")
benchmark_directory = pathlib.Path(benchmark_directory)
benchmark_directory.mkdir(parents=True, exist_ok=True)
with zipfile.ZipFile(benchmark_zip_file, "r") as zip_ref:
zip_ref.extractall(benchmark_directory)
download_benchmark()
train_df = pd.read_parquet("orderly_benchmark/orderly_benchmark_train.parquet")
test_df = pd.read_parquet("orderly_benchmark/orderly_benchmark_test.parquet")
```
</details>
๐ Reproducing results from paper
------------------------------
To reproduce the results from the paper, please clone the repository, and use poetry to install the requirements (see above). Towards the bottom of the makefile, you will find a comprehensive 8 step list of steps to generate all the datasets and reproduce all results presented in the paper.
### Results
We run the condition prediction model on four different datasets, and find that trusting the labelling of the ORD data leads to overly confident test accuracy. We conclude that applying chemical logic to the reaction string is necessary to get a high-quality dataset, and that the best strategy for dealing with rare molecules is to delete reactions where they appear.
Top-3 exact match combination accuracy (\%): frequency informed guess // model prediction // AIB\%:
| Dataset | A (labeling; rare->"other") | B (labeling; rare->delete rxn) | C (reaction string; rare->"other") | D (reaction string; rare->delete rxn) |
|--------------------|--------------------------------|---------------------------------|------------------------------------|--------------------------------------|
| Solvents | 47 // 58 // 21% | 50 // 61 // 22% | 23 // 42 // 26% | 24 // 45 // 28% |
| Agents | 54 // 70 // 35% | 58 // 72 // 32% | 19 // 39 // 25% | 21 // 42 // 27% |
| Solvents & Agents | 31 // 44 // 19% | 33 // 47 // 21% | 4 // 21 // 18% | 5 // 24 // 21% |
Where AIB\% is the Average Improvement of the model over the Baseline (i.e. a frequency informed guess), where $A_m$ is the accuracy of the model, and $A_B$ is the accuracy of the baseline:
$`AIB = (A_m - A_b) / (1 - A_b)`$
Full API documentation
------------------------
## Extraction
There are two different ways to extract data from ORD files, trusting the labelling, or using the reaction string (as specified in the ```trust_labelling``` boolean). Below you see all the arguments that can be passed to the extraction script, change as appropriate:
```python -m orderly.extract --name_contains_substring="uspto" --trust_labelling=False --output_path="data/orderly/uspto_no_trust" --consider_molecule_names=False```
## Cleaning
There are also a number of customisable steps for the cleaning:
```python -m orderly.clean --output_path="data/orderly/datasets_$(dataset_version)/orderly_no_trust_no_map.parquet" --ord_extraction_path="data/orderly/uspto_no_trust/extracted_ords" --molecules_to_remove_path="data/orderly/uspto_no_trust/all_molecule_names.csv" --min_frequency_of_occurrence=100 --map_rare_molecules_to_other=False --set_unresolved_names_to_none_if_mapped_rxn_str_exists_else_del_rxn=True --remove_rxn_with_unresolved_names=False --set_unresolved_names_to_none=False --num_product=1 --num_reactant=2 --num_solv=2 --num_agent=3 --num_cat=0 --num_reag=0 --consistent_yield=True --scramble=True --train_test_split_fraction=0.9```
A list of solvents (names and SMILES) commonly used in pharmaceutical chemistry can be found at orderly/data/solvents.csv
## Issues?
Submit an [issue](https://github.com/sustainable-processes/ORDerly/issues) or send an email to dsw46@cam.ac.uk.
## Citing
If you find this project useful, we encourage you to
* Star this repository :star:
<!-- * Cite our [paper](https://chemistry-europe.onlinelibrary.wiley.com/doi/full/10.1002/cmtd.202000051).
```
@article{Felton2021,
author = "Kobi Felton and Jan Rittig and Alexei Lapkin",
title = "{Summit: Benchmarking Machine Learning Methods for Reaction Optimisation}",
year = "2021",
month = "2",
url = "https://chemistry-europe.onlinelibrary.wiley.com/doi/full/10.1002/cmtd.202000051",
journal = "Chemistry Methods"
}
```-->
<!-- ### 2. Run extraction
We can run extraction using: ```poetry run python -m orderly.extract```. Using ```poetry run python -m orderly.extract --help``` will explain the arguments. Certain args must be set such as data paths.
### 3. Run cleaning
We can run cleaning using: ```poetry run python -m orderly.clean```. Using ```poetry run python -m orderly.clean --help``` will explain the arguments. Certain args must be set such as data paths. -->
Raw data
{
"_id": null,
"home_page": "https://github.com/sustainable-processes/ORDerly",
"name": "orderly",
"maintainer": "Daniel S. Wigh",
"docs_url": null,
"requires_python": ">=3.10,<4.0",
"maintainer_email": "dsw46@cam.ac.uk",
"keywords": "ord-schema,RDKit,ord-data,chemical-reactions",
"author": "Daniel S. Wigh",
"author_email": "dsw46@cam.ac.uk",
"download_url": "https://files.pythonhosted.org/packages/a1/8e/f586c666f866f943b129c753ff0aafdbd26b670a72a470473a4f8eae5970/orderly-0.1.2.tar.gz",
"platform": null,
"description": "# ORDerly\n\n\ud83e\uddea Cleaning chemical reaction data \ud83e\uddea\n\n\ud83c\udfaf [Condition Prediction Benchmark](https://figshare.com/articles/dataset/ORDerly_chemical_reactions_condition_benchmarks/23298467) \ud83c\udfaf\n\n## Quick Install\n\nRequires Python 3.10 (Tested on MacOS and Linux)\n\n```pip install orderly```\n\n\ud83e\udd14 What is this?\n-----------------\n\nMachine learning has the potential to provide tremendous value to chemistry. However, large amounts of clean high-quality data are needed to train models\n\nORDerly cleans chemical reaction data from the growing [Open Reaction Database (ORD)](https://docs.open-reaction-database.org/en/latest/).\n\nUse ORDerly to:\n- Extract and clean your own datasets.\n- Access the [ORDerly condition prediction benchmark dataset](https://figshare.com/articles/dataset/ORDerly_chemical_reactions_condition_benchmarks/23298467) for reaction condition prediction.\n- Reproduce results from our paper including training a ML model to predict reaction conditions.\n\n<img src=\"images/abstract_fig.png\" alt=\"Abstract Figure\" width=\"300\">\n\n\n<!-- Section on extracting and cleaning a dataset-->\n\n\ud83d\udcd6 Extract and clean a dataset\n------------------------------\n \n### Download data from ORD\n\n<!-- ```python -m orderly.download.ord```\n\nThis will create a folder called ```/data/ord/``` in your current directory, and download the data into ```ord/``` -->\n\nData in ORD format should be placed in a folder called ```/data/ord/```. You can either use your own data, or the open-source ORD data.\n\nTo download the ORD data follow the instructions in the [ORD repository](https://github.com/open-reaction-database/ord-data) (i.e. download [Git LFS](https://git-lfs.com/) and clone their repository) and then place it within a folder called ```/data/ord/```.\n\n### Extract data from the ORD files\n\n```python -m orderly.extract```\n\nIf you want to run ORDerly on your own data, and want to specify the input and output path:\n\n```python -m orderly.extract --input_path=\"/data/ord/\" --output_path=\"/data/orderly/\"```\n\nThis will generate a parquet file for each ORD file.\n\n### Clean the data\n\nThis will produce train and test parquet files, along with a .json file showing the arguments used and a .log file showing the operations run.\n\n```python -m orderly.clean```\n\n<!-- Section on downloading the benchmark -->\n\ud83d\ude80 Download the condition prediction benchmark dataset\n--------------------------------------------------------\n\nReaction condition prediction is the problem of predicting the things \"above the arrow\" in chemical reactions.\n\n<!-- Include image of a reactions -->\n\nThere are three options for donwloading the benchmark.\n\n1) If you have orderly installed you can download the benchmark using this command:\n\n```python -m orderly.download.benchmark```\n\n2) Or you can either download the [ORDerly condition prediction benchmark dataset](https://figshare.com/articles/dataset/ORDerly_chemical_reactions_condition_benchmarks/23298467) directly\n\n3) Or use the following code to download it (without installing ORDerly). Make sure to install needed dependencies first (shown below).\n\n\n<details>\n<summary>Toggle to see code to download benchmark</summary>\n\n```pip install requests fastparquet pandas```\n\n```python\nimport pathlib\nimport zipfile\n\nimport pandas as pd\nimport requests\n\n\ndef download_benchmark(\n benchmark_zip_file=\"orderly_benchmark.zip\",\n benchmark_directory=\"orderly_benchmark/\",\n version=2,\n):\n figshare_url = (\n f\"https://figshare.com/ndownloader/articles/23298467/versions/{version}\"\n )\n print(f\"Downloading benchmark from {figshare_url} to {benchmark_zip_file}\")\n r = requests.get(figshare_url, allow_redirects=True)\n with open(benchmark_zip_file, \"wb\") as f:\n f.write(r.content)\n\n print(\"Unzipping benchmark\")\n benchmark_directory = pathlib.Path(benchmark_directory)\n benchmark_directory.mkdir(parents=True, exist_ok=True)\n with zipfile.ZipFile(benchmark_zip_file, \"r\") as zip_ref:\n zip_ref.extractall(benchmark_directory)\n\n\ndownload_benchmark()\ntrain_df = pd.read_parquet(\"orderly_benchmark/orderly_benchmark_train.parquet\")\ntest_df = pd.read_parquet(\"orderly_benchmark/orderly_benchmark_test.parquet\")\n```\n</details>\n\n\n\n\ud83d\udccb Reproducing results from paper\n------------------------------\n\nTo reproduce the results from the paper, please clone the repository, and use poetry to install the requirements (see above). Towards the bottom of the makefile, you will find a comprehensive 8 step list of steps to generate all the datasets and reproduce all results presented in the paper. \n\n### Results\n\nWe run the condition prediction model on four different datasets, and find that trusting the labelling of the ORD data leads to overly confident test accuracy. We conclude that applying chemical logic to the reaction string is necessary to get a high-quality dataset, and that the best strategy for dealing with rare molecules is to delete reactions where they appear.\n\nTop-3 exact match combination accuracy (\\%): frequency informed guess // model prediction // AIB\\%:\n\n| Dataset | A (labeling; rare->\"other\") | B (labeling; rare->delete rxn) | C (reaction string; rare->\"other\") | D (reaction string; rare->delete rxn) |\n|--------------------|--------------------------------|---------------------------------|------------------------------------|--------------------------------------|\n| Solvents | 47 // 58 // 21% | 50 // 61 // 22% | 23 // 42 // 26% | 24 // 45 // 28% |\n| Agents | 54 // 70 // 35% | 58 // 72 // 32% | 19 // 39 // 25% | 21 // 42 // 27% |\n| Solvents & Agents | 31 // 44 // 19% | 33 // 47 // 21% | 4 // 21 // 18% | 5 // 24 // 21% |\n\nWhere AIB\\% is the Average Improvement of the model over the Baseline (i.e. a frequency informed guess), where $A_m$ is the accuracy of the model, and $A_B$ is the accuracy of the baseline: \n$`AIB = (A_m - A_b) / (1 - A_b)`$\n\n\n\nFull API documentation\n------------------------\n\n## Extraction\nThere are two different ways to extract data from ORD files, trusting the labelling, or using the reaction string (as specified in the ```trust_labelling``` boolean). Below you see all the arguments that can be passed to the extraction script, change as appropriate:\n\n```python -m orderly.extract --name_contains_substring=\"uspto\" --trust_labelling=False --output_path=\"data/orderly/uspto_no_trust\" --consider_molecule_names=False```\n\n## Cleaning\nThere are also a number of customisable steps for the cleaning:\n\n```python -m orderly.clean --output_path=\"data/orderly/datasets_$(dataset_version)/orderly_no_trust_no_map.parquet\" --ord_extraction_path=\"data/orderly/uspto_no_trust/extracted_ords\" --molecules_to_remove_path=\"data/orderly/uspto_no_trust/all_molecule_names.csv\" --min_frequency_of_occurrence=100 --map_rare_molecules_to_other=False --set_unresolved_names_to_none_if_mapped_rxn_str_exists_else_del_rxn=True --remove_rxn_with_unresolved_names=False --set_unresolved_names_to_none=False --num_product=1 --num_reactant=2 --num_solv=2 --num_agent=3 --num_cat=0 --num_reag=0 --consistent_yield=True --scramble=True --train_test_split_fraction=0.9```\n\nA list of solvents (names and SMILES) commonly used in pharmaceutical chemistry can be found at orderly/data/solvents.csv\n\n\n## Issues?\nSubmit an [issue](https://github.com/sustainable-processes/ORDerly/issues) or send an email to dsw46@cam.ac.uk.\n\n## Citing\n\nIf you find this project useful, we encourage you to\n\n* Star this repository :star: \n<!-- * Cite our [paper](https://chemistry-europe.onlinelibrary.wiley.com/doi/full/10.1002/cmtd.202000051).\n```\n@article{Felton2021,\nauthor = \"Kobi Felton and Jan Rittig and Alexei Lapkin\",\ntitle = \"{Summit: Benchmarking Machine Learning Methods for Reaction Optimisation}\",\nyear = \"2021\",\nmonth = \"2\",\nurl = \"https://chemistry-europe.onlinelibrary.wiley.com/doi/full/10.1002/cmtd.202000051\",\njournal = \"Chemistry Methods\"\n} \n```-->\n\n\n\n\n<!-- ### 2. Run extraction\n\nWe can run extraction using: ```poetry run python -m orderly.extract```. Using ```poetry run python -m orderly.extract --help``` will explain the arguments. Certain args must be set such as data paths.\n\n### 3. Run cleaning\n\nWe can run cleaning using: ```poetry run python -m orderly.clean```. Using ```poetry run python -m orderly.clean --help``` will explain the arguments. Certain args must be set such as data paths. -->\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Extraction and cleaning of chemical reactions data from ORD",
"version": "0.1.2",
"project_urls": {
"Homepage": "https://github.com/sustainable-processes/ORDerly",
"Repository": "https://github.com/sustainable-processes/ORDerly"
},
"split_keywords": [
"ord-schema",
"rdkit",
"ord-data",
"chemical-reactions"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "096810443a3b7d980afdad984bf13d37925609caca924caad316f2be00ad378e",
"md5": "f88515a126eee53165cb68da84b3f762",
"sha256": "d4f40af8a0e346e3bad740b23e8ac500b847e8d62e506d925d2c4f7198660da5"
},
"downloads": -1,
"filename": "orderly-0.1.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "f88515a126eee53165cb68da84b3f762",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10,<4.0",
"size": 18481421,
"upload_time": "2023-08-01T20:51:13",
"upload_time_iso_8601": "2023-08-01T20:51:13.034768Z",
"url": "https://files.pythonhosted.org/packages/09/68/10443a3b7d980afdad984bf13d37925609caca924caad316f2be00ad378e/orderly-0.1.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "a18ef586c666f866f943b129c753ff0aafdbd26b670a72a470473a4f8eae5970",
"md5": "cb39ddc79d75453c150992c04e2902b7",
"sha256": "5ce2f731f8e28e750458f62374e87fcbc14b36edd1dc61d8de378fc07a976f2d"
},
"downloads": -1,
"filename": "orderly-0.1.2.tar.gz",
"has_sig": false,
"md5_digest": "cb39ddc79d75453c150992c04e2902b7",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10,<4.0",
"size": 18455423,
"upload_time": "2023-08-01T20:51:17",
"upload_time_iso_8601": "2023-08-01T20:51:17.939338Z",
"url": "https://files.pythonhosted.org/packages/a1/8e/f586c666f866f943b129c753ff0aafdbd26b670a72a470473a4f8eae5970/orderly-0.1.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-08-01 20:51:17",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "sustainable-processes",
"github_project": "ORDerly",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "orderly"
}