# RXN reaction preprocessing
[](https://github.com/rxn4chemistry/rxn-reaction-preprocessing/actions)
This repository is devoted to preprocessing chemical reactions: standardization, filtering, etc.
It also includes code for stable train/test/validation splits and data augmentation.
Links:
* [GitHub repository](https://github.com/rxn4chemistry/rxn-reaction-preprocessing)
* [Documentation](https://rxn4chemistry.github.io/rxn-reaction-preprocessing/)
* [PyPI package](https://pypi.org/project/rxn-reaction-preprocessing/)
## System Requirements
This package is supported on all operating systems.
It has been tested on the following systems:
* macOS: Big Sur (11.1)
* Linux: Ubuntu 18.04.4
A Python version of 3.7 or greater is recommended.
## Installation guide
The package can be installed from Pypi:
```bash
pip install rxn-reaction-preprocessing[rdkit]
```
You can leave out `[rdkit]` if you prefer to install `rdkit` manually (via Conda or Pypi).
For local development, the package can be installed with:
```bash
pip install -e ".[dev]"
```
## Usage
The following command line scripts are installed with the package.
### rxn-data-pipeline
Wrapper for all other scripts. Allows constructing flexible data pipelines. Entrypoint for Hydra structured configuration.
For an overview of all available configuration parameters and default values, run: `rxn-data-pipeline --cfg job`.
Configuration using YAML (see the file `config.py` for more options and their meaning):
```yaml
defaults:
- base_config
data:
path: /tmp/inference/input.csv
proc_dir: /tmp/rxn-preproc/exp
common:
sequence:
# Define which steps and in which order to execute:
- IMPORT
- STANDARDIZE
- PREPROCESS
- SPLIT
- TOKENIZE
fragment_bond: TILDE
preprocess:
min_products: 0
split:
split_ratio: 0.05
tokenize:
input_output_pairs:
- inp: ${data.proc_dir}/${data.name}.processed.train.csv
out: ${data.proc_dir}/${data.name}.processed.train
- inp: ${data.proc_dir}/${data.name}.processed.validation.csv
out: ${data.proc_dir}/${data.name}.processed.validation
- inp: ${data.proc_dir}/${data.name}.processed.test.csv
out: ${data.proc_dir}/${data.name}.processed.test
```
```bash
rxn-data-pipeline --config-dir . --config-name example_config
```
Configuration using command line arguments (example):
```bash
rxn-data-pipeline \
data.path=/path/to/data/rxns-small.csv \
data.proc_dir=/path/to/proc/dir \
common.fragment_bond=TILDE \
rxn_import.data_format=TXT \
tokenize.input_output_pairs.0.out=train.txt \
tokenize.input_output_pairs.1.out=validation.txt \
tokenize.input_output_pairs.2.out=test.txt
```
## Note about reading CSV files
Pandas appears not to always be able to write a CSV and re-read it if it contains Windows carriage returns.
In order for the scripts to work despite this, all the `pd.read_csv` function calls should include the argument `lineterminator='\n'`.
## Examples
### A pipeline supporting augmentation
A config supporting augmentation of the training split called `train-augmentation-config.yaml`:
```yaml
defaults:
- base_config
data:
name: pipeline-with-augmentation
path: /tmp/file-with-reactions.txt
proc_dir: /tmp/rxn-preprocessing/experiment
common:
sequence:
# Define which steps and in which order to execute:
- IMPORT
- STANDARDIZE
- PREPROCESS
- SPLIT
- AUGMENT
- TOKENIZE
fragment_bond: TILDE
rxn_import:
data_format: TXT
preprocess:
min_products: 1
split:
input_file_path: ${preprocess.output_file_path}
split_ratio: 0.05
augment:
input_file_path: ${data.proc_dir}/${data.name}.processed.train.csv
output_file_path: ${data.proc_dir}/${data.name}.augmented.train.csv
permutations: 10
tokenize: false
random_type: rotated
tokenize:
input_output_pairs:
- inp: ${data.proc_dir}/${data.name}.augmented.train.csv
out: ${data.proc_dir}/${data.name}.augmented.train
reaction_column_name: rxn_rotated
- inp: ${data.proc_dir}/${data.name}.processed.validation.csv
out: ${data.proc_dir}/${data.name}.processed.validation
- inp: ${data.proc_dir}/${data.name}.processed.test.csv
out: ${data.proc_dir}/${data.name}.processed.test
```
```bash
rxn-data-pipeline --config-dir . --config-name train-augmentation-config
```
Raw data
{
"_id": null,
"home_page": "https://github.com/rxn4chemistry/rxn-reaction-preprocessing",
"name": "rxn-reaction-preprocessing",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": "",
"keywords": "",
"author": "IBM RXN team",
"author_email": "rxn4chemistry@zurich.ibm.com",
"download_url": "https://files.pythonhosted.org/packages/10/67/7e8805950737d3b7818ade71a45d166c97e7e5f78e91723d84a5a7a8e545/rxn-reaction-preprocessing-2.4.0.tar.gz",
"platform": null,
"description": "# RXN reaction preprocessing\n\n[](https://github.com/rxn4chemistry/rxn-reaction-preprocessing/actions)\n\nThis repository is devoted to preprocessing chemical reactions: standardization, filtering, etc. \nIt also includes code for stable train/test/validation splits and data augmentation.\n\nLinks:\n* [GitHub repository](https://github.com/rxn4chemistry/rxn-reaction-preprocessing)\n* [Documentation](https://rxn4chemistry.github.io/rxn-reaction-preprocessing/)\n* [PyPI package](https://pypi.org/project/rxn-reaction-preprocessing/)\n\n## System Requirements\n\nThis package is supported on all operating systems.\nIt has been tested on the following systems:\n* macOS: Big Sur (11.1)\n* Linux: Ubuntu 18.04.4\n\nA Python version of 3.7 or greater is recommended.\n\n## Installation guide\n\nThe package can be installed from Pypi:\n```bash\npip install rxn-reaction-preprocessing[rdkit]\n```\nYou can leave out `[rdkit]` if you prefer to install `rdkit` manually (via Conda or Pypi).\n\nFor local development, the package can be installed with:\n```bash\npip install -e \".[dev]\"\n```\n\n## Usage\nThe following command line scripts are installed with the package.\n\n### rxn-data-pipeline\nWrapper for all other scripts. Allows constructing flexible data pipelines. Entrypoint for Hydra structured configuration.\n\nFor an overview of all available configuration parameters and default values, run: `rxn-data-pipeline --cfg job`.\n\nConfiguration using YAML (see the file `config.py` for more options and their meaning):\n```yaml\ndefaults:\n - base_config\n\ndata:\n path: /tmp/inference/input.csv\n proc_dir: /tmp/rxn-preproc/exp\ncommon:\n sequence:\n # Define which steps and in which order to execute:\n - IMPORT\n - STANDARDIZE\n - PREPROCESS\n - SPLIT\n - TOKENIZE\n fragment_bond: TILDE\npreprocess:\n min_products: 0\nsplit:\n split_ratio: 0.05\ntokenize:\n input_output_pairs:\n - inp: ${data.proc_dir}/${data.name}.processed.train.csv\n out: ${data.proc_dir}/${data.name}.processed.train\n - inp: ${data.proc_dir}/${data.name}.processed.validation.csv\n out: ${data.proc_dir}/${data.name}.processed.validation\n - inp: ${data.proc_dir}/${data.name}.processed.test.csv\n out: ${data.proc_dir}/${data.name}.processed.test\n```\n```bash\nrxn-data-pipeline --config-dir . --config-name example_config\n```\n\nConfiguration using command line arguments (example):\n```bash\nrxn-data-pipeline \\\n data.path=/path/to/data/rxns-small.csv \\\n data.proc_dir=/path/to/proc/dir \\\n common.fragment_bond=TILDE \\\n rxn_import.data_format=TXT \\\n tokenize.input_output_pairs.0.out=train.txt \\\n tokenize.input_output_pairs.1.out=validation.txt \\\n tokenize.input_output_pairs.2.out=test.txt\n```\n\n## Note about reading CSV files\nPandas appears not to always be able to write a CSV and re-read it if it contains Windows carriage returns.\nIn order for the scripts to work despite this, all the `pd.read_csv` function calls should include the argument `lineterminator='\\n'`.\n\n## Examples\n\n### A pipeline supporting augmentation\n\nA config supporting augmentation of the training split called `train-augmentation-config.yaml`:\n```yaml\ndefaults:\n - base_config\n\ndata:\n name: pipeline-with-augmentation\n path: /tmp/file-with-reactions.txt\n proc_dir: /tmp/rxn-preprocessing/experiment\ncommon:\n sequence:\n # Define which steps and in which order to execute:\n - IMPORT\n - STANDARDIZE\n - PREPROCESS\n - SPLIT\n - AUGMENT\n - TOKENIZE\n fragment_bond: TILDE\nrxn_import:\n data_format: TXT\npreprocess:\n min_products: 1\nsplit:\n input_file_path: ${preprocess.output_file_path}\n split_ratio: 0.05\naugment:\n input_file_path: ${data.proc_dir}/${data.name}.processed.train.csv\n output_file_path: ${data.proc_dir}/${data.name}.augmented.train.csv\n permutations: 10\n tokenize: false\n random_type: rotated\ntokenize:\n input_output_pairs:\n - inp: ${data.proc_dir}/${data.name}.augmented.train.csv\n out: ${data.proc_dir}/${data.name}.augmented.train\n reaction_column_name: rxn_rotated\n - inp: ${data.proc_dir}/${data.name}.processed.validation.csv\n out: ${data.proc_dir}/${data.name}.processed.validation\n - inp: ${data.proc_dir}/${data.name}.processed.test.csv\n out: ${data.proc_dir}/${data.name}.processed.test\n```\n```bash\nrxn-data-pipeline --config-dir . --config-name train-augmentation-config\n```\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Reaction preprocessing tools",
"version": "2.4.0",
"project_urls": {
"Documentation": "https://rxn4chemistry.github.io/rxn-reaction-preprocessing/",
"Homepage": "https://github.com/rxn4chemistry/rxn-reaction-preprocessing",
"Repository": "https://github.com/rxn4chemistry/rxn-reaction-preprocessing"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "68e837d64c54b9d9d32fdce7b3594f44cad43e10c35e20b9293d6f7437ebaf82",
"md5": "688a1cdfaec8edd6aa24bbd14f9b669b",
"sha256": "10609e1a6824bfa6b6e93e090c96a73b3380b45ee73f15fc3e027c3d08deebc6"
},
"downloads": -1,
"filename": "rxn_reaction_preprocessing-2.4.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "688a1cdfaec8edd6aa24bbd14f9b669b",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 97710,
"upload_time": "2023-09-18T09:21:04",
"upload_time_iso_8601": "2023-09-18T09:21:04.003067Z",
"url": "https://files.pythonhosted.org/packages/68/e8/37d64c54b9d9d32fdce7b3594f44cad43e10c35e20b9293d6f7437ebaf82/rxn_reaction_preprocessing-2.4.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "10677e8805950737d3b7818ade71a45d166c97e7e5f78e91723d84a5a7a8e545",
"md5": "5633c42eb0b4cedcb0fbc569ba147185",
"sha256": "c6c22bc642039daa98441ee8d0c926755da0619b3b96f54f9122829172cf35bf"
},
"downloads": -1,
"filename": "rxn-reaction-preprocessing-2.4.0.tar.gz",
"has_sig": false,
"md5_digest": "5633c42eb0b4cedcb0fbc569ba147185",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 96660,
"upload_time": "2023-09-18T09:21:05",
"upload_time_iso_8601": "2023-09-18T09:21:05.972745Z",
"url": "https://files.pythonhosted.org/packages/10/67/7e8805950737d3b7818ade71a45d166c97e7e5f78e91723d84a5a7a8e545/rxn-reaction-preprocessing-2.4.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-09-18 09:21:05",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "rxn4chemistry",
"github_project": "rxn-reaction-preprocessing",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "rxn-reaction-preprocessing"
}