rxn-reaction-preprocessing


Namerxn-reaction-preprocessing JSON
Version 2.5.0 PyPI version JSON
download
home_pagehttps://github.com/rxn4chemistry/rxn-reaction-preprocessing
SummaryReaction preprocessing tools
upload_time2025-08-13 08:39:43
maintainerNone
docs_urlNone
authorIBM RXN team
requires_python>=3.7
licenseMIT
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # RXN reaction preprocessing

[![Actions tests](https://github.com/rxn4chemistry/rxn-reaction-preprocessing/actions/workflows/tests.yaml/badge.svg)](https://github.com/rxn4chemistry/rxn-reaction-preprocessing/actions)

This repository is devoted to preprocessing chemical reactions: standardization, filtering, etc. 
It also includes code for stable train/test/validation splits and data augmentation.

Links:
* [GitHub repository](https://github.com/rxn4chemistry/rxn-reaction-preprocessing)
* [Documentation](https://rxn4chemistry.github.io/rxn-reaction-preprocessing/)
* [PyPI package](https://pypi.org/project/rxn-reaction-preprocessing/)

## System Requirements

This package is supported on all operating systems.
It has been tested on the following systems:
* macOS: Big Sur (11.1)
* Linux: Ubuntu 18.04.4

A Python version of 3.7 or greater is recommended.

## Installation guide

The package can be installed from Pypi:
```bash
pip install rxn-reaction-preprocessing[rdkit]
```
You can leave out `[rdkit]` if you prefer to install `rdkit` manually (via Conda or Pypi).

For local development, the package can be installed with:
```bash
pip install -e ".[dev]"
```

## Usage
The following command line scripts are installed with the package.

### rxn-data-pipeline
Wrapper for all other scripts. Allows constructing flexible data pipelines. Entrypoint for Hydra structured configuration.

For an overview of all available configuration parameters and default values, run: `rxn-data-pipeline --cfg job`.

Configuration using YAML (see the file `config.py` for more options and their meaning):
```yaml
defaults:
  - base_config

data:
  path: /tmp/inference/input.csv
  proc_dir: /tmp/rxn-preproc/exp
common:
  sequence:
    # Define which steps and in which order to execute:
    - IMPORT
    - STANDARDIZE
    - PREPROCESS
    - SPLIT
    - TOKENIZE
  fragment_bond: TILDE
preprocess:
  min_products: 0
split:
  split_ratio: 0.05
tokenize:
  input_output_pairs:
    - inp: ${data.proc_dir}/${data.name}.processed.train.csv
      out: ${data.proc_dir}/${data.name}.processed.train
    - inp: ${data.proc_dir}/${data.name}.processed.validation.csv
      out: ${data.proc_dir}/${data.name}.processed.validation
    - inp: ${data.proc_dir}/${data.name}.processed.test.csv
      out: ${data.proc_dir}/${data.name}.processed.test
```
```bash
rxn-data-pipeline --config-dir . --config-name example_config
```

Configuration using command line arguments (example):
```bash
rxn-data-pipeline \
  data.path=/path/to/data/rxns-small.csv \
  data.proc_dir=/path/to/proc/dir \
  common.fragment_bond=TILDE \
  rxn_import.data_format=TXT \
  tokenize.input_output_pairs.0.out=train.txt \
  tokenize.input_output_pairs.1.out=validation.txt \
  tokenize.input_output_pairs.2.out=test.txt
```

## Note about reading CSV files
Pandas appears not to always be able to write a CSV and re-read it if it contains Windows carriage returns.
In order for the scripts to work despite this, all the `pd.read_csv` function calls should include the argument `lineterminator='\n'`.

## Examples

### A pipeline supporting augmentation

A config supporting augmentation of the training split called `train-augmentation-config.yaml`:
```yaml
defaults:
  - base_config

data:
  name: pipeline-with-augmentation
  path: /tmp/file-with-reactions.txt
  proc_dir: /tmp/rxn-preprocessing/experiment
common:
  sequence:
    # Define which steps and in which order to execute:
    - IMPORT
    - STANDARDIZE
    - PREPROCESS
    - SPLIT
    - AUGMENT
    - TOKENIZE
  fragment_bond: TILDE
rxn_import:
  data_format: TXT
preprocess:
  min_products: 1
split:
  input_file_path: ${preprocess.output_file_path}
  split_ratio: 0.05
augment:
  input_file_path: ${data.proc_dir}/${data.name}.processed.train.csv
  output_file_path: ${data.proc_dir}/${data.name}.augmented.train.csv
  permutations: 10
  tokenize: false
  random_type: rotated
tokenize:
  input_output_pairs:
    - inp: ${data.proc_dir}/${data.name}.augmented.train.csv
      out: ${data.proc_dir}/${data.name}.augmented.train
      reaction_column_name: rxn_rotated
    - inp: ${data.proc_dir}/${data.name}.processed.validation.csv
      out: ${data.proc_dir}/${data.name}.processed.validation
    - inp: ${data.proc_dir}/${data.name}.processed.test.csv
      out: ${data.proc_dir}/${data.name}.processed.test
```
```bash
rxn-data-pipeline --config-dir . --config-name train-augmentation-config
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/rxn4chemistry/rxn-reaction-preprocessing",
    "name": "rxn-reaction-preprocessing",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": null,
    "author": "IBM RXN team",
    "author_email": "rxn4chemistry@zurich.ibm.com",
    "download_url": "https://files.pythonhosted.org/packages/90/6a/3e73d616b0c33cb2f91d7365aa07a63243da5e1ebf3d07154ecf30aeb04e/rxn_reaction_preprocessing-2.5.0.tar.gz",
    "platform": null,
    "description": "# RXN reaction preprocessing\n\n[![Actions tests](https://github.com/rxn4chemistry/rxn-reaction-preprocessing/actions/workflows/tests.yaml/badge.svg)](https://github.com/rxn4chemistry/rxn-reaction-preprocessing/actions)\n\nThis repository is devoted to preprocessing chemical reactions: standardization, filtering, etc. \nIt also includes code for stable train/test/validation splits and data augmentation.\n\nLinks:\n* [GitHub repository](https://github.com/rxn4chemistry/rxn-reaction-preprocessing)\n* [Documentation](https://rxn4chemistry.github.io/rxn-reaction-preprocessing/)\n* [PyPI package](https://pypi.org/project/rxn-reaction-preprocessing/)\n\n## System Requirements\n\nThis package is supported on all operating systems.\nIt has been tested on the following systems:\n* macOS: Big Sur (11.1)\n* Linux: Ubuntu 18.04.4\n\nA Python version of 3.7 or greater is recommended.\n\n## Installation guide\n\nThe package can be installed from Pypi:\n```bash\npip install rxn-reaction-preprocessing[rdkit]\n```\nYou can leave out `[rdkit]` if you prefer to install `rdkit` manually (via Conda or Pypi).\n\nFor local development, the package can be installed with:\n```bash\npip install -e \".[dev]\"\n```\n\n## Usage\nThe following command line scripts are installed with the package.\n\n### rxn-data-pipeline\nWrapper for all other scripts. Allows constructing flexible data pipelines. Entrypoint for Hydra structured configuration.\n\nFor an overview of all available configuration parameters and default values, run: `rxn-data-pipeline --cfg job`.\n\nConfiguration using YAML (see the file `config.py` for more options and their meaning):\n```yaml\ndefaults:\n  - base_config\n\ndata:\n  path: /tmp/inference/input.csv\n  proc_dir: /tmp/rxn-preproc/exp\ncommon:\n  sequence:\n    # Define which steps and in which order to execute:\n    - IMPORT\n    - STANDARDIZE\n    - PREPROCESS\n    - SPLIT\n    - TOKENIZE\n  fragment_bond: TILDE\npreprocess:\n  min_products: 0\nsplit:\n  split_ratio: 0.05\ntokenize:\n  input_output_pairs:\n    - inp: ${data.proc_dir}/${data.name}.processed.train.csv\n      out: ${data.proc_dir}/${data.name}.processed.train\n    - inp: ${data.proc_dir}/${data.name}.processed.validation.csv\n      out: ${data.proc_dir}/${data.name}.processed.validation\n    - inp: ${data.proc_dir}/${data.name}.processed.test.csv\n      out: ${data.proc_dir}/${data.name}.processed.test\n```\n```bash\nrxn-data-pipeline --config-dir . --config-name example_config\n```\n\nConfiguration using command line arguments (example):\n```bash\nrxn-data-pipeline \\\n  data.path=/path/to/data/rxns-small.csv \\\n  data.proc_dir=/path/to/proc/dir \\\n  common.fragment_bond=TILDE \\\n  rxn_import.data_format=TXT \\\n  tokenize.input_output_pairs.0.out=train.txt \\\n  tokenize.input_output_pairs.1.out=validation.txt \\\n  tokenize.input_output_pairs.2.out=test.txt\n```\n\n## Note about reading CSV files\nPandas appears not to always be able to write a CSV and re-read it if it contains Windows carriage returns.\nIn order for the scripts to work despite this, all the `pd.read_csv` function calls should include the argument `lineterminator='\\n'`.\n\n## Examples\n\n### A pipeline supporting augmentation\n\nA config supporting augmentation of the training split called `train-augmentation-config.yaml`:\n```yaml\ndefaults:\n  - base_config\n\ndata:\n  name: pipeline-with-augmentation\n  path: /tmp/file-with-reactions.txt\n  proc_dir: /tmp/rxn-preprocessing/experiment\ncommon:\n  sequence:\n    # Define which steps and in which order to execute:\n    - IMPORT\n    - STANDARDIZE\n    - PREPROCESS\n    - SPLIT\n    - AUGMENT\n    - TOKENIZE\n  fragment_bond: TILDE\nrxn_import:\n  data_format: TXT\npreprocess:\n  min_products: 1\nsplit:\n  input_file_path: ${preprocess.output_file_path}\n  split_ratio: 0.05\naugment:\n  input_file_path: ${data.proc_dir}/${data.name}.processed.train.csv\n  output_file_path: ${data.proc_dir}/${data.name}.augmented.train.csv\n  permutations: 10\n  tokenize: false\n  random_type: rotated\ntokenize:\n  input_output_pairs:\n    - inp: ${data.proc_dir}/${data.name}.augmented.train.csv\n      out: ${data.proc_dir}/${data.name}.augmented.train\n      reaction_column_name: rxn_rotated\n    - inp: ${data.proc_dir}/${data.name}.processed.validation.csv\n      out: ${data.proc_dir}/${data.name}.processed.validation\n    - inp: ${data.proc_dir}/${data.name}.processed.test.csv\n      out: ${data.proc_dir}/${data.name}.processed.test\n```\n```bash\nrxn-data-pipeline --config-dir . --config-name train-augmentation-config\n```\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Reaction preprocessing tools",
    "version": "2.5.0",
    "project_urls": {
        "Documentation": "https://rxn4chemistry.github.io/rxn-reaction-preprocessing/",
        "Homepage": "https://github.com/rxn4chemistry/rxn-reaction-preprocessing",
        "Repository": "https://github.com/rxn4chemistry/rxn-reaction-preprocessing"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "ec09c455bd7fd249a8ac97611efc71c23aae9c2cc568f5fb01fcc1534b199fcc",
                "md5": "1c8424b85c4ba6707351b85db485e130",
                "sha256": "4c43f0af4a6de7479893236a6331fcf5a72d0b2b725ff3d7885991d86e7315b5"
            },
            "downloads": -1,
            "filename": "rxn_reaction_preprocessing-2.5.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "1c8424b85c4ba6707351b85db485e130",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 97797,
            "upload_time": "2025-08-13T08:39:42",
            "upload_time_iso_8601": "2025-08-13T08:39:42.755250Z",
            "url": "https://files.pythonhosted.org/packages/ec/09/c455bd7fd249a8ac97611efc71c23aae9c2cc568f5fb01fcc1534b199fcc/rxn_reaction_preprocessing-2.5.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "906a3e73d616b0c33cb2f91d7365aa07a63243da5e1ebf3d07154ecf30aeb04e",
                "md5": "1c145984650c1aabfb3a76aedcab9b4c",
                "sha256": "531827952b9ccc5eefc8845e63bdfa04eb483a82ec2763b147ccabd84a364612"
            },
            "downloads": -1,
            "filename": "rxn_reaction_preprocessing-2.5.0.tar.gz",
            "has_sig": false,
            "md5_digest": "1c145984650c1aabfb3a76aedcab9b4c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 96956,
            "upload_time": "2025-08-13T08:39:43",
            "upload_time_iso_8601": "2025-08-13T08:39:43.863715Z",
            "url": "https://files.pythonhosted.org/packages/90/6a/3e73d616b0c33cb2f91d7365aa07a63243da5e1ebf3d07154ecf30aeb04e/rxn_reaction_preprocessing-2.5.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-13 08:39:43",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "rxn4chemistry",
    "github_project": "rxn-reaction-preprocessing",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "rxn-reaction-preprocessing"
}
        
Elapsed time: 0.47774s