preon

Name	preon JSON
Version	0.1.2 JSON
	download
home_page	https://github.com/ermshaua/preon
Summary	None
upload_time	2025-07-24 12:50:27
maintainer	None
docs_url	None
author	Arik Ermshaus
requires_python	<3.13,>=3.7
license	BSD 3-Clause License Copyright (c) 2021, Arik Ermshaus All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
keywords	precision-oncology data-integration text-mining normalization drug-names cancer-types
VCS
bugtrack_url
requirements	daproli jellyfish nltk numpy pandas pronto tqdm lxml
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # preon (PREcision Oncology Normalization)
preon is a fuzzy search tool for medical entities.

## Installation

You can install preon with PyPi:
`python -m pip install preon`

## Examples

Let's first import the normalizer and EBI drug names with CHEMBL ids.

```python3
>>> from preon.normalization import PrecisionOncologyNormalizer
>>> from preon.drug import store_ebi_drugs, load_ebi_drugs
```

Please download the <a href="https://www.ebi.ac.uk/chembl/explore/compounds/">EBI compound CSV file</a> and store it as a local resource. This step only has to be performed when the resource file is created or updated. 

```python3
>>> store_ebi_drugs("/Users/Username/Downloads/compounds.csv")
```

Next, we can fit the normalizer with the drug names and ids as its reference data.

```python3
>>> drug_names, chembl_ids = load_ebi_drugs()
>>> normalizer = PrecisionOncologyNormalizer().fit(drug_names, chembl_ids)
```

We can now search for drug names and retrieve their CHEMBL ids. Let's search for the cancer drug "Avastin".

```python3
>>> normalizer.query("Avastin")
(['avastin'], [['CHEMBL1201583']], {'match_type': 'exact'})
```

As a result for our query, we get list of matching normalized drug names (in this case `['avastin']`), a list of associated CHEMBL ids for every returned drug name `[['CHEMBL1201583']]` and some meta information about the matching `{'match_type': 'exact'}`. We can also search for multi-token drug names like "Ixabepilone Epothilone B analog" and find CHEMBL ids for the relevant tokens.

```python3
>>> normalizer.query("Ixabepilone Epothilone B analog")
(['ixabepilone'], [['CHEMBL1201752']], {'match_type': 'substring'})
```

We find the relevant drug name `['ixabepilone']` and preon provides the meta information that the matching is based on a substring. On default, preon only looks for 1 matching token. It can also look for n-grams by setting the `n_grams` parameter in the query method. Let's take a harder example, say "Isavuconazonium", but misspell it as "Isavuconaconium".

```python3
>>> normalizer.query("Isavuconaconium")
(['isavuconazonium'], [['CHEMBL1183349']], {'match_type': 'partial', 'edit_distance': 0.067})
```

preon finds the correct drug "Isavuconazonium" and provides the meta information that it is a partial match with 7% distance. It returns drug names with a distance smaller than 20% on default. In order to change this parameter, set the `threshold` argument in the query method. If preon cannot normalize the query, it returns `None` and issues a user warning.

```python3
>>> normalizer.query("risolipase en.")
preon/normalization.py:50: UserWarning: Cannot match risolipase en. to reference data. Try changing the partial matching threshold or number of n-grams.
```

For automatic data integrations, warnings can be stored in a logging file, see e.g. <a href="https://github.com/ermshaua/preon/blob/main/preon/examples/drug_name_normalization.ipynb">here</a>. In a similar fashion, you can also normalize cancer types or genes. We provide gold standards for preon with which we test it. For more detail, see the example <a href="https://github.com/ermshaua/preon/tree/main/preon/examples">notebooks</a>. We also use preon in practice to normalize and integrate medical data in the PREDICT project.

## Citation

The preon package is actively maintained, updated and intended for application. If you use it in your scientific publication, we would appreciate the following <a href="https://doi.org/10.1093/bioinformatics/btae085" target="_blank">citation</a>:

```
@article{preon2023,
  title={preon: Fast and accurate entity normalization for drug names and cancer types in precision oncology},
  author={Arik Ermshaus and Michael Piechotta and Gina R{\"u}ter and Ulrich Keilholz and Ulf Leser and Manuela Benary},
  journal={Bioinformatics},
  year={2023},
  volume={40}
}
```

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/ermshaua/preon",
    "name": "preon",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<3.13,>=3.7",
    "maintainer_email": null,
    "keywords": "precision-oncology, data-integration, text-mining, normalization, drug-names, cancer-types",
    "author": "Arik Ermshaus",
    "author_email": "Arik Ermshaus <ermshaua@informatik.hu-berlin.de>",
    "download_url": "https://files.pythonhosted.org/packages/d9/1c/55d56918a316ca973e2bf99b8c5f910940ce7242c3f470b8dc572c0fd062/preon-0.1.2.tar.gz",
    "platform": null,
    "description": "# preon (PREcision Oncology Normalization)\npreon is a fuzzy search tool for medical entities.\n\n## Installation\n\nYou can install preon with PyPi:\n`python -m pip install preon`\n\n## Examples\n\nLet's first import the normalizer and EBI drug names with CHEMBL ids.\n\n```python3\n>>> from preon.normalization import PrecisionOncologyNormalizer\n>>> from preon.drug import store_ebi_drugs, load_ebi_drugs\n```\n\nPlease download the <a href=\"https://www.ebi.ac.uk/chembl/explore/compounds/\">EBI compound CSV file</a> and store it as a local resource. This step only has to be performed when the resource file is created or updated. \n\n```python3\n>>> store_ebi_drugs(\"/Users/Username/Downloads/compounds.csv\")\n```\n\nNext, we can fit the normalizer with the drug names and ids as its reference data.\n\n```python3\n>>> drug_names, chembl_ids = load_ebi_drugs()\n>>> normalizer = PrecisionOncologyNormalizer().fit(drug_names, chembl_ids)\n```\n\nWe can now search for drug names and retrieve their CHEMBL ids. Let's search for the cancer drug \"Avastin\".\n\n```python3\n>>> normalizer.query(\"Avastin\")\n(['avastin'], [['CHEMBL1201583']], {'match_type': 'exact'})\n```\n\nAs a result for our query, we get list of matching normalized drug names (in this case `['avastin']`), a list of associated CHEMBL ids for every returned drug name `[['CHEMBL1201583']]` and some meta information about the matching `{'match_type': 'exact'}`. We can also search for multi-token drug names like \"Ixabepilone Epothilone B analog\" and find CHEMBL ids for the relevant tokens.\n\n```python3\n>>> normalizer.query(\"Ixabepilone Epothilone B analog\")\n(['ixabepilone'], [['CHEMBL1201752']], {'match_type': 'substring'})\n```\n\nWe find the relevant drug name `['ixabepilone']` and preon provides the meta information that the matching is based on a substring. On default, preon only looks for 1 matching token. It can also look for n-grams by setting the `n_grams` parameter in the query method. Let's take a harder example, say \"Isavuconazonium\", but misspell it as \"Isavuconaconium\".\n\n```python3\n>>> normalizer.query(\"Isavuconaconium\")\n(['isavuconazonium'], [['CHEMBL1183349']], {'match_type': 'partial', 'edit_distance': 0.067})\n```\n\npreon finds the correct drug \"Isavuconazonium\" and provides the meta information that it is a partial match with 7% distance. It returns drug names with a distance smaller than 20% on default. In order to change this parameter, set the `threshold` argument in the query method. If preon cannot normalize the query, it returns `None` and issues a user warning.\n\n```python3\n>>> normalizer.query(\"risolipase en.\")\npreon/normalization.py:50: UserWarning: Cannot match risolipase en. to reference data. Try changing the partial matching threshold or number of n-grams.\n```\n\nFor automatic data integrations, warnings can be stored in a logging file, see e.g. <a href=\"https://github.com/ermshaua/preon/blob/main/preon/examples/drug_name_normalization.ipynb\">here</a>. In a similar fashion, you can also normalize cancer types or genes. We provide gold standards for preon with which we test it. For more detail, see the example <a href=\"https://github.com/ermshaua/preon/tree/main/preon/examples\">notebooks</a>. We also use preon in practice to normalize and integrate medical data in the PREDICT project.\n\n## Citation\n\nThe preon package is actively maintained, updated and intended for application. If you use it in your scientific publication, we would appreciate the following <a href=\"https://doi.org/10.1093/bioinformatics/btae085\" target=\"_blank\">citation</a>:\n\n```\n@article{preon2023,\n  title={preon: Fast and accurate entity normalization for drug names and cancer types in precision oncology},\n  author={Arik Ermshaus and Michael Piechotta and Gina R{\\\"u}ter and Ulrich Keilholz and Ulf Leser and Manuela Benary},\n  journal={Bioinformatics},\n  year={2023},\n  volume={40}\n}\n```\n",
    "bugtrack_url": null,
    "license": "BSD 3-Clause License\n        \n        Copyright (c) 2021, Arik Ermshaus\n        All rights reserved.\n        \n        Redistribution and use in source and binary forms, with or without\n        modification, are permitted provided that the following conditions are met:\n        \n        1. Redistributions of source code must retain the above copyright notice, this\n           list of conditions and the following disclaimer.\n        \n        2. Redistributions in binary form must reproduce the above copyright notice,\n           this list of conditions and the following disclaimer in the documentation\n           and/or other materials provided with the distribution.\n        \n        3. Neither the name of the copyright holder nor the names of its\n           contributors may be used to endorse or promote products derived from\n           this software without specific prior written permission.\n        \n        THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS \"AS IS\"\n        AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE\n        IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE\n        DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE\n        FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL\n        DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR\n        SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER\n        CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,\n        OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE\n        OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.\n        ",
    "summary": null,
    "version": "0.1.2",
    "project_urls": {
        "Homepage": "https://github.com/ermshaua/preon",
        "repository": "https://github.com/ermshaua/preon"
    },
    "split_keywords": [
        "precision-oncology",
        " data-integration",
        " text-mining",
        " normalization",
        " drug-names",
        " cancer-types"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "e84482e919b1d0255c7e4ee1523ff5145ddc253f726a15e52d64c91ffbc8ebe7",
                "md5": "8843fa6a586546ea21be675498b81cf4",
                "sha256": "81412586b820e8f7a8b54750528c98d00efc101bb371997d955578e1f3a976a1"
            },
            "downloads": -1,
            "filename": "preon-0.1.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "8843fa6a586546ea21be675498b81cf4",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.13,>=3.7",
            "size": 12655,
            "upload_time": "2025-07-24T12:50:25",
            "upload_time_iso_8601": "2025-07-24T12:50:25.818345Z",
            "url": "https://files.pythonhosted.org/packages/e8/44/82e919b1d0255c7e4ee1523ff5145ddc253f726a15e52d64c91ffbc8ebe7/preon-0.1.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "d91c55d56918a316ca973e2bf99b8c5f910940ce7242c3f470b8dc572c0fd062",
                "md5": "b2edbef8a95a5048c1785d24499b7f19",
                "sha256": "9b91f92d460dcb5524493977b488ce9d08bba7ccfddfb6df4fa5ac0bf0bfc0dd"
            },
            "downloads": -1,
            "filename": "preon-0.1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "b2edbef8a95a5048c1785d24499b7f19",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.13,>=3.7",
            "size": 13671,
            "upload_time": "2025-07-24T12:50:27",
            "upload_time_iso_8601": "2025-07-24T12:50:27.046877Z",
            "url": "https://files.pythonhosted.org/packages/d9/1c/55d56918a316ca973e2bf99b8c5f910940ce7242c3f470b8dc572c0fd062/preon-0.1.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-24 12:50:27",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ermshaua",
    "github_project": "preon",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "daproli",
            "specs": [
                [
                    ">=",
                    "0.22"
                ]
            ]
        },
        {
            "name": "jellyfish",
            "specs": [
                [
                    ">=",
                    "0.9.0"
                ]
            ]
        },
        {
            "name": "nltk",
            "specs": [
                [
                    ">=",
                    "3.9"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    "<",
                    "2.3.0"
                ],
                [
                    ">=",
                    "1.21.0"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    "<",
                    "2.4.0"
                ],
                [
                    ">=",
                    "2.0.0"
                ]
            ]
        },
        {
            "name": "pronto",
            "specs": [
                [
                    ">=",
                    "2.5.0"
                ]
            ]
        },
        {
            "name": "tqdm",
            "specs": [
                [
                    ">=",
                    "4.66.3"
                ]
            ]
        },
        {
            "name": "lxml",
            "specs": [
                [
                    ">=",
                    "4.6.3"
                ]
            ]
        }
    ],
    "lcname": "preon"
}

Arik Ermshaus