drugname-standardizer

Name	drugname-standardizer JSON
Version	1.2.1 JSON
	download
home_page	None
Summary	A Python tool for standardizing drug names using the latest FDA's UNII Names list.
upload_time	2025-01-21 13:07:12
maintainer	None
docs_url	None
author	None
requires_python	>=3.12
license	MIT
keywords	drug synonyms standardization fda unii
VCS
bugtrack_url
requirements	pandas requests tqdm
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Drugname Standardizer

The **Drugname Standardizer** is a Python tool for standardizing drug names using [the official FDA's UNII Names List archive](https://precision.fda.gov/uniisearch/archive). It notably supports both JSON and TSV input formats, making it easy to ensure consistent drug naming in datasets.

---

## Features

- **A trusted source for drug synonyms** : the package automatically downloads the latest version of the *UNII Names* file from [the official FDA repository](https://precision.fda.gov/uniisearch/archive/latest/UNIIs.zip).
The `UNII_Names.txt` is saved to the package's `data/` folder for future use. The user can also choose to indicate another local *UNII Names* file if a particular version is preferred.

- **Parsing of the FDA's UNII Names List to map drug names** (code / official / systematic / common / brand names) **to a single preferred name** (i.e. the *Display Name* of the UNII Names file).

- Input versatility:
   - a single drug name,
   - a list of drug names,
   - a JSON input file (a list of drugs to standardize)
   - a TSV input file (a dataframe containing a column of drugs to standardize)

- Provides both **a Python package interface for scripting** and **a command-line interface (CLI) for direct use**.

- Resolves naming ambiguities of the FDA's UNII Names file by selecting the shortest *Display Names*. Rare but exists: 55 / 986397 associations in `UNII_Names_20Dec2024.txt`. For example, for `PRN1008` the ambiguity is solved by keeping `RILZABRUTINIB` whereas 2 associations exist:
   - `PRN1008`	...	... `RILZABRUTINIB, (.ALPHA.E,3S)-`
   - `PRN1008`	...	... `RILZABRUTINIB`  

### **Warning:**

There are code / official / systematic / common / brand names for drugs. **Some are linked to different level of details about the compound.**
**The standardization proposed here gathers information at the "upper" level (i.e. the less detailled one).** I relied on the "Preferred Substance Name" (= the *Display name* field) indicated in the correspondence table provided by the FDA.  
For instance : both `3'-((1R)-1-((6R)-5,6-DIHYDRO-4-HYDROXY-2-OXO-6-PHENETHYL-6-PROPYL-2H-PYRAN-3-YL)PROPYL)-5-(TRIFLUOROMETHYL)-2-PYRIDINESULFONANILIDE` (systematic name) and `Aptivus` (brand name) become `TIPRANAVIR`.

---

## Usage

### Python API

You can use the package programmatically in your Python scripts:

```python
from drugname_standardizer import standardize
```

#### Examples:

**- Get the preferred name for a specific drug:**
```python
drug_name = "GDC-0199"
preferred_name = standardize(drug_name)
print(preferred_name)  # Outputs: VENETOCLAX
```

**- Standardize a list of drugs:**
```python
drug_names = ["GDC-0199", "Aptivus", "diodrast"]
preferred_names = standardize(drug_names)
print(preferred_names)  # Outputs: ["VENETOCLAX", "TIPRANAVIR", "IODOPYRACET"]
```

**- Standardize a JSON file:**
```python
standardize(
    input_file="drugs.json",
    output_file="standardized_drugs.json",
    file_type="json"
)
# Outputs: Standardized JSON file saved as standardized_drugs.json
```

**- Standardize a TSV file:**
```python
standardize(
    input_file="dataset.tsv",
    file_type="tsv",
    column_drug=0
)
# Outputs: Standardized TSV file saved as dataset_drug_standardized.tsv
```

### Command-Line Interface

You can also use a CLI for standardizing JSON and TSV files.

* Required arguments:
    - `--input`, `-i`: **A drug name or the path to a JSON/TSV file**
* Optional arguments:
  - `--file_type`, `-f`: **Type of the input file** (`json` or `tsv`)
  - `--output`, `-o`: **The output file name** (relative path can be given). Defaults: the input file name with `_drug_standardized` added before the extension.
  - `--column_drug`, `-c`: **Index of the column containing the drug names to standardize** (required for TSV files). Starts at 0: 1st column = column 0.
  - `--separator`, `-s`: **Field separator for TSV files**. Defaults: `\t`.
  - `--unii_file`, `-u`: **Path to a UNII Names List file**. Defaults: automatic download of the latest version.

#### Examples:

**- Get the preferred name for a specific drug:**
```bash
drugname_standardizer -i "DynaCirc"
```

**- Standardize a JSON file:**
```bash
drugname_standardizer -i drugs.json -f json
```

**- Standardize a TSV file:**
e.g., using a comma as separator and a custom file name for the output:
```bash
drugname_standardizer -i dataset.tsv -f tsv -c 2 -s "," -o standardized_dataset.tsv
```

---

## Installation

### Using pip

```bash
python3 -m pip install drugname_standardizer
```

### GitHub repository

```bash
git clone https://github.com/StephanieChevalier/drugname_standardizer.git
cd drugname_standardizer
pip install -r requirements.txt
```
<!--
### Install the package via `pip`:

```bash
pip install drugname_standardizer
```
-->

### Requirements:

- Python 3.12+
- Dependencies:
  - `pandas >= 2.2.2`
  - `requests >= 2.32.2`
  - `tqdm >= 4.66.4`

---

## How it works

1. Parse UNII File:
    - Reads the UNII Names List to create a mapping of drug names to the *Display Name* (i.e. the preferred name).
    - Resolves potential naming conflicts by selecting the shortest *Display Name* (55 / 986397 associations).

2. Standardize Names:
    - For a single drug name: return the preferred name.
    - For a list of drug names: maps drug names to their preferred names and return the updated list.
    - For JSON input: Maps drug names to their preferred names and saves the results to a JSON file.
    - For TSV input: Updates the specified column with standardized drug names and saves the modified DataFrame to a TSV file.

---

## Package structure
```
drugname_standardizer/
├── drugname_standardizer/
│   ├── __init__.py               # Package initialization
│   ├── standardizer.py           # Core logic for name standardization
│   └── data/
│       ├── UNII_Names.txt  # UNII Names List file (ensured to be no older than 1 month when available)
│       └── UNII_dict.pkl   # parsed UNII Names List
├── tests/
│   ├── __init__.py               
│   └── test_standardizer.py      # Unit tests for the package
├── LICENSE                       # MIT License
├── pyproject.toml                # Package configuration
├── README.md                     # Project documentation
└── requirements.txt              # Development dependencies
```

---

## License

This project is licensed under the MIT License - see the [LICENSE](https://github.com/StephanieChevalier/drugname_standardizer/blob/main/LICENSE) file for details.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "drugname-standardizer",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.12",
    "maintainer_email": null,
    "keywords": "drug, synonyms, standardization, FDA, UNII",
    "author": null,
    "author_email": "St\u00e9phanie Chevalier <pro.stephaniechevalier@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/c7/ee/615611560d0955476a9454b06581cfd85cce0de753e2e7861da3b4131379/drugname_standardizer-1.2.1.tar.gz",
    "platform": null,
    "description": "# Drugname Standardizer\n\nThe **Drugname Standardizer** is a Python tool for standardizing drug names using [the official FDA's UNII Names List archive](https://precision.fda.gov/uniisearch/archive). It notably supports both JSON and TSV input formats, making it easy to ensure consistent drug naming in datasets.\n\n---\n\n## Features\n\n- **A trusted source for drug synonyms** : the package automatically downloads the latest version of the *UNII Names* file from [the official FDA repository](https://precision.fda.gov/uniisearch/archive/latest/UNIIs.zip).\nThe `UNII_Names.txt` is saved to the package's `data/` folder for future use. The user can also choose to indicate another local *UNII Names* file if a particular version is preferred.\n\n- **Parsing of the FDA's UNII Names List to map drug names** (code / official / systematic / common / brand names) **to a single preferred name** (i.e. the *Display Name* of the UNII Names file).\n\n- Input versatility:\n   - a single drug name,\n   - a list of drug names,\n   - a JSON input file (a list of drugs to standardize)\n   - a TSV input file (a dataframe containing a column of drugs to standardize)\n\n- Provides both **a Python package interface for scripting** and **a command-line interface (CLI) for direct use**.\n\n- Resolves naming ambiguities of the FDA's UNII Names file by selecting the shortest *Display Names*. Rare but exists: 55 / 986397 associations in `UNII_Names_20Dec2024.txt`. For example, for `PRN1008` the ambiguity is solved by keeping `RILZABRUTINIB` whereas 2 associations exist:\n   - `PRN1008`\t...\t... `RILZABRUTINIB, (.ALPHA.E,3S)-`\n   - `PRN1008`\t...\t... `RILZABRUTINIB`  \n\n### **Warning:**\n\nThere are code / official / systematic / common / brand names for drugs. **Some are linked to different level of details about the compound.**\n**The standardization proposed here gathers information at the \"upper\" level (i.e. the less detailled one).** I relied on the \"Preferred Substance Name\" (= the *Display name* field) indicated in the correspondence table provided by the FDA.  \nFor instance : both `3'-((1R)-1-((6R)-5,6-DIHYDRO-4-HYDROXY-2-OXO-6-PHENETHYL-6-PROPYL-2H-PYRAN-3-YL)PROPYL)-5-(TRIFLUOROMETHYL)-2-PYRIDINESULFONANILIDE` (systematic name) and `Aptivus` (brand name) become `TIPRANAVIR`.\n\n---\n\n## Usage\n\n### Python API\n\nYou can use the package programmatically in your Python scripts:\n\n```python\nfrom drugname_standardizer import standardize\n```\n\n#### Examples:\n\n**- Get the preferred name for a specific drug:**\n```python\ndrug_name = \"GDC-0199\"\npreferred_name = standardize(drug_name)\nprint(preferred_name)  # Outputs: VENETOCLAX\n```\n\n**- Standardize a list of drugs:**\n```python\ndrug_names = [\"GDC-0199\", \"Aptivus\", \"diodrast\"]\npreferred_names = standardize(drug_names)\nprint(preferred_names)  # Outputs: [\"VENETOCLAX\", \"TIPRANAVIR\", \"IODOPYRACET\"]\n```\n\n**- Standardize a JSON file:**\n```python\nstandardize(\n    input_file=\"drugs.json\",\n    output_file=\"standardized_drugs.json\",\n    file_type=\"json\"\n)\n# Outputs: Standardized JSON file saved as standardized_drugs.json\n```\n\n**- Standardize a TSV file:**\n```python\nstandardize(\n    input_file=\"dataset.tsv\",\n    file_type=\"tsv\",\n    column_drug=0\n)\n# Outputs: Standardized TSV file saved as dataset_drug_standardized.tsv\n```\n\n### Command-Line Interface\n\nYou can also use a CLI for standardizing JSON and TSV files.\n\n* Required arguments:\n    - `--input`, `-i`: **A drug name or the path to a JSON/TSV file**\n* Optional arguments:\n  - `--file_type`, `-f`: **Type of the input file** (`json` or `tsv`)\n  - `--output`, `-o`: **The output file name** (relative path can be given). Defaults: the input file name with `_drug_standardized` added before the extension.\n  - `--column_drug`, `-c`: **Index of the column containing the drug names to standardize** (required for TSV files). Starts at 0: 1st column = column 0.\n  - `--separator`, `-s`: **Field separator for TSV files**. Defaults: `\\t`.\n  - `--unii_file`, `-u`: **Path to a UNII Names List file**. Defaults: automatic download of the latest version.\n\n#### Examples:\n\n**- Get the preferred name for a specific drug:**\n```bash\ndrugname_standardizer -i \"DynaCirc\"\n```\n\n**- Standardize a JSON file:**\n```bash\ndrugname_standardizer -i drugs.json -f json\n```\n\n**- Standardize a TSV file:**\ne.g., using a comma as separator and a custom file name for the output:\n```bash\ndrugname_standardizer -i dataset.tsv -f tsv -c 2 -s \",\" -o standardized_dataset.tsv\n```\n\n---\n\n## Installation\n\n### Using pip\n\n```bash\npython3 -m pip install drugname_standardizer\n```\n\n### GitHub repository\n\n```bash\ngit clone https://github.com/StephanieChevalier/drugname_standardizer.git\ncd drugname_standardizer\npip install -r requirements.txt\n```\n<!--\n### Install the package via `pip`:\n\n```bash\npip install drugname_standardizer\n```\n-->\n\n### Requirements:\n\n- Python 3.12+\n- Dependencies:\n  - `pandas >= 2.2.2`\n  - `requests >= 2.32.2`\n  - `tqdm >= 4.66.4`\n\n---\n\n## How it works\n\n1. Parse UNII File:\n    - Reads the UNII Names List to create a mapping of drug names to the *Display Name* (i.e. the preferred name).\n    - Resolves potential naming conflicts by selecting the shortest *Display Name* (55 / 986397 associations).\n\n2. Standardize Names:\n    - For a single drug name: return the preferred name.\n    - For a list of drug names: maps drug names to their preferred names and return the updated list.\n    - For JSON input: Maps drug names to their preferred names and saves the results to a JSON file.\n    - For TSV input: Updates the specified column with standardized drug names and saves the modified DataFrame to a TSV file.\n\n---\n\n## Package structure\n```\ndrugname_standardizer/\n\u251c\u2500\u2500 drugname_standardizer/\n\u2502   \u251c\u2500\u2500 __init__.py               # Package initialization\n\u2502   \u251c\u2500\u2500 standardizer.py           # Core logic for name standardization\n\u2502   \u2514\u2500\u2500 data/\n\u2502       \u251c\u2500\u2500 UNII_Names.txt  # UNII Names List file (ensured to be no older than 1 month when available)\n\u2502       \u2514\u2500\u2500 UNII_dict.pkl   # parsed UNII Names List\n\u251c\u2500\u2500 tests/\n\u2502   \u251c\u2500\u2500 __init__.py               \n\u2502   \u2514\u2500\u2500 test_standardizer.py      # Unit tests for the package\n\u251c\u2500\u2500 LICENSE                       # MIT License\n\u251c\u2500\u2500 pyproject.toml                # Package configuration\n\u251c\u2500\u2500 README.md                     # Project documentation\n\u2514\u2500\u2500 requirements.txt              # Development dependencies\n```\n\n---\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](https://github.com/StephanieChevalier/drugname_standardizer/blob/main/LICENSE) file for details.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A Python tool for standardizing drug names using the latest FDA's UNII Names list.",
    "version": "1.2.1",
    "project_urls": {
        "Homepage": "https://github.com/StephanieChevalier/drugname_standardizer"
    },
    "split_keywords": [
        "drug",
        " synonyms",
        " standardization",
        " fda",
        " unii"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6de54316396498e9e2a3754b1b4f3dda62853b4a0252d1f4565838cf7be7272e",
                "md5": "8380de1dc2f3d316ade237e01ec3a5e4",
                "sha256": "d6944fe3538ebcc5a796c13a395ec55843753bdc13199e6bffd5ab805bc9365e"
            },
            "downloads": -1,
            "filename": "drugname_standardizer-1.2.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "8380de1dc2f3d316ade237e01ec3a5e4",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.12",
            "size": 26161222,
            "upload_time": "2025-01-21T13:07:04",
            "upload_time_iso_8601": "2025-01-21T13:07:04.824317Z",
            "url": "https://files.pythonhosted.org/packages/6d/e5/4316396498e9e2a3754b1b4f3dda62853b4a0252d1f4565838cf7be7272e/drugname_standardizer-1.2.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c7ee615611560d0955476a9454b06581cfd85cce0de753e2e7861da3b4131379",
                "md5": "d87fa56b57a82a5459df59b4f6ec018f",
                "sha256": "de28bc6fa56231dac20b5ed0496103f55a3715bf4ca40e12843365aff66235c9"
            },
            "downloads": -1,
            "filename": "drugname_standardizer-1.2.1.tar.gz",
            "has_sig": false,
            "md5_digest": "d87fa56b57a82a5459df59b4f6ec018f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.12",
            "size": 26109286,
            "upload_time": "2025-01-21T13:07:12",
            "upload_time_iso_8601": "2025-01-21T13:07:12.827406Z",
            "url": "https://files.pythonhosted.org/packages/c7/ee/615611560d0955476a9454b06581cfd85cce0de753e2e7861da3b4131379/drugname_standardizer-1.2.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-21 13:07:12",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "StephanieChevalier",
    "github_project": "drugname_standardizer",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "pandas",
            "specs": [
                [
                    ">=",
                    "2.2.2"
                ]
            ]
        },
        {
            "name": "requests",
            "specs": [
                [
                    ">=",
                    "2.32.2"
                ]
            ]
        },
        {
            "name": "tqdm",
            "specs": [
                [
                    ">=",
                    "4.66.4"
                ]
            ]
        }
    ],
    "lcname": "drugname-standardizer"
}

None