# Drugname Standardizer
The **Drugname Standardizer** is a Python tool for standardizing drug names using [the official FDA's UNII Names List archive](https://precision.fda.gov/uniisearch/archive). It notably supports both JSON and TSV input formats, making it easy to ensure consistent drug naming in datasets.
---
## Features
- **A trusted source for drug synonyms** : the package automatically downloads the latest version of the *UNII Names* file from [the official FDA repository](https://precision.fda.gov/uniisearch/archive/latest/UNIIs.zip).
The `UNII_Names.txt` is saved to the package's `data/` folder for future use. The user can also choose to indicate another local *UNII Names* file if a particular version is preferred.
- **Parsing of the FDA's UNII Names List to map drug names** (code / official / systematic / common / brand names) **to a single preferred name** (i.e. the *Display Name* of the UNII Names file).
- Input versatility:
- a single drug name,
- a list of drug names,
- a JSON input file (a list of drugs to standardize)
- a TSV input file (a dataframe containing a column of drugs to standardize)
- Provides both **a Python package interface for scripting** and **a command-line interface (CLI) for direct use**.
- Resolves naming ambiguities of the FDA's UNII Names file by selecting the shortest *Display Names*. Rare but exists: 55 / 986397 associations in `UNII_Names_20Dec2024.txt`. For example, for `PRN1008` the ambiguity is solved by keeping `RILZABRUTINIB` whereas 2 associations exist:
- `PRN1008` ... ... `RILZABRUTINIB, (.ALPHA.E,3S)-`
- `PRN1008` ... ... `RILZABRUTINIB`
### **Warning:**
There are code / official / systematic / common / brand names for drugs. **Some are linked to different level of details about the compound.**
**The standardization proposed here gathers information at the "upper" level (i.e. the less detailled one).** I relied on the "Preferred Substance Name" (= the *Display name* field) indicated in the correspondence table provided by the FDA.
For instance : both `3'-((1R)-1-((6R)-5,6-DIHYDRO-4-HYDROXY-2-OXO-6-PHENETHYL-6-PROPYL-2H-PYRAN-3-YL)PROPYL)-5-(TRIFLUOROMETHYL)-2-PYRIDINESULFONANILIDE` (systematic name) and `Aptivus` (brand name) become `TIPRANAVIR`.
---
## Usage
### Python API
You can use the package programmatically in your Python scripts:
```python
from drugname_standardizer import standardize
```
#### Examples:
**- Get the preferred name for a specific drug:**
```python
drug_name = "GDC-0199"
preferred_name = standardize(drug_name)
print(preferred_name) # Outputs: VENETOCLAX
```
**- Standardize a list of drugs:**
```python
drug_names = ["GDC-0199", "Aptivus", "diodrast"]
preferred_names = standardize(drug_names)
print(preferred_names) # Outputs: ["VENETOCLAX", "TIPRANAVIR", "IODOPYRACET"]
```
**- Standardize a JSON file:**
```python
standardize(
input_file="drugs.json",
output_file="standardized_drugs.json",
file_type="json"
)
# Outputs: Standardized JSON file saved as standardized_drugs.json
```
**- Standardize a TSV file:**
```python
standardize(
input_file="dataset.tsv",
file_type="tsv",
column_drug=0
)
# Outputs: Standardized TSV file saved as dataset_drug_standardized.tsv
```
### Command-Line Interface
You can also use a CLI for standardizing JSON and TSV files.
* Required arguments:
- `--input`, `-i`: **A drug name or the path to a JSON/TSV file**
* Optional arguments:
- `--file_type`, `-f`: **Type of the input file** (`json` or `tsv`)
- `--output`, `-o`: **The output file name** (relative path can be given). Defaults: the input file name with `_drug_standardized` added before the extension.
- `--column_drug`, `-c`: **Index of the column containing the drug names to standardize** (required for TSV files). Starts at 0: 1st column = column 0.
- `--separator`, `-s`: **Field separator for TSV files**. Defaults: `\t`.
- `--unii_file`, `-u`: **Path to a UNII Names List file**. Defaults: automatic download of the latest version.
#### Examples:
**- Get the preferred name for a specific drug:**
```bash
drugname_standardizer -i "DynaCirc"
```
**- Standardize a JSON file:**
```bash
drugname_standardizer -i drugs.json -f json
```
**- Standardize a TSV file:**
e.g., using a comma as separator and a custom file name for the output:
```bash
drugname_standardizer -i dataset.tsv -f tsv -c 2 -s "," -o standardized_dataset.tsv
```
---
## Installation
### Using pip
```bash
python3 -m pip install drugname_standardizer
```
### GitHub repository
```bash
git clone https://github.com/StephanieChevalier/drugname_standardizer.git
cd drugname_standardizer
pip install -r requirements.txt
```
<!--
### Install the package via `pip`:
```bash
pip install drugname_standardizer
```
-->
### Requirements:
- Python 3.12+
- Dependencies:
- `pandas >= 2.2.2`
- `requests >= 2.32.2`
- `tqdm >= 4.66.4`
---
## How it works
1. Parse UNII File:
- Reads the UNII Names List to create a mapping of drug names to the *Display Name* (i.e. the preferred name).
- Resolves potential naming conflicts by selecting the shortest *Display Name* (55 / 986397 associations).
2. Standardize Names:
- For a single drug name: return the preferred name.
- For a list of drug names: maps drug names to their preferred names and return the updated list.
- For JSON input: Maps drug names to their preferred names and saves the results to a JSON file.
- For TSV input: Updates the specified column with standardized drug names and saves the modified DataFrame to a TSV file.
---
## Package structure
```
drugname_standardizer/
├── drugname_standardizer/
│ ├── __init__.py # Package initialization
│ ├── standardizer.py # Core logic for name standardization
│ └── data/
│ ├── UNII_Names.txt # UNII Names List file (ensured to be no older than 1 month when available)
│ └── UNII_dict.pkl # parsed UNII Names List
├── tests/
│ ├── __init__.py
│ └── test_standardizer.py # Unit tests for the package
├── LICENSE # MIT License
├── pyproject.toml # Package configuration
├── README.md # Project documentation
└── requirements.txt # Development dependencies
```
---
## License
This project is licensed under the MIT License - see the [LICENSE](https://github.com/StephanieChevalier/drugname_standardizer/blob/main/LICENSE) file for details.
Raw data
{
"_id": null,
"home_page": null,
"name": "drugname-standardizer",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.12",
"maintainer_email": null,
"keywords": "drug, synonyms, standardization, FDA, UNII",
"author": null,
"author_email": "St\u00e9phanie Chevalier <pro.stephaniechevalier@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/c7/ee/615611560d0955476a9454b06581cfd85cce0de753e2e7861da3b4131379/drugname_standardizer-1.2.1.tar.gz",
"platform": null,
"description": "# Drugname Standardizer\n\nThe **Drugname Standardizer** is a Python tool for standardizing drug names using [the official FDA's UNII Names List archive](https://precision.fda.gov/uniisearch/archive). It notably supports both JSON and TSV input formats, making it easy to ensure consistent drug naming in datasets.\n\n---\n\n## Features\n\n- **A trusted source for drug synonyms** : the package automatically downloads the latest version of the *UNII Names* file from [the official FDA repository](https://precision.fda.gov/uniisearch/archive/latest/UNIIs.zip).\nThe `UNII_Names.txt` is saved to the package's `data/` folder for future use. The user can also choose to indicate another local *UNII Names* file if a particular version is preferred.\n\n- **Parsing of the FDA's UNII Names List to map drug names** (code / official / systematic / common / brand names) **to a single preferred name** (i.e. the *Display Name* of the UNII Names file).\n\n- Input versatility:\n - a single drug name,\n - a list of drug names,\n - a JSON input file (a list of drugs to standardize)\n - a TSV input file (a dataframe containing a column of drugs to standardize)\n\n- Provides both **a Python package interface for scripting** and **a command-line interface (CLI) for direct use**.\n\n- Resolves naming ambiguities of the FDA's UNII Names file by selecting the shortest *Display Names*. Rare but exists: 55 / 986397 associations in `UNII_Names_20Dec2024.txt`. For example, for `PRN1008` the ambiguity is solved by keeping `RILZABRUTINIB` whereas 2 associations exist:\n - `PRN1008`\t...\t... `RILZABRUTINIB, (.ALPHA.E,3S)-`\n - `PRN1008`\t...\t... `RILZABRUTINIB` \n\n### **Warning:**\n\nThere are code / official / systematic / common / brand names for drugs. **Some are linked to different level of details about the compound.**\n**The standardization proposed here gathers information at the \"upper\" level (i.e. the less detailled one).** I relied on the \"Preferred Substance Name\" (= the *Display name* field) indicated in the correspondence table provided by the FDA. \nFor instance : both `3'-((1R)-1-((6R)-5,6-DIHYDRO-4-HYDROXY-2-OXO-6-PHENETHYL-6-PROPYL-2H-PYRAN-3-YL)PROPYL)-5-(TRIFLUOROMETHYL)-2-PYRIDINESULFONANILIDE` (systematic name) and `Aptivus` (brand name) become `TIPRANAVIR`.\n\n---\n\n## Usage\n\n### Python API\n\nYou can use the package programmatically in your Python scripts:\n\n```python\nfrom drugname_standardizer import standardize\n```\n\n#### Examples:\n\n**- Get the preferred name for a specific drug:**\n```python\ndrug_name = \"GDC-0199\"\npreferred_name = standardize(drug_name)\nprint(preferred_name) # Outputs: VENETOCLAX\n```\n\n**- Standardize a list of drugs:**\n```python\ndrug_names = [\"GDC-0199\", \"Aptivus\", \"diodrast\"]\npreferred_names = standardize(drug_names)\nprint(preferred_names) # Outputs: [\"VENETOCLAX\", \"TIPRANAVIR\", \"IODOPYRACET\"]\n```\n\n**- Standardize a JSON file:**\n```python\nstandardize(\n input_file=\"drugs.json\",\n output_file=\"standardized_drugs.json\",\n file_type=\"json\"\n)\n# Outputs: Standardized JSON file saved as standardized_drugs.json\n```\n\n**- Standardize a TSV file:**\n```python\nstandardize(\n input_file=\"dataset.tsv\",\n file_type=\"tsv\",\n column_drug=0\n)\n# Outputs: Standardized TSV file saved as dataset_drug_standardized.tsv\n```\n\n### Command-Line Interface\n\nYou can also use a CLI for standardizing JSON and TSV files.\n\n* Required arguments:\n - `--input`, `-i`: **A drug name or the path to a JSON/TSV file**\n* Optional arguments:\n - `--file_type`, `-f`: **Type of the input file** (`json` or `tsv`)\n - `--output`, `-o`: **The output file name** (relative path can be given). Defaults: the input file name with `_drug_standardized` added before the extension.\n - `--column_drug`, `-c`: **Index of the column containing the drug names to standardize** (required for TSV files). Starts at 0: 1st column = column 0.\n - `--separator`, `-s`: **Field separator for TSV files**. Defaults: `\\t`.\n - `--unii_file`, `-u`: **Path to a UNII Names List file**. Defaults: automatic download of the latest version.\n\n#### Examples:\n\n**- Get the preferred name for a specific drug:**\n```bash\ndrugname_standardizer -i \"DynaCirc\"\n```\n\n**- Standardize a JSON file:**\n```bash\ndrugname_standardizer -i drugs.json -f json\n```\n\n**- Standardize a TSV file:**\ne.g., using a comma as separator and a custom file name for the output:\n```bash\ndrugname_standardizer -i dataset.tsv -f tsv -c 2 -s \",\" -o standardized_dataset.tsv\n```\n\n---\n\n## Installation\n\n### Using pip\n\n```bash\npython3 -m pip install drugname_standardizer\n```\n\n### GitHub repository\n\n```bash\ngit clone https://github.com/StephanieChevalier/drugname_standardizer.git\ncd drugname_standardizer\npip install -r requirements.txt\n```\n<!--\n### Install the package via `pip`:\n\n```bash\npip install drugname_standardizer\n```\n-->\n\n### Requirements:\n\n- Python 3.12+\n- Dependencies:\n - `pandas >= 2.2.2`\n - `requests >= 2.32.2`\n - `tqdm >= 4.66.4`\n\n---\n\n## How it works\n\n1. Parse UNII File:\n - Reads the UNII Names List to create a mapping of drug names to the *Display Name* (i.e. the preferred name).\n - Resolves potential naming conflicts by selecting the shortest *Display Name* (55 / 986397 associations).\n\n2. Standardize Names:\n - For a single drug name: return the preferred name.\n - For a list of drug names: maps drug names to their preferred names and return the updated list.\n - For JSON input: Maps drug names to their preferred names and saves the results to a JSON file.\n - For TSV input: Updates the specified column with standardized drug names and saves the modified DataFrame to a TSV file.\n\n---\n\n## Package structure\n```\ndrugname_standardizer/\n\u251c\u2500\u2500 drugname_standardizer/\n\u2502 \u251c\u2500\u2500 __init__.py # Package initialization\n\u2502 \u251c\u2500\u2500 standardizer.py # Core logic for name standardization\n\u2502 \u2514\u2500\u2500 data/\n\u2502 \u251c\u2500\u2500 UNII_Names.txt # UNII Names List file (ensured to be no older than 1 month when available)\n\u2502 \u2514\u2500\u2500 UNII_dict.pkl # parsed UNII Names List\n\u251c\u2500\u2500 tests/\n\u2502 \u251c\u2500\u2500 __init__.py \n\u2502 \u2514\u2500\u2500 test_standardizer.py # Unit tests for the package\n\u251c\u2500\u2500 LICENSE # MIT License\n\u251c\u2500\u2500 pyproject.toml # Package configuration\n\u251c\u2500\u2500 README.md # Project documentation\n\u2514\u2500\u2500 requirements.txt # Development dependencies\n```\n\n---\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](https://github.com/StephanieChevalier/drugname_standardizer/blob/main/LICENSE) file for details.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A Python tool for standardizing drug names using the latest FDA's UNII Names list.",
"version": "1.2.1",
"project_urls": {
"Homepage": "https://github.com/StephanieChevalier/drugname_standardizer"
},
"split_keywords": [
"drug",
" synonyms",
" standardization",
" fda",
" unii"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "6de54316396498e9e2a3754b1b4f3dda62853b4a0252d1f4565838cf7be7272e",
"md5": "8380de1dc2f3d316ade237e01ec3a5e4",
"sha256": "d6944fe3538ebcc5a796c13a395ec55843753bdc13199e6bffd5ab805bc9365e"
},
"downloads": -1,
"filename": "drugname_standardizer-1.2.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "8380de1dc2f3d316ade237e01ec3a5e4",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.12",
"size": 26161222,
"upload_time": "2025-01-21T13:07:04",
"upload_time_iso_8601": "2025-01-21T13:07:04.824317Z",
"url": "https://files.pythonhosted.org/packages/6d/e5/4316396498e9e2a3754b1b4f3dda62853b4a0252d1f4565838cf7be7272e/drugname_standardizer-1.2.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "c7ee615611560d0955476a9454b06581cfd85cce0de753e2e7861da3b4131379",
"md5": "d87fa56b57a82a5459df59b4f6ec018f",
"sha256": "de28bc6fa56231dac20b5ed0496103f55a3715bf4ca40e12843365aff66235c9"
},
"downloads": -1,
"filename": "drugname_standardizer-1.2.1.tar.gz",
"has_sig": false,
"md5_digest": "d87fa56b57a82a5459df59b4f6ec018f",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.12",
"size": 26109286,
"upload_time": "2025-01-21T13:07:12",
"upload_time_iso_8601": "2025-01-21T13:07:12.827406Z",
"url": "https://files.pythonhosted.org/packages/c7/ee/615611560d0955476a9454b06581cfd85cce0de753e2e7861da3b4131379/drugname_standardizer-1.2.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-01-21 13:07:12",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "StephanieChevalier",
"github_project": "drugname_standardizer",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "pandas",
"specs": [
[
">=",
"2.2.2"
]
]
},
{
"name": "requests",
"specs": [
[
">=",
"2.32.2"
]
]
},
{
"name": "tqdm",
"specs": [
[
">=",
"4.66.4"
]
]
}
],
"lcname": "drugname-standardizer"
}