datastew


Namedatastew JSON
Version 0.4.2 PyPI version JSON
download
home_pagehttps://github.com/SCAI-BIO/datastew
SummaryDatastew is a python library for intelligent data harmonization using Large Language Model (LLM) vector embeddings.
upload_time2025-01-06 08:58:26
maintainerNone
docs_urlNone
authorTim Adams
requires_python<4,>=3.10
licenseApache-2.0
keywords data-harmonization llm embeddings data-steward
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # datastew

![tests](https://github.com/SCAI-BIO/datastew/actions/workflows/tests.yml/badge.svg) ![GitHub Release](https://img.shields.io/github/v/release/SCAI-BIO/datastew)

Datastew is a python library for intelligent data harmonization using Large Language Model (LLM) vector embeddings.

## Installation

```bash
pip install datastew
```

## Usage

### Harmonizing excel/csv resources

You can directly import common data models, terminology sources or data dictionaries for harmonization directly from a
csv, tsv or excel file. An example how to match two separate variable descriptions is shown in
[datastew/scripts/mapping_excel_example.py](datastew/scripts/mapping_excel_example.py):

```python
from datastew.process.parsing import DataDictionarySource
from datastew.process.mapping import map_dictionary_to_dictionary

# Variable and description refer to the corresponding column names in your excel sheet
source = DataDictionarySource("source.xlxs", variable_field="var", description_field="desc")
target = DataDictionarySource("target.xlxs", variable_field="var", description_field="desc")

df = map_dictionary_to_dictionary(source, target)
df.to_excel("result.xlxs")
```

The resulting file contains the pairwise variable mapping based on the closest similarity for all possible matches
as well as a similarity measure per row.

Per default this will use the local MPNet model, which may not yield the optimal performance. If you got an OpenAI API
key it is possible to use their embedding API instead. To use your key, create an OpenAIAdapter model and pass it to the
function:

```python
from datastew.embedding import GPT4Adapter

embedding_model = GPT4Adapter(key="your_api_key")
df = map_dictionary_to_dictionary(source, target, embedding_model=embedding_model)
```

You can also retrieve embeddings from data dictionaries and visualize them in form of an interactive scatter plot to
explore sematic neighborhoods:

```python
from datastew.visualisation import plot_embeddings

# Get embedding vectors for your dictionaries
source_embeddings = source.get_embeddings()

# plot embedding neighborhoods for several dictionaries
plot_embeddings(data_dictionaries=[source, target])

```

### Creating and using stored mappings

A simple example how to initialize an in memory database and compute a similarity mapping is shown in
[datastew/scripts/mapping_db_example.py](datastew/scripts/mapping_db_example.py):

```python
from datastew.repository.sqllite import SQLLiteRepository
from datastew.repository.model import Terminology, Concept, Mapping
from datastew.embedding import MPNetAdapter

# omit mode to create a permanent db file instead
repository = SQLLiteRepository(mode="memory")
embedding_model = MPNetAdapter()

terminology = Terminology("snomed CT", "SNOMED")

text1 = "Diabetes mellitus (disorder)"
concept1 = Concept(terminology, text1, "Concept ID: 11893007")
mapping1 = Mapping(concept1, text1, embedding_model.get_embedding(text1))

text2 = "Hypertension (disorder)"
concept2 = Concept(terminology, text2, "Concept ID: 73211009")
mapping2 = Mapping(concept2, text2, embedding_model.get_embedding(text2))

repository.store_all([terminology, concept1, mapping1, concept2, mapping2])

text_to_map = "Sugar sickness"
embedding = embedding_model.get_embedding(text_to_map)
mappings, similarities = repository.get_closest_mappings(embedding, limit=2)
for mapping, similarity in zip(mappings, similarities):
    print(f"Similarity: {similarity} -> {mapping}")
```

output:

```plaintext
Similarity: 0.47353370635583486 -> Concept ID: 11893007 : Diabetes mellitus (disorder) | Diabetes mellitus (disorder)
Similarity: 0.20031612264852067 -> Concept ID: 73211009 : Hypertension (disorder) | Hypertension (disorder)
```

You can also import data from file sources (csv, tsv, xlsx) or from a public API like OLS. An example script to
download & compute embeddings for SNOMED from ebi OLS can be found in
[datastew/scripts/ols_snomed_retrieval.py](datastew/scripts/ols_snomed_retrieval.py).

### Embedding visualization

You can visualize the embedding space of multiple data dictionary sources with t-SNE plots utilizing different
language models. An example how to generate a t-sne plot is shown in
[datastew/scripts/tsne_visualization.py](datastew/scripts/tsne_visualization.py):

```python
from datastew.embedding import MPNetAdapter
from datastew.process.parsing import DataDictionarySource
from datastew.visualisation import plot_embeddings

# Variable and description refer to the corresponding column names in your excel sheet
data_dictionary_source_1 = DataDictionarySource(
    "source1.xlsx", variable_field="var", description_field="desc"
)
data_dictionary_source_2 = DataDictionarySource(
    "source2.xlsx", variable_field="var", description_field="desc"
)

mpnet_adapter = MPNetAdapter()
plot_embeddings(
    [data_dictionary_source_1, data_dictionary_source_2], embedding_model=mpnet_adapter
)
```
![t-SNE plot](./docs/tsne_plot.png)

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/SCAI-BIO/datastew",
    "name": "datastew",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4,>=3.10",
    "maintainer_email": null,
    "keywords": "data-harmonization, LLM, embeddings, data-steward",
    "author": "Tim Adams",
    "author_email": "tim.adams@scai.fraunhofer.de",
    "download_url": "https://files.pythonhosted.org/packages/51/f6/4eb7fd962f10bc2643bce6229d25a5e0fd252a5d80b8eb7844a64c640ddc/datastew-0.4.2.tar.gz",
    "platform": null,
    "description": "# datastew\n\n![tests](https://github.com/SCAI-BIO/datastew/actions/workflows/tests.yml/badge.svg) ![GitHub Release](https://img.shields.io/github/v/release/SCAI-BIO/datastew)\n\nDatastew is a python library for intelligent data harmonization using Large Language Model (LLM) vector embeddings.\n\n## Installation\n\n```bash\npip install datastew\n```\n\n## Usage\n\n### Harmonizing excel/csv resources\n\nYou can directly import common data models, terminology sources or data dictionaries for harmonization directly from a\ncsv, tsv or excel file. An example how to match two separate variable descriptions is shown in\n[datastew/scripts/mapping_excel_example.py](datastew/scripts/mapping_excel_example.py):\n\n```python\nfrom datastew.process.parsing import DataDictionarySource\nfrom datastew.process.mapping import map_dictionary_to_dictionary\n\n# Variable and description refer to the corresponding column names in your excel sheet\nsource = DataDictionarySource(\"source.xlxs\", variable_field=\"var\", description_field=\"desc\")\ntarget = DataDictionarySource(\"target.xlxs\", variable_field=\"var\", description_field=\"desc\")\n\ndf = map_dictionary_to_dictionary(source, target)\ndf.to_excel(\"result.xlxs\")\n```\n\nThe resulting file contains the pairwise variable mapping based on the closest similarity for all possible matches\nas well as a similarity measure per row.\n\nPer default this will use the local MPNet model, which may not yield the optimal performance. If you got an OpenAI API\nkey it is possible to use their embedding API instead. To use your key, create an OpenAIAdapter model and pass it to the\nfunction:\n\n```python\nfrom datastew.embedding import GPT4Adapter\n\nembedding_model = GPT4Adapter(key=\"your_api_key\")\ndf = map_dictionary_to_dictionary(source, target, embedding_model=embedding_model)\n```\n\nYou can also retrieve embeddings from data dictionaries and visualize them in form of an interactive scatter plot to\nexplore sematic neighborhoods:\n\n```python\nfrom datastew.visualisation import plot_embeddings\n\n# Get embedding vectors for your dictionaries\nsource_embeddings = source.get_embeddings()\n\n# plot embedding neighborhoods for several dictionaries\nplot_embeddings(data_dictionaries=[source, target])\n\n```\n\n### Creating and using stored mappings\n\nA simple example how to initialize an in memory database and compute a similarity mapping is shown in\n[datastew/scripts/mapping_db_example.py](datastew/scripts/mapping_db_example.py):\n\n```python\nfrom datastew.repository.sqllite import SQLLiteRepository\nfrom datastew.repository.model import Terminology, Concept, Mapping\nfrom datastew.embedding import MPNetAdapter\n\n# omit mode to create a permanent db file instead\nrepository = SQLLiteRepository(mode=\"memory\")\nembedding_model = MPNetAdapter()\n\nterminology = Terminology(\"snomed CT\", \"SNOMED\")\n\ntext1 = \"Diabetes mellitus (disorder)\"\nconcept1 = Concept(terminology, text1, \"Concept ID: 11893007\")\nmapping1 = Mapping(concept1, text1, embedding_model.get_embedding(text1))\n\ntext2 = \"Hypertension (disorder)\"\nconcept2 = Concept(terminology, text2, \"Concept ID: 73211009\")\nmapping2 = Mapping(concept2, text2, embedding_model.get_embedding(text2))\n\nrepository.store_all([terminology, concept1, mapping1, concept2, mapping2])\n\ntext_to_map = \"Sugar sickness\"\nembedding = embedding_model.get_embedding(text_to_map)\nmappings, similarities = repository.get_closest_mappings(embedding, limit=2)\nfor mapping, similarity in zip(mappings, similarities):\n    print(f\"Similarity: {similarity} -> {mapping}\")\n```\n\noutput:\n\n```plaintext\nSimilarity: 0.47353370635583486 -> Concept ID: 11893007 : Diabetes mellitus (disorder) | Diabetes mellitus (disorder)\nSimilarity: 0.20031612264852067 -> Concept ID: 73211009 : Hypertension (disorder) | Hypertension (disorder)\n```\n\nYou can also import data from file sources (csv, tsv, xlsx) or from a public API like OLS. An example script to\ndownload & compute embeddings for SNOMED from ebi OLS can be found in\n[datastew/scripts/ols_snomed_retrieval.py](datastew/scripts/ols_snomed_retrieval.py).\n\n### Embedding visualization\n\nYou can visualize the embedding space of multiple data dictionary sources with t-SNE plots utilizing different\nlanguage models. An example how to generate a t-sne plot is shown in\n[datastew/scripts/tsne_visualization.py](datastew/scripts/tsne_visualization.py):\n\n```python\nfrom datastew.embedding import MPNetAdapter\nfrom datastew.process.parsing import DataDictionarySource\nfrom datastew.visualisation import plot_embeddings\n\n# Variable and description refer to the corresponding column names in your excel sheet\ndata_dictionary_source_1 = DataDictionarySource(\n    \"source1.xlsx\", variable_field=\"var\", description_field=\"desc\"\n)\ndata_dictionary_source_2 = DataDictionarySource(\n    \"source2.xlsx\", variable_field=\"var\", description_field=\"desc\"\n)\n\nmpnet_adapter = MPNetAdapter()\nplot_embeddings(\n    [data_dictionary_source_1, data_dictionary_source_2], embedding_model=mpnet_adapter\n)\n```\n![t-SNE plot](./docs/tsne_plot.png)\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "Datastew is a python library for intelligent data harmonization using Large Language Model (LLM) vector embeddings.",
    "version": "0.4.2",
    "project_urls": {
        "Documentation": "https://github.com/SCAI-BIO/datastew#readme",
        "Homepage": "https://github.com/SCAI-BIO/datastew",
        "Repository": "https://github.com/SCAI-BIO/datastew",
        "tracker": "https://github.com/SCAI-BIO/datastew/issues"
    },
    "split_keywords": [
        "data-harmonization",
        " llm",
        " embeddings",
        " data-steward"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "7a7ad1cb96be4b16a64e5463ae09c122b513bc4b31117da0d5a9f9a3b30f31fa",
                "md5": "ad3ad884e12504703f7edb5287710b17",
                "sha256": "9ae527f0924c1bda9918ecd1af1dd79e4a3301b03044f5beacaf8e5e0dd9a364"
            },
            "downloads": -1,
            "filename": "datastew-0.4.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ad3ad884e12504703f7edb5287710b17",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4,>=3.10",
            "size": 36045,
            "upload_time": "2025-01-06T08:58:23",
            "upload_time_iso_8601": "2025-01-06T08:58:23.947720Z",
            "url": "https://files.pythonhosted.org/packages/7a/7a/d1cb96be4b16a64e5463ae09c122b513bc4b31117da0d5a9f9a3b30f31fa/datastew-0.4.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "51f64eb7fd962f10bc2643bce6229d25a5e0fd252a5d80b8eb7844a64c640ddc",
                "md5": "a78a1297bba5191974ed1e781ca768fd",
                "sha256": "2cdd9382aac18ca65eac771e8fa6ef889561757d665c09ba590526c0e6128fa2"
            },
            "downloads": -1,
            "filename": "datastew-0.4.2.tar.gz",
            "has_sig": false,
            "md5_digest": "a78a1297bba5191974ed1e781ca768fd",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4,>=3.10",
            "size": 28939,
            "upload_time": "2025-01-06T08:58:26",
            "upload_time_iso_8601": "2025-01-06T08:58:26.340172Z",
            "url": "https://files.pythonhosted.org/packages/51/f6/4eb7fd962f10bc2643bce6229d25a5e0fd252a5d80b8eb7844a64c640ddc/datastew-0.4.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-06 08:58:26",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "SCAI-BIO",
    "github_project": "datastew",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "datastew"
}
        
Elapsed time: 0.49374s