splinkclickhouse


Namesplinkclickhouse JSON
Version 0.3.4 PyPI version JSON
download
home_pageNone
SummaryClickhouse backend support for Splink
upload_time2024-12-16 10:07:24
maintainerNone
docs_urlNone
authorAndrew Bond
requires_python>=3.9
licenseMIT License
keywords data linking deduplication entity resolution fuzzy matching record linkage
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [![pypi](https://img.shields.io/github/v/release/adbond/splinkclickhouse?include_prereleases)](https://pypi.org/project/splinkclickhouse/#history)
[![Downloads](https://static.pepy.tech/badge/splinkclickhouse)](https://pepy.tech/project/splinkclickhouse)

# `splinkclickhouse`

Basic [Clickhouse](https://clickhouse.com/docs/en/intro) support for use as a backend with the data-linkage and deduplication package [Splink](https://moj-analytical-services.github.io/splink/).

Supports clickhouse server connected via [clickhouse connect](https://clickhouse.com/docs/en/integrations/python).

Also supports in-process [chDB](https://clickhouse.com/docs/en/chdb) version if installed with the `chdb` extras.

## Installation

Install from `PyPI` using `pip`:

```sh
# just installs the Clickhouse server dependencies
pip install splinkclickhouse
# or to install with support for chdb:
pip install splinkclickhouse[chdb]
```

or you can install the package directly from github:

```sh
# Replace with any version you want, or specify a branch after '@'
pip install git+https://github.com/ADBond/splinkclickhouse.git@v0.3.4
```

If instead you are using `conda`, `splinkclickhouse` is available on [conda-forge](https://conda-forge.org/):

```sh
conda install conda-forge::splinkclickhouse
```

Note that the `conda` version will only be able to use [the Clickhouse server functionality](#clickhouse-server) as `chdb` is not currently available within `conda`.

While the package is in early development there will may be breaking changes in new versions without warning, although these _should_ only occur in new minor versions.
Nevertheless if you depend on this package it is recommended to pin a version to avoid any disruption that this may cause.

## Use

### Clickhouse server

Import `ClickhouseAPI`, which accepts a `clickhouse_connect` client, configured with attributes relevant for your connection:
```python
import clickhouse_connect
import splink.comparison_library as cl
from splink import Linker, SettingsCreator, block_on, splink_datasets

from splinkclickhouse import ClickhouseAPI

df = splink_datasets.fake_1000

conn_atts = {
    "host": "localhost",
    "port": 8123,
    "username": "splinkognito",
    "password": "splink123!",
}

db_name = "__temp_splink_db"

default_client = clickhouse_connect.get_client(**conn_atts)
default_client.command(f"CREATE DATABASE IF NOT EXISTS {db_name}")
client = clickhouse_connect.get_client(
    **conn_atts,
    database=db_name,
)

db_api = ClickhouseAPI(client)

# can have at most one tf-adjusted comparison, see caveats below
settings = SettingsCreator(
    link_type="dedupe_only",
    comparisons=[
        cl.JaroWinklerAtThresholds("first_name"),
        cl.JaroAtThresholds("surname"),
        cl.DateOfBirthComparison(
            "dob",
            input_is_string=True,
        ),
        cl.DamerauLevenshteinAtThresholds("city").configure(
            term_frequency_adjustments=True
        ),
        cl.JaccardAtThresholds("email"),
    ],
    blocking_rules_to_generate_predictions=[
        block_on("first_name", "dob"),
        block_on("surname"),
    ],
)

linker = Linker(df, settings, db_api=db_api)
```

See [Splink documentation](https://moj-analytical-services.github.io/splink/) for use of the `Linker`.

### `chDB`

To use `chdb` as a Splink backend you must install the `chdb` package.
This is automatically installed if you install with the `chdb` extras (`pip install splinkclickhouse[chdb]`).

Import `ChDBAPI`, which accepts a connection from `chdb.api`:
```python
import splink.comparison_library as cl
from chdb import dbapi
from splink import Linker, SettingsCreator, block_on, splink_datasets

from splinkclickhouse import ChDBAPI

con = dbapi.connect()
db_api = ChDBAPI(con)

df = splink_datasets.fake_1000

settings = SettingsCreator(
    link_type="dedupe_only",
    comparisons=[
        cl.NameComparison("first_name"),
        cl.JaroAtThresholds("surname"),
        cl.DateOfBirthComparison(
            "dob",
            input_is_string=True,
        ),
        cl.DamerauLevenshteinAtThresholds("city").configure(
            term_frequency_adjustments=True
        ),
        cl.EmailComparison("email"),
    ],
    blocking_rules_to_generate_predictions=[
        block_on("first_name", "dob"),
        block_on("surname"),
    ],
)

linker = Linker(df, settings, db_api=db_api)
```

See [Splink documentation](https://moj-analytical-services.github.io/splink/) for use of the `Linker`.

### Comparisons

`splinkclickhouse` is compatible with all of the in-built `splinks` comparisons and comparison levels in `splink.comparison_library` and `splink.comparison_level_library`.
However, `splinkclickhouse ` provides a few pre-made extras to leverage Clickhouse-specific functionality.
These can be used in exactly the same way as the native Splink libraries, for example:

```python
import splink.comparison_library as cl
from splink import SettingsCreator

import splinkclickhouse.comparison_library as cl_ch

...
settings = SettingsCreator(
    link_type="dedupe_only",
    comparisons=[
        cl.ExactMatch("name"),
        cl_ch.DistanceInKMAtThresholds(
            "latitude",
            "longitude",
            [10, 50, 100, 200, 500],
        ),
    ],
)
```

or with individual comparison-levels:

```python
import splink.comparison_level_library as cll
import splink.comparison_library as cl
from splink import SettingsCreator

import splinkclickhouse.comparison_level_library as cll_ch

...
settings = SettingsCreator(
    link_type="dedupe_only",
    comparisons=[
        cl.ExactMatch("name"),
        cl.CustomComparison(
            comparison_levels = [
                cll.And(
                    cll.NullLevel("city"),
                    cll.NullLevel("postcode"),
                    cll.Or(cll.NullLevel("latitude"), cll.NullLevel("longitude"))
                ),
                cll.ExactMatch("postcode"),
                cll_ch.DistanceInKMLevel("latitude", "longitude", 5),
                cll_ch.DistanceInKMLevel("latitude", "longitude", 10),
                cll.ExactMatch("city"),
                cll_ch.DistanceInKMLevel("latitude", "longitude", 50),
                cll.ElseLevel(),
            ],
            output_column_name="location",
        ),
    ],
)
```

## Support

If you have difficulties with the package you can [open an issue](https://github.com/ADBond/splinkclickhouse/issues).
You may also [suggest changes by opening a PR](https://github.com/ADBond/splinkclickhouse/pulls), although it may be best to discuss in an issue beforehand.

This package is 'unofficial', in that it is not directly supported by the Splink team. Maintenance / improvements will be done on a 'best effort' basis where resources allow.

## Known issues / caveats

### Datetime parsing

Clickhouse offers several different date formats.
The basic `Date` format cannot handle dates before the Unix epoch (1970-01-01), which makes it unsuitable for many use-cases for holding date-of-births.

The parsing function `parseDateTime` (and variants) which support providing custom formats return a `DateTime`, which also has the above limited range.
In `splinkclickhouse` we use the function `parseDateTime64BestEffortOrNull` so that we can use the extended-range `DateTime64` data type, which supports dates back to 1900-01-01, but does not allow custom date formats. Currently no `DateTime64` equivalent of `parseDateTime` exists.

If you require different behaviour (for instance if you have an unusual date format and know that you do not need dates outside of the `DateTime` range) you will either need to derive a new column in your source data, or construct the relevant SQL expression manually.

#### Extended Dates

There is not currently a way in Clickhouse to deal directly with date values before 1900. However, `splinkclickhouse` offers some tools to help with this.
It creates a SQL UDF (which can be opted-out of) `days_since_epoch`, to convert a date string (in `YYYY-MM-DD` format) into an integer, representing the number of days since `1970-01-01` to handle dates well outside the range of `DateTime64`, based on the proleptic Gregorian calendar.

This can be used with column expression extension `splinkclickhouse.column_expression.ColumnExpression` via the transform `.parse_date_to_int()`, or using custom versions of Splink library functions `cll.AbsoluteDateDifferenceLevel`, `cl.AbsoluteDateDifferenceAtThresholds`, and `cl.DateOfBirthComparison`.
These functions can be used with string columns (which will be wrapped in the above parsing function), or integer columns if the conversion via `days_since_epoch` is already done in the data-preparation stage.

### `NULL` values in `chdb`

When passing data into `chdb` from pandas or pyarrow tables, `NULL` values in `String` columns are converted into empty strings, instead of remaining `NULL`.

For now this is not handled within the package. You can workaround the issue by wrapping column names in `NULLIF`:

```python
import splink.comparison_level as cl

first_name_comparison = cl.DamerauLevenshteinAtThresholds("NULLIF(first_name, '')")
```

### `ClickhouseAPI` pandas registration

`ClickhouseAPI` will allow registration of pandas dataframes, by inferring the types of columns. It currently only does this for string, integer, and float columns, and will always make them `Nullable`.

If you require other data types, or more fine-grained control, it is recommended to import the data into Clickhouse yourself, and then pass the table name (as a string) to the `Linker` instead.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "splinkclickhouse",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "data linking, deduplication, entity resolution, fuzzy matching, record linkage",
    "author": "Andrew Bond",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/bd/12/e98b249f0f85e2efeee22b2fba4bfff76ce6f373a22295321fd8962aee8e/splinkclickhouse-0.3.4.tar.gz",
    "platform": null,
    "description": "[![pypi](https://img.shields.io/github/v/release/adbond/splinkclickhouse?include_prereleases)](https://pypi.org/project/splinkclickhouse/#history)\n[![Downloads](https://static.pepy.tech/badge/splinkclickhouse)](https://pepy.tech/project/splinkclickhouse)\n\n# `splinkclickhouse`\n\nBasic [Clickhouse](https://clickhouse.com/docs/en/intro) support for use as a backend with the data-linkage and deduplication package [Splink](https://moj-analytical-services.github.io/splink/).\n\nSupports clickhouse server connected via [clickhouse connect](https://clickhouse.com/docs/en/integrations/python).\n\nAlso supports in-process [chDB](https://clickhouse.com/docs/en/chdb) version if installed with the `chdb` extras.\n\n## Installation\n\nInstall from `PyPI` using `pip`:\n\n```sh\n# just installs the Clickhouse server dependencies\npip install splinkclickhouse\n# or to install with support for chdb:\npip install splinkclickhouse[chdb]\n```\n\nor you can install the package directly from github:\n\n```sh\n# Replace with any version you want, or specify a branch after '@'\npip install git+https://github.com/ADBond/splinkclickhouse.git@v0.3.4\n```\n\nIf instead you are using `conda`, `splinkclickhouse` is available on [conda-forge](https://conda-forge.org/):\n\n```sh\nconda install conda-forge::splinkclickhouse\n```\n\nNote that the `conda` version will only be able to use [the Clickhouse server functionality](#clickhouse-server) as `chdb` is not currently available within `conda`.\n\nWhile the package is in early development there will may be breaking changes in new versions without warning, although these _should_ only occur in new minor versions.\nNevertheless if you depend on this package it is recommended to pin a version to avoid any disruption that this may cause.\n\n## Use\n\n### Clickhouse server\n\nImport `ClickhouseAPI`, which accepts a `clickhouse_connect` client, configured with attributes relevant for your connection:\n```python\nimport clickhouse_connect\nimport splink.comparison_library as cl\nfrom splink import Linker, SettingsCreator, block_on, splink_datasets\n\nfrom splinkclickhouse import ClickhouseAPI\n\ndf = splink_datasets.fake_1000\n\nconn_atts = {\n    \"host\": \"localhost\",\n    \"port\": 8123,\n    \"username\": \"splinkognito\",\n    \"password\": \"splink123!\",\n}\n\ndb_name = \"__temp_splink_db\"\n\ndefault_client = clickhouse_connect.get_client(**conn_atts)\ndefault_client.command(f\"CREATE DATABASE IF NOT EXISTS {db_name}\")\nclient = clickhouse_connect.get_client(\n    **conn_atts,\n    database=db_name,\n)\n\ndb_api = ClickhouseAPI(client)\n\n# can have at most one tf-adjusted comparison, see caveats below\nsettings = SettingsCreator(\n    link_type=\"dedupe_only\",\n    comparisons=[\n        cl.JaroWinklerAtThresholds(\"first_name\"),\n        cl.JaroAtThresholds(\"surname\"),\n        cl.DateOfBirthComparison(\n            \"dob\",\n            input_is_string=True,\n        ),\n        cl.DamerauLevenshteinAtThresholds(\"city\").configure(\n            term_frequency_adjustments=True\n        ),\n        cl.JaccardAtThresholds(\"email\"),\n    ],\n    blocking_rules_to_generate_predictions=[\n        block_on(\"first_name\", \"dob\"),\n        block_on(\"surname\"),\n    ],\n)\n\nlinker = Linker(df, settings, db_api=db_api)\n```\n\nSee [Splink documentation](https://moj-analytical-services.github.io/splink/) for use of the `Linker`.\n\n### `chDB`\n\nTo use `chdb` as a Splink backend you must install the `chdb` package.\nThis is automatically installed if you install with the `chdb` extras (`pip install splinkclickhouse[chdb]`).\n\nImport `ChDBAPI`, which accepts a connection from `chdb.api`:\n```python\nimport splink.comparison_library as cl\nfrom chdb import dbapi\nfrom splink import Linker, SettingsCreator, block_on, splink_datasets\n\nfrom splinkclickhouse import ChDBAPI\n\ncon = dbapi.connect()\ndb_api = ChDBAPI(con)\n\ndf = splink_datasets.fake_1000\n\nsettings = SettingsCreator(\n    link_type=\"dedupe_only\",\n    comparisons=[\n        cl.NameComparison(\"first_name\"),\n        cl.JaroAtThresholds(\"surname\"),\n        cl.DateOfBirthComparison(\n            \"dob\",\n            input_is_string=True,\n        ),\n        cl.DamerauLevenshteinAtThresholds(\"city\").configure(\n            term_frequency_adjustments=True\n        ),\n        cl.EmailComparison(\"email\"),\n    ],\n    blocking_rules_to_generate_predictions=[\n        block_on(\"first_name\", \"dob\"),\n        block_on(\"surname\"),\n    ],\n)\n\nlinker = Linker(df, settings, db_api=db_api)\n```\n\nSee [Splink documentation](https://moj-analytical-services.github.io/splink/) for use of the `Linker`.\n\n### Comparisons\n\n`splinkclickhouse` is compatible with all of the in-built `splinks` comparisons and comparison levels in `splink.comparison_library` and `splink.comparison_level_library`.\nHowever, `splinkclickhouse ` provides a few pre-made extras to leverage Clickhouse-specific functionality.\nThese can be used in exactly the same way as the native Splink libraries, for example:\n\n```python\nimport splink.comparison_library as cl\nfrom splink import SettingsCreator\n\nimport splinkclickhouse.comparison_library as cl_ch\n\n...\nsettings = SettingsCreator(\n    link_type=\"dedupe_only\",\n    comparisons=[\n        cl.ExactMatch(\"name\"),\n        cl_ch.DistanceInKMAtThresholds(\n            \"latitude\",\n            \"longitude\",\n            [10, 50, 100, 200, 500],\n        ),\n    ],\n)\n```\n\nor with individual comparison-levels:\n\n```python\nimport splink.comparison_level_library as cll\nimport splink.comparison_library as cl\nfrom splink import SettingsCreator\n\nimport splinkclickhouse.comparison_level_library as cll_ch\n\n...\nsettings = SettingsCreator(\n    link_type=\"dedupe_only\",\n    comparisons=[\n        cl.ExactMatch(\"name\"),\n        cl.CustomComparison(\n            comparison_levels = [\n                cll.And(\n                    cll.NullLevel(\"city\"),\n                    cll.NullLevel(\"postcode\"),\n                    cll.Or(cll.NullLevel(\"latitude\"), cll.NullLevel(\"longitude\"))\n                ),\n                cll.ExactMatch(\"postcode\"),\n                cll_ch.DistanceInKMLevel(\"latitude\", \"longitude\", 5),\n                cll_ch.DistanceInKMLevel(\"latitude\", \"longitude\", 10),\n                cll.ExactMatch(\"city\"),\n                cll_ch.DistanceInKMLevel(\"latitude\", \"longitude\", 50),\n                cll.ElseLevel(),\n            ],\n            output_column_name=\"location\",\n        ),\n    ],\n)\n```\n\n## Support\n\nIf you have difficulties with the package you can [open an issue](https://github.com/ADBond/splinkclickhouse/issues).\nYou may also [suggest changes by opening a PR](https://github.com/ADBond/splinkclickhouse/pulls), although it may be best to discuss in an issue beforehand.\n\nThis package is 'unofficial', in that it is not directly supported by the Splink team. Maintenance / improvements will be done on a 'best effort' basis where resources allow.\n\n## Known issues / caveats\n\n### Datetime parsing\n\nClickhouse offers several different date formats.\nThe basic `Date` format cannot handle dates before the Unix epoch (1970-01-01), which makes it unsuitable for many use-cases for holding date-of-births.\n\nThe parsing function `parseDateTime` (and variants) which support providing custom formats return a `DateTime`, which also has the above limited range.\nIn `splinkclickhouse` we use the function `parseDateTime64BestEffortOrNull` so that we can use the extended-range `DateTime64` data type, which supports dates back to 1900-01-01, but does not allow custom date formats. Currently no `DateTime64` equivalent of `parseDateTime` exists.\n\nIf you require different behaviour (for instance if you have an unusual date format and know that you do not need dates outside of the `DateTime` range) you will either need to derive a new column in your source data, or construct the relevant SQL expression manually.\n\n#### Extended Dates\n\nThere is not currently a way in Clickhouse to deal directly with date values before 1900. However, `splinkclickhouse` offers some tools to help with this.\nIt creates a SQL UDF (which can be opted-out of) `days_since_epoch`, to convert a date string (in `YYYY-MM-DD` format) into an integer, representing the number of days since `1970-01-01` to handle dates well outside the range of `DateTime64`, based on the proleptic Gregorian calendar.\n\nThis can be used with column expression extension `splinkclickhouse.column_expression.ColumnExpression` via the transform `.parse_date_to_int()`, or using custom versions of Splink library functions `cll.AbsoluteDateDifferenceLevel`, `cl.AbsoluteDateDifferenceAtThresholds`, and `cl.DateOfBirthComparison`.\nThese functions can be used with string columns (which will be wrapped in the above parsing function), or integer columns if the conversion via `days_since_epoch` is already done in the data-preparation stage.\n\n### `NULL` values in `chdb`\n\nWhen passing data into `chdb` from pandas or pyarrow tables, `NULL` values in `String` columns are converted into empty strings, instead of remaining `NULL`.\n\nFor now this is not handled within the package. You can workaround the issue by wrapping column names in `NULLIF`:\n\n```python\nimport splink.comparison_level as cl\n\nfirst_name_comparison = cl.DamerauLevenshteinAtThresholds(\"NULLIF(first_name, '')\")\n```\n\n### `ClickhouseAPI` pandas registration\n\n`ClickhouseAPI` will allow registration of pandas dataframes, by inferring the types of columns. It currently only does this for string, integer, and float columns, and will always make them `Nullable`.\n\nIf you require other data types, or more fine-grained control, it is recommended to import the data into Clickhouse yourself, and then pass the table name (as a string) to the `Linker` instead.\n",
    "bugtrack_url": null,
    "license": "MIT License",
    "summary": "Clickhouse backend support for Splink",
    "version": "0.3.4",
    "project_urls": {
        "Changelog": "https://github.com/ADBond/splinkclickhouse/blob/main/CHANGELOG.md",
        "Documentation": "https://github.com/ADBond/splinkclickhouse/blob/main/README.md",
        "Homepage": "https://github.com/ADBond/splinkclickhouse",
        "Issues": "https://github.com/ADBond/splinkclickhouse/issues",
        "Repository": "https://github.com/ADBond/splinkclickhouse.git"
    },
    "split_keywords": [
        "data linking",
        " deduplication",
        " entity resolution",
        " fuzzy matching",
        " record linkage"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "08c02ea7662b6aa13536a3ba75d81bc56c2c8e5c80a6d1a6e912c3079ad6a6b6",
                "md5": "9efcf4d968ae6086132875f9c8cf79bc",
                "sha256": "c556f74105d4922fb45525adae1e15bef4cdeb54c082ad22dab7eb3edeaf3320"
            },
            "downloads": -1,
            "filename": "splinkclickhouse-0.3.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "9efcf4d968ae6086132875f9c8cf79bc",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 20443,
            "upload_time": "2024-12-16T10:07:22",
            "upload_time_iso_8601": "2024-12-16T10:07:22.581355Z",
            "url": "https://files.pythonhosted.org/packages/08/c0/2ea7662b6aa13536a3ba75d81bc56c2c8e5c80a6d1a6e912c3079ad6a6b6/splinkclickhouse-0.3.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "bd12e98b249f0f85e2efeee22b2fba4bfff76ce6f373a22295321fd8962aee8e",
                "md5": "68260638cd5d8327535c186ec34197dd",
                "sha256": "f0df99f495be0a0e23716920784522c410c6a83d3e76c3929f1e372203699a17"
            },
            "downloads": -1,
            "filename": "splinkclickhouse-0.3.4.tar.gz",
            "has_sig": false,
            "md5_digest": "68260638cd5d8327535c186ec34197dd",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 16734,
            "upload_time": "2024-12-16T10:07:24",
            "upload_time_iso_8601": "2024-12-16T10:07:24.857005Z",
            "url": "https://files.pythonhosted.org/packages/bd/12/e98b249f0f85e2efeee22b2fba4bfff76ce6f373a22295321fd8962aee8e/splinkclickhouse-0.3.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-16 10:07:24",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ADBond",
    "github_project": "splinkclickhouse",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "splinkclickhouse"
}
        
Elapsed time: 0.61531s