[![pypi](https://img.shields.io/github/v/release/adbond/splinkclickhouse?include_prereleases)](https://pypi.org/project/splinkclickhouse/#history)
[![Downloads](https://static.pepy.tech/badge/splinkclickhouse)](https://pepy.tech/project/splinkclickhouse)
# `splinkclickhouse`
Basic [Clickhouse](https://clickhouse.com/docs/en/intro) support for use as a backend with the data-linkage and deduplication package [Splink](https://moj-analytical-services.github.io/splink/).
Supports clickhouse server connected via [clickhouse connect](https://clickhouse.com/docs/en/integrations/python).
Also supports in-process [chDB](https://clickhouse.com/docs/en/chdb) version if installed with the `chdb` extras.
## Installation
Install from `PyPI` using `pip`:
```sh
# just installs the Clickhouse server dependencies
pip install splinkclickhouse
# or to install with support for chdb:
pip install splinkclickhouse[chdb]
```
or you can install the package directly from github:
```sh
# Replace with any version you want, or specify a branch after '@'
pip install git+https://github.com/ADBond/splinkclickhouse.git@v0.3.4
```
If instead you are using `conda`, `splinkclickhouse` is available on [conda-forge](https://conda-forge.org/):
```sh
conda install conda-forge::splinkclickhouse
```
Note that the `conda` version will only be able to use [the Clickhouse server functionality](#clickhouse-server) as `chdb` is not currently available within `conda`.
While the package is in early development there will may be breaking changes in new versions without warning, although these _should_ only occur in new minor versions.
Nevertheless if you depend on this package it is recommended to pin a version to avoid any disruption that this may cause.
## Use
### Clickhouse server
Import `ClickhouseAPI`, which accepts a `clickhouse_connect` client, configured with attributes relevant for your connection:
```python
import clickhouse_connect
import splink.comparison_library as cl
from splink import Linker, SettingsCreator, block_on, splink_datasets
from splinkclickhouse import ClickhouseAPI
df = splink_datasets.fake_1000
conn_atts = {
"host": "localhost",
"port": 8123,
"username": "splinkognito",
"password": "splink123!",
}
db_name = "__temp_splink_db"
default_client = clickhouse_connect.get_client(**conn_atts)
default_client.command(f"CREATE DATABASE IF NOT EXISTS {db_name}")
client = clickhouse_connect.get_client(
**conn_atts,
database=db_name,
)
db_api = ClickhouseAPI(client)
# can have at most one tf-adjusted comparison, see caveats below
settings = SettingsCreator(
link_type="dedupe_only",
comparisons=[
cl.JaroWinklerAtThresholds("first_name"),
cl.JaroAtThresholds("surname"),
cl.DateOfBirthComparison(
"dob",
input_is_string=True,
),
cl.DamerauLevenshteinAtThresholds("city").configure(
term_frequency_adjustments=True
),
cl.JaccardAtThresholds("email"),
],
blocking_rules_to_generate_predictions=[
block_on("first_name", "dob"),
block_on("surname"),
],
)
linker = Linker(df, settings, db_api=db_api)
```
See [Splink documentation](https://moj-analytical-services.github.io/splink/) for use of the `Linker`.
### `chDB`
To use `chdb` as a Splink backend you must install the `chdb` package.
This is automatically installed if you install with the `chdb` extras (`pip install splinkclickhouse[chdb]`).
Import `ChDBAPI`, which accepts a connection from `chdb.api`:
```python
import splink.comparison_library as cl
from chdb import dbapi
from splink import Linker, SettingsCreator, block_on, splink_datasets
from splinkclickhouse import ChDBAPI
con = dbapi.connect()
db_api = ChDBAPI(con)
df = splink_datasets.fake_1000
settings = SettingsCreator(
link_type="dedupe_only",
comparisons=[
cl.NameComparison("first_name"),
cl.JaroAtThresholds("surname"),
cl.DateOfBirthComparison(
"dob",
input_is_string=True,
),
cl.DamerauLevenshteinAtThresholds("city").configure(
term_frequency_adjustments=True
),
cl.EmailComparison("email"),
],
blocking_rules_to_generate_predictions=[
block_on("first_name", "dob"),
block_on("surname"),
],
)
linker = Linker(df, settings, db_api=db_api)
```
See [Splink documentation](https://moj-analytical-services.github.io/splink/) for use of the `Linker`.
### Comparisons
`splinkclickhouse` is compatible with all of the in-built `splinks` comparisons and comparison levels in `splink.comparison_library` and `splink.comparison_level_library`.
However, `splinkclickhouse ` provides a few pre-made extras to leverage Clickhouse-specific functionality.
These can be used in exactly the same way as the native Splink libraries, for example:
```python
import splink.comparison_library as cl
from splink import SettingsCreator
import splinkclickhouse.comparison_library as cl_ch
...
settings = SettingsCreator(
link_type="dedupe_only",
comparisons=[
cl.ExactMatch("name"),
cl_ch.DistanceInKMAtThresholds(
"latitude",
"longitude",
[10, 50, 100, 200, 500],
),
],
)
```
or with individual comparison-levels:
```python
import splink.comparison_level_library as cll
import splink.comparison_library as cl
from splink import SettingsCreator
import splinkclickhouse.comparison_level_library as cll_ch
...
settings = SettingsCreator(
link_type="dedupe_only",
comparisons=[
cl.ExactMatch("name"),
cl.CustomComparison(
comparison_levels = [
cll.And(
cll.NullLevel("city"),
cll.NullLevel("postcode"),
cll.Or(cll.NullLevel("latitude"), cll.NullLevel("longitude"))
),
cll.ExactMatch("postcode"),
cll_ch.DistanceInKMLevel("latitude", "longitude", 5),
cll_ch.DistanceInKMLevel("latitude", "longitude", 10),
cll.ExactMatch("city"),
cll_ch.DistanceInKMLevel("latitude", "longitude", 50),
cll.ElseLevel(),
],
output_column_name="location",
),
],
)
```
## Support
If you have difficulties with the package you can [open an issue](https://github.com/ADBond/splinkclickhouse/issues).
You may also [suggest changes by opening a PR](https://github.com/ADBond/splinkclickhouse/pulls), although it may be best to discuss in an issue beforehand.
This package is 'unofficial', in that it is not directly supported by the Splink team. Maintenance / improvements will be done on a 'best effort' basis where resources allow.
## Known issues / caveats
### Datetime parsing
Clickhouse offers several different date formats.
The basic `Date` format cannot handle dates before the Unix epoch (1970-01-01), which makes it unsuitable for many use-cases for holding date-of-births.
The parsing function `parseDateTime` (and variants) which support providing custom formats return a `DateTime`, which also has the above limited range.
In `splinkclickhouse` we use the function `parseDateTime64BestEffortOrNull` so that we can use the extended-range `DateTime64` data type, which supports dates back to 1900-01-01, but does not allow custom date formats. Currently no `DateTime64` equivalent of `parseDateTime` exists.
If you require different behaviour (for instance if you have an unusual date format and know that you do not need dates outside of the `DateTime` range) you will either need to derive a new column in your source data, or construct the relevant SQL expression manually.
#### Extended Dates
There is not currently a way in Clickhouse to deal directly with date values before 1900. However, `splinkclickhouse` offers some tools to help with this.
It creates a SQL UDF (which can be opted-out of) `days_since_epoch`, to convert a date string (in `YYYY-MM-DD` format) into an integer, representing the number of days since `1970-01-01` to handle dates well outside the range of `DateTime64`, based on the proleptic Gregorian calendar.
This can be used with column expression extension `splinkclickhouse.column_expression.ColumnExpression` via the transform `.parse_date_to_int()`, or using custom versions of Splink library functions `cll.AbsoluteDateDifferenceLevel`, `cl.AbsoluteDateDifferenceAtThresholds`, and `cl.DateOfBirthComparison`.
These functions can be used with string columns (which will be wrapped in the above parsing function), or integer columns if the conversion via `days_since_epoch` is already done in the data-preparation stage.
### `NULL` values in `chdb`
When passing data into `chdb` from pandas or pyarrow tables, `NULL` values in `String` columns are converted into empty strings, instead of remaining `NULL`.
For now this is not handled within the package. You can workaround the issue by wrapping column names in `NULLIF`:
```python
import splink.comparison_level as cl
first_name_comparison = cl.DamerauLevenshteinAtThresholds("NULLIF(first_name, '')")
```
### `ClickhouseAPI` pandas registration
`ClickhouseAPI` will allow registration of pandas dataframes, by inferring the types of columns. It currently only does this for string, integer, and float columns, and will always make them `Nullable`.
If you require other data types, or more fine-grained control, it is recommended to import the data into Clickhouse yourself, and then pass the table name (as a string) to the `Linker` instead.
Raw data
{
"_id": null,
"home_page": null,
"name": "splinkclickhouse",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "data linking, deduplication, entity resolution, fuzzy matching, record linkage",
"author": "Andrew Bond",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/bd/12/e98b249f0f85e2efeee22b2fba4bfff76ce6f373a22295321fd8962aee8e/splinkclickhouse-0.3.4.tar.gz",
"platform": null,
"description": "[![pypi](https://img.shields.io/github/v/release/adbond/splinkclickhouse?include_prereleases)](https://pypi.org/project/splinkclickhouse/#history)\n[![Downloads](https://static.pepy.tech/badge/splinkclickhouse)](https://pepy.tech/project/splinkclickhouse)\n\n# `splinkclickhouse`\n\nBasic [Clickhouse](https://clickhouse.com/docs/en/intro) support for use as a backend with the data-linkage and deduplication package [Splink](https://moj-analytical-services.github.io/splink/).\n\nSupports clickhouse server connected via [clickhouse connect](https://clickhouse.com/docs/en/integrations/python).\n\nAlso supports in-process [chDB](https://clickhouse.com/docs/en/chdb) version if installed with the `chdb` extras.\n\n## Installation\n\nInstall from `PyPI` using `pip`:\n\n```sh\n# just installs the Clickhouse server dependencies\npip install splinkclickhouse\n# or to install with support for chdb:\npip install splinkclickhouse[chdb]\n```\n\nor you can install the package directly from github:\n\n```sh\n# Replace with any version you want, or specify a branch after '@'\npip install git+https://github.com/ADBond/splinkclickhouse.git@v0.3.4\n```\n\nIf instead you are using `conda`, `splinkclickhouse` is available on [conda-forge](https://conda-forge.org/):\n\n```sh\nconda install conda-forge::splinkclickhouse\n```\n\nNote that the `conda` version will only be able to use [the Clickhouse server functionality](#clickhouse-server) as `chdb` is not currently available within `conda`.\n\nWhile the package is in early development there will may be breaking changes in new versions without warning, although these _should_ only occur in new minor versions.\nNevertheless if you depend on this package it is recommended to pin a version to avoid any disruption that this may cause.\n\n## Use\n\n### Clickhouse server\n\nImport `ClickhouseAPI`, which accepts a `clickhouse_connect` client, configured with attributes relevant for your connection:\n```python\nimport clickhouse_connect\nimport splink.comparison_library as cl\nfrom splink import Linker, SettingsCreator, block_on, splink_datasets\n\nfrom splinkclickhouse import ClickhouseAPI\n\ndf = splink_datasets.fake_1000\n\nconn_atts = {\n \"host\": \"localhost\",\n \"port\": 8123,\n \"username\": \"splinkognito\",\n \"password\": \"splink123!\",\n}\n\ndb_name = \"__temp_splink_db\"\n\ndefault_client = clickhouse_connect.get_client(**conn_atts)\ndefault_client.command(f\"CREATE DATABASE IF NOT EXISTS {db_name}\")\nclient = clickhouse_connect.get_client(\n **conn_atts,\n database=db_name,\n)\n\ndb_api = ClickhouseAPI(client)\n\n# can have at most one tf-adjusted comparison, see caveats below\nsettings = SettingsCreator(\n link_type=\"dedupe_only\",\n comparisons=[\n cl.JaroWinklerAtThresholds(\"first_name\"),\n cl.JaroAtThresholds(\"surname\"),\n cl.DateOfBirthComparison(\n \"dob\",\n input_is_string=True,\n ),\n cl.DamerauLevenshteinAtThresholds(\"city\").configure(\n term_frequency_adjustments=True\n ),\n cl.JaccardAtThresholds(\"email\"),\n ],\n blocking_rules_to_generate_predictions=[\n block_on(\"first_name\", \"dob\"),\n block_on(\"surname\"),\n ],\n)\n\nlinker = Linker(df, settings, db_api=db_api)\n```\n\nSee [Splink documentation](https://moj-analytical-services.github.io/splink/) for use of the `Linker`.\n\n### `chDB`\n\nTo use `chdb` as a Splink backend you must install the `chdb` package.\nThis is automatically installed if you install with the `chdb` extras (`pip install splinkclickhouse[chdb]`).\n\nImport `ChDBAPI`, which accepts a connection from `chdb.api`:\n```python\nimport splink.comparison_library as cl\nfrom chdb import dbapi\nfrom splink import Linker, SettingsCreator, block_on, splink_datasets\n\nfrom splinkclickhouse import ChDBAPI\n\ncon = dbapi.connect()\ndb_api = ChDBAPI(con)\n\ndf = splink_datasets.fake_1000\n\nsettings = SettingsCreator(\n link_type=\"dedupe_only\",\n comparisons=[\n cl.NameComparison(\"first_name\"),\n cl.JaroAtThresholds(\"surname\"),\n cl.DateOfBirthComparison(\n \"dob\",\n input_is_string=True,\n ),\n cl.DamerauLevenshteinAtThresholds(\"city\").configure(\n term_frequency_adjustments=True\n ),\n cl.EmailComparison(\"email\"),\n ],\n blocking_rules_to_generate_predictions=[\n block_on(\"first_name\", \"dob\"),\n block_on(\"surname\"),\n ],\n)\n\nlinker = Linker(df, settings, db_api=db_api)\n```\n\nSee [Splink documentation](https://moj-analytical-services.github.io/splink/) for use of the `Linker`.\n\n### Comparisons\n\n`splinkclickhouse` is compatible with all of the in-built `splinks` comparisons and comparison levels in `splink.comparison_library` and `splink.comparison_level_library`.\nHowever, `splinkclickhouse ` provides a few pre-made extras to leverage Clickhouse-specific functionality.\nThese can be used in exactly the same way as the native Splink libraries, for example:\n\n```python\nimport splink.comparison_library as cl\nfrom splink import SettingsCreator\n\nimport splinkclickhouse.comparison_library as cl_ch\n\n...\nsettings = SettingsCreator(\n link_type=\"dedupe_only\",\n comparisons=[\n cl.ExactMatch(\"name\"),\n cl_ch.DistanceInKMAtThresholds(\n \"latitude\",\n \"longitude\",\n [10, 50, 100, 200, 500],\n ),\n ],\n)\n```\n\nor with individual comparison-levels:\n\n```python\nimport splink.comparison_level_library as cll\nimport splink.comparison_library as cl\nfrom splink import SettingsCreator\n\nimport splinkclickhouse.comparison_level_library as cll_ch\n\n...\nsettings = SettingsCreator(\n link_type=\"dedupe_only\",\n comparisons=[\n cl.ExactMatch(\"name\"),\n cl.CustomComparison(\n comparison_levels = [\n cll.And(\n cll.NullLevel(\"city\"),\n cll.NullLevel(\"postcode\"),\n cll.Or(cll.NullLevel(\"latitude\"), cll.NullLevel(\"longitude\"))\n ),\n cll.ExactMatch(\"postcode\"),\n cll_ch.DistanceInKMLevel(\"latitude\", \"longitude\", 5),\n cll_ch.DistanceInKMLevel(\"latitude\", \"longitude\", 10),\n cll.ExactMatch(\"city\"),\n cll_ch.DistanceInKMLevel(\"latitude\", \"longitude\", 50),\n cll.ElseLevel(),\n ],\n output_column_name=\"location\",\n ),\n ],\n)\n```\n\n## Support\n\nIf you have difficulties with the package you can [open an issue](https://github.com/ADBond/splinkclickhouse/issues).\nYou may also [suggest changes by opening a PR](https://github.com/ADBond/splinkclickhouse/pulls), although it may be best to discuss in an issue beforehand.\n\nThis package is 'unofficial', in that it is not directly supported by the Splink team. Maintenance / improvements will be done on a 'best effort' basis where resources allow.\n\n## Known issues / caveats\n\n### Datetime parsing\n\nClickhouse offers several different date formats.\nThe basic `Date` format cannot handle dates before the Unix epoch (1970-01-01), which makes it unsuitable for many use-cases for holding date-of-births.\n\nThe parsing function `parseDateTime` (and variants) which support providing custom formats return a `DateTime`, which also has the above limited range.\nIn `splinkclickhouse` we use the function `parseDateTime64BestEffortOrNull` so that we can use the extended-range `DateTime64` data type, which supports dates back to 1900-01-01, but does not allow custom date formats. Currently no `DateTime64` equivalent of `parseDateTime` exists.\n\nIf you require different behaviour (for instance if you have an unusual date format and know that you do not need dates outside of the `DateTime` range) you will either need to derive a new column in your source data, or construct the relevant SQL expression manually.\n\n#### Extended Dates\n\nThere is not currently a way in Clickhouse to deal directly with date values before 1900. However, `splinkclickhouse` offers some tools to help with this.\nIt creates a SQL UDF (which can be opted-out of) `days_since_epoch`, to convert a date string (in `YYYY-MM-DD` format) into an integer, representing the number of days since `1970-01-01` to handle dates well outside the range of `DateTime64`, based on the proleptic Gregorian calendar.\n\nThis can be used with column expression extension `splinkclickhouse.column_expression.ColumnExpression` via the transform `.parse_date_to_int()`, or using custom versions of Splink library functions `cll.AbsoluteDateDifferenceLevel`, `cl.AbsoluteDateDifferenceAtThresholds`, and `cl.DateOfBirthComparison`.\nThese functions can be used with string columns (which will be wrapped in the above parsing function), or integer columns if the conversion via `days_since_epoch` is already done in the data-preparation stage.\n\n### `NULL` values in `chdb`\n\nWhen passing data into `chdb` from pandas or pyarrow tables, `NULL` values in `String` columns are converted into empty strings, instead of remaining `NULL`.\n\nFor now this is not handled within the package. You can workaround the issue by wrapping column names in `NULLIF`:\n\n```python\nimport splink.comparison_level as cl\n\nfirst_name_comparison = cl.DamerauLevenshteinAtThresholds(\"NULLIF(first_name, '')\")\n```\n\n### `ClickhouseAPI` pandas registration\n\n`ClickhouseAPI` will allow registration of pandas dataframes, by inferring the types of columns. It currently only does this for string, integer, and float columns, and will always make them `Nullable`.\n\nIf you require other data types, or more fine-grained control, it is recommended to import the data into Clickhouse yourself, and then pass the table name (as a string) to the `Linker` instead.\n",
"bugtrack_url": null,
"license": "MIT License",
"summary": "Clickhouse backend support for Splink",
"version": "0.3.4",
"project_urls": {
"Changelog": "https://github.com/ADBond/splinkclickhouse/blob/main/CHANGELOG.md",
"Documentation": "https://github.com/ADBond/splinkclickhouse/blob/main/README.md",
"Homepage": "https://github.com/ADBond/splinkclickhouse",
"Issues": "https://github.com/ADBond/splinkclickhouse/issues",
"Repository": "https://github.com/ADBond/splinkclickhouse.git"
},
"split_keywords": [
"data linking",
" deduplication",
" entity resolution",
" fuzzy matching",
" record linkage"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "08c02ea7662b6aa13536a3ba75d81bc56c2c8e5c80a6d1a6e912c3079ad6a6b6",
"md5": "9efcf4d968ae6086132875f9c8cf79bc",
"sha256": "c556f74105d4922fb45525adae1e15bef4cdeb54c082ad22dab7eb3edeaf3320"
},
"downloads": -1,
"filename": "splinkclickhouse-0.3.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "9efcf4d968ae6086132875f9c8cf79bc",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 20443,
"upload_time": "2024-12-16T10:07:22",
"upload_time_iso_8601": "2024-12-16T10:07:22.581355Z",
"url": "https://files.pythonhosted.org/packages/08/c0/2ea7662b6aa13536a3ba75d81bc56c2c8e5c80a6d1a6e912c3079ad6a6b6/splinkclickhouse-0.3.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "bd12e98b249f0f85e2efeee22b2fba4bfff76ce6f373a22295321fd8962aee8e",
"md5": "68260638cd5d8327535c186ec34197dd",
"sha256": "f0df99f495be0a0e23716920784522c410c6a83d3e76c3929f1e372203699a17"
},
"downloads": -1,
"filename": "splinkclickhouse-0.3.4.tar.gz",
"has_sig": false,
"md5_digest": "68260638cd5d8327535c186ec34197dd",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 16734,
"upload_time": "2024-12-16T10:07:24",
"upload_time_iso_8601": "2024-12-16T10:07:24.857005Z",
"url": "https://files.pythonhosted.org/packages/bd/12/e98b249f0f85e2efeee22b2fba4bfff76ce6f373a22295321fd8962aee8e/splinkclickhouse-0.3.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-16 10:07:24",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "ADBond",
"github_project": "splinkclickhouse",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "splinkclickhouse"
}