This package contains model classes that are used in the PPRL service for validation purposes.
They have been conceived with the idea of an HTTP-based service for record linkage based on Bloom filters in mind.
It encompasses models for the service's data transformation, masking and bit vector matching routines.
[Pydantic](https://docs.pydantic.dev/latest/) is used for validation, serialization and deserialization.
This package is rarely to be used directly.
Instead, it is used by other packages to power their functionalities.
# Data models
Models for entity pre-processing, masking and bit vector matching are exposed through this package.
The following examples are taken from the test suites of the
[PPRL service package](https://github.com/ul-mds/pprl/tree/main/packages/pprl_service) and show additional
validation steps in addition to the ones native to Pydantic.
## Entity transformation
```python
from pprl_model import EntityTransformRequest, TransformConfig, EmptyValueHandling, AttributeValueEntity, \
AttributeTransformerConfig, NumberTransformer, GlobalTransformerConfig, NormalizationTransformer, \
CharacterFilterTransformer
# This is a valid config.
_ = EntityTransformRequest(
config=TransformConfig(empty_value=EmptyValueHandling.ignore),
entities=[
AttributeValueEntity(
id="001",
attributes={
"bar1": " 12.345 ",
"bar2": " 12.345 "
}
)
],
attribute_transformers=[
AttributeTransformerConfig(
attribute_name="bar1",
transformers=[
NumberTransformer(decimal_places=2)
]
)
],
global_transformers=GlobalTransformerConfig(
before=[
NormalizationTransformer()
],
after=[
CharacterFilterTransformer(characters=".")
]
)
)
from uuid import uuid4
# Validation will fail since no transformers have been defined.
_ = EntityTransformRequest(
config=TransformConfig(empty_value=EmptyValueHandling.ignore),
entities=[
AttributeValueEntity(
id=str(uuid4()),
attributes={
"foo": "bar"
}
)
],
attribute_transformers=[]
)
# => ValidationError: attribute and global transformers are empty: must contain at least one
```
## Entity masking
```python
from pprl_model import EntityMaskRequest, MaskConfig, HashConfig, HashFunction, HashAlgorithm, \
DoubleHash, CLKFilter, AttributeValueEntity, StaticAttributeConfig, AttributeSalt, CLKRBFFilter
# This is a valid config.
_ = EntityMaskRequest(
config=MaskConfig(
token_size=2,
hash=HashConfig(
function=HashFunction(algorithms=[HashAlgorithm.sha1]),
strategy=DoubleHash()
),
filter=CLKFilter(filter_size=1024, hash_values=5),
padding="_"
),
entities=[
AttributeValueEntity(
id="001",
attributes={
"first_name": "John",
"last_name": "Doe",
"date_of_birth": "1987-06-05",
"gender": "m"
}
)
]
)
# This is an invalid config since salting an attribute can only be done through a fixed value
# or another attribute on an entity, not both at the same time.
_ = EntityMaskRequest(
config=MaskConfig(
token_size=2,
hash=HashConfig(
function=HashFunction(algorithms=[HashAlgorithm.sha1]),
strategy=DoubleHash()
),
filter=CLKFilter(filter_size=1024, hash_values=5),
padding="_"
),
entities=[
AttributeValueEntity(
id="001",
attributes={
"first_name": "foobar",
"salt": "0123456789"
}
)
],
attributes=[
StaticAttributeConfig(
attribute_name="first_name",
salt=AttributeSalt(
value="my_salt",
attribute="salt"
)
)
]
)
# => ValidationError: value and attribute cannot be set at the same time
# This also fails if neither a static value nor an attribute are set for salting.
_ = EntityMaskRequest(
config=MaskConfig(
token_size=2,
hash=HashConfig(
function=HashFunction(algorithms=[HashAlgorithm.sha1]),
strategy=DoubleHash()
),
filter=CLKFilter(filter_size=1024, hash_values=5),
padding="_"
),
entities=[
AttributeValueEntity(
id="001",
attributes={
"first_name": "foobar",
"salt": "0123456789"
}
)
],
attributes=[
StaticAttributeConfig(
attribute_name="first_name",
salt=AttributeSalt()
)
]
)
# => ValidationError: neither value nor attribute is set
# When using a weighted filter (RBF, CLKRBF), an error will be thrown if any attribute configuration
# provided is static, not weighted. The same applies vice versa, meaning if CLK is specified as a filter and
# weighted attribute configurations are provided.
_ = EntityMaskRequest(
config=MaskConfig(
token_size=2,
hash=HashConfig(
function=HashFunction(algorithms=[HashAlgorithm.sha1]),
strategy=DoubleHash()
),
filter=CLKRBFFilter(hash_values=5),
padding="_"
),
entities=[
AttributeValueEntity(
id="001",
attributes={
"first_name": "foobar",
"salt": "0123456789"
}
)
],
attributes=[
StaticAttributeConfig(
attribute_name="first_name",
salt=AttributeSalt(value="my_salt")
)
]
)
# => ValidationError: `clkrbf` filters require weighted attribute configurations, but static ones were found
# Weighted filters (RBF, CLKRBF) always require weighted attribute configurations. If none
# are provided, validation fails.
_ = EntityMaskRequest(
config=MaskConfig(
token_size=2,
hash=HashConfig(
function=HashFunction(algorithms=[HashAlgorithm.sha1]),
strategy=DoubleHash()
),
filter=CLKRBFFilter(hash_values=5),
padding="_"
),
entities=[
AttributeValueEntity(
id="001",
attributes={
"first_name": "foobar",
"salt": "0123456789"
}
)
]
)
# => ValidationError: `clkrbf` filters require weighted attribute configurations, but none were found
# If a configuration is provided for an attribute that doesn't exist on some entities, validation fails.
_ = EntityMaskRequest(
config=MaskConfig(
token_size=2,
hash=HashConfig(
function=HashFunction(algorithms=[HashAlgorithm.sha1]),
strategy=DoubleHash()
),
filter=CLKFilter(filter_size=1024, hash_values=5),
padding="_"
),
entities=[
AttributeValueEntity(
id="001",
attributes={
"first_name": "foobar"
}
)
],
attributes=[
StaticAttributeConfig(
attribute_name="last_name",
salt=AttributeSalt(value="my_salt")
)
]
)
# => ValidationError: some configured attributes are not present on entities: `last_name` on entities with ID `001`
```
## Bit vector matching
```python
from pprl_model import VectorMatchRequest, MatchConfig, SimilarityMeasure, BitVectorEntity
_ = VectorMatchRequest(
config=MatchConfig(
measure=SimilarityMeasure.jaccard,
threshold=0.8
),
domain=[
BitVectorEntity(
id="D001",
value="kY7yXn+rmp8L0nyGw5NlMw=="
)
],
range=[
BitVectorEntity(
id="R001",
value="qig0C1i8YttqhPwo4VqLlg=="
)
]
)
```
# License
MIT.
Raw data
{
"_id": null,
"home_page": "https://github.com/ul-mds/pprl",
"name": "pprl-model",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.10",
"maintainer_email": null,
"keywords": "record linkage, privacy, bloom filter",
"author": "Maximilian Jugl",
"author_email": "Maximilian.Jugl@medizin.uni-leipzig.de",
"download_url": "https://files.pythonhosted.org/packages/c4/55/dc05b98dc34c948e86a02b026da6faaededc7b2df4cd7b921b320eec1565/pprl_model-0.1.5.tar.gz",
"platform": null,
"description": "This package contains model classes that are used in the PPRL service for validation purposes.\nThey have been conceived with the idea of an HTTP-based service for record linkage based on Bloom filters in mind.\nIt encompasses models for the service's data transformation, masking and bit vector matching routines.\n[Pydantic](https://docs.pydantic.dev/latest/) is used for validation, serialization and deserialization.\nThis package is rarely to be used directly.\nInstead, it is used by other packages to power their functionalities.\n\n# Data models\n\nModels for entity pre-processing, masking and bit vector matching are exposed through this package.\nThe following examples are taken from the test suites of the\n[PPRL service package](https://github.com/ul-mds/pprl/tree/main/packages/pprl_service) and show additional\nvalidation steps in addition to the ones native to Pydantic.\n\n## Entity transformation\n\n```python\nfrom pprl_model import EntityTransformRequest, TransformConfig, EmptyValueHandling, AttributeValueEntity, \\\n AttributeTransformerConfig, NumberTransformer, GlobalTransformerConfig, NormalizationTransformer, \\\n CharacterFilterTransformer\n\n# This is a valid config.\n_ = EntityTransformRequest(\n config=TransformConfig(empty_value=EmptyValueHandling.ignore),\n entities=[\n AttributeValueEntity(\n id=\"001\",\n attributes={\n \"bar1\": \" 12.345 \",\n \"bar2\": \" 12.345 \"\n }\n )\n ],\n attribute_transformers=[\n AttributeTransformerConfig(\n attribute_name=\"bar1\",\n transformers=[\n NumberTransformer(decimal_places=2)\n ]\n )\n ],\n global_transformers=GlobalTransformerConfig(\n before=[\n NormalizationTransformer()\n ],\n after=[\n CharacterFilterTransformer(characters=\".\")\n ]\n )\n)\n\nfrom uuid import uuid4\n\n# Validation will fail since no transformers have been defined.\n_ = EntityTransformRequest(\n config=TransformConfig(empty_value=EmptyValueHandling.ignore),\n entities=[\n AttributeValueEntity(\n id=str(uuid4()),\n attributes={\n \"foo\": \"bar\"\n }\n )\n ],\n attribute_transformers=[]\n)\n# => ValidationError: attribute and global transformers are empty: must contain at least one\n```\n\n## Entity masking\n\n```python\nfrom pprl_model import EntityMaskRequest, MaskConfig, HashConfig, HashFunction, HashAlgorithm, \\\n DoubleHash, CLKFilter, AttributeValueEntity, StaticAttributeConfig, AttributeSalt, CLKRBFFilter\n\n# This is a valid config.\n_ = EntityMaskRequest(\n config=MaskConfig(\n token_size=2,\n hash=HashConfig(\n function=HashFunction(algorithms=[HashAlgorithm.sha1]),\n strategy=DoubleHash()\n ),\n filter=CLKFilter(filter_size=1024, hash_values=5),\n padding=\"_\"\n ),\n entities=[\n AttributeValueEntity(\n id=\"001\",\n attributes={\n \"first_name\": \"John\",\n \"last_name\": \"Doe\",\n \"date_of_birth\": \"1987-06-05\",\n \"gender\": \"m\"\n }\n )\n ]\n)\n\n# This is an invalid config since salting an attribute can only be done through a fixed value\n# or another attribute on an entity, not both at the same time.\n_ = EntityMaskRequest(\n config=MaskConfig(\n token_size=2,\n hash=HashConfig(\n function=HashFunction(algorithms=[HashAlgorithm.sha1]),\n strategy=DoubleHash()\n ),\n filter=CLKFilter(filter_size=1024, hash_values=5),\n padding=\"_\"\n ),\n entities=[\n AttributeValueEntity(\n id=\"001\",\n attributes={\n \"first_name\": \"foobar\",\n \"salt\": \"0123456789\"\n }\n )\n ],\n attributes=[\n StaticAttributeConfig(\n attribute_name=\"first_name\",\n salt=AttributeSalt(\n value=\"my_salt\",\n attribute=\"salt\"\n )\n )\n ]\n)\n# => ValidationError: value and attribute cannot be set at the same time\n\n# This also fails if neither a static value nor an attribute are set for salting.\n_ = EntityMaskRequest(\n config=MaskConfig(\n token_size=2,\n hash=HashConfig(\n function=HashFunction(algorithms=[HashAlgorithm.sha1]),\n strategy=DoubleHash()\n ),\n filter=CLKFilter(filter_size=1024, hash_values=5),\n padding=\"_\"\n ),\n entities=[\n AttributeValueEntity(\n id=\"001\",\n attributes={\n \"first_name\": \"foobar\",\n \"salt\": \"0123456789\"\n }\n )\n ],\n attributes=[\n StaticAttributeConfig(\n attribute_name=\"first_name\",\n salt=AttributeSalt()\n )\n ]\n)\n# => ValidationError: neither value nor attribute is set\n\n# When using a weighted filter (RBF, CLKRBF), an error will be thrown if any attribute configuration \n# provided is static, not weighted. The same applies vice versa, meaning if CLK is specified as a filter and\n# weighted attribute configurations are provided.\n_ = EntityMaskRequest(\n config=MaskConfig(\n token_size=2,\n hash=HashConfig(\n function=HashFunction(algorithms=[HashAlgorithm.sha1]),\n strategy=DoubleHash()\n ),\n filter=CLKRBFFilter(hash_values=5),\n padding=\"_\"\n ),\n entities=[\n AttributeValueEntity(\n id=\"001\",\n attributes={\n \"first_name\": \"foobar\",\n \"salt\": \"0123456789\"\n }\n )\n ],\n attributes=[\n StaticAttributeConfig(\n attribute_name=\"first_name\",\n salt=AttributeSalt(value=\"my_salt\")\n )\n ]\n)\n# => ValidationError: `clkrbf` filters require weighted attribute configurations, but static ones were found\n\n# Weighted filters (RBF, CLKRBF) always require weighted attribute configurations. If none\n# are provided, validation fails.\n_ = EntityMaskRequest(\n config=MaskConfig(\n token_size=2,\n hash=HashConfig(\n function=HashFunction(algorithms=[HashAlgorithm.sha1]),\n strategy=DoubleHash()\n ),\n filter=CLKRBFFilter(hash_values=5),\n padding=\"_\"\n ),\n entities=[\n AttributeValueEntity(\n id=\"001\",\n attributes={\n \"first_name\": \"foobar\",\n \"salt\": \"0123456789\"\n }\n )\n ]\n)\n# => ValidationError: `clkrbf` filters require weighted attribute configurations, but none were found\n\n# If a configuration is provided for an attribute that doesn't exist on some entities, validation fails.\n_ = EntityMaskRequest(\n config=MaskConfig(\n token_size=2,\n hash=HashConfig(\n function=HashFunction(algorithms=[HashAlgorithm.sha1]),\n strategy=DoubleHash()\n ),\n filter=CLKFilter(filter_size=1024, hash_values=5),\n padding=\"_\"\n ),\n entities=[\n AttributeValueEntity(\n id=\"001\",\n attributes={\n \"first_name\": \"foobar\"\n }\n )\n ],\n attributes=[\n StaticAttributeConfig(\n attribute_name=\"last_name\",\n salt=AttributeSalt(value=\"my_salt\")\n )\n ]\n)\n# => ValidationError: some configured attributes are not present on entities: `last_name` on entities with ID `001`\n```\n\n## Bit vector matching\n\n```python\nfrom pprl_model import VectorMatchRequest, MatchConfig, SimilarityMeasure, BitVectorEntity\n\n_ = VectorMatchRequest(\n config=MatchConfig(\n measure=SimilarityMeasure.jaccard,\n threshold=0.8\n ),\n domain=[\n BitVectorEntity(\n id=\"D001\",\n value=\"kY7yXn+rmp8L0nyGw5NlMw==\"\n )\n ],\n range=[\n BitVectorEntity(\n id=\"R001\",\n value=\"qig0C1i8YttqhPwo4VqLlg==\"\n )\n ]\n)\n```\n\n# License\n\nMIT.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Data models for use with a HTTP-based service for privacy-preserving record linkage using Bloom filters.",
"version": "0.1.5",
"project_urls": {
"Homepage": "https://github.com/ul-mds/pprl",
"Repository": "https://github.com/ul-mds/pprl"
},
"split_keywords": [
"record linkage",
" privacy",
" bloom filter"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "c5e9eb38328c9988a2dcd24d507047375a63607765bd1cd4347e3084542cf89a",
"md5": "cb3e7b684f038b216caefc2211c58eb9",
"sha256": "2292c4587904a28b0786074ea5ea4fb5c88a7e82c676a57df034dc0902e3d383"
},
"downloads": -1,
"filename": "pprl_model-0.1.5-py3-none-any.whl",
"has_sig": false,
"md5_digest": "cb3e7b684f038b216caefc2211c58eb9",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.10",
"size": 8872,
"upload_time": "2024-09-17T13:19:07",
"upload_time_iso_8601": "2024-09-17T13:19:07.326488Z",
"url": "https://files.pythonhosted.org/packages/c5/e9/eb38328c9988a2dcd24d507047375a63607765bd1cd4347e3084542cf89a/pprl_model-0.1.5-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "c455dc05b98dc34c948e86a02b026da6faaededc7b2df4cd7b921b320eec1565",
"md5": "ab560de8d94d2730864c27ed78a47090",
"sha256": "84132d2a8b387f48b7122ce450781c9510e5644e0ff5940ef02ff87c67d641bf"
},
"downloads": -1,
"filename": "pprl_model-0.1.5.tar.gz",
"has_sig": false,
"md5_digest": "ab560de8d94d2730864c27ed78a47090",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.10",
"size": 8416,
"upload_time": "2024-09-17T13:19:09",
"upload_time_iso_8601": "2024-09-17T13:19:09.066584Z",
"url": "https://files.pythonhosted.org/packages/c4/55/dc05b98dc34c948e86a02b026da6faaededc7b2df4cd7b921b320eec1565/pprl_model-0.1.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-09-17 13:19:09",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "ul-mds",
"github_project": "pprl",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "pprl-model"
}