pprl-client


Namepprl-client JSON
Version 0.3.1 PyPI version JSON
download
home_pagehttps://github.com/ul-mds/pprl
SummaryHTTP-based client for interacting with a service for privacy-preserving record linkage with Bloom filters.
upload_time2024-09-16 13:03:44
maintainerNone
docs_urlNone
authorMaximilian Jugl
requires_python<4.0,>=3.10
licenseMIT
keywords record linkage privacy bloom filter bitarray cryptography service client cli
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            This package contains a small HTTP-based library for working with the server provided by
the [PPRL service package](https://github.com/ul-mds/pprl/tree/main/packages/pprl_service).
It also contains a command-line application which uses the library to process CSV files.

Weight estimation requires additional packages which are not shipped by default.
To add them, install this package using any of the following commands as desired.

```
$ pip install pprl_client[faker]
$ pip install pprl_client[gecko]
$ pip install pprl_client[all]
```

# Library methods

The library exposes functions for entity pre-processing, masking and bit vector matching.
They follow the data model that is also used by the PPRL service, which is exposed through
the [PPRL model package](https://github.com/ul-mds/pprl/tree/main/packages/pprl_model).

In addition to the request objects, each function accepts a base URL, a full URL and a connection timeout in seconds as
optional parameters.
By default, the base URL is set to http://localhost:8000.
The full URL, if set, takes precedence over the base URL.
The connection timeout is set to 10 seconds by default, but should be increased for large-scale requests.

## Entity transformation

```python
import pprl_client
from pprl_model import EntityTransformRequest, TransformConfig, EmptyValueHandling, AttributeValueEntity, \
    GlobalTransformerConfig, NormalizationTransformer

response = pprl_client.transform(EntityTransformRequest(
    config=TransformConfig(empty_value=EmptyValueHandling.error),
    entities=[
        AttributeValueEntity(
            id="001",
            attributes={
                "first_name": "Müller",
                "last_name": "Ludenscheidt"
            }
        )
    ],
    global_transformers=GlobalTransformerConfig(
        before=[NormalizationTransformer()]
    )
))

print(response.entities)
# => [AttributeValueEntity(id='001', attributes={'first_name': 'muller', 'last_name': 'ludenscheidt'})]
```

## Entity masking

```python
import pprl_client
from pprl_model import EntityMaskRequest, MaskConfig, HashConfig, HashFunction, HashAlgorithm, RandomHash, CLKFilter, \
    AttributeValueEntity

response = pprl_client.mask(EntityMaskRequest(
    config=MaskConfig(
        token_size=2,
        hash=HashConfig(
            function=HashFunction(
                algorithms=[HashAlgorithm.sha1],
                key="s3cr3t_k3y"
            ),
            strategy=RandomHash()
        ),
        filter=CLKFilter(hash_values=5, filter_size=256)
    ),
    entities=[
        AttributeValueEntity(
            id="001",
            attributes={
                "first_name": "muller",
                "last_name": "ludenscheidt"
            }
        )
    ]
))

print(response.entities)
# => [BitVectorEntity(id='001', value='SKkgqBHBCJJCANICEKSpWMAUBYCQEMLuZgEQGBKRC8A=')]
```

## Bit vector matching

```python
import pprl_client
from pprl_model import VectorMatchRequest, MatchConfig, SimilarityMeasure, BitVectorEntity

response = pprl_client.match(VectorMatchRequest(
    config=MatchConfig(
        measure=SimilarityMeasure.jaccard,
        threshold=0.8
    ),
    domain=[
        BitVectorEntity(
            id="001",
            value="SKkgqBHBCJJCANICEKSpWMAUBYCQEMLuZgEQGBKRC8A="
        )
    ],
    range=[
        BitVectorEntity(
            id="100",
            value="UKkgqBHBDJJCANICELSpWMAUBYCMEMLrZgEQGBKRC7A="
        ),
        BitVectorEntity(
            id="101",
            value="H5DN45iUeEjrjbHZrzHb3AyQk9O4IgxcpENKKzEKRLE="
        )
    ]
))

print(response.matches)
# => [Match(domain=BitVectorEntity(id='001', value='SKkgqBHBCJJCANICEKSpWMAUBYCQEMLuZgEQGBKRC8A='), range=BitVectorEntity(id='100', value='UKkgqBHBDJJCANICELSpWMAUBYCMEMLrZgEQGBKRC7A='), similarity=0.8536585365853658)]
```

## Attribute weight estimation

```python
import pprl_client
from pprl_model import AttributeValueEntity, BaseTransformRequest, TransformConfig, EmptyValueHandling, \
    GlobalTransformerConfig, NormalizationTransformer

stats = pprl_client.compute_attribute_stats(
    [
        AttributeValueEntity(
            id="001",
            attributes={
                "given_name": "Max",
                "last_name": "Mustermann",
                "gender": "m"
            }
        ),
        AttributeValueEntity(
            id="002",
            attributes={
                "given_name": "Maria",
                "last_name": "Musterfrau",
                "gender": "f"
            }
        )
    ],
    BaseTransformRequest(
        config=TransformConfig(empty_value=EmptyValueHandling.skip),
        global_transformers=GlobalTransformerConfig(
            before=[NormalizationTransformer()]
        )
    ),
)

print(stats)
# => {'given_name': AttributeStats(average_tokens=5.0, ngram_entropy=2.9219280948873623), 'last_name': AttributeStats(average_tokens=11.0, ngram_entropy=3.913977073182751), 'gender': AttributeStats(average_tokens=2.0, ngram_entropy=2.0)}
```

# Command line interface

The `pprl` command exposes all the library's functions and adapts them to work with CSV files. 
Running `pprl --help` provides an overview of the command options.

```
$ pprl --help
Usage: pprl [OPTIONS] COMMAND [ARGS]...

  HTTP client for performing PPRL based on Bloom filters.

Options:
  --base-url TEXT                 base URL to HTTP-based PPRL service
  -b, --batch-size INTEGER RANGE  amount of bit vectors to match at a time  [x>=1]
  --timeout-secs INTEGER RANGE    seconds until a request times out  [x>=1]
  --delimiter TEXT                column delimiter for CSV files
  --encoding TEXT                 character encoding for files
  --help                          Show this message and exit.

Commands:
  estimate   Estimate attribute weights based on randomly generated data.
  mask       Mask a CSV file with entities.
  match      Match bit vectors from CSV files against each other.
  transform  Perform pre-processing on a CSV file with entities
```

The `pprl` command works on two basic types of CSV files that follow a simple structure.
Entity files are CSV files that contain a column with a unique identifier and arbitrary additional columns which
contain values for certain attributes that identify an entity.
Each row is representative of a single entity.

```csv
id,first_name,last_name,date_of_birth,gender
001,Natalie,Sampson,1956-12-16,female
002,Eric,Lynch,1910-01-11,female
003,Pam,Vaughn,1983-10-05,male
004,David,Jackson,2006-01-27,male
005,Rachel,Dyer,1904-02-02,female
```

Bit vector files contain an ID column and a value column which contains a representative bit vector.
These bit vectors are generally generated by masking a record from an entity file.

```csv
id,value
001,0Dr8t+kE5ltI+xdM85fwx0QLrTIgvFN35/0YvODNdOE0AaUHPphikXYy4LlArE4UqfjPs+wKtT233R7lBzSp5mwkCjTzA1tl0N7s+sFeKyIrOiGk0gNIYvA=
002,QMEIkE9TN1Quv0K0QAIk1RZD3qF7nQh0IyOYqVDf8IQkyaLGcFjiLHsEgBpU8CRSCuATbWpjEwGi3dilizySQy4miGiJolilYmwKysjseq+IFsAU3T1IRjA=
003,BqFoNZhrAVBq9SV1wBK0dUZLHDM9hCBoO4XdKCzvasSUELQeAB8+DV5tAhDl5KCSJfDCB6JG4WSoCFbozXqBYSUMqEQJE0JwhpRK6oLOcRRoGwGESDBMZwA=
004,8C9KItMTwtz4oXQvo8G0t1bTnwspnghmJwyqqcL2RIHASb4XJHAqybMCXQBm5mq6h/kdxGbblxBjhy79jRUcI60haqZhNsst0n7OUAxM/UoZVumIilRIbCA=
005,CFk4I0sKwnRoiTEOQASy1QZfHCGB1GBgYQDcZwDDtIkGGLOmLRhrQyOSlQDUDoYTbvaBRVqbkRnqmYQbDTEGlG+2y60FMmBEKtxsr0I4I00oMpuoXAsDWmA=
```

Pre-processing is done with the `pprl transform` command.
It requires a base transform request file, an entity file and an output file to write the pre-processed entities to.
Attribute and global transformer configurations can be provided, but at least one must be specified.

In this example, a global normalization transformer which is executed before all other attribute-specific transformers
is defined.
Date time reformatting is applied to the "date of birth" column in the input file.

_request.json_

```json
{
  "config": {
    "empty_value": "skip"
  },
  "attribute_transformers": [
    {
      "attribute_name": "date_of_birth",
      "transformers": [
        {
          "name": "date_time",
          "input_format": "%Y-%m-%d",
          "output_format": "%Y%m%d"
        }
      ]
    }
  ],
  "global_transformers": {
    "before": [
      {
        "name": "normalization"
      }
    ]
  }
}
```

```
$ pprl transform ./request.json ./input.csv ./output.csv  
Transforming entities  [####################################]  100%
```

_output.csv_

```csv
id,first_name,last_name,date_of_birth,gender
001,natalie,sampson,19561216,female
002,eric,lynch,19100111,female
003,pam,vaughn,19831005,male
004,david,jackson,20060127,male
005,rachel,dyer,19040202,female
```

Masking is done with `pprl mask` and its subcommands.
It requires a base mask request file, an entity file and an output file to write the masked entities to.

_request.json_

```json
{
  "config": {
    "token_size": 2,
    "hash": {
      "function": {
        "algorithms": ["sha256"],
        "key": "s3cr3t_k3y",
        "strategy": {
          "name": "random_hash"
        }
      }
    },
    "prepend_attribute_name": true,
    "filter": {
      "type": "clk",
      "filter_size": 512,
      "hash_values": 5,
      "padding": "_",
      "hardeners": [
        {
          "name": "permute",
          "seed": 727
        },
        {
          "name": "rehash",
          "window_size": 16,
          "window_step": 8,
          "samples": 2
        }
      ]
    }
  }
}
```

_input.csv_

```csv
id,first_name,last_name,date_of_birth,gender
001,natalie,sampson,19561216,female
002,eric,lynch,19100111,female
003,pam,vaughn,19831005,male
004,david,jackson,20060127,male
005,rachel,dyer,19040202,female
```

```
$ pprl mask ./request.json ./input.csv ./output.csv
Masking entities  [####################################]  100%
```

_output.csv_

```csv
id,value
001,wAWgITvQ1/VACpRYC2EKrfCkWziyEhmyKwi5sMsFrAQVoIBygTQScPRoIIAto0AwS0ihlcAIFAcQRwccY5IOmQ==
002,cFCwQIABQ+TgSSdlGM/z54BEUgmYhA1GKtCxQAKAXFIWiPAFIQYaFArgM61pUAAeATwBlBEOEw4Oowe0rbcMGw==
003,IgK16AAISCRoCuVAb1UBZYBBhGgxSEkKeMkTUCKAx4IAsNGJBS4ShgBAGIapBIQWJLiBFEEKAIWAGYS8ZZGMKw==
004,ZlBkyoYIEWmeaxbPDNng5JjHACkCAJwjlBCJQBJ4ZBSyOAukACUahOAFQ20oNwTQEDRA005+VUUfsUQcKCGNxg==
005,cUekQFQkI7TpTcRwmcNDoodRRBshlSEiAUjBQiMlxBLTmODMJICmDmxgUqYKonQEMFD58QsogRQFIgYUwJDOHA==
```

Matching is done with the `pprl match` command.
It allows the matching of multiple bit vector input files at once.
If more than two files are provided, the command will pick out pairs of files and matches their contents against one 
another.

In this example, the bit vectors of two files are matched against each other.
The Jaccard index is used as a similarity measure and a match threshold of 70% is applied.

_request.json_

```json
{
  "config": {
    "measure": "jaccard",
    "threshold": 0.7
  }
}
```

_domain.csv_

```csv
id,value
001,wAWgITvQ1/VACpRYC2EKrfCkWziyEhmyKwi5sMsFrAQVoIBygTQScPRoIIAto0AwS0ihlcAIFAcQRwccY5IOmQ==
002,cFCwQIABQ+TgSSdlGM/z54BEUgmYhA1GKtCxQAKAXFIWiPAFIQYaFArgM61pUAAeATwBlBEOEw4Oowe0rbcMGw==
003,IgK16AAISCRoCuVAb1UBZYBBhGgxSEkKeMkTUCKAx4IAsNGJBS4ShgBAGIapBIQWJLiBFEEKAIWAGYS8ZZGMKw==
004,ZlBkyoYIEWmeaxbPDNng5JjHACkCAJwjlBCJQBJ4ZBSyOAukACUahOAFQ20oNwTQEDRA005+VUUfsUQcKCGNxg==
005,cUekQFQkI7TpTcRwmcNDoodRRBshlSEiAUjBQiMlxBLTmODMJICmDmxgUqYKonQEMFD58QsogRQFIgYUwJDOHA==
```

_range.csv_

```csv
id,value
101,kUSyxIgtIDSAB7ZYDkFQRZpFoMkCjCCCbDTWAUJTRAAEBpspBX4PNUZKi1AIVCABAjg6EAoKuwVleeUYgRBYoQ==
102,IAA0YE4MGexIiYdEjwNzoOKmIA4CEHEiKQASYFPhxQTQlPAAgYW3AWBYmQJ8YMoaAj0ZkoOrFyUmFo52TDcIKw==
103,BFAwREkkQbTdzddgDHFWgMRJMyxAMW+jq2ASICMBtIEr+YDCBRUgxEDIsQpciO4mAK3h2cIbXFQCMlaVpJPZIQ==
104,wBWgITvQ2/VACpRYC2EKrfCkWxiyEhmyKwi5sMsFrBQVoIBygTQScPRoIIAto0AwS0ihldAIFAcQRwccY5IOmQ==
105,QCCwIKQAED5AjaZYmodDcZAEBKkIxgAiDfEUoDKEdgEAEJAMAwcfQEbQkaQ4ANAABqiUscAKPQZEMJxRhTGIGQ==
```

```
$ pprl match request.json domain.csv range.csv output.csv
Matching bit vectors from domain.csv and range.csv  [####################################]  100%
```

_output.csv_

```csv
domain_id,domain_file,range_id,range_file,similarity
001,domain.csv,104,range.csv,0.9690721649484536
```

Weight estimation is done with the `pprl estimate` command.
It generates random data based off of user specification and computes estimates for attribute weights.
Data can be generated using [Faker](https://faker.readthedocs.io/) and [Gecko](https://ul-mds.github.io/gecko/).
These are exposed through the `faker` and `gecko` subcommands respectively.
Both subcommands require a file that tell Faker and Gecko how to generate data, as well as a path to a file to write 
results to.
[Refer to the example files in the test asset directory](tests/assets).

```
$ pprl estimate faker tests/assets/faker-config.json faker-output.json
```

*faker-output.json*

```json
[
  {
    "attribute_name": "given_name",
    "weight": 7.657958943890718,
    "average_token_count": 7.5686
  },
  {
    "attribute_name": "last_name",
    "weight": 7.444573503220938,
    "average_token_count": 7.5204
  },
  {
    "attribute_name": "gender",
    "weight": 1.9999971146079947,
    "average_token_count": 2.0
  },
  {
    "attribute_name": "street_name",
    "weight": 7.605565770282046,
    "average_token_count": 16.2188
  },
  {
    "attribute_name": "municipality",
    "weight": 7.659422921807241,
    "average_token_count": 9.952
  },
  {
    "attribute_name": "postcode",
    "weight": 6.7812429085107,
    "average_token_count": 5.9464
  }
]
```

# License

MIT.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/ul-mds/pprl",
    "name": "pprl-client",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.10",
    "maintainer_email": null,
    "keywords": "record linkage, privacy, bloom filter, bitarray, cryptography, service, client, cli",
    "author": "Maximilian Jugl",
    "author_email": "Maximilian.Jugl@medizin.uni-leipzig.de",
    "download_url": "https://files.pythonhosted.org/packages/c1/8f/ca6e8d85418b745cf706e4d61f35ee145cd1313d2f6b571feb0a079d12bf/pprl_client-0.3.1.tar.gz",
    "platform": null,
    "description": "This package contains a small HTTP-based library for working with the server provided by\nthe [PPRL service package](https://github.com/ul-mds/pprl/tree/main/packages/pprl_service).\nIt also contains a command-line application which uses the library to process CSV files.\n\nWeight estimation requires additional packages which are not shipped by default.\nTo add them, install this package using any of the following commands as desired.\n\n```\n$ pip install pprl_client[faker]\n$ pip install pprl_client[gecko]\n$ pip install pprl_client[all]\n```\n\n# Library methods\n\nThe library exposes functions for entity pre-processing, masking and bit vector matching.\nThey follow the data model that is also used by the PPRL service, which is exposed through\nthe [PPRL model package](https://github.com/ul-mds/pprl/tree/main/packages/pprl_model).\n\nIn addition to the request objects, each function accepts a base URL, a full URL and a connection timeout in seconds as\noptional parameters.\nBy default, the base URL is set to http://localhost:8000.\nThe full URL, if set, takes precedence over the base URL.\nThe connection timeout is set to 10 seconds by default, but should be increased for large-scale requests.\n\n## Entity transformation\n\n```python\nimport pprl_client\nfrom pprl_model import EntityTransformRequest, TransformConfig, EmptyValueHandling, AttributeValueEntity, \\\n    GlobalTransformerConfig, NormalizationTransformer\n\nresponse = pprl_client.transform(EntityTransformRequest(\n    config=TransformConfig(empty_value=EmptyValueHandling.error),\n    entities=[\n        AttributeValueEntity(\n            id=\"001\",\n            attributes={\n                \"first_name\": \"M\u00fcller\",\n                \"last_name\": \"Ludenscheidt\"\n            }\n        )\n    ],\n    global_transformers=GlobalTransformerConfig(\n        before=[NormalizationTransformer()]\n    )\n))\n\nprint(response.entities)\n# => [AttributeValueEntity(id='001', attributes={'first_name': 'muller', 'last_name': 'ludenscheidt'})]\n```\n\n## Entity masking\n\n```python\nimport pprl_client\nfrom pprl_model import EntityMaskRequest, MaskConfig, HashConfig, HashFunction, HashAlgorithm, RandomHash, CLKFilter, \\\n    AttributeValueEntity\n\nresponse = pprl_client.mask(EntityMaskRequest(\n    config=MaskConfig(\n        token_size=2,\n        hash=HashConfig(\n            function=HashFunction(\n                algorithms=[HashAlgorithm.sha1],\n                key=\"s3cr3t_k3y\"\n            ),\n            strategy=RandomHash()\n        ),\n        filter=CLKFilter(hash_values=5, filter_size=256)\n    ),\n    entities=[\n        AttributeValueEntity(\n            id=\"001\",\n            attributes={\n                \"first_name\": \"muller\",\n                \"last_name\": \"ludenscheidt\"\n            }\n        )\n    ]\n))\n\nprint(response.entities)\n# => [BitVectorEntity(id='001', value='SKkgqBHBCJJCANICEKSpWMAUBYCQEMLuZgEQGBKRC8A=')]\n```\n\n## Bit vector matching\n\n```python\nimport pprl_client\nfrom pprl_model import VectorMatchRequest, MatchConfig, SimilarityMeasure, BitVectorEntity\n\nresponse = pprl_client.match(VectorMatchRequest(\n    config=MatchConfig(\n        measure=SimilarityMeasure.jaccard,\n        threshold=0.8\n    ),\n    domain=[\n        BitVectorEntity(\n            id=\"001\",\n            value=\"SKkgqBHBCJJCANICEKSpWMAUBYCQEMLuZgEQGBKRC8A=\"\n        )\n    ],\n    range=[\n        BitVectorEntity(\n            id=\"100\",\n            value=\"UKkgqBHBDJJCANICELSpWMAUBYCMEMLrZgEQGBKRC7A=\"\n        ),\n        BitVectorEntity(\n            id=\"101\",\n            value=\"H5DN45iUeEjrjbHZrzHb3AyQk9O4IgxcpENKKzEKRLE=\"\n        )\n    ]\n))\n\nprint(response.matches)\n# => [Match(domain=BitVectorEntity(id='001', value='SKkgqBHBCJJCANICEKSpWMAUBYCQEMLuZgEQGBKRC8A='), range=BitVectorEntity(id='100', value='UKkgqBHBDJJCANICELSpWMAUBYCMEMLrZgEQGBKRC7A='), similarity=0.8536585365853658)]\n```\n\n## Attribute weight estimation\n\n```python\nimport pprl_client\nfrom pprl_model import AttributeValueEntity, BaseTransformRequest, TransformConfig, EmptyValueHandling, \\\n    GlobalTransformerConfig, NormalizationTransformer\n\nstats = pprl_client.compute_attribute_stats(\n    [\n        AttributeValueEntity(\n            id=\"001\",\n            attributes={\n                \"given_name\": \"Max\",\n                \"last_name\": \"Mustermann\",\n                \"gender\": \"m\"\n            }\n        ),\n        AttributeValueEntity(\n            id=\"002\",\n            attributes={\n                \"given_name\": \"Maria\",\n                \"last_name\": \"Musterfrau\",\n                \"gender\": \"f\"\n            }\n        )\n    ],\n    BaseTransformRequest(\n        config=TransformConfig(empty_value=EmptyValueHandling.skip),\n        global_transformers=GlobalTransformerConfig(\n            before=[NormalizationTransformer()]\n        )\n    ),\n)\n\nprint(stats)\n# => {'given_name': AttributeStats(average_tokens=5.0, ngram_entropy=2.9219280948873623), 'last_name': AttributeStats(average_tokens=11.0, ngram_entropy=3.913977073182751), 'gender': AttributeStats(average_tokens=2.0, ngram_entropy=2.0)}\n```\n\n# Command line interface\n\nThe `pprl` command exposes all the library's functions and adapts them to work with CSV files. \nRunning `pprl --help` provides an overview of the command options.\n\n```\n$ pprl --help\nUsage: pprl [OPTIONS] COMMAND [ARGS]...\n\n  HTTP client for performing PPRL based on Bloom filters.\n\nOptions:\n  --base-url TEXT                 base URL to HTTP-based PPRL service\n  -b, --batch-size INTEGER RANGE  amount of bit vectors to match at a time  [x>=1]\n  --timeout-secs INTEGER RANGE    seconds until a request times out  [x>=1]\n  --delimiter TEXT                column delimiter for CSV files\n  --encoding TEXT                 character encoding for files\n  --help                          Show this message and exit.\n\nCommands:\n  estimate   Estimate attribute weights based on randomly generated data.\n  mask       Mask a CSV file with entities.\n  match      Match bit vectors from CSV files against each other.\n  transform  Perform pre-processing on a CSV file with entities\n```\n\nThe `pprl` command works on two basic types of CSV files that follow a simple structure.\nEntity files are CSV files that contain a column with a unique identifier and arbitrary additional columns which\ncontain values for certain attributes that identify an entity.\nEach row is representative of a single entity.\n\n```csv\nid,first_name,last_name,date_of_birth,gender\n001,Natalie,Sampson,1956-12-16,female\n002,Eric,Lynch,1910-01-11,female\n003,Pam,Vaughn,1983-10-05,male\n004,David,Jackson,2006-01-27,male\n005,Rachel,Dyer,1904-02-02,female\n```\n\nBit vector files contain an ID column and a value column which contains a representative bit vector.\nThese bit vectors are generally generated by masking a record from an entity file.\n\n```csv\nid,value\n001,0Dr8t+kE5ltI+xdM85fwx0QLrTIgvFN35/0YvODNdOE0AaUHPphikXYy4LlArE4UqfjPs+wKtT233R7lBzSp5mwkCjTzA1tl0N7s+sFeKyIrOiGk0gNIYvA=\n002,QMEIkE9TN1Quv0K0QAIk1RZD3qF7nQh0IyOYqVDf8IQkyaLGcFjiLHsEgBpU8CRSCuATbWpjEwGi3dilizySQy4miGiJolilYmwKysjseq+IFsAU3T1IRjA=\n003,BqFoNZhrAVBq9SV1wBK0dUZLHDM9hCBoO4XdKCzvasSUELQeAB8+DV5tAhDl5KCSJfDCB6JG4WSoCFbozXqBYSUMqEQJE0JwhpRK6oLOcRRoGwGESDBMZwA=\n004,8C9KItMTwtz4oXQvo8G0t1bTnwspnghmJwyqqcL2RIHASb4XJHAqybMCXQBm5mq6h/kdxGbblxBjhy79jRUcI60haqZhNsst0n7OUAxM/UoZVumIilRIbCA=\n005,CFk4I0sKwnRoiTEOQASy1QZfHCGB1GBgYQDcZwDDtIkGGLOmLRhrQyOSlQDUDoYTbvaBRVqbkRnqmYQbDTEGlG+2y60FMmBEKtxsr0I4I00oMpuoXAsDWmA=\n```\n\nPre-processing is done with the `pprl transform` command.\nIt requires a base transform request file, an entity file and an output file to write the pre-processed entities to.\nAttribute and global transformer configurations can be provided, but at least one must be specified.\n\nIn this example, a global normalization transformer which is executed before all other attribute-specific transformers\nis defined.\nDate time reformatting is applied to the \"date of birth\" column in the input file.\n\n_request.json_\n\n```json\n{\n  \"config\": {\n    \"empty_value\": \"skip\"\n  },\n  \"attribute_transformers\": [\n    {\n      \"attribute_name\": \"date_of_birth\",\n      \"transformers\": [\n        {\n          \"name\": \"date_time\",\n          \"input_format\": \"%Y-%m-%d\",\n          \"output_format\": \"%Y%m%d\"\n        }\n      ]\n    }\n  ],\n  \"global_transformers\": {\n    \"before\": [\n      {\n        \"name\": \"normalization\"\n      }\n    ]\n  }\n}\n```\n\n```\n$ pprl transform ./request.json ./input.csv ./output.csv  \nTransforming entities  [####################################]  100%\n```\n\n_output.csv_\n\n```csv\nid,first_name,last_name,date_of_birth,gender\n001,natalie,sampson,19561216,female\n002,eric,lynch,19100111,female\n003,pam,vaughn,19831005,male\n004,david,jackson,20060127,male\n005,rachel,dyer,19040202,female\n```\n\nMasking is done with `pprl mask` and its subcommands.\nIt requires a base mask request file, an entity file and an output file to write the masked entities to.\n\n_request.json_\n\n```json\n{\n  \"config\": {\n    \"token_size\": 2,\n    \"hash\": {\n      \"function\": {\n        \"algorithms\": [\"sha256\"],\n        \"key\": \"s3cr3t_k3y\",\n        \"strategy\": {\n          \"name\": \"random_hash\"\n        }\n      }\n    },\n    \"prepend_attribute_name\": true,\n    \"filter\": {\n      \"type\": \"clk\",\n      \"filter_size\": 512,\n      \"hash_values\": 5,\n      \"padding\": \"_\",\n      \"hardeners\": [\n        {\n          \"name\": \"permute\",\n          \"seed\": 727\n        },\n        {\n          \"name\": \"rehash\",\n          \"window_size\": 16,\n          \"window_step\": 8,\n          \"samples\": 2\n        }\n      ]\n    }\n  }\n}\n```\n\n_input.csv_\n\n```csv\nid,first_name,last_name,date_of_birth,gender\n001,natalie,sampson,19561216,female\n002,eric,lynch,19100111,female\n003,pam,vaughn,19831005,male\n004,david,jackson,20060127,male\n005,rachel,dyer,19040202,female\n```\n\n```\n$ pprl mask ./request.json ./input.csv ./output.csv\nMasking entities  [####################################]  100%\n```\n\n_output.csv_\n\n```csv\nid,value\n001,wAWgITvQ1/VACpRYC2EKrfCkWziyEhmyKwi5sMsFrAQVoIBygTQScPRoIIAto0AwS0ihlcAIFAcQRwccY5IOmQ==\n002,cFCwQIABQ+TgSSdlGM/z54BEUgmYhA1GKtCxQAKAXFIWiPAFIQYaFArgM61pUAAeATwBlBEOEw4Oowe0rbcMGw==\n003,IgK16AAISCRoCuVAb1UBZYBBhGgxSEkKeMkTUCKAx4IAsNGJBS4ShgBAGIapBIQWJLiBFEEKAIWAGYS8ZZGMKw==\n004,ZlBkyoYIEWmeaxbPDNng5JjHACkCAJwjlBCJQBJ4ZBSyOAukACUahOAFQ20oNwTQEDRA005+VUUfsUQcKCGNxg==\n005,cUekQFQkI7TpTcRwmcNDoodRRBshlSEiAUjBQiMlxBLTmODMJICmDmxgUqYKonQEMFD58QsogRQFIgYUwJDOHA==\n```\n\nMatching is done with the `pprl match` command.\nIt allows the matching of multiple bit vector input files at once.\nIf more than two files are provided, the command will pick out pairs of files and matches their contents against one \nanother.\n\nIn this example, the bit vectors of two files are matched against each other.\nThe Jaccard index is used as a similarity measure and a match threshold of 70% is applied.\n\n_request.json_\n\n```json\n{\n  \"config\": {\n    \"measure\": \"jaccard\",\n    \"threshold\": 0.7\n  }\n}\n```\n\n_domain.csv_\n\n```csv\nid,value\n001,wAWgITvQ1/VACpRYC2EKrfCkWziyEhmyKwi5sMsFrAQVoIBygTQScPRoIIAto0AwS0ihlcAIFAcQRwccY5IOmQ==\n002,cFCwQIABQ+TgSSdlGM/z54BEUgmYhA1GKtCxQAKAXFIWiPAFIQYaFArgM61pUAAeATwBlBEOEw4Oowe0rbcMGw==\n003,IgK16AAISCRoCuVAb1UBZYBBhGgxSEkKeMkTUCKAx4IAsNGJBS4ShgBAGIapBIQWJLiBFEEKAIWAGYS8ZZGMKw==\n004,ZlBkyoYIEWmeaxbPDNng5JjHACkCAJwjlBCJQBJ4ZBSyOAukACUahOAFQ20oNwTQEDRA005+VUUfsUQcKCGNxg==\n005,cUekQFQkI7TpTcRwmcNDoodRRBshlSEiAUjBQiMlxBLTmODMJICmDmxgUqYKonQEMFD58QsogRQFIgYUwJDOHA==\n```\n\n_range.csv_\n\n```csv\nid,value\n101,kUSyxIgtIDSAB7ZYDkFQRZpFoMkCjCCCbDTWAUJTRAAEBpspBX4PNUZKi1AIVCABAjg6EAoKuwVleeUYgRBYoQ==\n102,IAA0YE4MGexIiYdEjwNzoOKmIA4CEHEiKQASYFPhxQTQlPAAgYW3AWBYmQJ8YMoaAj0ZkoOrFyUmFo52TDcIKw==\n103,BFAwREkkQbTdzddgDHFWgMRJMyxAMW+jq2ASICMBtIEr+YDCBRUgxEDIsQpciO4mAK3h2cIbXFQCMlaVpJPZIQ==\n104,wBWgITvQ2/VACpRYC2EKrfCkWxiyEhmyKwi5sMsFrBQVoIBygTQScPRoIIAto0AwS0ihldAIFAcQRwccY5IOmQ==\n105,QCCwIKQAED5AjaZYmodDcZAEBKkIxgAiDfEUoDKEdgEAEJAMAwcfQEbQkaQ4ANAABqiUscAKPQZEMJxRhTGIGQ==\n```\n\n```\n$ pprl match request.json domain.csv range.csv output.csv\nMatching bit vectors from domain.csv and range.csv  [####################################]  100%\n```\n\n_output.csv_\n\n```csv\ndomain_id,domain_file,range_id,range_file,similarity\n001,domain.csv,104,range.csv,0.9690721649484536\n```\n\nWeight estimation is done with the `pprl estimate` command.\nIt generates random data based off of user specification and computes estimates for attribute weights.\nData can be generated using [Faker](https://faker.readthedocs.io/) and [Gecko](https://ul-mds.github.io/gecko/).\nThese are exposed through the `faker` and `gecko` subcommands respectively.\nBoth subcommands require a file that tell Faker and Gecko how to generate data, as well as a path to a file to write \nresults to.\n[Refer to the example files in the test asset directory](tests/assets).\n\n```\n$ pprl estimate faker tests/assets/faker-config.json faker-output.json\n```\n\n*faker-output.json*\n\n```json\n[\n  {\n    \"attribute_name\": \"given_name\",\n    \"weight\": 7.657958943890718,\n    \"average_token_count\": 7.5686\n  },\n  {\n    \"attribute_name\": \"last_name\",\n    \"weight\": 7.444573503220938,\n    \"average_token_count\": 7.5204\n  },\n  {\n    \"attribute_name\": \"gender\",\n    \"weight\": 1.9999971146079947,\n    \"average_token_count\": 2.0\n  },\n  {\n    \"attribute_name\": \"street_name\",\n    \"weight\": 7.605565770282046,\n    \"average_token_count\": 16.2188\n  },\n  {\n    \"attribute_name\": \"municipality\",\n    \"weight\": 7.659422921807241,\n    \"average_token_count\": 9.952\n  },\n  {\n    \"attribute_name\": \"postcode\",\n    \"weight\": 6.7812429085107,\n    \"average_token_count\": 5.9464\n  }\n]\n```\n\n# License\n\nMIT.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "HTTP-based client for interacting with a service for privacy-preserving record linkage with Bloom filters.",
    "version": "0.3.1",
    "project_urls": {
        "Homepage": "https://github.com/ul-mds/pprl",
        "Repository": "https://github.com/ul-mds/pprl"
    },
    "split_keywords": [
        "record linkage",
        " privacy",
        " bloom filter",
        " bitarray",
        " cryptography",
        " service",
        " client",
        " cli"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "087f15546f26df88fbd83c9dba476dc307a2ae18f71b75a428e84f9edf44f7d8",
                "md5": "36b89084f0a3e20010347a16d9ba260b",
                "sha256": "6b9056d5a081dd778b85953b7ac621b5366de97c628402fc869eb98a29ea09b2"
            },
            "downloads": -1,
            "filename": "pprl_client-0.3.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "36b89084f0a3e20010347a16d9ba260b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.10",
            "size": 14709,
            "upload_time": "2024-09-16T13:03:42",
            "upload_time_iso_8601": "2024-09-16T13:03:42.646639Z",
            "url": "https://files.pythonhosted.org/packages/08/7f/15546f26df88fbd83c9dba476dc307a2ae18f71b75a428e84f9edf44f7d8/pprl_client-0.3.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c18fca6e8d85418b745cf706e4d61f35ee145cd1313d2f6b571feb0a079d12bf",
                "md5": "329d333339c72e9289cb8795fd665e8c",
                "sha256": "47ff81f5fbfd01a84db2197a54f46c0105e9da81fbcd60b29a8b4770857a1b2d"
            },
            "downloads": -1,
            "filename": "pprl_client-0.3.1.tar.gz",
            "has_sig": false,
            "md5_digest": "329d333339c72e9289cb8795fd665e8c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.10",
            "size": 17612,
            "upload_time": "2024-09-16T13:03:44",
            "upload_time_iso_8601": "2024-09-16T13:03:44.647056Z",
            "url": "https://files.pythonhosted.org/packages/c1/8f/ca6e8d85418b745cf706e4d61f35ee145cd1313d2f6b571feb0a079d12bf/pprl_client-0.3.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-16 13:03:44",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ul-mds",
    "github_project": "pprl",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "pprl-client"
}
        
Elapsed time: 4.05589s