# 🔍 Detect PII
Detect PII is a library inspired by [piicatcher](https://github.com/tokern/piicatcher) and [CommonRegex](https://github.com/madisonmay/CommonRegex) to detect columns in tables that may potentially contain PII. It does so by performing regex matches
on column names and column values, flagging the ones that may contain PII.
## Usage
### Installation
```shell
pip install detectpii
```
### Scan tables for PII
```python
from detectpii.catalog import PostgresCatalog
from detectpii.pipeline import PiiDetectionPipeline
from detectpii.scanner import DataScanner, MetadataScanner
from detectpii.util import print_columns
# -- Create a catalog to connect to a database / warehouse
pg_catalog = PostgresCatalog(
host="localhost",
user="postgres",
password="my-secret-pw",
database="postgres",
port=5432,
schema="public"
)
# -- Create a pipeline to detect PII in the tables
pipeline = PiiDetectionPipeline(
catalog=pg_catalog,
scanners=[
MetadataScanner(),
DataScanner(percentage=20, times=2,),
]
)
# -- Scan for PII columns.
pii_columns = pipeline.scan()
# -- Print them to the console
print_columns(pii_columns)
```
### Persist the pipeline
```python
import json
from detectpii.pipeline import pipeline_to_dict
# -- Create a pipeline
pipeline = ...
# -- Convert it into a dictionary
dictionary = pipeline_to_dict(pipeline)
# -- Print it
print(json.dumps(dictionary, indent=4))
# {
# "catalog": {
# "tables": [],
# "resolver": {
# "name": "PlaintextResolver",
# "_type": "PlaintextResolver"
# },
# "user": "postgres",
# "password": "my-secret-pw",
# "host": "localhost",
# "port": 5432,
# "database": "postgres",
# "schema": "public",
# "_type": "PostgresCatalog"
# },
# "scanners": [
# {
# "_type": "MetadataScanner"
# },
# {
# "times": 2,
# "percentage": 20,
# "_type": "DataScanner"
# }
# ]
# }
```
### Load the pipeline
```python
from detectpii.pipeline import dict_to_pipeline
# -- Load the persisted pipeline as a dictionary
dictionary: dict = ...
# -- Convert it back to a pipeline object
pipeline = dict_to_pipeline(dictionary=dictionary)
```
For more detailed documentation, please see the `docs` folder.
## Supported databases / warehouses
* Postgres
* Snowflake
* Trino
* Yugabyte
Raw data
{
"_id": null,
"home_page": "https://github.com/thescalaguy/detectpii",
"name": "detectpii",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.11",
"maintainer_email": null,
"keywords": null,
"author": "Fasih Khatib",
"author_email": "hellofasih.confound928@passinbox.com",
"download_url": "https://files.pythonhosted.org/packages/01/2c/31dc78521b213c8a4d45102c405a1b3a80296f7007b0afb4b67c68244291/detectpii-0.1.3.tar.gz",
"platform": null,
"description": "# \ud83d\udd0d Detect PII\n\nDetect PII is a library inspired by [piicatcher](https://github.com/tokern/piicatcher) and [CommonRegex](https://github.com/madisonmay/CommonRegex) to detect columns in tables that may potentially contain PII. It does so by performing regex matches \non column names and column values, flagging the ones that may contain PII.\n\n## Usage\n\n### Installation\n\n```shell\npip install detectpii\n```\n\n### Scan tables for PII\n\n```python\nfrom detectpii.catalog import PostgresCatalog\nfrom detectpii.pipeline import PiiDetectionPipeline\nfrom detectpii.scanner import DataScanner, MetadataScanner\nfrom detectpii.util import print_columns\n\n# -- Create a catalog to connect to a database / warehouse\npg_catalog = PostgresCatalog(\n host=\"localhost\",\n user=\"postgres\",\n password=\"my-secret-pw\",\n database=\"postgres\",\n port=5432,\n schema=\"public\"\n)\n\n# -- Create a pipeline to detect PII in the tables\npipeline = PiiDetectionPipeline(\n catalog=pg_catalog,\n scanners=[\n MetadataScanner(),\n DataScanner(percentage=20, times=2,),\n ]\n)\n\n# -- Scan for PII columns.\npii_columns = pipeline.scan()\n\n# -- Print them to the console\nprint_columns(pii_columns)\n```\n\n### Persist the pipeline\n\n```python\nimport json\nfrom detectpii.pipeline import pipeline_to_dict\n\n# -- Create a pipeline\npipeline = ...\n\n# -- Convert it into a dictionary\ndictionary = pipeline_to_dict(pipeline)\n\n# -- Print it\nprint(json.dumps(dictionary, indent=4))\n\n# {\n# \"catalog\": {\n# \"tables\": [],\n# \"resolver\": {\n# \"name\": \"PlaintextResolver\",\n# \"_type\": \"PlaintextResolver\"\n# },\n# \"user\": \"postgres\",\n# \"password\": \"my-secret-pw\",\n# \"host\": \"localhost\",\n# \"port\": 5432,\n# \"database\": \"postgres\",\n# \"schema\": \"public\",\n# \"_type\": \"PostgresCatalog\"\n# },\n# \"scanners\": [\n# {\n# \"_type\": \"MetadataScanner\"\n# },\n# {\n# \"times\": 2,\n# \"percentage\": 20,\n# \"_type\": \"DataScanner\"\n# }\n# ]\n# }\n```\n\n### Load the pipeline\n\n```python\nfrom detectpii.pipeline import dict_to_pipeline\n\n# -- Load the persisted pipeline as a dictionary\ndictionary: dict = ...\n\n# -- Convert it back to a pipeline object\npipeline = dict_to_pipeline(dictionary=dictionary)\n```\n\nFor more detailed documentation, please see the `docs` folder.\n\n## Supported databases / warehouses \n\n* Postgres\n* Snowflake\n* Trino\n* Yugabyte\n",
"bugtrack_url": null,
"license": null,
"summary": "Detect PII columns in your database and warehouse",
"version": "0.1.3",
"project_urls": {
"Homepage": "https://github.com/thescalaguy/detectpii",
"Repository": "https://github.com/thescalaguy/detectpii"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "a45340e96584248ecc117954fa4ff7cfbc4a987475b224b9e09150aafa049cf5",
"md5": "982153ee2a7216ca70e11971ce03d060",
"sha256": "00b4dc3f5ff29f21da2205cb2a2e5f818a97fefbed6d4aa6b9844f93a218ccfa"
},
"downloads": -1,
"filename": "detectpii-0.1.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "982153ee2a7216ca70e11971ce03d060",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.11",
"size": 20504,
"upload_time": "2024-08-09T06:52:12",
"upload_time_iso_8601": "2024-08-09T06:52:12.743257Z",
"url": "https://files.pythonhosted.org/packages/a4/53/40e96584248ecc117954fa4ff7cfbc4a987475b224b9e09150aafa049cf5/detectpii-0.1.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "012c31dc78521b213c8a4d45102c405a1b3a80296f7007b0afb4b67c68244291",
"md5": "f7a67f3503d470edd01ad163ee284e71",
"sha256": "0b113b3cf87d139427527405ea123ef12e8a992d1b21157d844b57b78b5122c8"
},
"downloads": -1,
"filename": "detectpii-0.1.3.tar.gz",
"has_sig": false,
"md5_digest": "f7a67f3503d470edd01ad163ee284e71",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.11",
"size": 14751,
"upload_time": "2024-08-09T06:52:14",
"upload_time_iso_8601": "2024-08-09T06:52:14.995907Z",
"url": "https://files.pythonhosted.org/packages/01/2c/31dc78521b213c8a4d45102c405a1b3a80296f7007b0afb4b67c68244291/detectpii-0.1.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-08-09 06:52:14",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "thescalaguy",
"github_project": "detectpii",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "detectpii"
}