[![piicatcher](https://github.com/tokern/piicatcher/actions/workflows/ci.yml/badge.svg)](https://github.com/tokern/piicatcher/actions/workflows/ci.yml)
[![PyPI](https://img.shields.io/pypi/v/piicatcher.svg)](https://pypi.python.org/pypi/piicatcher)
[![image](https://img.shields.io/pypi/l/piicatcher.svg)](https://pypi.org/project/piicatcher/)
[![image](https://img.shields.io/pypi/pyversions/piicatcher.svg)](https://pypi.org/project/piicatcher/)
[![image](https://img.shields.io/docker/v/tokern/piicatcher)](https://hub.docker.com/r/tokern/piicatcher)
# PII Catcher for Databases and Data Warehouses
## Overview
PIICatcher is a scanner for PII and PHI information. It finds PII data in your databases and file systems
and tracks critical data. PIICatcher uses two techniques to detect PII:
* Match regular expressions with column names
* Match regular expressions and using NLP libraries to match sample data in columns.
Read more in the [blog post](https://tokern.io/blog/scan-pii-data-warehouse/) on both these strategies.
PIICatcher is *batteries-included* with a growing set of plugins to scan column metadata as well as metadata.
For example, [piicatcher_spacy](https://github.com/tokern/piicatcher_spacy) uses [Spacy](https://spacy.io) to detect
PII in column data.
PIICatcher supports incremental scans and will only scan new or not-yet scanned columns. Incremental scans allow easy
scheduling of scans. It also provides powerful options to include or exclude schema and tables to manage compute resources.
There are ingestion functions for both [Datahub](https://datahubproject.io) and [Amundsen](https://amundsen.io) which will tag columns
and tables with PII and the type of PII tags.
![PIIcatcher Screencast](https://tokern.io/static/piicatcher-2023-96c7c0d73e20427528633b4f0a0e25f4.gif)
## Resources
* [AWS Glue & Lake Formation Privilege Analyzer](https://tokern.io/blog/lake-glue-access-analyzer/) for an example of how piicatcher is used in production.
* [Two strategies to scan data warehouses](https://tokern.io/blog/scan-pii-data-warehouse/)
## Quick Start
PIICatcher is available as a docker image or command-line application.
### Installation
Docker:
alias piicatcher='docker run -v ${HOME}/.config/tokern:/config -u $(id -u ${USER}):$(id -g ${USER}) -it --add-host=host.docker.internal:host-gateway tokern/piicatcher:latest'
Pypi:
# Install development libraries for compiling dependencies.
# On Amazon Linux
sudo yum install mysql-devel gcc gcc-devel python-devel
python3 -m venv .env
source .env/bin/activate
pip install piicatcher
# Install Spacy plugin
pip install piicatcher_spacy
### Command Line Usage
# add a sqlite source
piicatcher catalog add-sqlite --name sqldb --path '/db/sqldb/test.db'
# run piicatcher on a sqlite db and print report to console
piicatcher detect --source-name sqldb
╭─────────────┬─────────────┬─────────────┬─────────────╮
│ schema │ table │ column │ has_pii │
├─────────────┼─────────────┼─────────────┼─────────────┤
│ main │ full_pii │ a │ 1 │
│ main │ full_pii │ b │ 1 │
│ main │ no_pii │ a │ 0 │
│ main │ no_pii │ b │ 0 │
│ main │ partial_pii │ a │ 1 │
│ main │ partial_pii │ b │ 0 │
╰─────────────┴─────────────┴─────────────┴─────────────╯
### API Usage
Code Snippet:
```python3
from dbcat.api import open_catalog, add_postgresql_source
from piicatcher.api import scan_database
# PIICatcher uses a catalog to store its state.
# The easiest option is to use a sqlite memory database.
# For production usage check, https://tokern.io/docs/data-catalog
catalog = open_catalog(app_dir='/tmp/.config/piicatcher', path=':memory:', secret='my_secret')
with catalog.managed_session:
# Add a postgresql source
source = add_postgresql_source(catalog=catalog, name="pg_db", uri="127.0.0.1", username="piiuser",
password="p11secret", database="piidb")
output = scan_database(catalog=catalog, source=source)
print(output)
# Example Output
[
['public', 'sample', 'gender', 'PiiTypes.GENDER'],
['public', 'sample', 'maiden_name', 'PiiTypes.PERSON'],
['public', 'sample', 'lname', 'PiiTypes.PERSON'],
['public', 'sample', 'fname', 'PiiTypes.PERSON'],
['public', 'sample', 'address', 'PiiTypes.ADDRESS'],
['public', 'sample', 'city', 'PiiTypes.ADDRESS'],
['public', 'sample', 'state', 'PiiTypes.ADDRESS'],
['public', 'sample', 'email', 'PiiTypes.EMAIL']
]
```
## Plugins
PIICatcher can be extended by creating new detectors. PIICatcher supports two scanning techniques:
* Metadata
* Data
Plugins can be created for either of these two techniques. Plugins are then registered using an API or using
[Python Entry Points](https://packaging.python.org/en/latest/specifications/entry-points/).
To create a new detector, simply create a new class that inherits from [`MetadataDetector`](https://github.com/tokern/piicatcher/blob/master/piicatcher/detectors.py)
or [`DatumDetector`](https://github.com/tokern/piicatcher/blob/master/piicatcher/detectors.py).
In the new class, define a function `detect` that will return a [`PIIType`](https://github.com/tokern/dbcat/blob/main/dbcat/catalog/pii_types.py)
If you are detecting a new PII type, then you can define a new class that inherits from PIIType.
For detailed documentation, check [piicatcher plugin docs](https://tokern.io/docs/piicatcher/detectors/plugins).
## Supported Databases
PIICatcher supports the following databases:
1. **Sqlite3** v3.24.0 or greater
2. **MySQL** 5.6 or greater
3. **PostgreSQL** 9.4 or greater
4. **AWS Redshift**
5. **AWS Athena**
6. **Snowflake**
7. **BigQuery**
## Documentation
For advanced usage refer documentation [PIICatcher Documentation](https://tokern.io/docs/piicatcher).
## Survey
Please take this [survey](https://forms.gle/Ns6QSNvfj3Pr2s9s6) if you are a user or considering using PIICatcher.
The responses will help to prioritize improvements to the project.
## Stats Collection
We use cookies to a analyse our traffic and features usage.
We may share information about your use of our product for our social media and marketing purposes.
These cookies don't collect your sensitive and/or confidential information.
If you would like to opt out of these cookies, run
```bash
piicatcher --disable-stats
```
To Enable:
```bash
piicatcher --enable-stats
```
## Contributing
For Contribution guidelines, [PIICatcher Developer documentation](https://tokern.io/docs/piicatcher/development).
Raw data
{
"_id": null,
"home_page": "https://tokern.io/",
"name": "piicatcher",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8,<=3.10.8",
"maintainer_email": "",
"keywords": "pii,postgres,snowflake,redshift,athena,bigquery",
"author": "Tokern",
"author_email": "info@tokern.io",
"download_url": "https://files.pythonhosted.org/packages/a3/a9/d6901c0027fd88229fbee4e83f3edd202a07f71bf886450985c586b7409f/piicatcher-0.21.2.tar.gz",
"platform": null,
"description": "[![piicatcher](https://github.com/tokern/piicatcher/actions/workflows/ci.yml/badge.svg)](https://github.com/tokern/piicatcher/actions/workflows/ci.yml)\n[![PyPI](https://img.shields.io/pypi/v/piicatcher.svg)](https://pypi.python.org/pypi/piicatcher)\n[![image](https://img.shields.io/pypi/l/piicatcher.svg)](https://pypi.org/project/piicatcher/)\n[![image](https://img.shields.io/pypi/pyversions/piicatcher.svg)](https://pypi.org/project/piicatcher/)\n[![image](https://img.shields.io/docker/v/tokern/piicatcher)](https://hub.docker.com/r/tokern/piicatcher)\n\n# PII Catcher for Databases and Data Warehouses\n\n## Overview\n\nPIICatcher is a scanner for PII and PHI information. It finds PII data in your databases and file systems\nand tracks critical data. PIICatcher uses two techniques to detect PII:\n\n* Match regular expressions with column names\n* Match regular expressions and using NLP libraries to match sample data in columns.\n\nRead more in the [blog post](https://tokern.io/blog/scan-pii-data-warehouse/) on both these strategies.\n\nPIICatcher is *batteries-included* with a growing set of plugins to scan column metadata as well as metadata. \nFor example, [piicatcher_spacy](https://github.com/tokern/piicatcher_spacy) uses [Spacy](https://spacy.io) to detect\nPII in column data.\n\nPIICatcher supports incremental scans and will only scan new or not-yet scanned columns. Incremental scans allow easy\nscheduling of scans. It also provides powerful options to include or exclude schema and tables to manage compute resources.\n\nThere are ingestion functions for both [Datahub](https://datahubproject.io) and [Amundsen](https://amundsen.io) which will tag columns \nand tables with PII and the type of PII tags.\n\n![PIIcatcher Screencast](https://tokern.io/static/piicatcher-2023-96c7c0d73e20427528633b4f0a0e25f4.gif)\n\n\n## Resources\n\n* [AWS Glue & Lake Formation Privilege Analyzer](https://tokern.io/blog/lake-glue-access-analyzer/) for an example of how piicatcher is used in production.\n* [Two strategies to scan data warehouses](https://tokern.io/blog/scan-pii-data-warehouse/)\n\n## Quick Start\n\nPIICatcher is available as a docker image or command-line application.\n\n### Installation\n\nDocker:\n\n alias piicatcher='docker run -v ${HOME}/.config/tokern:/config -u $(id -u ${USER}):$(id -g ${USER}) -it --add-host=host.docker.internal:host-gateway tokern/piicatcher:latest'\n\n\nPypi:\n # Install development libraries for compiling dependencies.\n # On Amazon Linux\n sudo yum install mysql-devel gcc gcc-devel python-devel\n\n python3 -m venv .env\n source .env/bin/activate\n pip install piicatcher\n\n # Install Spacy plugin\n pip install piicatcher_spacy\n\n\n### Command Line Usage\n # add a sqlite source\n piicatcher catalog add-sqlite --name sqldb --path '/db/sqldb/test.db'\n\n # run piicatcher on a sqlite db and print report to console\n piicatcher detect --source-name sqldb\n \u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n \u2502 schema \u2502 table \u2502 column \u2502 has_pii \u2502\n \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n \u2502 main \u2502 full_pii \u2502 a \u2502 1 \u2502\n \u2502 main \u2502 full_pii \u2502 b \u2502 1 \u2502\n \u2502 main \u2502 no_pii \u2502 a \u2502 0 \u2502\n \u2502 main \u2502 no_pii \u2502 b \u2502 0 \u2502\n \u2502 main \u2502 partial_pii \u2502 a \u2502 1 \u2502\n \u2502 main \u2502 partial_pii \u2502 b \u2502 0 \u2502\n \u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n\n\n### API Usage\nCode Snippet: \n```python3\nfrom dbcat.api import open_catalog, add_postgresql_source\nfrom piicatcher.api import scan_database\n\n# PIICatcher uses a catalog to store its state. \n# The easiest option is to use a sqlite memory database.\n# For production usage check, https://tokern.io/docs/data-catalog\ncatalog = open_catalog(app_dir='/tmp/.config/piicatcher', path=':memory:', secret='my_secret')\n\nwith catalog.managed_session:\n # Add a postgresql source\n source = add_postgresql_source(catalog=catalog, name=\"pg_db\", uri=\"127.0.0.1\", username=\"piiuser\",\n password=\"p11secret\", database=\"piidb\")\n output = scan_database(catalog=catalog, source=source)\n\nprint(output)\n\n# Example Output\n[\n ['public', 'sample', 'gender', 'PiiTypes.GENDER'],\n ['public', 'sample', 'maiden_name', 'PiiTypes.PERSON'],\n ['public', 'sample', 'lname', 'PiiTypes.PERSON'],\n ['public', 'sample', 'fname', 'PiiTypes.PERSON'],\n ['public', 'sample', 'address', 'PiiTypes.ADDRESS'],\n ['public', 'sample', 'city', 'PiiTypes.ADDRESS'],\n ['public', 'sample', 'state', 'PiiTypes.ADDRESS'], \n ['public', 'sample', 'email', 'PiiTypes.EMAIL']\n]\n```\n\n## Plugins\n\nPIICatcher can be extended by creating new detectors. PIICatcher supports two scanning techniques:\n* Metadata\n* Data\n\nPlugins can be created for either of these two techniques. Plugins are then registered using an API or using\n[Python Entry Points](https://packaging.python.org/en/latest/specifications/entry-points/).\n\nTo create a new detector, simply create a new class that inherits from [`MetadataDetector`](https://github.com/tokern/piicatcher/blob/master/piicatcher/detectors.py)\nor [`DatumDetector`](https://github.com/tokern/piicatcher/blob/master/piicatcher/detectors.py).\n\nIn the new class, define a function `detect` that will return a [`PIIType`](https://github.com/tokern/dbcat/blob/main/dbcat/catalog/pii_types.py) \nIf you are detecting a new PII type, then you can define a new class that inherits from PIIType.\n\nFor detailed documentation, check [piicatcher plugin docs](https://tokern.io/docs/piicatcher/detectors/plugins).\n\n\n## Supported Databases\n\nPIICatcher supports the following databases:\n1. **Sqlite3** v3.24.0 or greater\n2. **MySQL** 5.6 or greater\n3. **PostgreSQL** 9.4 or greater\n4. **AWS Redshift**\n5. **AWS Athena**\n6. **Snowflake**\n7. **BigQuery**\n\n## Documentation\n\nFor advanced usage refer documentation [PIICatcher Documentation](https://tokern.io/docs/piicatcher).\n\n## Survey\n\nPlease take this [survey](https://forms.gle/Ns6QSNvfj3Pr2s9s6) if you are a user or considering using PIICatcher. \nThe responses will help to prioritize improvements to the project.\n\n## Stats Collection\nWe use cookies to a analyse our traffic and features usage.\nWe may share information about your use of our product for our social media and marketing purposes.\nThese cookies don't collect your sensitive and/or confidential information.\nIf you would like to opt out of these cookies, run \n```bash\npiicatcher --disable-stats\n```\nTo Enable:\n```bash\npiicatcher --enable-stats\n```\n\n## Contributing\n\nFor Contribution guidelines, [PIICatcher Developer documentation](https://tokern.io/docs/piicatcher/development). \n\n",
"bugtrack_url": null,
"license": "Apache 2.0",
"summary": "Find PII data in databases",
"version": "0.21.2",
"project_urls": {
"Homepage": "https://tokern.io/",
"Repository": "https://github.com/tokern/piicatcher/"
},
"split_keywords": [
"pii",
"postgres",
"snowflake",
"redshift",
"athena",
"bigquery"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "f22e81ac36cd26ec5651f2a04e551fd5f87fe19bf3341304f573a6e318665a84",
"md5": "acf45016dc23590b0d32477e370d2a83",
"sha256": "4f431baf43afc09340148affdbf2f6f194a4f52a451a51fbd265677c21d89ef2"
},
"downloads": -1,
"filename": "piicatcher-0.21.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "acf45016dc23590b0d32477e370d2a83",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8,<=3.10.8",
"size": 19673,
"upload_time": "2023-07-06T02:28:51",
"upload_time_iso_8601": "2023-07-06T02:28:51.394898Z",
"url": "https://files.pythonhosted.org/packages/f2/2e/81ac36cd26ec5651f2a04e551fd5f87fe19bf3341304f573a6e318665a84/piicatcher-0.21.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "a3a9d6901c0027fd88229fbee4e83f3edd202a07f71bf886450985c586b7409f",
"md5": "63049dbde6133d512e6dc1ed44e66863",
"sha256": "b12ab887d53e9d411f29657d877c4451367903dd10077af21f45f062518005e6"
},
"downloads": -1,
"filename": "piicatcher-0.21.2.tar.gz",
"has_sig": false,
"md5_digest": "63049dbde6133d512e6dc1ed44e66863",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8,<=3.10.8",
"size": 19058,
"upload_time": "2023-07-06T02:28:52",
"upload_time_iso_8601": "2023-07-06T02:28:52.820879Z",
"url": "https://files.pythonhosted.org/packages/a3/a9/d6901c0027fd88229fbee4e83f3edd202a07f71bf886450985c586b7409f/piicatcher-0.21.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-07-06 02:28:52",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "tokern",
"github_project": "piicatcher",
"travis_ci": false,
"coveralls": true,
"github_actions": true,
"lcname": "piicatcher"
}