piicatcher


Namepiicatcher JSON
Version 0.20.1 PyPI version JSON
download
home_pagehttps://tokern.io/
SummaryFind PII data in databases
upload_time2022-11-30 10:17:38
maintainer
docs_urlNone
authorTokern
requires_python>=3.8,<=3.10
licenseApache 2.0
keywords pii postgres snowflake redshift athena
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [![piicatcher](https://github.com/tokern/piicatcher/actions/workflows/ci.yml/badge.svg)](https://github.com/tokern/piicatcher/actions/workflows/ci.yml)
[![PyPI](https://img.shields.io/pypi/v/piicatcher.svg)](https://pypi.python.org/pypi/piicatcher)
[![image](https://img.shields.io/pypi/l/piicatcher.svg)](https://pypi.org/project/piicatcher/)
[![image](https://img.shields.io/pypi/pyversions/piicatcher.svg)](https://pypi.org/project/piicatcher/)
[![image](https://img.shields.io/docker/v/tokern/piicatcher)](https://hub.docker.com/r/tokern/piicatcher)

# PII Catcher for Databases and Data Warehouses

## Overview

PIICatcher is a scanner for PII and PHI information. It finds PII data in your databases and file systems
and tracks critical data. PIICatcher uses two techniques to detect PII:

* Match regular expressions with column names
* Match regular expressions and using NLP libraries to match sample data in columns.

Read more in the [blog post](https://tokern.io/blog/scan-pii-data-warehouse/) on both these strategies.

PIICatcher is *batteries-included* with a growing set of plugins to scan column metadata as well as metadata. 
For example, [piicatcher_spacy](https://github.com/tokern/piicatcher_spacy) uses [Spacy](https://spacy.io) to detect
PII in column data.

PIICatcher supports incremental scans and will only scan new or not-yet scanned columns. Incremental scans allow easy
scheduling of scans. It also provides powerful options to include or exclude schema and tables to manage compute resources.

There are ingestion functions for both [Datahub](https://datahubproject.io) and [Amundsen](https://amundsen.io) which will tag columns 
and tables with PII and the type of PII tags.

![PIIcatcher Screencast](https://user-images.githubusercontent.com/1638298/143765818-87c7059a-f971-447b-83ca-e21182e28051.gif)


## Resources

* [AWS Glue & Lake Formation Privilege Analyzer](https://tokern.io/blog/lake-glue-access-analyzer/) for an example of how piicatcher is used in production.
* [Two strategies to scan data warehouses](https://tokern.io/blog/scan-pii-data-warehouse/)

## Quick Start

PIICatcher is available as a docker image or command-line application.

### Installation

Docker:

    alias piicatcher='docker run -v ${HOME}/.config/tokern:/config -u $(id -u ${USER}):$(id -g ${USER}) -it --add-host=host.docker.internal:host-gateway tokern/piicatcher:latest'


Pypi:
    # Install development libraries for compiling dependencies.
    # On Amazon Linux
    sudo yum install mysql-devel gcc gcc-devel python-devel

    python3 -m venv .env
    source .env/bin/activate
    pip install piicatcher

    # Install Spacy plugin
    pip install piicatcher_spacy


### Command Line Usage
    # add a sqlite source
    piicatcher catalog add_sqlite --name sqldb --path '/db/sqldb'

    # run piicatcher on a sqlite db and print report to console
    piicatcher detect --source-name sqldb
    ╭─────────────┬─────────────┬─────────────┬─────────────╮
    │   schema    │    table    │   column    │   has_pii   │
    ├─────────────┼─────────────┼─────────────┼─────────────┤
    │        main │    full_pii │           a │           1 │
    │        main │    full_pii │           b │           1 │
    │        main │      no_pii │           a │           0 │
    │        main │      no_pii │           b │           0 │
    │        main │ partial_pii │           a │           1 │
    │        main │ partial_pii │           b │           0 │
    ╰─────────────┴─────────────┴─────────────┴─────────────╯


### API Usage
```python3
from dbcat.api import open_catalog, add_postgresql_source
from piicatcher.api import scan_database

# PIICatcher uses a catalog to store its state. 
# The easiest option is to use a sqlite memory database.
# For production usage check, https://tokern.io/docs/data-catalog
catalog = open_catalog(app_dir='/tmp/.config/piicatcher', path=':memory:', secret='my_secret')

with catalog.managed_session:
    # Add a postgresql source
    source = add_postgresql_source(catalog=catalog, name="pg_db", uri="127.0.0.1", username="piiuser",
                                    password="p11secret", database="piidb")
    output = scan_database(catalog=catalog, source=source)

print(output)

# Example Output
[['public', 'sample', 'gender', 'PiiTypes.GENDER'], 
 ['public', 'sample', 'maiden_name', 'PiiTypes.PERSON'], 
 ['public', 'sample', 'lname', 'PiiTypes.PERSON'], 
 ['public', 'sample', 'fname', 'PiiTypes.PERSON'], 
 ['public', 'sample', 'address', 'PiiTypes.ADDRESS'], 
 ['public', 'sample', 'city', 'PiiTypes.ADDRESS'], 
 ['public', 'sample', 'state', 'PiiTypes.ADDRESS'], 
 ['public', 'sample', 'email', 'PiiTypes.EMAIL']]
```

## Plugins

PIICatcher can be extended by creating new detectors. PIICatcher supports two scanning techniques:
* Metadata
* Data

Plugins can be created for either of these two techniques. Plugins are then registered using an API or using
[Python Entry Points](https://packaging.python.org/en/latest/specifications/entry-points/).

To create a new detector, simply create a new class that inherits from [`MetadataDetector`](https://github.com/tokern/piicatcher/blob/master/piicatcher/detectors.py)
or [`DatumDetector`](https://github.com/tokern/piicatcher/blob/master/piicatcher/detectors.py).

In the new class, define a function `detect` that will return a [`PIIType`](https://github.com/tokern/dbcat/blob/main/dbcat/catalog/pii_types.py) 
If you are detecting a new PII type, then you can define a new class that inherits from PIIType.

For detailed documentation, check [piicatcher plugin docs](https://tokern.io/docs/piicatcher/detectors/plugins).


## Supported Databases

PIICatcher supports the following databases:
1. **Sqlite3** v3.24.0 or greater
2. **MySQL** 5.6 or greater
3. **PostgreSQL** 9.4 or greater
4. **AWS Redshift**
5. **AWS Athena**
6. **Snowflake**

## Documentation

For advanced usage refer documentation [PIICatcher Documentation](https://tokern.io/docs/piicatcher).

## Survey

Please take this [survey](https://forms.gle/Ns6QSNvfj3Pr2s9s6) if you are a user or considering using PIICatcher. 
The responses will help to prioritize improvements to the project.

## Contributing

For Contribution guidelines, [PIICatcher Developer documentation](https://tokern.io/docs/piicatcher/development). 


            

Raw data

            {
    "_id": null,
    "home_page": "https://tokern.io/",
    "name": "piicatcher",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8,<=3.10",
    "maintainer_email": "",
    "keywords": "pii,postgres,snowflake,redshift,athena",
    "author": "Tokern",
    "author_email": "info@tokern.io",
    "download_url": "https://files.pythonhosted.org/packages/bd/ab/2facce2460487291af51f26d607952235f83e80c9bf7e193b089b4023bc2/piicatcher-0.20.1.tar.gz",
    "platform": null,
    "description": "[![piicatcher](https://github.com/tokern/piicatcher/actions/workflows/ci.yml/badge.svg)](https://github.com/tokern/piicatcher/actions/workflows/ci.yml)\n[![PyPI](https://img.shields.io/pypi/v/piicatcher.svg)](https://pypi.python.org/pypi/piicatcher)\n[![image](https://img.shields.io/pypi/l/piicatcher.svg)](https://pypi.org/project/piicatcher/)\n[![image](https://img.shields.io/pypi/pyversions/piicatcher.svg)](https://pypi.org/project/piicatcher/)\n[![image](https://img.shields.io/docker/v/tokern/piicatcher)](https://hub.docker.com/r/tokern/piicatcher)\n\n# PII Catcher for Databases and Data Warehouses\n\n## Overview\n\nPIICatcher is a scanner for PII and PHI information. It finds PII data in your databases and file systems\nand tracks critical data. PIICatcher uses two techniques to detect PII:\n\n* Match regular expressions with column names\n* Match regular expressions and using NLP libraries to match sample data in columns.\n\nRead more in the [blog post](https://tokern.io/blog/scan-pii-data-warehouse/) on both these strategies.\n\nPIICatcher is *batteries-included* with a growing set of plugins to scan column metadata as well as metadata. \nFor example, [piicatcher_spacy](https://github.com/tokern/piicatcher_spacy) uses [Spacy](https://spacy.io) to detect\nPII in column data.\n\nPIICatcher supports incremental scans and will only scan new or not-yet scanned columns. Incremental scans allow easy\nscheduling of scans. It also provides powerful options to include or exclude schema and tables to manage compute resources.\n\nThere are ingestion functions for both [Datahub](https://datahubproject.io) and [Amundsen](https://amundsen.io) which will tag columns \nand tables with PII and the type of PII tags.\n\n![PIIcatcher Screencast](https://user-images.githubusercontent.com/1638298/143765818-87c7059a-f971-447b-83ca-e21182e28051.gif)\n\n\n## Resources\n\n* [AWS Glue & Lake Formation Privilege Analyzer](https://tokern.io/blog/lake-glue-access-analyzer/) for an example of how piicatcher is used in production.\n* [Two strategies to scan data warehouses](https://tokern.io/blog/scan-pii-data-warehouse/)\n\n## Quick Start\n\nPIICatcher is available as a docker image or command-line application.\n\n### Installation\n\nDocker:\n\n    alias piicatcher='docker run -v ${HOME}/.config/tokern:/config -u $(id -u ${USER}):$(id -g ${USER}) -it --add-host=host.docker.internal:host-gateway tokern/piicatcher:latest'\n\n\nPypi:\n    # Install development libraries for compiling dependencies.\n    # On Amazon Linux\n    sudo yum install mysql-devel gcc gcc-devel python-devel\n\n    python3 -m venv .env\n    source .env/bin/activate\n    pip install piicatcher\n\n    # Install Spacy plugin\n    pip install piicatcher_spacy\n\n\n### Command Line Usage\n    # add a sqlite source\n    piicatcher catalog add_sqlite --name sqldb --path '/db/sqldb'\n\n    # run piicatcher on a sqlite db and print report to console\n    piicatcher detect --source-name sqldb\n    \u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n    \u2502   schema    \u2502    table    \u2502   column    \u2502   has_pii   \u2502\n    \u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n    \u2502        main \u2502    full_pii \u2502           a \u2502           1 \u2502\n    \u2502        main \u2502    full_pii \u2502           b \u2502           1 \u2502\n    \u2502        main \u2502      no_pii \u2502           a \u2502           0 \u2502\n    \u2502        main \u2502      no_pii \u2502           b \u2502           0 \u2502\n    \u2502        main \u2502 partial_pii \u2502           a \u2502           1 \u2502\n    \u2502        main \u2502 partial_pii \u2502           b \u2502           0 \u2502\n    \u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n\n\n### API Usage\n```python3\nfrom dbcat.api import open_catalog, add_postgresql_source\nfrom piicatcher.api import scan_database\n\n# PIICatcher uses a catalog to store its state. \n# The easiest option is to use a sqlite memory database.\n# For production usage check, https://tokern.io/docs/data-catalog\ncatalog = open_catalog(app_dir='/tmp/.config/piicatcher', path=':memory:', secret='my_secret')\n\nwith catalog.managed_session:\n    # Add a postgresql source\n    source = add_postgresql_source(catalog=catalog, name=\"pg_db\", uri=\"127.0.0.1\", username=\"piiuser\",\n                                    password=\"p11secret\", database=\"piidb\")\n    output = scan_database(catalog=catalog, source=source)\n\nprint(output)\n\n# Example Output\n[['public', 'sample', 'gender', 'PiiTypes.GENDER'], \n ['public', 'sample', 'maiden_name', 'PiiTypes.PERSON'], \n ['public', 'sample', 'lname', 'PiiTypes.PERSON'], \n ['public', 'sample', 'fname', 'PiiTypes.PERSON'], \n ['public', 'sample', 'address', 'PiiTypes.ADDRESS'], \n ['public', 'sample', 'city', 'PiiTypes.ADDRESS'], \n ['public', 'sample', 'state', 'PiiTypes.ADDRESS'], \n ['public', 'sample', 'email', 'PiiTypes.EMAIL']]\n```\n\n## Plugins\n\nPIICatcher can be extended by creating new detectors. PIICatcher supports two scanning techniques:\n* Metadata\n* Data\n\nPlugins can be created for either of these two techniques. Plugins are then registered using an API or using\n[Python Entry Points](https://packaging.python.org/en/latest/specifications/entry-points/).\n\nTo create a new detector, simply create a new class that inherits from [`MetadataDetector`](https://github.com/tokern/piicatcher/blob/master/piicatcher/detectors.py)\nor [`DatumDetector`](https://github.com/tokern/piicatcher/blob/master/piicatcher/detectors.py).\n\nIn the new class, define a function `detect` that will return a [`PIIType`](https://github.com/tokern/dbcat/blob/main/dbcat/catalog/pii_types.py) \nIf you are detecting a new PII type, then you can define a new class that inherits from PIIType.\n\nFor detailed documentation, check [piicatcher plugin docs](https://tokern.io/docs/piicatcher/detectors/plugins).\n\n\n## Supported Databases\n\nPIICatcher supports the following databases:\n1. **Sqlite3** v3.24.0 or greater\n2. **MySQL** 5.6 or greater\n3. **PostgreSQL** 9.4 or greater\n4. **AWS Redshift**\n5. **AWS Athena**\n6. **Snowflake**\n\n## Documentation\n\nFor advanced usage refer documentation [PIICatcher Documentation](https://tokern.io/docs/piicatcher).\n\n## Survey\n\nPlease take this [survey](https://forms.gle/Ns6QSNvfj3Pr2s9s6) if you are a user or considering using PIICatcher. \nThe responses will help to prioritize improvements to the project.\n\n## Contributing\n\nFor Contribution guidelines, [PIICatcher Developer documentation](https://tokern.io/docs/piicatcher/development). \n\n",
    "bugtrack_url": null,
    "license": "Apache 2.0",
    "summary": "Find PII data in databases",
    "version": "0.20.1",
    "split_keywords": [
        "pii",
        "postgres",
        "snowflake",
        "redshift",
        "athena"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "fc3f261e5c60a4cdc956b3bb95f61b9a",
                "sha256": "16a1552fc7dae885ef51b611a2a782e4a7799fde1daeae88b699a1fde27ed630"
            },
            "downloads": -1,
            "filename": "piicatcher-0.20.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "fc3f261e5c60a4cdc956b3bb95f61b9a",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8,<=3.10",
            "size": 18504,
            "upload_time": "2022-11-30T10:17:37",
            "upload_time_iso_8601": "2022-11-30T10:17:37.140481Z",
            "url": "https://files.pythonhosted.org/packages/3e/79/c4a791fe655cfaa0dca19fa11394c31e9225b7a89ff97928f55e87e43338/piicatcher-0.20.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "md5": "58d61f68961b8c6c3495c3bab40a4085",
                "sha256": "bad6ca8a7f46c763bc6b3e1479f468ec33149d6f71b4eaf29c75ef33e5969ef8"
            },
            "downloads": -1,
            "filename": "piicatcher-0.20.1.tar.gz",
            "has_sig": false,
            "md5_digest": "58d61f68961b8c6c3495c3bab40a4085",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8,<=3.10",
            "size": 18798,
            "upload_time": "2022-11-30T10:17:38",
            "upload_time_iso_8601": "2022-11-30T10:17:38.637868Z",
            "url": "https://files.pythonhosted.org/packages/bd/ab/2facce2460487291af51f26d607952235f83e80c9bf7e193b089b4023bc2/piicatcher-0.20.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2022-11-30 10:17:38",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "lcname": "piicatcher"
}
        
Elapsed time: 0.02227s