privy-presidio-utils


Nameprivy-presidio-utils JSON
Version 0.0.51 PyPI version JSON
download
home_pagehttps://www.github.com/microsoft/presidio-research
SummaryRead the latest Real Python tutorials
upload_time2022-09-28 04:58:35
maintainer
docs_urlNone
author
requires_python
licenseMIT License Copyright (c) Microsoft Corporation. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE
keywords nlp pii
VCS
bugtrack_url
requirements spacy numpy jupyter pandas tqdm haikunator schwifty faker scikit_learn pytest presidio_analyzer presidio_anonymizer requests xmltodict python-dotenv
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Fork of Presidio-research, modifying some utility functions

This package features data-science related tasks for developing new recognizers for 
[Presidio](https://github.com/microsoft/presidio).
It is used for the evaluation of the entire system, 
as well as for evaluating specific PII recognizers or PII detection models. 
In addition, it contains a fake data generator which creates fake sentences based on templates and fake PII.

## Who should use it?

- Anyone interested in **developing or evaluating PII detection models**, an existing Presidio instance or a Presidio PII recognizer.
- Anyone interested in **generating new data based on previous datasets or sentence templates** (e.g. to increase the coverage of entity values) for Named Entity Recognition models.

## Getting started

To install the package:
1. Clone the repo
2. Install all dependencies, preferably in a virtual environment:

``` sh
# Create conda env (optional)
conda create --name presidio python=3.9
conda activate presidio

# Install package+dependencies
pip install -r requirements.txt
python setup.py install

# Download a spaCy model used by presidio-analyzer
python -m spacy download en_core_web_lg

# Verify installation
pytest
```

Note that some dependencies (such as Flair and Stanza) are not automatically installed to reduce installation complexity.

## What's in this package?

1. **Fake data generator** for PII recognizers and NER models
2. **Data representation layer** for data generation, modeling and analysis
3. Multiple **Model/Recognizer evaluation** files (e.g. for Spacy, Flair, CRF, Presidio API, Presidio Analyzer python package, specific Presidio recognizers)
4. **Training and modeling code** for multiple models
5. Helper functions for **results analysis**

## 1. Data generation

See [Data Generator README](presidio_evaluator/data_generator/README.md) for more details.

The data generation process receives a file with templates, e.g. `My name is {{name}}`. 
Then, it creates new synthetic sentences by sampling templates and PII values. 
Furthermore, it tokenizes the data, creates tags (either IO/BIO/BILUO) and spans for the newly created samples.

- For information on data generation/augmentation, see the data generator [README](presidio_evaluator/data_generator/README.md).
- For an example for running the generation process, see [this notebook](notebooks/1_Generate_data.ipynb).
- For an understanding of the underlying fake PII data used, see this [exploratory data analysis notebook](notebooks/2_PII_EDA.ipynb).

Once data is generated, it could be split into train/test/validation sets 
while ensuring that each template only exists in one set. 
See [this notebook for more details](notebooks/3_Split_by_pattern_%23.ipynb).

## 2. Data representation

In order to standardize the process, 
we use specific data objects that hold all the information needed for generating, 
analyzing, modeling and evaluating data and models. Specifically, 
see [data_objects.py](presidio_evaluator/data_objects.py).

The standardized structure, `List[InputSample]` could be translated into different formats:
- CONLL
```python
from presidio_evaluator import InputSample
dataset = InputSample.read_dataset_json("data/synth_dataset_v2.json")
conll = InputSample.create_conll_dataset(dataset)
conll.to_csv("dataset.csv", sep="\t")

```

- spaCy v3
```python
from presidio_evaluator import InputSample
dataset = InputSample.read_dataset_json("data/synth_dataset_v2.json")
InputSample.create_spacy_dataset(dataset, output_path="dataset.spacy")
```

- Flair
```python
from presidio_evaluator import InputSample
dataset = InputSample.read_dataset_json("data/synth_dataset_v2.json")
flair = InputSample.create_flair_dataset(dataset)
```

- json
```python
from presidio_evaluator import InputSample
dataset = InputSample.read_dataset_json("data/synth_dataset_v2.json")
InputSample.to_json(dataset, output_file="dataset_json")
```

## 3. PII models evaluation

The presidio-evaluator framework allows you to evaluate Presidio as a system, a NER model, 
or a specific PII recognizer for precision and recall and error-analysis.


### Examples:
- [Evaluate Presidio](notebooks/4_Evaluate_Presidio_Analyzer.ipynb)
- [Evaluate spaCy models](notebooks/models/Evaluate%20spacy%20models.ipynb)
- [Evaluate Stanza models](notebooks/models/Evaluate%20stanza%20models.ipynb)
- [Evaluate CRF models](notebooks/models/Evaluate%20CRF%20models.ipynb)
- [Evaluate Flair models](notebooks/models/Evaluate%20flair%20models.ipynb)


## 4. Training PII detection models

### CRF

To train a vanilla CRF on a new dataset, see [this notebook](notebooks/models/Train%20CRF.ipynb). To evaluate, see [this notebook](notebooks/models/Evaluate%20CRF%20models.ipynb).

### spaCy

To train a new spaCy model, first save the dataset in a spaCy format:
```python
# dataset is a List[InputSample]
InputSample.create_spacy_dataset(dataset ,output_path="dataset.spacy")
```

To evaluate, see [this notebook](notebooks/models/Evaluate%20spacy%20models.ipynb)

### Flair

- To train Flair models, see this [helper class](presidio_evaluator/models/flair_train.py) or this snippet:
```python
from presidio_evaluator.models import FlairTrainer
train_samples = "data/generated_train.json"
test_samples = "data/generated_test.json"
val_samples = "data/generated_validation.json"

trainer = FlairTrainer()
trainer.create_flair_corpus(train_samples, test_samples, val_samples)

corpus = trainer.read_corpus("")
trainer.train(corpus)
```

> Note that the three json files are created using `InputSample.to_json`.

## For more information


- [Blog post on NLP approaches to data anonymization](https://towardsdatascience.com/nlp-approaches-to-data-anonymization-1fb5bde6b929)
- [Conference talk about leveraging Presidio and utilizing NLP approaches for data anonymization](https://youtu.be/Tl773LANRwY)

# Contributing

This project welcomes contributions and suggestions.  Most contributions require you to agree to a
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
the rights to use your contribution. For details, visit <https://cla.opensource.microsoft.com>.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide
a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions
provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.

Copyright notice:

Fake Name Generator identities by the [Fake Name Generator](https://www.fakenamegenerator.com/)
are licensed under a [Creative Commons Attribution-Share Alike 3.0 United States License](http://creativecommons.org/licenses/by-sa/3.0/us/).
Fake Name Generator and the Fake Name Generator logo are trademarks of Corban Works, LLC.

            

Raw data

            {
    "_id": null,
    "home_page": "https://www.github.com/microsoft/presidio-research",
    "name": "privy-presidio-utils",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "nlp,pii",
    "author": "",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/d6/6c/aea2d539d6944804e0e673f6b6e5eb20faa45d30df04e067ba0179fe5649/privy-presidio-utils-0.0.51.tar.gz",
    "platform": null,
    "description": "# Fork of Presidio-research, modifying some utility functions\n\nThis package features data-science related tasks for developing new recognizers for \n[Presidio](https://github.com/microsoft/presidio).\nIt is used for the evaluation of the entire system, \nas well as for evaluating specific PII recognizers or PII detection models. \nIn addition, it contains a fake data generator which creates fake sentences based on templates and fake PII.\n\n## Who should use it?\n\n- Anyone interested in **developing or evaluating PII detection models**, an existing Presidio instance or a Presidio PII recognizer.\n- Anyone interested in **generating new data based on previous datasets or sentence templates** (e.g. to increase the coverage of entity values) for Named Entity Recognition models.\n\n## Getting started\n\nTo install the package:\n1. Clone the repo\n2. Install all dependencies, preferably in a virtual environment:\n\n``` sh\n# Create conda env (optional)\nconda create --name presidio python=3.9\nconda activate presidio\n\n# Install package+dependencies\npip install -r requirements.txt\npython setup.py install\n\n# Download a spaCy model used by presidio-analyzer\npython -m spacy download en_core_web_lg\n\n# Verify installation\npytest\n```\n\nNote that some dependencies (such as Flair and Stanza) are not automatically installed to reduce installation complexity.\n\n## What's in this package?\n\n1. **Fake data generator** for PII recognizers and NER models\n2. **Data representation layer** for data generation, modeling and analysis\n3. Multiple **Model/Recognizer evaluation** files (e.g. for Spacy, Flair, CRF, Presidio API, Presidio Analyzer python package, specific Presidio recognizers)\n4. **Training and modeling code** for multiple models\n5. Helper functions for **results analysis**\n\n## 1. Data generation\n\nSee [Data Generator README](presidio_evaluator/data_generator/README.md) for more details.\n\nThe data generation process receives a file with templates, e.g. `My name is {{name}}`. \nThen, it creates new synthetic sentences by sampling templates and PII values. \nFurthermore, it tokenizes the data, creates tags (either IO/BIO/BILUO) and spans for the newly created samples.\n\n- For information on data generation/augmentation, see the data generator [README](presidio_evaluator/data_generator/README.md).\n- For an example for running the generation process, see [this notebook](notebooks/1_Generate_data.ipynb).\n- For an understanding of the underlying fake PII data used, see this [exploratory data analysis notebook](notebooks/2_PII_EDA.ipynb).\n\nOnce data is generated, it could be split into train/test/validation sets \nwhile ensuring that each template only exists in one set. \nSee [this notebook for more details](notebooks/3_Split_by_pattern_%23.ipynb).\n\n## 2. Data representation\n\nIn order to standardize the process, \nwe use specific data objects that hold all the information needed for generating, \nanalyzing, modeling and evaluating data and models. Specifically, \nsee [data_objects.py](presidio_evaluator/data_objects.py).\n\nThe standardized structure, `List[InputSample]` could be translated into different formats:\n- CONLL\n```python\nfrom presidio_evaluator import InputSample\ndataset = InputSample.read_dataset_json(\"data/synth_dataset_v2.json\")\nconll = InputSample.create_conll_dataset(dataset)\nconll.to_csv(\"dataset.csv\", sep=\"\\t\")\n\n```\n\n- spaCy v3\n```python\nfrom presidio_evaluator import InputSample\ndataset = InputSample.read_dataset_json(\"data/synth_dataset_v2.json\")\nInputSample.create_spacy_dataset(dataset, output_path=\"dataset.spacy\")\n```\n\n- Flair\n```python\nfrom presidio_evaluator import InputSample\ndataset = InputSample.read_dataset_json(\"data/synth_dataset_v2.json\")\nflair = InputSample.create_flair_dataset(dataset)\n```\n\n- json\n```python\nfrom presidio_evaluator import InputSample\ndataset = InputSample.read_dataset_json(\"data/synth_dataset_v2.json\")\nInputSample.to_json(dataset, output_file=\"dataset_json\")\n```\n\n## 3. PII models evaluation\n\nThe presidio-evaluator framework allows you to evaluate Presidio as a system, a NER model, \nor a specific PII recognizer for precision and recall and error-analysis.\n\n\n### Examples:\n- [Evaluate Presidio](notebooks/4_Evaluate_Presidio_Analyzer.ipynb)\n- [Evaluate spaCy models](notebooks/models/Evaluate%20spacy%20models.ipynb)\n- [Evaluate Stanza models](notebooks/models/Evaluate%20stanza%20models.ipynb)\n- [Evaluate CRF models](notebooks/models/Evaluate%20CRF%20models.ipynb)\n- [Evaluate Flair models](notebooks/models/Evaluate%20flair%20models.ipynb)\n\n\n## 4. Training PII detection models\n\n### CRF\n\nTo train a vanilla CRF on a new dataset, see [this notebook](notebooks/models/Train%20CRF.ipynb). To evaluate, see [this notebook](notebooks/models/Evaluate%20CRF%20models.ipynb).\n\n### spaCy\n\nTo train a new spaCy model, first save the dataset in a spaCy format:\n```python\n# dataset is a List[InputSample]\nInputSample.create_spacy_dataset(dataset ,output_path=\"dataset.spacy\")\n```\n\nTo evaluate, see [this notebook](notebooks/models/Evaluate%20spacy%20models.ipynb)\n\n### Flair\n\n- To train Flair models, see this [helper class](presidio_evaluator/models/flair_train.py) or this snippet:\n```python\nfrom presidio_evaluator.models import FlairTrainer\ntrain_samples = \"data/generated_train.json\"\ntest_samples = \"data/generated_test.json\"\nval_samples = \"data/generated_validation.json\"\n\ntrainer = FlairTrainer()\ntrainer.create_flair_corpus(train_samples, test_samples, val_samples)\n\ncorpus = trainer.read_corpus(\"\")\ntrainer.train(corpus)\n```\n\n> Note that the three json files are created using `InputSample.to_json`.\n\n## For more information\n\n\n- [Blog post on NLP approaches to data anonymization](https://towardsdatascience.com/nlp-approaches-to-data-anonymization-1fb5bde6b929)\n- [Conference talk about leveraging Presidio and utilizing NLP approaches for data anonymization](https://youtu.be/Tl773LANRwY)\n\n# Contributing\n\nThis project welcomes contributions and suggestions.  Most contributions require you to agree to a\nContributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us\nthe rights to use your contribution. For details, visit <https://cla.opensource.microsoft.com>.\n\nWhen you submit a pull request, a CLA bot will automatically determine whether you need to provide\na CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions\nprovided by the bot. You will only need to do this once across all repos using our CLA.\n\nThis project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).\nFor more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or\ncontact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.\n\nCopyright notice:\n\nFake Name Generator identities by the [Fake Name Generator](https://www.fakenamegenerator.com/)\nare licensed under a [Creative Commons Attribution-Share Alike 3.0 United States License](http://creativecommons.org/licenses/by-sa/3.0/us/).\nFake Name Generator and the Fake Name Generator logo are trademarks of Corban Works, LLC.\n",
    "bugtrack_url": null,
    "license": "MIT License Copyright (c) Microsoft Corporation. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE",
    "summary": "Read the latest Real Python tutorials",
    "version": "0.0.51",
    "split_keywords": [
        "nlp",
        "pii"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "6e38babb42943ac273fe20136000f4bd",
                "sha256": "9cfb3eb3aacfc0e2f96f400be196e929b5c8847c67f389569f6fc35055e60afc"
            },
            "downloads": -1,
            "filename": "privy_presidio_utils-0.0.51-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "6e38babb42943ac273fe20136000f4bd",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 855167,
            "upload_time": "2022-09-28T04:58:32",
            "upload_time_iso_8601": "2022-09-28T04:58:32.915073Z",
            "url": "https://files.pythonhosted.org/packages/7f/2e/b9888f1da5026f8b92b97fe0fa9131cdec1150410d1ae02aca95493c2b53/privy_presidio_utils-0.0.51-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "md5": "6c124e5660f3383cbdf985a73dd94cb9",
                "sha256": "87f433722b6d34b2151013265dee384770b481b8e159d244df629b06dcbc04a5"
            },
            "downloads": -1,
            "filename": "privy-presidio-utils-0.0.51.tar.gz",
            "has_sig": false,
            "md5_digest": "6c124e5660f3383cbdf985a73dd94cb9",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 445180,
            "upload_time": "2022-09-28T04:58:35",
            "upload_time_iso_8601": "2022-09-28T04:58:35.800940Z",
            "url": "https://files.pythonhosted.org/packages/d6/6c/aea2d539d6944804e0e673f6b6e5eb20faa45d30df04e067ba0179fe5649/privy-presidio-utils-0.0.51.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2022-09-28 04:58:35",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "microsoft",
    "github_project": "presidio-research",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "spacy",
            "specs": [
                [
                    ">=",
                    "3.2.0"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    ">=",
                    "1.20.2"
                ]
            ]
        },
        {
            "name": "jupyter",
            "specs": [
                [
                    ">=",
                    "1"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    ">=",
                    "1.2.4"
                ]
            ]
        },
        {
            "name": "tqdm",
            "specs": [
                [
                    ">=",
                    "4.60.0"
                ]
            ]
        },
        {
            "name": "haikunator",
            "specs": [
                [
                    ">=",
                    "2.1.0"
                ]
            ]
        },
        {
            "name": "schwifty",
            "specs": []
        },
        {
            "name": "faker",
            "specs": [
                [
                    ">=",
                    "9.6.0"
                ]
            ]
        },
        {
            "name": "scikit_learn",
            "specs": []
        },
        {
            "name": "pytest",
            "specs": [
                [
                    ">=",
                    "6.2.3"
                ]
            ]
        },
        {
            "name": "presidio_analyzer",
            "specs": []
        },
        {
            "name": "presidio_anonymizer",
            "specs": []
        },
        {
            "name": "requests",
            "specs": [
                [
                    ">=",
                    "2.25.1"
                ]
            ]
        },
        {
            "name": "xmltodict",
            "specs": [
                [
                    ">=",
                    "0.12.0"
                ]
            ]
        },
        {
            "name": "python-dotenv",
            "specs": []
        }
    ],
    "lcname": "privy-presidio-utils"
}
        
Elapsed time: 0.39470s