Name | nerpii JSON |
Version |
0.2.3
JSON |
| download |
home_page | https://github.com/Clearbox-AI/nerpii |
Summary | A python library to perform NER on structured data and generate PII with Faker |
upload_time | 2024-05-03 10:23:30 |
maintainer | None |
docs_url | None |
author | Clearbox AI |
requires_python | <4.0,>=3.9 |
license | GPL |
keywords |
|
VCS |
 |
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# Nerpii
Nerpii is a Python library developed to perform Named Entity Recognition (NER) on structured datasets and synthesize Personal Identifiable Information (PII).
NER is performed with [Presidio](https://github.com/microsoft/presidio) and with a [NLP model](https://huggingface.co/dslim/bert-base-NER) available on HuggingFace, while the PII generation is based on [Faker](https://faker.readthedocs.io/en/master/).
## Installation
You can install Nerpii by using pip:
```python
pip install nerpii
```
## Quickstart
### Named Entity Recognition
You can import the NamedEntityRecognizer using
```python
from nerpii.named_entity_recognizer import NamedEntityRecognizer
```
You can create a recognizer passing as parameter a path to a csv file or a Pandas Dataframe
```python
recognizer = NamedEntityRecognizer('./csv_path.csv', lang)
```
The <strong>lang</strong> parameter is used to define the language of the dataset. The deafult value is <strong>en</strong> (english), but it can be also selelcted <strong>it</strong> (italian).
Please note that if there are columns in the dataset containing names of people consisting of first and last names (e.g. John Smith), before creating a recognizer, it is necessary to split the name into two different columns called <strong>first_name</strong> and <strong>last_name</strong> using the function `split_name()`.
```python
from nerpii.named_entity_recognizer import split_name
df = split_name('./csv_path.csv', name_of_column_to_split)
```
The NamedEntityRecognizer class contains three methods to perform NER on a dataset:
```python
recognizer.assign_entities_with_presidio()
```
which assigns Presidio entities, listed [here](https://microsoft.github.io/presidio/supported_entities/)
```python
recognizer.assign_entities_manually()
```
which assigns manually ZIPCODE and CREDIT_CARD_NUMBER entities
```python
recognizer.assign_organization_entity_with_model()
```
which assigns ORGANIZATION entity using a [NLP model](https://huggingface.co/dslim/bert-base-NER) available on HuggingFace.
To perform NER, you have to run these three methods sequentially, as reported below:
```python
recognizer.assign_entities_with_presidio()
recognizer.assign_entities_manually()
recognizer.assign_organization_entity_with_model()
```
The final output is a dictionary in which column names are given as keys and assigned entities and a confidence score as values.
This dictionary can be accessed using
```python
recognizer.dict_global_entities
```
### PII generation
After performing NER on a dataset, you can generate new PII using Faker.
You can import the FakerGenerator using
```python
from nerpii.faker_generator import FakerGenerator
```
You can create a generator using
```python
generator = FakerGenerator(dataset, recognizer.dict_global_entities)
```
If you want to generate Italian PII, add ```lang = "it"``` as parameter to the previous object (default: ```lang = "en"```)
To generate new PII you can run
```python
generator.get_faker_generation()
```
The method above can generate the following PII:
* address
* phone number
* email naddress
* first name
* last name
* city
* state
* url
* zipcode
* credit card
* ssn
* country
## Examples
You can find a notebook example in the [notebook](https://github.com/Clearbox-AI/nerpii/tree/main/notebooks) folder.
Raw data
{
"_id": null,
"home_page": "https://github.com/Clearbox-AI/nerpii",
"name": "nerpii",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.9",
"maintainer_email": null,
"keywords": null,
"author": "Clearbox AI",
"author_email": "info@clearbox.ai",
"download_url": "https://files.pythonhosted.org/packages/7a/23/99d0ba7152419daeda652596723136ad38cc6bd0f167c1bb334b9736f3c9/nerpii-0.2.3.tar.gz",
"platform": null,
"description": "# Nerpii \nNerpii is a Python library developed to perform Named Entity Recognition (NER) on structured datasets and synthesize Personal Identifiable Information (PII).\n\nNER is performed with [Presidio](https://github.com/microsoft/presidio) and with a [NLP model](https://huggingface.co/dslim/bert-base-NER) available on HuggingFace, while the PII generation is based on [Faker](https://faker.readthedocs.io/en/master/).\n\n## Installation\nYou can install Nerpii by using pip: \n\n```python\npip install nerpii\n```\n## Quickstart\n### Named Entity Recognition\nYou can import the NamedEntityRecognizer using\n```python\nfrom nerpii.named_entity_recognizer import NamedEntityRecognizer\n```\nYou can create a recognizer passing as parameter a path to a csv file or a Pandas Dataframe\n\n```python\nrecognizer = NamedEntityRecognizer('./csv_path.csv', lang)\n```\nThe <strong>lang</strong> parameter is used to define the language of the dataset. The deafult value is <strong>en</strong> (english), but it can be also selelcted <strong>it</strong> (italian).\n\nPlease note that if there are columns in the dataset containing names of people consisting of first and last names (e.g. John Smith), before creating a recognizer, it is necessary to split the name into two different columns called <strong>first_name</strong> and <strong>last_name</strong> using the function `split_name()`.\n\n```python\nfrom nerpii.named_entity_recognizer import split_name\n\ndf = split_name('./csv_path.csv', name_of_column_to_split)\n```\nThe NamedEntityRecognizer class contains three methods to perform NER on a dataset:\n\n```python\nrecognizer.assign_entities_with_presidio()\n```\nwhich assigns Presidio entities, listed [here](https://microsoft.github.io/presidio/supported_entities/)\n\n```python\nrecognizer.assign_entities_manually()\n```\nwhich assigns manually ZIPCODE and CREDIT_CARD_NUMBER entities \n\n```python\nrecognizer.assign_organization_entity_with_model()\n```\nwhich assigns ORGANIZATION entity using a [NLP model](https://huggingface.co/dslim/bert-base-NER) available on HuggingFace.\n\nTo perform NER, you have to run these three methods sequentially, as reported below:\n\n```python\nrecognizer.assign_entities_with_presidio()\nrecognizer.assign_entities_manually()\nrecognizer.assign_organization_entity_with_model()\n```\n\nThe final output is a dictionary in which column names are given as keys and assigned entities and a confidence score as values.\n\nThis dictionary can be accessed using\n\n```python\nrecognizer.dict_global_entities\n```\n\n### PII generation \n\nAfter performing NER on a dataset, you can generate new PII using Faker. \n\nYou can import the FakerGenerator using \n\n```python\nfrom nerpii.faker_generator import FakerGenerator\n```\n\nYou can create a generator using\n\n```python\ngenerator = FakerGenerator(dataset, recognizer.dict_global_entities)\n```\nIf you want to generate Italian PII, add ```lang = \"it\"``` as parameter to the previous object (default: ```lang = \"en\"```)\n\nTo generate new PII you can run\n\n```python\ngenerator.get_faker_generation()\n```\nThe method above can generate the following PII:\n* address\n* phone number\n* email naddress\n* first name\n* last name\n* city\n* state\n* url\n* zipcode\n* credit card\n* ssn\n* country\n\n## Examples\n\nYou can find a notebook example in the [notebook](https://github.com/Clearbox-AI/nerpii/tree/main/notebooks) folder. \n\n\n",
"bugtrack_url": null,
"license": "GPL",
"summary": "A python library to perform NER on structured data and generate PII with Faker",
"version": "0.2.3",
"project_urls": {
"Homepage": "https://github.com/Clearbox-AI/nerpii",
"Repository": "https://github.com/Clearbox-AI/nerpii"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "437d070b58f56bc2c5cfdbfecc725dd0323fb935460ddc1a09c65ab798982c03",
"md5": "e6dd35154349542110d914313b72d934",
"sha256": "4061b8b6204d9e3e0230e573392e5533634f9b7d89f6bdcf060f391dea17c745"
},
"downloads": -1,
"filename": "nerpii-0.2.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "e6dd35154349542110d914313b72d934",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.9",
"size": 9783,
"upload_time": "2024-05-03T10:23:28",
"upload_time_iso_8601": "2024-05-03T10:23:28.801589Z",
"url": "https://files.pythonhosted.org/packages/43/7d/070b58f56bc2c5cfdbfecc725dd0323fb935460ddc1a09c65ab798982c03/nerpii-0.2.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "7a2399d0ba7152419daeda652596723136ad38cc6bd0f167c1bb334b9736f3c9",
"md5": "77ce2bd96623fa7d0840862a3987820e",
"sha256": "11b9b8ab98d7939abfc790425de0fa614f77e38fa7107a243ed55d7b2ea2bb59"
},
"downloads": -1,
"filename": "nerpii-0.2.3.tar.gz",
"has_sig": false,
"md5_digest": "77ce2bd96623fa7d0840862a3987820e",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.9",
"size": 9503,
"upload_time": "2024-05-03T10:23:30",
"upload_time_iso_8601": "2024-05-03T10:23:30.416559Z",
"url": "https://files.pythonhosted.org/packages/7a/23/99d0ba7152419daeda652596723136ad38cc6bd0f167c1bb334b9736f3c9/nerpii-0.2.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-05-03 10:23:30",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Clearbox-AI",
"github_project": "nerpii",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "nerpii"
}