nerpii


Namenerpii JSON
Version 0.2.3 PyPI version JSON
download
home_pagehttps://github.com/Clearbox-AI/nerpii
SummaryA python library to perform NER on structured data and generate PII with Faker
upload_time2024-05-03 10:23:30
maintainerNone
docs_urlNone
authorClearbox AI
requires_python<4.0,>=3.9
licenseGPL
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Nerpii 
Nerpii is a Python library developed to perform Named Entity Recognition (NER) on structured datasets and synthesize Personal Identifiable Information (PII).

NER is performed with [Presidio](https://github.com/microsoft/presidio) and with a [NLP model](https://huggingface.co/dslim/bert-base-NER) available on HuggingFace, while the PII generation is based on [Faker](https://faker.readthedocs.io/en/master/).

## Installation
You can install Nerpii by using pip: 

```python
pip install nerpii
```
## Quickstart
### Named Entity Recognition
You can import the NamedEntityRecognizer using
```python
from nerpii.named_entity_recognizer import NamedEntityRecognizer
```
You can create a recognizer passing as parameter a path to a csv file or a Pandas Dataframe

```python
recognizer = NamedEntityRecognizer('./csv_path.csv', lang)
```
The <strong>lang</strong> parameter is used to define the language of the dataset. The deafult value is <strong>en</strong> (english), but it can be also selelcted <strong>it</strong> (italian).

Please note that if there are columns in the dataset containing names of people consisting of first and last names (e.g. John Smith), before creating a recognizer, it is necessary to split the name into two different columns called <strong>first_name</strong> and <strong>last_name</strong> using the function `split_name()`.

```python
from nerpii.named_entity_recognizer import split_name

df = split_name('./csv_path.csv', name_of_column_to_split)
```
The NamedEntityRecognizer class contains three methods to perform NER on a dataset:

```python
recognizer.assign_entities_with_presidio()
```
which assigns Presidio entities, listed [here](https://microsoft.github.io/presidio/supported_entities/)

```python
recognizer.assign_entities_manually()
```
which assigns manually ZIPCODE and CREDIT_CARD_NUMBER entities 

```python
recognizer.assign_organization_entity_with_model()
```
which assigns ORGANIZATION entity using a [NLP model](https://huggingface.co/dslim/bert-base-NER) available on HuggingFace.

To perform NER, you have to run these three methods sequentially, as reported below:

```python
recognizer.assign_entities_with_presidio()
recognizer.assign_entities_manually()
recognizer.assign_organization_entity_with_model()
```

The final output is a dictionary in which column names are given as keys and assigned entities and a confidence score as values.

This dictionary can be accessed using

```python
recognizer.dict_global_entities
```

### PII generation 

After performing NER on a dataset, you can generate new PII using Faker. 

You can import the FakerGenerator using 

```python
from nerpii.faker_generator import FakerGenerator
```

You can create a generator using

```python
generator = FakerGenerator(dataset, recognizer.dict_global_entities)
```
If you want to generate Italian PII, add ```lang = "it"``` as parameter to the previous object (default: ```lang = "en"```)

To generate new PII you can run

```python
generator.get_faker_generation()
```
The method above can generate the following PII:
* address
* phone number
* email naddress
* first name
* last name
* city
* state
* url
* zipcode
* credit card
* ssn
* country

## Examples

You can find a notebook example in the [notebook](https://github.com/Clearbox-AI/nerpii/tree/main/notebooks) folder. 



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Clearbox-AI/nerpii",
    "name": "nerpii",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.9",
    "maintainer_email": null,
    "keywords": null,
    "author": "Clearbox AI",
    "author_email": "info@clearbox.ai",
    "download_url": "https://files.pythonhosted.org/packages/7a/23/99d0ba7152419daeda652596723136ad38cc6bd0f167c1bb334b9736f3c9/nerpii-0.2.3.tar.gz",
    "platform": null,
    "description": "# Nerpii \nNerpii is a Python library developed to perform Named Entity Recognition (NER) on structured datasets and synthesize Personal Identifiable Information (PII).\n\nNER is performed with [Presidio](https://github.com/microsoft/presidio) and with a [NLP model](https://huggingface.co/dslim/bert-base-NER) available on HuggingFace, while the PII generation is based on [Faker](https://faker.readthedocs.io/en/master/).\n\n## Installation\nYou can install Nerpii by using pip: \n\n```python\npip install nerpii\n```\n## Quickstart\n### Named Entity Recognition\nYou can import the NamedEntityRecognizer using\n```python\nfrom nerpii.named_entity_recognizer import NamedEntityRecognizer\n```\nYou can create a recognizer passing as parameter a path to a csv file or a Pandas Dataframe\n\n```python\nrecognizer = NamedEntityRecognizer('./csv_path.csv', lang)\n```\nThe <strong>lang</strong> parameter is used to define the language of the dataset. The deafult value is <strong>en</strong> (english), but it can be also selelcted <strong>it</strong> (italian).\n\nPlease note that if there are columns in the dataset containing names of people consisting of first and last names (e.g. John Smith), before creating a recognizer, it is necessary to split the name into two different columns called <strong>first_name</strong> and <strong>last_name</strong> using the function `split_name()`.\n\n```python\nfrom nerpii.named_entity_recognizer import split_name\n\ndf = split_name('./csv_path.csv', name_of_column_to_split)\n```\nThe NamedEntityRecognizer class contains three methods to perform NER on a dataset:\n\n```python\nrecognizer.assign_entities_with_presidio()\n```\nwhich assigns Presidio entities, listed [here](https://microsoft.github.io/presidio/supported_entities/)\n\n```python\nrecognizer.assign_entities_manually()\n```\nwhich assigns manually ZIPCODE and CREDIT_CARD_NUMBER entities \n\n```python\nrecognizer.assign_organization_entity_with_model()\n```\nwhich assigns ORGANIZATION entity using a [NLP model](https://huggingface.co/dslim/bert-base-NER) available on HuggingFace.\n\nTo perform NER, you have to run these three methods sequentially, as reported below:\n\n```python\nrecognizer.assign_entities_with_presidio()\nrecognizer.assign_entities_manually()\nrecognizer.assign_organization_entity_with_model()\n```\n\nThe final output is a dictionary in which column names are given as keys and assigned entities and a confidence score as values.\n\nThis dictionary can be accessed using\n\n```python\nrecognizer.dict_global_entities\n```\n\n### PII generation \n\nAfter performing NER on a dataset, you can generate new PII using Faker. \n\nYou can import the FakerGenerator using \n\n```python\nfrom nerpii.faker_generator import FakerGenerator\n```\n\nYou can create a generator using\n\n```python\ngenerator = FakerGenerator(dataset, recognizer.dict_global_entities)\n```\nIf you want to generate Italian PII, add ```lang = \"it\"``` as parameter to the previous object (default: ```lang = \"en\"```)\n\nTo generate new PII you can run\n\n```python\ngenerator.get_faker_generation()\n```\nThe method above can generate the following PII:\n* address\n* phone number\n* email naddress\n* first name\n* last name\n* city\n* state\n* url\n* zipcode\n* credit card\n* ssn\n* country\n\n## Examples\n\nYou can find a notebook example in the [notebook](https://github.com/Clearbox-AI/nerpii/tree/main/notebooks) folder. \n\n\n",
    "bugtrack_url": null,
    "license": "GPL",
    "summary": "A python library to perform NER on structured data and generate PII with Faker",
    "version": "0.2.3",
    "project_urls": {
        "Homepage": "https://github.com/Clearbox-AI/nerpii",
        "Repository": "https://github.com/Clearbox-AI/nerpii"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "437d070b58f56bc2c5cfdbfecc725dd0323fb935460ddc1a09c65ab798982c03",
                "md5": "e6dd35154349542110d914313b72d934",
                "sha256": "4061b8b6204d9e3e0230e573392e5533634f9b7d89f6bdcf060f391dea17c745"
            },
            "downloads": -1,
            "filename": "nerpii-0.2.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e6dd35154349542110d914313b72d934",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.9",
            "size": 9783,
            "upload_time": "2024-05-03T10:23:28",
            "upload_time_iso_8601": "2024-05-03T10:23:28.801589Z",
            "url": "https://files.pythonhosted.org/packages/43/7d/070b58f56bc2c5cfdbfecc725dd0323fb935460ddc1a09c65ab798982c03/nerpii-0.2.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "7a2399d0ba7152419daeda652596723136ad38cc6bd0f167c1bb334b9736f3c9",
                "md5": "77ce2bd96623fa7d0840862a3987820e",
                "sha256": "11b9b8ab98d7939abfc790425de0fa614f77e38fa7107a243ed55d7b2ea2bb59"
            },
            "downloads": -1,
            "filename": "nerpii-0.2.3.tar.gz",
            "has_sig": false,
            "md5_digest": "77ce2bd96623fa7d0840862a3987820e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.9",
            "size": 9503,
            "upload_time": "2024-05-03T10:23:30",
            "upload_time_iso_8601": "2024-05-03T10:23:30.416559Z",
            "url": "https://files.pythonhosted.org/packages/7a/23/99d0ba7152419daeda652596723136ad38cc6bd0f167c1bb334b9736f3c9/nerpii-0.2.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-05-03 10:23:30",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Clearbox-AI",
    "github_project": "nerpii",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "nerpii"
}
        
Elapsed time: 0.24495s