hufr


Namehufr JSON
Version 2.0.1 PyPI version JSON
download
home_pagehttps://github.com/robertsonwang/hufr
SummaryRedact Text with HuggingFace Models
upload_time2024-01-30 07:42:07
maintainer
docs_urlNone
authorRobertson Wang
requires_python>=3.9,<4.0
licenseApache 2.0
keywords huggingface pii ner onnx nlp redactions
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # 🤗 Redactions

[HuggingFace Redactions](https://github.com/robertsonwang/hufr) (`hufr`) redacts personal identifiable information from text using pretrained language models from the HuggingFace model repository. This packge wraps token classification models to streamline the redaction of personal identifiable information from free text. This project is not associated with the official HuggingFace organization, just a fun side project for this individual contributor.

# Installation

To install this package, run `pip install hufr`

# Usage

See below for an example snippet to load a specific token classification library from the HuggingFace model zoo:

```python
from hufr.models import TokenClassificationTransformer
from hufr.redact import redact_text
from transformers.tokenization_utils_base import BatchEncoding

model_path = "dslim/bert-base-NER"
model = TokenClassificationTransformer(
    model=model_path,
    tokenizer=model_path
)

text = "Hello! My name is Rob"
redact_text(
    text,
    redaction_map={'PER': '<PERSON>'},
    model=model
)

> `"Hello! My name is <PERSON>"`
```

If you don't want to instantiate a model and supply a specific token classification model, then you can simply rely on the repository defaults for a quick and simple redaction:

```python
from hufr.redact import redact_text

text = "Hello! My name is Rob"
redact_text(text)
```

To get the predicted entity for each word in the original text:

```python
from hufr.redact import redact_text

text = "Hello! My name is Rob"
redact_text(text, return_preds=True)

> "Hello! My name is <PERSON>", ['O', 'O', 'O', 'O', 'PER']
```

By default, personal identifiable information is predicted by the [dslim/bert-base-NER](https://huggingface.co/dslim/bert-base-NER) model where entities are mapped to redactions using the following mapping table:

```python
'PER': '<PERSON>',
'MIS': '<OTHER>',
'ORG': '<ORGANIZATION>',
'LOC': '<LOCATION>'
```


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/robertsonwang/hufr",
    "name": "hufr",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.9,<4.0",
    "maintainer_email": "",
    "keywords": "huggingface,pii,ner,ONNX,NLP,redactions",
    "author": "Robertson Wang",
    "author_email": "robertsonwang@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/a6/15/dbd64cac250f4575c069481c98a839ebf83bc271f9140c27e645a0cad477/hufr-2.0.1.tar.gz",
    "platform": null,
    "description": "# \ud83e\udd17 Redactions\n\n[HuggingFace Redactions](https://github.com/robertsonwang/hufr) (`hufr`) redacts personal identifiable information from text using pretrained language models from the HuggingFace model repository. This packge wraps token classification models to streamline the redaction of personal identifiable information from free text. This project is not associated with the official HuggingFace organization, just a fun side project for this individual contributor.\n\n# Installation\n\nTo install this package, run `pip install hufr`\n\n# Usage\n\nSee below for an example snippet to load a specific token classification library from the HuggingFace model zoo:\n\n```python\nfrom hufr.models import TokenClassificationTransformer\nfrom hufr.redact import redact_text\nfrom transformers.tokenization_utils_base import BatchEncoding\n\nmodel_path = \"dslim/bert-base-NER\"\nmodel = TokenClassificationTransformer(\n    model=model_path,\n    tokenizer=model_path\n)\n\ntext = \"Hello! My name is Rob\"\nredact_text(\n    text,\n    redaction_map={'PER': '<PERSON>'},\n    model=model\n)\n\n> `\"Hello! My name is <PERSON>\"`\n```\n\nIf you don't want to instantiate a model and supply a specific token classification model, then you can simply rely on the repository defaults for a quick and simple redaction:\n\n```python\nfrom hufr.redact import redact_text\n\ntext = \"Hello! My name is Rob\"\nredact_text(text)\n```\n\nTo get the predicted entity for each word in the original text:\n\n```python\nfrom hufr.redact import redact_text\n\ntext = \"Hello! My name is Rob\"\nredact_text(text, return_preds=True)\n\n> \"Hello! My name is <PERSON>\", ['O', 'O', 'O', 'O', 'PER']\n```\n\nBy default, personal identifiable information is predicted by the [dslim/bert-base-NER](https://huggingface.co/dslim/bert-base-NER) model where entities are mapped to redactions using the following mapping table:\n\n```python\n'PER': '<PERSON>',\n'MIS': '<OTHER>',\n'ORG': '<ORGANIZATION>',\n'LOC': '<LOCATION>'\n```\n\n",
    "bugtrack_url": null,
    "license": "Apache 2.0",
    "summary": "Redact Text with HuggingFace Models",
    "version": "2.0.1",
    "project_urls": {
        "Homepage": "https://github.com/robertsonwang/hufr",
        "Repository": "https://github.com/robertsonwang/hufr"
    },
    "split_keywords": [
        "huggingface",
        "pii",
        "ner",
        "onnx",
        "nlp",
        "redactions"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "cf855cc65e9777dc6e80261f638d163709455ff4a55a136e2407f3aaa5ec25a4",
                "md5": "37a0c40e67704ee364c6c8abc9c8922c",
                "sha256": "b90b52a1d14063eb97186ac4522a628a0cafe23283762d7de8fb4fd5cd870936"
            },
            "downloads": -1,
            "filename": "hufr-2.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "37a0c40e67704ee364c6c8abc9c8922c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9,<4.0",
            "size": 14019,
            "upload_time": "2024-01-30T07:42:06",
            "upload_time_iso_8601": "2024-01-30T07:42:06.546624Z",
            "url": "https://files.pythonhosted.org/packages/cf/85/5cc65e9777dc6e80261f638d163709455ff4a55a136e2407f3aaa5ec25a4/hufr-2.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a615dbd64cac250f4575c069481c98a839ebf83bc271f9140c27e645a0cad477",
                "md5": "c9e7b523e602b3f25122c8abd770ed7c",
                "sha256": "ac4b1a781db5bce0446162ba0bd94cd8cf9a4e54cdcfdd4e5a72260c689372a5"
            },
            "downloads": -1,
            "filename": "hufr-2.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "c9e7b523e602b3f25122c8abd770ed7c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9,<4.0",
            "size": 11114,
            "upload_time": "2024-01-30T07:42:07",
            "upload_time_iso_8601": "2024-01-30T07:42:07.707151Z",
            "url": "https://files.pythonhosted.org/packages/a6/15/dbd64cac250f4575c069481c98a839ebf83bc271f9140c27e645a0cad477/hufr-2.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-01-30 07:42:07",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "robertsonwang",
    "github_project": "hufr",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "hufr"
}
        
Elapsed time: 2.22988s