# Pii Extractor plugin: Transformers
This repository builds a Python package that installs a pii-extract-base
plugin to perform PII detection for text data using the Hugging Face
Transformers Python library. It will download and use trained token
classification models running on that library.
The name of the plugin entry point is `piisa-detectors-transformers`
## Requirements
The package needs
* at least Python 3.8
* the pii-data and the pii-extract-base base packages
* the Transformers package
* PyTorch, either GPU or CPU
* an NLP engine model for the desired language (will be downloaded on demand,
based on the configuration)
## Installation
* Install the package: `pip install pii-extract-plg-transformers` (it will
automatically install its dependencies, *except* Pytorch)
* Install PyTorch: either the CPU PyTorch package or the GPU package
appropriate for your GPU
* If necessary, define the cache directory for models (see
below)
### Cache directory
The transformers library downloads models on the fly from the HuggingFace
Hub. It keeps them in a cache in a local folder, to avoid repeated downloads.
The `pii-extract-plg-transformers` package defines this local folder as
follows:
1. If the `HUGGINGFACE_HUB_CACHE` environment variable is defined, it is
used
2. Else, if the configuration file for the package contains a `cachedir` field
inside the `task_config` section, it will be used
3. If that field contains a `false` value, then **no** specific cache directory
will be defined (so the HuggingFace internal default will be used)
4. Else, a default is chosen: the `var/piisa/hf-cache` subfolder in
the virtualenv that holds the package
## Usage
The package does not have any user-facing entry points (except for two auxiliary
console scripts, see below). Instead, upon installation it
defines a plugin entry point. This plugin is automatically picked up by the
scripts and classes in pii-extract-base, and thus its functionality is exposed
to them.
The task created by the plugin is a standard PII task object, using the
`pii_extract.build.task.MultiPiiTask` class definition. It will be called,
as all PII task objects, with a `DocumentChunk` object containing the data to
analyze. The chunk **must** contain language specification in its metadata, so
that the plugin knows which language to use (unless the plugin task has been
built with *only one* language; in that case if the chunk does not contain
a language specification, it will use that single language).
## Configuration
Runtime behaviour is governed by a PIISA configuration file, which sets up which
models from the HuggingFace Hub will be downloaded and used (note that the
configuration defines the total set of languages available for detection, but
it is also possible to initialize the plugin with a _subset_ of the configuration
languages).
The default configuration file defines detection for `Person` and `Location`
PII instances for English, Spanish and French, using the WikiNEuRal multilingual
NER model available in the Hugging Face Hub.
However, a configuration file can
also define a different model per language, and a different set of PII to detect
for each model (and also different aggregation strategies to merge the model
output). There is another example available.
## Auxiliary scripts
### Information
`pii-extract-transformers-info` is a command-line script which provides
information about the plugin capabilities:
* `version`: installed package versions
* `models`: list of configured Transdormers models
* `model-entities`: the total list of entities each configured model can
generate
* `pii-entities`: the PIISA tasks that this plugin will create, by translating
from the entities detected by the models (this depends on the PIISA config
used)
### Testing
`pii-extract-transformers-detect` is a command-line script to do initial
testing: it performs PII detection by processing a text chunk through one of the
models defined in the plugin configuration.
Note that this script instantiates the plugin task directly, i.e. it does *not*
go through the standard PIISA software stack (which would execute the task via
plugin loading into the pii-extract framework). For the same reason, it *only*
executes this detection task, ignoring any other pii-extract plugins that
might be available.
## Building
The provided Makefile can be used to process the package:
* `make pkg` will build the Python package, creating a file that can be
installed with `pip`
* `make unit` will launch all unit tests (using pytest, so pytest must be
available)
* `make install` will install the package in a Python virtualenv. The
virtualenv will be chosen as, in this order:
- the one defined in the `VENV` environment variable, if it is defined
- if there is a virtualenv activated in the shell, it will be used
- otherwise, a default is chosen as `/opt/venv/pii` (it will be
created if it does not exist)
Raw data
{
"_id": null,
"home_page": "https://github.com/piisa/pii-extract-plg-transformers",
"name": "pii-extract-plg-transformers",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "",
"keywords": "PIISA, PII",
"author": "Paulo Villegas",
"author_email": "paulo.vllgs@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/e9/85/85d02074e9525022ccd85e2b99f56854564eb2823de5f54314aaa77e2769/pii-extract-plg-transformers-0.1.3.tar.gz",
"platform": null,
"description": "# Pii Extractor plugin: Transformers\n\n\nThis repository builds a Python package that installs a pii-extract-base\nplugin to perform PII detection for text data using the Hugging Face\nTransformers Python library. It will download and use trained token\nclassification models running on that library.\n\nThe name of the plugin entry point is `piisa-detectors-transformers`\n\n\n## Requirements\n\nThe package needs\n * at least Python 3.8\n * the pii-data and the pii-extract-base base packages\n * the Transformers package\n * PyTorch, either GPU or CPU\n * an NLP engine model for the desired language (will be downloaded on demand,\n based on the configuration)\n\n\n## Installation\n\n * Install the package: `pip install pii-extract-plg-transformers` (it will\n automatically install its dependencies, *except* Pytorch)\n * Install PyTorch: either the CPU PyTorch package or the GPU package\n appropriate for your GPU\n * If necessary, define the cache directory for models (see\n below)\n \n\n### Cache directory\n\nThe transformers library downloads models on the fly from the HuggingFace\nHub. It keeps them in a cache in a local folder, to avoid repeated downloads.\n\nThe `pii-extract-plg-transformers` package defines this local folder as\nfollows:\n\n1. If the `HUGGINGFACE_HUB_CACHE` environment variable is defined, it is\n used\n2. Else, if the configuration file for the package contains a `cachedir` field\n inside the `task_config` section, it will be used\n3. If that field contains a `false` value, then **no** specific cache directory\n will be defined (so the HuggingFace internal default will be used)\n4. Else, a default is chosen: the `var/piisa/hf-cache` subfolder in\n the virtualenv that holds the package\n\n\n## Usage\n\nThe package does not have any user-facing entry points (except for two auxiliary\nconsole scripts, see below). Instead, upon installation it\ndefines a plugin entry point. This plugin is automatically picked up by the\nscripts and classes in pii-extract-base, and thus its functionality is exposed\nto them.\n\nThe task created by the plugin is a standard PII task object, using the\n`pii_extract.build.task.MultiPiiTask` class definition. It will be called,\nas all PII task objects, with a `DocumentChunk` object containing the data to\nanalyze. The chunk **must** contain language specification in its metadata, so\nthat the plugin knows which language to use (unless the plugin task has been\nbuilt with *only one* language; in that case if the chunk does not contain\na language specification, it will use that single language).\n\n\n## Configuration\n\nRuntime behaviour is governed by a PIISA configuration file, which sets up which\nmodels from the HuggingFace Hub will be downloaded and used (note that the\nconfiguration defines the total set of languages available for detection, but\nit is also possible to initialize the plugin with a _subset_ of the configuration\nlanguages).\n\nThe default configuration file defines detection for `Person` and `Location`\nPII instances for English, Spanish and French, using the WikiNEuRal multilingual\nNER model available in the Hugging Face Hub.\n\nHowever, a configuration file can\nalso define a different model per language, and a different set of PII to detect\nfor each model (and also different aggregation strategies to merge the model\noutput). There is another example available.\n\n\n## Auxiliary scripts\n\n### Information\n\n`pii-extract-transformers-info` is a command-line script which provides\ninformation about the plugin capabilities: \n * `version`: installed package versions\n * `models`: list of configured Transdormers models\n * `model-entities`: the total list of entities each configured model can\n\t generate\n * `pii-entities`: the PIISA tasks that this plugin will create, by translating\n\tfrom the entities detected by the models (this depends on the PIISA config\n\tused)\n\n\n### Testing\n\n`pii-extract-transformers-detect` is a command-line script to do initial\ntesting: it performs PII detection by processing a text chunk through one of the\nmodels defined in the plugin configuration.\n\nNote that this script instantiates the plugin task directly, i.e. it does *not*\ngo through the standard PIISA software stack (which would execute the task via\nplugin loading into the pii-extract framework). For the same reason, it *only*\nexecutes this detection task, ignoring any other pii-extract plugins that\nmight be available.\n\n\n## Building\n\nThe provided Makefile can be used to process the package:\n * `make pkg` will build the Python package, creating a file that can be\n installed with `pip`\n * `make unit` will launch all unit tests (using pytest, so pytest must be\n available)\n * `make install` will install the package in a Python virtualenv. The\n virtualenv will be chosen as, in this order:\n - the one defined in the `VENV` environment variable, if it is defined\n - if there is a virtualenv activated in the shell, it will be used\n - otherwise, a default is chosen as `/opt/venv/pii` (it will be\n created if it does not exist)\n\n\n\n",
"bugtrack_url": null,
"license": "Apache",
"summary": "Transformers plugin for PII detection",
"version": "0.1.3",
"project_urls": {
"Download": "https://github.com/piisa/pii-extract-plg-transformers/tarball/v0.1.3",
"Homepage": "https://github.com/piisa/pii-extract-plg-transformers"
},
"split_keywords": [
"piisa",
" pii"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "e98585d02074e9525022ccd85e2b99f56854564eb2823de5f54314aaa77e2769",
"md5": "a1039ffe0446526f583ae2d45a8f19f0",
"sha256": "bbdffb097a6a31b0471ea87735e3b4527d16d9f7a1c9e76b210df9e3fff2fa15"
},
"downloads": -1,
"filename": "pii-extract-plg-transformers-0.1.3.tar.gz",
"has_sig": false,
"md5_digest": "a1039ffe0446526f583ae2d45a8f19f0",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 19999,
"upload_time": "2024-01-21T22:21:25",
"upload_time_iso_8601": "2024-01-21T22:21:25.532750Z",
"url": "https://files.pythonhosted.org/packages/e9/85/85d02074e9525022ccd85e2b99f56854564eb2823de5f54314aaa77e2769/pii-extract-plg-transformers-0.1.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-01-21 22:21:25",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "piisa",
"github_project": "pii-extract-plg-transformers",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "pii-extract-plg-transformers"
}