# Pii Extractor plugin: Presidio
This repository builds a Python package that installs a pii-extract-base
plugin to perform PII detection for text data using the Microsoft Presidio
Python library.
The name of the plugin entry point is `piisa-detectors-presidio`
## Requirements
The package neads
* at least Python 3.8
* the pii-data and the pii-extract-base base packages
* the presidio-analyzer package
* an NLP engine model for the desired language
## Installation
* Install the package: `pip install pii-extract-plg-presidio` (it will
automatically install its dependencies, including `presidio-analyzer`)
* Download the recognition model for the desired language(s), as instructed by
the presidio-analyzer installation instructions. The default plugin
configuration file defines three spaCy models:
- English model: `python -m spacy download en_core_web_lg`
- Spanish model: `python -m spacy download es_core_news_md`
- Italian model: `python -m spacy download it_core_news_md`
* For additional information on model specification, see customizing NLP
models in the Presidio documentation. If custom models are used, the
`nlp_config` element in the plugin configuration file must be
adjusted accordingly.
## Usage
The package does not have any user-facing entry points (except for one console
information script, see below). Instead, upon installation it
defines a plugin entry point. This plugin is automatically picked up by the
scripts and classes in pii-extract-base, and thus its functionality is exposed
to them.
Runtime behaviour is governed by a configuration file, which sets up which
recognizers from Presidio will be instantiated and used (note that the
configuration defines which languages are available for detection, but the
plugin can also be initialized with a _subset_ of those languages).
The task created from the plugin is a standard PII task object, using the
`pii_extract.build.task.MultiPiiTask` class definition. It will be called,
as all PII task objects, with a `DocumentChunk` object containing the data to
analyze. The chunk **must** contain language specification in its metadata, so
that Presidio knows which language to use (unless the plugin task has been
built with *only one* language; in that case if the chunk does not contain
a language specification, it will use that single language).
## info script
`pii-extract-presidio-info` is a command-line script which provides
information about the plugin capabilities:
* `version`: installed package versions
* `presidio-recognizers`: list of recognizers in Presidio
* `presidio-entities`: the total list of entities Presidio can generate
* `pii-entities`: the PIISA tasks that this plugin will create, by translating
from the entities detected by Presidio (this depends on the PIISA config
used)
## Building
The provided Makefile can be used to process the package:
* `make pkg` will build the Python package, creating a file that can be
installed with `pip`
* `make unit` will launch all unit tests (using pytest, so pytest must be
available)
* `make install` will install the package in a Python virtualenv. The
virtualenv will be chosen as, in this order:
- the one defined in the `VENV` environment variable, if it is defined
- if there is a virtualenv activated in the shell, it will be used
- otherwise, a default is chosen as `/opt/venv/pii` (it will be
created if it does not exist)
Raw data
{
"_id": null,
"home_page": "https://github.com/piisa/pii-extract-plg-presidio",
"name": "pii-extract-plg-presidio",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "",
"keywords": "PIISA, PII",
"author": "Paulo Villegas",
"author_email": "paulo.vllgs@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/09/73/13aa7636a3130d1fea064f296879287e1f206e896b0f26ce0dac7674879a/pii-extract-plg-presidio-0.3.3.tar.gz",
"platform": null,
"description": "# Pii Extractor plugin: Presidio\n\n\nThis repository builds a Python package that installs a pii-extract-base\nplugin to perform PII detection for text data using the Microsoft Presidio\nPython library.\n\nThe name of the plugin entry point is `piisa-detectors-presidio`\n\n\n## Requirements\n\nThe package neads\n * at least Python 3.8\n * the pii-data and the pii-extract-base base packages\n * the presidio-analyzer package\n * an NLP engine model for the desired language\n\n\n## Installation\n\n * Install the package: `pip install pii-extract-plg-presidio` (it will\n automatically install its dependencies, including `presidio-analyzer`)\n * Download the recognition model for the desired language(s), as instructed by\n the presidio-analyzer installation instructions. The default plugin\n configuration file defines three spaCy models:\n - English model: `python -m spacy download en_core_web_lg`\n - Spanish model: `python -m spacy download es_core_news_md`\n - Italian model: `python -m spacy download it_core_news_md`\n * For additional information on model specification, see customizing NLP\n models in the Presidio documentation. If custom models are used, the\n `nlp_config` element in the plugin configuration file must be\n adjusted accordingly.\n\n\n## Usage\n\nThe package does not have any user-facing entry points (except for one console\ninformation script, see below). Instead, upon installation it\ndefines a plugin entry point. This plugin is automatically picked up by the\nscripts and classes in pii-extract-base, and thus its functionality is exposed\nto them.\n\nRuntime behaviour is governed by a configuration file, which sets up which\nrecognizers from Presidio will be instantiated and used (note that the\nconfiguration defines which languages are available for detection, but the\nplugin can also be initialized with a _subset_ of those languages).\n\nThe task created from the plugin is a standard PII task object, using the\n`pii_extract.build.task.MultiPiiTask` class definition. It will be called,\nas all PII task objects, with a `DocumentChunk` object containing the data to\nanalyze. The chunk **must** contain language specification in its metadata, so\nthat Presidio knows which language to use (unless the plugin task has been\nbuilt with *only one* language; in that case if the chunk does not contain\na language specification, it will use that single language).\n\n\n## info script\n\n`pii-extract-presidio-info` is a command-line script which provides\ninformation about the plugin capabilities: \n * `version`: installed package versions\n * `presidio-recognizers`: list of recognizers in Presidio\n * `presidio-entities`: the total list of entities Presidio can generate\n * `pii-entities`: the PIISA tasks that this plugin will create, by translating\n\tfrom the entities detected by Presidio (this depends on the PIISA config\n\tused)\n\n\n## Building\n\nThe provided Makefile can be used to process the package:\n * `make pkg` will build the Python package, creating a file that can be\n installed with `pip`\n * `make unit` will launch all unit tests (using pytest, so pytest must be\n available)\n * `make install` will install the package in a Python virtualenv. The\n virtualenv will be chosen as, in this order:\n - the one defined in the `VENV` environment variable, if it is defined\n - if there is a virtualenv activated in the shell, it will be used\n - otherwise, a default is chosen as `/opt/venv/pii` (it will be\n created if it does not exist)\n\n\n\n",
"bugtrack_url": null,
"license": "Apache",
"summary": "Presidio plugin for PII detection",
"version": "0.3.3",
"project_urls": {
"Download": "https://github.com/piisa/pii-extract-plg-presidio/tarball/v0.3.3",
"Homepage": "https://github.com/piisa/pii-extract-plg-presidio"
},
"split_keywords": [
"piisa",
" pii"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "097313aa7636a3130d1fea064f296879287e1f206e896b0f26ce0dac7674879a",
"md5": "55d0938f43f2ba3e5e628b1c3163423c",
"sha256": "8fb5cc3b7df12881c6c7e5dc3e09eb9ad0ce37559eaa37954b928d8db4b43b57"
},
"downloads": -1,
"filename": "pii-extract-plg-presidio-0.3.3.tar.gz",
"has_sig": false,
"md5_digest": "55d0938f43f2ba3e5e628b1c3163423c",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 17748,
"upload_time": "2023-10-30T21:04:41",
"upload_time_iso_8601": "2023-10-30T21:04:41.796555Z",
"url": "https://files.pythonhosted.org/packages/09/73/13aa7636a3130d1fea064f296879287e1f206e896b0f26ce0dac7674879a/pii-extract-plg-presidio-0.3.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-10-30 21:04:41",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "piisa",
"github_project": "pii-extract-plg-presidio",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "pii-extract-plg-presidio"
}