# docdeid
[![tests](https://github.com/vmenger/docdeid/actions/workflows/test.yml/badge.svg)](https://github.com/vmenger/docdeid/actions/workflows/test.yml)
[![build](https://github.com/vmenger/docdeid/actions/workflows/build.yml/badge.svg)](https://github.com/vmenger/docdeid/actions/workflows/build.yml)
[![Documentation Status](https://readthedocs.org/projects/docdeid/badge/?version=latest)](https://docdeid.readthedocs.io/en/latest/)
[![pypy version](https://img.shields.io/pypi/v/docdeid)](https://pypi.org/project/docdeid/)
[![python versions](https://img.shields.io/pypi/pyversions/docdeid)](https://pypi.org/project/docdeid/)
[![license](https://img.shields.io/github/license/vmenger/docdeid)](https://github.com/vmenger/docdeid/blob/main/LICENSE.md)
[![black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[Installation](#installation) - [Getting started](#getting-started) - [Features](#features) - [Documentation](#documentation) - [Development and contributiong](#development-and-contributing) - [Authors](#authors) - [License](#license)
<!-- start include in docs -->
Create your own document de-identifier using `docdeid`, a simple framework independent of language or domain.
> Note that `docdeid` is still on version 0.x.x, and breaking changes might occur. If you plan to do extensive work involving `docdeid`, feel free to get in touch to coordinate.
## Installation
Grab the latest version from PyPi:
```bash
pip install docdeid
```
## Getting started
```python
from docdeid import DocDeid
from docdeid.tokenize import WordBoundaryTokenizer
from docdeid.process SingleTokenLookupAnnotator, RegexpAnnotator, SimpleRedactor
deidentifier = DocDeid()
deidentifier.tokenizers["default"] = WordBoundaryTokenizer()
deidentifier.processors.add_processor(
"name_lookup",
SingleTokenLookupAnnotator(lookup_values=["John", "Mary"], tag="name"),
)
deidentifier.processors.add_processor(
"name_regexp",
RegexpAnnotator(regexp_pattern=re.compile(r"[A-Z]\w+"), tag="name"),
)
deidentifier.processors.add_processor(
"redactor",
SimpleRedactor()
)
text = "John loves Mary, but Mary loves William."
doc = deidentifier.deidentify(text)
```
Find the relevant info in the `Document` object:
```python
print(doc.annotations)
AnnotationSet({
Annotation(text='John', start_char=0, end_char=4, tag='name', length=4),
Annotation(text='Mary', start_char=11, end_char=15, tag='name', length=4),
Annotation(text='Mary', start_char=21, end_char=25, tag='name', length=4),
Annotation(text='William', start_char=32, end_char=39, tag='name', length=7)
})
```
```python
print(doc.deidentified_text)
'[NAME-1] loves [NAME-2], but [NAME-2] loves [NAME-3].'
```
## Features
Additionally, `docdeid` features:
- Ability to create your own `Annotator`, `AnnotationProcessor`, `Redactor` and `Tokenizer` components
- Some basic re-usable components included (e.g. regexp, token lookup, token patterns)
- Callable from one interface (`DocDeid.deidenitfy()`)
- String processing and filtering
- Fast lookup based on sets or tries
- Anything you add! PRs welcome.
For a more in-depth tutorial, see: [docs/tutorial](https://docdeid.readthedocs.io/en/latest/tutorial.html)
<!-- end include in docs -->
## Documentation
For full documentation and API, see: [https://docdeid.readthedocs.io/en/latest/](https://docdeid.readthedocs.io/en/latest/)
## Development and contributing
For setting up dev environment, see: [docs/environment](https://docdeid.readthedocs.io/en/latest/environment.html)
For contributing, see: [docs/contributing](https://docdeid.readthedocs.io/en/latest/contributing.html)
## Authors
Vincent Menger - *Author, maintainer*
## License
This project is licensed under the MIT license - see the [LICENSE.md](LICENSE.md) file for details.
Raw data
{
"_id": null,
"home_page": "",
"name": "docdeid",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.9,<4.0",
"maintainer_email": "",
"keywords": "python,document de-identification,de-identification,document de-identifier,de-identifier",
"author": "Vincent Menger",
"author_email": "vmenger@protonmail.com",
"download_url": "https://files.pythonhosted.org/packages/00/1e/a725d1d012bcc14dd671c6e456f12cadc9c37cd6ca11ffd0d02bcaaa6a58/docdeid-1.0.0.tar.gz",
"platform": null,
"description": "# docdeid\n\n[![tests](https://github.com/vmenger/docdeid/actions/workflows/test.yml/badge.svg)](https://github.com/vmenger/docdeid/actions/workflows/test.yml)\n[![build](https://github.com/vmenger/docdeid/actions/workflows/build.yml/badge.svg)](https://github.com/vmenger/docdeid/actions/workflows/build.yml)\n[![Documentation Status](https://readthedocs.org/projects/docdeid/badge/?version=latest)](https://docdeid.readthedocs.io/en/latest/)\n[![pypy version](https://img.shields.io/pypi/v/docdeid)](https://pypi.org/project/docdeid/)\n[![python versions](https://img.shields.io/pypi/pyversions/docdeid)](https://pypi.org/project/docdeid/)\n[![license](https://img.shields.io/github/license/vmenger/docdeid)](https://github.com/vmenger/docdeid/blob/main/LICENSE.md)\n[![black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n\n[Installation](#installation) - [Getting started](#getting-started) - [Features](#features) - [Documentation](#documentation) - [Development and contributiong](#development-and-contributing) - [Authors](#authors) - [License](#license) \n\n<!-- start include in docs -->\n\nCreate your own document de-identifier using `docdeid`, a simple framework independent of language or domain.\n\n> Note that `docdeid` is still on version 0.x.x, and breaking changes might occur. If you plan to do extensive work involving `docdeid`, feel free to get in touch to coordinate. \n\n## Installation\n\nGrab the latest version from PyPi:\n\n```bash\npip install docdeid\n```\n\n## Getting started\n\n```python\nfrom docdeid import DocDeid\nfrom docdeid.tokenize import WordBoundaryTokenizer\nfrom docdeid.process SingleTokenLookupAnnotator, RegexpAnnotator, SimpleRedactor\n\ndeidentifier = DocDeid()\n\ndeidentifier.tokenizers[\"default\"] = WordBoundaryTokenizer()\n\ndeidentifier.processors.add_processor(\n \"name_lookup\",\n SingleTokenLookupAnnotator(lookup_values=[\"John\", \"Mary\"], tag=\"name\"),\n)\n\ndeidentifier.processors.add_processor(\n \"name_regexp\",\n RegexpAnnotator(regexp_pattern=re.compile(r\"[A-Z]\\w+\"), tag=\"name\"),\n)\n\ndeidentifier.processors.add_processor(\n \"redactor\", \n SimpleRedactor()\n)\n\ntext = \"John loves Mary, but Mary loves William.\"\ndoc = deidentifier.deidentify(text)\n```\n\nFind the relevant info in the `Document` object:\n\n```python\nprint(doc.annotations)\n\nAnnotationSet({\n Annotation(text='John', start_char=0, end_char=4, tag='name', length=4),\n Annotation(text='Mary', start_char=11, end_char=15, tag='name', length=4),\n Annotation(text='Mary', start_char=21, end_char=25, tag='name', length=4), \n Annotation(text='William', start_char=32, end_char=39, tag='name', length=7)\n})\n```\n\n```python\nprint(doc.deidentified_text)\n\n'[NAME-1] loves [NAME-2], but [NAME-2] loves [NAME-3].'\n```\n\n## Features\n\nAdditionally, `docdeid` features: \n\n- Ability to create your own `Annotator`, `AnnotationProcessor`, `Redactor` and `Tokenizer` components\n- Some basic re-usable components included (e.g. regexp, token lookup, token patterns)\n- Callable from one interface (`DocDeid.deidenitfy()`)\n- String processing and filtering\n- Fast lookup based on sets or tries\n- Anything you add! PRs welcome.\n\nFor a more in-depth tutorial, see: [docs/tutorial](https://docdeid.readthedocs.io/en/latest/tutorial.html)\n\n<!-- end include in docs -->\n\n## Documentation\n\nFor full documentation and API, see: [https://docdeid.readthedocs.io/en/latest/](https://docdeid.readthedocs.io/en/latest/)\n\n## Development and contributing\n\nFor setting up dev environment, see: [docs/environment](https://docdeid.readthedocs.io/en/latest/environment.html)\n\nFor contributing, see: [docs/contributing](https://docdeid.readthedocs.io/en/latest/contributing.html)\n\n## Authors\n\nVincent Menger - *Author, maintainer*\n\n## License\n\nThis project is licensed under the MIT license - see the [LICENSE.md](LICENSE.md) file for details.",
"bugtrack_url": null,
"license": "MIT",
"summary": "Create your own document de-identifier using docdeid, a simple framework independent of language or domain.",
"version": "1.0.0",
"project_urls": null,
"split_keywords": [
"python",
"document de-identification",
"de-identification",
"document de-identifier",
"de-identifier"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "1f3e33aec857ccd7739c2b8f8744384bf450f3ca6c498eadcacd4a79e2acc6b8",
"md5": "5ee59fff68ea632243e5891585f4dc23",
"sha256": "d5d93ec3fbd8557a9cd41b56ec3774bc3a86575d8dc6a3becd486cdf2190993b"
},
"downloads": -1,
"filename": "docdeid-1.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "5ee59fff68ea632243e5891585f4dc23",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9,<4.0",
"size": 26263,
"upload_time": "2023-12-20T10:05:06",
"upload_time_iso_8601": "2023-12-20T10:05:06.586144Z",
"url": "https://files.pythonhosted.org/packages/1f/3e/33aec857ccd7739c2b8f8744384bf450f3ca6c498eadcacd4a79e2acc6b8/docdeid-1.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "001ea725d1d012bcc14dd671c6e456f12cadc9c37cd6ca11ffd0d02bcaaa6a58",
"md5": "59b825f349f551f2f2339a95a3dbe89c",
"sha256": "fea630e1dff140eb939c6474df8fcebe428c28c94eed5a5b9ae5c218205b0948"
},
"downloads": -1,
"filename": "docdeid-1.0.0.tar.gz",
"has_sig": false,
"md5_digest": "59b825f349f551f2f2339a95a3dbe89c",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9,<4.0",
"size": 21179,
"upload_time": "2023-12-20T10:05:08",
"upload_time_iso_8601": "2023-12-20T10:05:08.503512Z",
"url": "https://files.pythonhosted.org/packages/00/1e/a725d1d012bcc14dd671c6e456f12cadc9c37cd6ca11ffd0d02bcaaa6a58/docdeid-1.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-12-20 10:05:08",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "docdeid"
}