presidio-image-redactor


Namepresidio-image-redactor JSON
Version 0.0.52 PyPI version JSON
download
home_pagehttps://github.com/Microsoft/presidio
SummaryPresidio image redactor package
upload_time2024-03-29 13:48:13
maintainerNone
docs_urlNone
authorNone
requires_python>=3.5
licenseMIT
keywords presidio_image_redactor
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Presidio Image Redactor

***Please notice, this package is still in alpha and not production ready.***

## Description

The Presidio Image Redactor is a Python based module for detecting and redacting PII text entities in images.

## Deploy Presidio image redactor to Azure

Use the following button to deploy presidio image redactor to your Azure subscription.

[![Deploy to Azure](https://aka.ms/deploytoazurebutton)](https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2Fmicrosoft%2Fpresidio%2Fmain%2Fpresidio-image-redactor%2Fdeploytoazure.json)

Process for standard images:

![Image Redactor Design](../docs/assets/image-redactor-design.png)

Process for DICOM files:

![DICOM image Redactor Design](../docs/assets/dicom-image-redactor-design.png)

## Installation

Pre-requisites:

- Install [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) by following the
  instructions on how to install it for your operating system.

  For best performance, please use the most up-to-date version of Tesseract OCR. Presidio was tested with **v5.2.0**.

### As package

To get started with Presidio-image-redactor, run the following:

```sh
pip install presidio-image-redactor
```

Once Installed, run the following command to download the default spacy model needed for
Presidio Analyzer:

```sh
python -m spacy download en_core_web_lg
```

## Getting started (standard image types)

The engine will receive 2 parameters:

1. Image to redact.
2. Color fill to redact with, by default color fill will be black. Can either be an int
   or tuple (0,0,0)

```python
from PIL import Image
from presidio_image_redactor import ImageRedactorEngine

# Get the image to redact using PIL lib (pillow)
image = Image.open("presidio-image-redactor/tests/integration/resources/ocr_test.png")

# Initialize the engine
engine = ImageRedactorEngine()

# Redact the image with pink color
redacted_image = engine.redact(image, (255, 192, 203))

# save the redacted image 
redacted_image.save("new_image.png")
# uncomment to open the image for viewing
# redacted_image.show()
```

### As docker service

In folder presidio/presidio-image-redactor run:

```
docker-compose up -d
```

### HTTP API

### redact

Receives an image and color fill (optional, default is black). Redact the image PII text
and returns a new redacted image.

```
POST /redact
```

Payload:

Sent as multipart-form. Contains image file and data of the required color fill.

```json
{
  "data": "{'color_fill':'0,0,0'}"
}
```

Result:

```
200 OK
```

curl example:

```
# use ocr_test.png as the image to redact, and 255 as the color fill. 
# out.png is the new redacted image received from the server.
curl -XPOST "http://localhost:3000/redact" -H "content-type: multipart/form-data" -F "image=@ocr_test.png" -F "data=\"{'color_fill':'255'}\"" > out.png
```

Python script example can be found under:
/presidio/e2e-tests/tests/test_image_redactor.py

## Getting started (DICOM images)

This module only redacts pixel data and does not scrub text PHI which may exist in the DICOM metadata.

We highly recommend using the DICOM image redactor engine to redact text from images **before** scrubbing metadata PHI. To redact sensitive information from metadata, consider using another package such as the [Tools for Health Data Anonymization](https://github.com/microsoft/Tools-for-Health-Data-Anonymization).

To redact burnt-in text PHI in DICOM images, see the below sample code:

```python
import pydicom
from presidio_image_redactor import DicomImageRedactorEngine

# Set input and output paths
input_path = "path/to/your/dicom/file.dcm"
output_dir = "./output"

# Initialize the engine
engine = DicomImageRedactorEngine()

# Option 1: Redact from a loaded DICOM image
dicom_image = pydicom.dcmread(input_path)
redacted_dicom_image = engine.redact(dicom_image, fill="contrast")

# Option 2: Redact from a loaded DICOM image and return redacted regions
redacted_dicom_image, bboxes = engine.redact_and_return_bbox(dicom_image, fill="contrast")

# Option 3: Redact from DICOM file and save redacted regions as json file
engine.redact_from_file(input_path, output_dir, padding_width=25, fill="contrast", save_bboxes=True)

# Option 4: Redact from directory and save redacted regions as json files
ocr_kwargs = {"ocr_threshold": 50}
engine.redact_from_directory("path/to/your/dicom", output_dir, fill="background", save_bboxes=True, ocr_kwargs=ocr_kwargs)
```

See the example notebook for more details and visual confirmation of the output: [docs/samples/python/example_dicom_image_redactor.ipynb](../docs/samples/python/example_dicom_image_redactor.ipynb).

### Side note for Windows

If you are using a Windows machine, you may run into issues if file paths are too long. Unfortunatley, this is not rare when working with DICOM images that are often nested in directories with descriptive names.

To avoid errors where the code may not recognize a path as existing due to the length of the characters in the file path, please [enable long paths on your system](https://learn.microsoft.com/en-us/answers/questions/293227/longpathsenabled.html).

### DICOM Data Citation

The DICOM data used for unit and integration testing for `DicomImageRedactorEngine` are stored in this repository with permission from the original dataset owners. Please see the dataset information as follows:

> Rutherford, M., Mun, S.K., Levine, B., Bennett, W.C., Smith, K., Farmer, P., Jarosz, J., Wagner, U., Farahani, K., Prior, F. (2021). A DICOM dataset for evaluation of medical image de-identification (Pseudo-PHI-DICOM-Data) [Data set]. The Cancer Imaging Archive. DOI: <https://doi.org/10.7937/s17z-r072>

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Microsoft/presidio",
    "name": "presidio-image-redactor",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.5",
    "maintainer_email": null,
    "keywords": "presidio_image_redactor",
    "author": null,
    "author_email": null,
    "download_url": null,
    "platform": null,
    "description": "# Presidio Image Redactor\n\n***Please notice, this package is still in alpha and not production ready.***\n\n## Description\n\nThe Presidio Image Redactor is a Python based module for detecting and redacting PII text entities in images.\n\n## Deploy Presidio image redactor to Azure\n\nUse the following button to deploy presidio image redactor to your Azure subscription.\n\n[![Deploy to Azure](https://aka.ms/deploytoazurebutton)](https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2Fmicrosoft%2Fpresidio%2Fmain%2Fpresidio-image-redactor%2Fdeploytoazure.json)\n\nProcess for standard images:\n\n![Image Redactor Design](../docs/assets/image-redactor-design.png)\n\nProcess for DICOM files:\n\n![DICOM image Redactor Design](../docs/assets/dicom-image-redactor-design.png)\n\n## Installation\n\nPre-requisites:\n\n- Install [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) by following the\n  instructions on how to install it for your operating system.\n\n  For best performance, please use the most up-to-date version of Tesseract OCR. Presidio was tested with **v5.2.0**.\n\n### As package\n\nTo get started with Presidio-image-redactor, run the following:\n\n```sh\npip install presidio-image-redactor\n```\n\nOnce Installed, run the following command to download the default spacy model needed for\nPresidio Analyzer:\n\n```sh\npython -m spacy download en_core_web_lg\n```\n\n## Getting started (standard image types)\n\nThe engine will receive 2 parameters:\n\n1. Image to redact.\n2. Color fill to redact with, by default color fill will be black. Can either be an int\n   or tuple (0,0,0)\n\n```python\nfrom PIL import Image\nfrom presidio_image_redactor import ImageRedactorEngine\n\n# Get the image to redact using PIL lib (pillow)\nimage = Image.open(\"presidio-image-redactor/tests/integration/resources/ocr_test.png\")\n\n# Initialize the engine\nengine = ImageRedactorEngine()\n\n# Redact the image with pink color\nredacted_image = engine.redact(image, (255, 192, 203))\n\n# save the redacted image \nredacted_image.save(\"new_image.png\")\n# uncomment to open the image for viewing\n# redacted_image.show()\n```\n\n### As docker service\n\nIn folder presidio/presidio-image-redactor run:\n\n```\ndocker-compose up -d\n```\n\n### HTTP API\n\n### redact\n\nReceives an image and color fill (optional, default is black). Redact the image PII text\nand returns a new redacted image.\n\n```\nPOST /redact\n```\n\nPayload:\n\nSent as multipart-form. Contains image file and data of the required color fill.\n\n```json\n{\n  \"data\": \"{'color_fill':'0,0,0'}\"\n}\n```\n\nResult:\n\n```\n200 OK\n```\n\ncurl example:\n\n```\n# use ocr_test.png as the image to redact, and 255 as the color fill. \n# out.png is the new redacted image received from the server.\ncurl -XPOST \"http://localhost:3000/redact\" -H \"content-type: multipart/form-data\" -F \"image=@ocr_test.png\" -F \"data=\\\"{'color_fill':'255'}\\\"\" > out.png\n```\n\nPython script example can be found under:\n/presidio/e2e-tests/tests/test_image_redactor.py\n\n## Getting started (DICOM images)\n\nThis module only redacts pixel data and does not scrub text PHI which may exist in the DICOM metadata.\n\nWe highly recommend using the DICOM image redactor engine to redact text from images **before** scrubbing metadata PHI. To redact sensitive information from metadata, consider using another package such as the [Tools for Health Data Anonymization](https://github.com/microsoft/Tools-for-Health-Data-Anonymization).\n\nTo redact burnt-in text PHI in DICOM images, see the below sample code:\n\n```python\nimport pydicom\nfrom presidio_image_redactor import DicomImageRedactorEngine\n\n# Set input and output paths\ninput_path = \"path/to/your/dicom/file.dcm\"\noutput_dir = \"./output\"\n\n# Initialize the engine\nengine = DicomImageRedactorEngine()\n\n# Option 1: Redact from a loaded DICOM image\ndicom_image = pydicom.dcmread(input_path)\nredacted_dicom_image = engine.redact(dicom_image, fill=\"contrast\")\n\n# Option 2: Redact from a loaded DICOM image and return redacted regions\nredacted_dicom_image, bboxes = engine.redact_and_return_bbox(dicom_image, fill=\"contrast\")\n\n# Option 3: Redact from DICOM file and save redacted regions as json file\nengine.redact_from_file(input_path, output_dir, padding_width=25, fill=\"contrast\", save_bboxes=True)\n\n# Option 4: Redact from directory and save redacted regions as json files\nocr_kwargs = {\"ocr_threshold\": 50}\nengine.redact_from_directory(\"path/to/your/dicom\", output_dir, fill=\"background\", save_bboxes=True, ocr_kwargs=ocr_kwargs)\n```\n\nSee the example notebook for more details and visual confirmation of the output: [docs/samples/python/example_dicom_image_redactor.ipynb](../docs/samples/python/example_dicom_image_redactor.ipynb).\n\n### Side note for Windows\n\nIf you are using a Windows machine, you may run into issues if file paths are too long. Unfortunatley, this is not rare when working with DICOM images that are often nested in directories with descriptive names.\n\nTo avoid errors where the code may not recognize a path as existing due to the length of the characters in the file path, please [enable long paths on your system](https://learn.microsoft.com/en-us/answers/questions/293227/longpathsenabled.html).\n\n### DICOM Data Citation\n\nThe DICOM data used for unit and integration testing for `DicomImageRedactorEngine` are stored in this repository with permission from the original dataset owners. Please see the dataset information as follows:\n\n> Rutherford, M., Mun, S.K., Levine, B., Bennett, W.C., Smith, K., Farmer, P., Jarosz, J., Wagner, U., Farahani, K., Prior, F. (2021). A DICOM dataset for evaluation of medical image de-identification (Pseudo-PHI-DICOM-Data) [Data set]. The Cancer Imaging Archive. DOI: <https://doi.org/10.7937/s17z-r072>\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Presidio image redactor package",
    "version": "0.0.52",
    "project_urls": {
        "Homepage": "https://github.com/Microsoft/presidio"
    },
    "split_keywords": [
        "presidio_image_redactor"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5d81ec8e62963ebd140577ef2f87a4e5dacb3ccf579fe221c0f2709c85d323c8",
                "md5": "9cee2f33a136d727fd6a1f3d8471fd5e",
                "sha256": "3a12c59f640c7b3438143de97b5d020409f50d060ac7f6bc68416a34ff9b18e7"
            },
            "downloads": -1,
            "filename": "presidio_image_redactor-0.0.52-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "9cee2f33a136d727fd6a1f3d8471fd5e",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.5",
            "size": 34395,
            "upload_time": "2024-03-29T13:48:13",
            "upload_time_iso_8601": "2024-03-29T13:48:13.944536Z",
            "url": "https://files.pythonhosted.org/packages/5d/81/ec8e62963ebd140577ef2f87a4e5dacb3ccf579fe221c0f2709c85d323c8/presidio_image_redactor-0.0.52-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-03-29 13:48:13",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Microsoft",
    "github_project": "presidio",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "presidio-image-redactor"
}
        
Elapsed time: 0.21986s