lx-anonymizer

Name	lx-anonymizer JSON
Version	0.8.8.6 JSON
	download
home_page	None
Summary	OCR-driven anonymization pipeline for medical reports and endoscopy frames
upload_time	2025-10-30 17:34:12
maintainer	None
docs_url	None
author	None
requires_python	>=3.12
license	MIT License Copyright (c) 2025 WG Lux Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords	anonymization medical nlp ocr privacy
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # LX Anonymizer

LX Anonymizer is a toolkit for de-identifying endoscopy frames and medical reports. It combines OCR pipelines, spaCy-based NER, heuristic sanitizers, and report-specific rules to redact or pseudonymize sensitive information while preserving clinical context.

## Highlights
- **End-to-end anonymization** of PDFs and frame sequences using OCR, NER, and pseudonymization helpers.
- **Modular pipeline** that lets you choose between Tesseract, TrOCR, ensemble OCR, and multiple metadata extractors.
- **Human-in-the-loop ready** outputs: original/anonymized text side by side, metadata JSON, and validation artefacts.
- **Extensible ruleset** covering device-specific renderers, fuzzy name matching, and language-specific replacements.

## Requirements
- Python 3.12+
- Linux or macOS (Windows support is experimental)
- NVIDIA GPU recommended for real-time video anonymization (CUDA 12.x). CPU-only processing works but is slower.
- Optional extras:
  - spaCy `de_core_news_lg` model (download after installation)
  - Torch vision/audio for video OCR workloads
  - Ollama-compatible LLMs for advanced metadata extraction

## Installation

### From PyPI *(upcoming release)*
```bash
pip install lx-anonymizer
```

Install extras to tailor the footprint:
```bash
pip install "lx-anonymizer[gpu,ocr,llm,dev]"
```

### From source
```bash
git clone https://github.com/wg-lux/lx-anonymizer.git
cd lx-anonymizer
uv sync
```

### Nix development shell
```bash
direnv allow
nix develop
```
This loads GPU, OCR, and tooling dependencies declared in `devenv.nix`.

## Model downloads
After installation, fetch the German spaCy model:
```bash
python -m spacy download de_core_news_lg
```
First CLI runs also download OCR checkpoints (EAST, TrOCR, etc.). For air-gapped deployments, grab the archives listed in [`lx_anonymizer/settings.py`](lx_anonymizer/settings.py) and place them in `~/.cache/lx-anonymizer`.

## Quickstart

### CLI
```bash
python -m cli.report_reader process report.pdf --ensemble --output-dir ./anonymized
```
Useful options:
- `--llm-extractor {deepseek,medllama,llama3}` for LLM-powered metadata extraction.
- `--use-ocr` and `--ensemble` to switch OCR strategies.
- `batch` and `extract` sub-commands for folder processing or metadata-only runs.

### Python API
```python
from lx_anonymizer import ReportReader

reader = ReportReader(locale="de_DE")
original, anonymized, meta = reader.process_report(
    pdf_path="/path/to/report.pdf",
    use_ensemble=True,
    use_llm_extractor="deepseek",
)
```
See [`tests/test_cli_integration.py`](tests/test_cli_integration.py) for more examples.

## Data directories
By default, outputs live in `~/etc/lx-anonymizer/{data,temp}`. Adjust them in [`lx_anonymizer/directory_setup.py`](lx_anonymizer/directory_setup.py). Clean `temp` regularly to avoid large intermediate artefacts.

## Development workflow
- Format & lint: `uv run flake8`
- Tests (CPU friendly): `uv run pytest -m "not gpu"`
  - GPU tests are marked and can be run with `-m gpu`
- Build wheel for release: `uv run python -m build`
- Full local check helper: `scripts/run_checks.sh`

## Project roadmap
1. Publish CPU-only wheel to TestPyPI.
2. Add optional extras for GPU/LLM workloads and slim default install.
3. Automate release workflow (wheel + sdist upload, GitHub release notes).
4. Expose REST/gRPC service with validation UI.

## Contributing
See [`CONTRIBUTING.md`](CONTRIBUTING.md) for contribution guidelines, testing instructions, and communication channels.

## License
Released under the [MIT License](LICENSE).

## Contact
Questions? Email lux@coloreg.de .

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "lx-anonymizer",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.12",
    "maintainer_email": "WG Lux <lux@coloreg.de>",
    "keywords": "anonymization, medical, nlp, ocr, privacy",
    "author": null,
    "author_email": "Max Hild <maxhild10@gmail.com>, \"Thomas J. Lux\" <lux_t1@ukw.de>",
    "download_url": "https://files.pythonhosted.org/packages/db/d9/9fce6465b41968c19f20647ac2940db445d8442d7b665605e56e633e1336/lx_anonymizer-0.8.8.6.tar.gz",
    "platform": null,
    "description": "# LX Anonymizer\n\nLX Anonymizer is a toolkit for de-identifying endoscopy frames and medical reports. It combines OCR pipelines, spaCy-based NER, heuristic sanitizers, and report-specific rules to redact or pseudonymize sensitive information while preserving clinical context.\n\n## Highlights\n- **End-to-end anonymization** of PDFs and frame sequences using OCR, NER, and pseudonymization helpers.\n- **Modular pipeline** that lets you choose between Tesseract, TrOCR, ensemble OCR, and multiple metadata extractors.\n- **Human-in-the-loop ready** outputs: original/anonymized text side by side, metadata JSON, and validation artefacts.\n- **Extensible ruleset** covering device-specific renderers, fuzzy name matching, and language-specific replacements.\n\n## Requirements\n- Python 3.12+\n- Linux or macOS (Windows support is experimental)\n- NVIDIA GPU recommended for real-time video anonymization (CUDA 12.x). CPU-only processing works but is slower.\n- Optional extras:\n  - spaCy `de_core_news_lg` model (download after installation)\n  - Torch vision/audio for video OCR workloads\n  - Ollama-compatible LLMs for advanced metadata extraction\n\n## Installation\n\n### From PyPI *(upcoming release)*\n```bash\npip install lx-anonymizer\n```\n\nInstall extras to tailor the footprint:\n```bash\npip install \"lx-anonymizer[gpu,ocr,llm,dev]\"\n```\n\n### From source\n```bash\ngit clone https://github.com/wg-lux/lx-anonymizer.git\ncd lx-anonymizer\nuv sync\n```\n\n### Nix development shell\n```bash\ndirenv allow\nnix develop\n```\nThis loads GPU, OCR, and tooling dependencies declared in `devenv.nix`.\n\n## Model downloads\nAfter installation, fetch the German spaCy model:\n```bash\npython -m spacy download de_core_news_lg\n```\nFirst CLI runs also download OCR checkpoints (EAST, TrOCR, etc.). For air-gapped deployments, grab the archives listed in [`lx_anonymizer/settings.py`](lx_anonymizer/settings.py) and place them in `~/.cache/lx-anonymizer`.\n\n## Quickstart\n\n### CLI\n```bash\npython -m cli.report_reader process report.pdf --ensemble --output-dir ./anonymized\n```\nUseful options:\n- `--llm-extractor {deepseek,medllama,llama3}` for LLM-powered metadata extraction.\n- `--use-ocr` and `--ensemble` to switch OCR strategies.\n- `batch` and `extract` sub-commands for folder processing or metadata-only runs.\n\n### Python API\n```python\nfrom lx_anonymizer import ReportReader\n\nreader = ReportReader(locale=\"de_DE\")\noriginal, anonymized, meta = reader.process_report(\n    pdf_path=\"/path/to/report.pdf\",\n    use_ensemble=True,\n    use_llm_extractor=\"deepseek\",\n)\n```\nSee [`tests/test_cli_integration.py`](tests/test_cli_integration.py) for more examples.\n\n## Data directories\nBy default, outputs live in `~/etc/lx-anonymizer/{data,temp}`. Adjust them in [`lx_anonymizer/directory_setup.py`](lx_anonymizer/directory_setup.py). Clean `temp` regularly to avoid large intermediate artefacts.\n\n## Development workflow\n- Format & lint: `uv run flake8`\n- Tests (CPU friendly): `uv run pytest -m \"not gpu\"`\n  - GPU tests are marked and can be run with `-m gpu`\n- Build wheel for release: `uv run python -m build`\n- Full local check helper: `scripts/run_checks.sh`\n\n## Project roadmap\n1. Publish CPU-only wheel to TestPyPI.\n2. Add optional extras for GPU/LLM workloads and slim default install.\n3. Automate release workflow (wheel + sdist upload, GitHub release notes).\n4. Expose REST/gRPC service with validation UI.\n\n## Contributing\nSee [`CONTRIBUTING.md`](CONTRIBUTING.md) for contribution guidelines, testing instructions, and communication channels.\n\n## License\nReleased under the [MIT License](LICENSE).\n\n## Contact\nQuestions? Email lux@coloreg.de .\n\n\n\n",
    "bugtrack_url": null,
    "license": "MIT License\n        \n        Copyright (c) 2025 WG Lux\n        \n        Permission is hereby granted, free of charge, to any person obtaining a copy\n        of this software and associated documentation files (the \"Software\"), to deal\n        in the Software without restriction, including without limitation the rights\n        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n        copies of the Software, and to permit persons to whom the Software is\n        furnished to do so, subject to the following conditions:\n        \n        The above copyright notice and this permission notice shall be included in all\n        copies or substantial portions of the Software.\n        \n        THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\n        SOFTWARE.",
    "summary": "OCR-driven anonymization pipeline for medical reports and endoscopy frames",
    "version": "0.8.8.6",
    "project_urls": {
        "Changelog": "https://github.com/wg-lux/lx-anonymizer/blob/main/CHANGELOG.md",
        "Documentation": "https://github.com/wg-lux/lx-anonymizer#readme",
        "Homepage": "https://github.com/wg-lux/lx-anonymizer",
        "Issue Tracker": "https://github.com/wg-lux/lx-anonymizer/issues"
    },
    "split_keywords": [
        "anonymization",
        " medical",
        " nlp",
        " ocr",
        " privacy"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "bfcb7412cdf663e1a456f3c724c7ba97c826d428f1b8916c56500ec80c1a8c74",
                "md5": "3b7add762a279d136fa1f3ddf81f2367",
                "sha256": "f15797eb9d587150e2f78d9288a7af78373bea73d98e895c35a53e2e6435509e"
            },
            "downloads": -1,
            "filename": "lx_anonymizer-0.8.8.6-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "3b7add762a279d136fa1f3ddf81f2367",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.12",
            "size": 728734,
            "upload_time": "2025-10-30T17:34:09",
            "upload_time_iso_8601": "2025-10-30T17:34:09.955855Z",
            "url": "https://files.pythonhosted.org/packages/bf/cb/7412cdf663e1a456f3c724c7ba97c826d428f1b8916c56500ec80c1a8c74/lx_anonymizer-0.8.8.6-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "dbd99fce6465b41968c19f20647ac2940db445d8442d7b665605e56e633e1336",
                "md5": "34072677522962d7db693139f0c755f5",
                "sha256": "a9344eedab372cba38c790c49b7d6145948d30e5c7c020e3c26ed8d264e4b3cc"
            },
            "downloads": -1,
            "filename": "lx_anonymizer-0.8.8.6.tar.gz",
            "has_sig": false,
            "md5_digest": "34072677522962d7db693139f0c755f5",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.12",
            "size": 566661,
            "upload_time": "2025-10-30T17:34:12",
            "upload_time_iso_8601": "2025-10-30T17:34:12.124918Z",
            "url": "https://files.pythonhosted.org/packages/db/d9/9fce6465b41968c19f20647ac2940db445d8442d7b665605e56e633e1336/lx_anonymizer-0.8.8.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-30 17:34:12",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "wg-lux",
    "github_project": "lx-anonymizer",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "lx-anonymizer"
}

None