SpectrePDF


NameSpectrePDF JSON
Version 0.2.1 PyPI version JSON
download
home_pageNone
SummaryA tool for processing and redacting PDFs based on target words using OCR.
upload_time2025-07-14 17:57:14
maintainerNone
docs_urlNone
authorNone
requires_python>=3.11
licenseNone
keywords pdf redaction ocr security privacy text extraction
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # SpectrePDF: Advanced PDF Redaction and Annotation Tool

Welcome to **SpectrePDF**, a powerful Python library designed for detecting, annotating, and redacting sensitive information in PDF documents using OCR (Optical Character Recognition). Whether you're handling confidential reports, legal documents, or personal files, SpectrePDF makes it easy to identify specific words or phrases and either highlight them with boxes or redact them securely—replacing them with custom text or blacking them out entirely.

Built with efficiency and accuracy in mind, SpectrePDF leverages high-resolution rendering and intelligent word grouping to ensure precise results, even on complex layouts. It's ideal for privacy compliance (e.g., GDPR, HIPAA), data anonymization, or simply reviewing PDFs for key terms.

## Why Choose SpectrePDF?
- **Accurate OCR Detection**: Uses Tesseract OCR to scan rendered PDF pages at customizable DPI for reliable word recognition.
- **Flexible Redaction Options**: Redact with custom replacement text, black boxes, or just draw outlines for review.
- **Smart Grouping**: Automatically groups words into lines and merges adjacent targets for clean, professional results.
- **High Performance**: Processes multi-page PDFs with progress tracking via `tqdm`.
- **Open-Source and Extensible**: Easy to integrate into your workflows, with verbose logging for debugging.

SpectrePDF is perfect for developers, data privacy officers, researchers, and anyone needing robust PDF manipulation tools.

## Features
- **Word Detection**: Scan PDFs for specific target words or phrases using OCR.
- **Box Annotation**:
  - Draw colored outlines around detected words (blue for targets, red for others).
  - Option to show boxes around all words or only targets.
- **Redaction Modes**:
  - Replace targets with custom text from a JSON dictionary (e.g., anonymize names like "John Doe" to "[REDACTED]").
  - Blackout targets completely for irreversible redaction.
  - Whiteout and overlay replacement text with auto-scaled font sizing.
- **Customization**:
  - Adjustable DPI for rendering (higher for better accuracy, e.g., 500 DPI).
  - Verbosity levels for silent, basic, or detailed output.
  - Support for custom Tesseract executable path.
- **Efficiency Tools**: Progress bars, timing metrics, and error handling for seamless processing.
- **Output**: Generates a new PDF with modifications, preserving original layout.

## Installation
SpectrePDF is a Python package that requires Python 3.6+. Install it via pip (assuming you've packaged it; if not, clone the repo and install dependencies manually).

```bash
pip install spectrepdf  # If published on PyPI; otherwise, use git
# Or clone and install:
git clone https://github.com/udiram/spectrePDF.git
cd spectrePDF
pip install -r requirements.txt
```

### Dependencies
SpectrePDF relies on the following libraries (install via `pip`):
- `pymupdf` (PyMuPDF for PDF handling)
- `Pillow` (PIL for image manipulation)
- `pytesseract` (Tesseract OCR wrapper)
- `img2pdf` (Image to PDF conversion)
- `tqdm` (Progress bars)
- `statistics` (Standard library, for median calculations)

Additionally, install Tesseract OCR on your system:
- Windows: Download from [Tesseract at UB Mannheim](https://github.com/UB-Mannheim/tesseract/wiki) and add to PATH.
- macOS: `brew install tesseract`
- Linux: `sudo apt install tesseract-ocr`

No internet access is required during runtime—everything runs locally.

## Quick Start
1. **Prepare Your Redaction Dictionary**: Create a `redaction.json` file mapping target words to replacements. Example:
   ```json
   {
       "confidential": "[REDACTED]",
       "secret": "[CLASSIFIED]",
       "john doe": "[ANONYMIZED]"
   }
   ```
   Keys are case-insensitive during detection.

2. **Run the Processor**: Use the `process_pdf` function from `SpectrePDF.anonymizer`.

   Here's a basic example script:

   ```python
   from SpectrePDF.anonymizer import process_pdf
   import json

   # Load targets from redaction dict keys
   with open("redaction.json", "r") as f:
       redaction_dict = json.load(f)
   target_words = list(redaction_dict.keys())

   # Process the PDF (redact with replacements)
   process_pdf(
       input_pdf="input.pdf",
       output_pdf="output_redacted.pdf",
       target_words=target_words,
       redaction_json_path="redaction.json",
       redact_targets=True,
       black_redaction=False,  # False for text replacement; True for black boxes
       dpi=500,
       tesseract_cmd="/usr/local/bin/tesseract",  # Adjust to your Tesseract path
       verbosity=1  # 0: silent, 1: progress, 2: detailed
   )
   ```

   This will scan `input.pdf` for words like "confidential" or "john doe", redact them with replacements from the JSON, and save the result as `output_redacted.pdf`.

## Understanding the `process_pdf` Function

The `process_pdf` function is the core entry point of the SpectrePDF library. It enables users to load a PDF document, perform OCR-based detection of specific words or phrases, and then either annotate the document by drawing boxes around detected elements or redact sensitive information. The process involves rendering PDF pages as high-resolution images, running OCR to extract text data, grouping words logically, and applying modifications before saving a new PDF.

This function is highly configurable, allowing for various modes of operation such as simple detection with annotations, targeted redaction with replacements, or secure blackouts. Below, I'll break down its purpose, workflow, and every parameter in detail to ensure you can use it effectively. All parameters are designed to balance flexibility, accuracy, and performance.

#### Function Signature and Overview
```python
def process_pdf(
    input_pdf: str,
    output_pdf: str,
    target_words: list,
    redaction_json_path: str,
    show_all_boxes: bool = False,
    only_target_boxes: bool = False,
    redact_targets: bool = False,
    black_redaction: bool = False,
    tesseract_cmd: str = None,
    dpi: int = 500,
    verbosity: int = 1
):
    """
    Process a PDF to detect, optionally draw boxes around, or redact and replace specific target words using OCR.
    """
```
- **Core Workflow**:
  1. Open the input PDF using PyMuPDF.
  2. Render each page as an image at the specified DPI.
  3. Perform OCR on each image using Tesseract to extract words, positions, and bounding boxes.
  4. Collect and group words into lines based on vertical alignment.
  5. Identify target words (case-insensitive partial matches) and merge adjacent targets if needed.
  6. Depending on parameters, either draw boxes for annotation or apply redaction.
  7. Convert modified images back to a PDF and save it.
  8. Handle errors gracefully (e.g., file not found, invalid JSON) and provide timing metrics.

The function raises exceptions for invalid inputs (e.g., non-list `target_words`, missing files) to prevent silent failures. It supports multi-page PDFs and uses progress bars (`tqdm`) for user-friendly feedback.

Now, let's discuss **all the options (parameters)** in detail, including their types, defaults, interactions, and best practices.

#### 1. `input_pdf` (Required, Type: str)
   - **Description**: The full path to the input PDF file you want to process. This can be any valid PDF, including scanned documents or those with embedded text (though OCR is always used for consistency).
   - **Usage Notes**: Ensure the file exists and is readable. Relative or absolute paths are fine. If the path is invalid, a `FileNotFoundError` is raised.
   - **Example**: `"documents/confidential_report.pdf"`
   - **Interactions**: None specific, but higher-complexity PDFs (e.g., with images or tables) may benefit from increased `dpi` for better OCR results.
   - **Best Practice**: Use absolute paths to avoid working directory issues in scripts.

#### 2. `output_pdf` (Required, Type: str)
   - **Description**: The full path where the processed PDF will be saved. This will be a new file containing the annotated or redacted pages.
   - **Usage Notes**: The output is always a PDF generated from modified images, preserving the original layout as closely as possible. If the path's directory doesn't exist, it will be created implicitly by the file write operation.
   - **Example**: `"output/annotated_report.pdf"`
   - **Interactions**: Overwrites existing files without warning, so choose a unique name.
   - **Best Practice**: Include descriptive suffixes like "_redacted" or "_highlighted" to distinguish outputs.

#### 3. `target_words` (Required, Type: list)
   - **Description**: A list of strings representing the words or phrases to detect as "targets." Detection is case-insensitive and matches if any target substring appears in a word (e.g., "conf" would match "confidential").
   - **Usage Notes**: Must be a list; single strings will raise a `ValueError`. Empty list is allowed but results in no targets (useful for modes like `show_all_boxes=True`).
   - **Example**: `["confidential", "secret", "john doe"]`
   - **Interactions**: Targets are derived from or used alongside `redaction_json_path`. In redaction modes, only these words trigger changes. Combine with `only_target_boxes` to ignore non-targets.
   - **Best Practice**: Use lowercase for consistency, as matching is lowercased. For phrases, include spaces (e.g., "top secret").

#### 4. `redaction_json_path` (Required, Type: str)
   - **Description**: Path to a JSON file containing a dictionary that maps target words (keys) to their replacement strings (values). This is used during redaction to overlay custom text.
   - **Usage Notes**: The JSON must be a valid dict; invalid formats raise a `ValueError`. Keys should match `target_words` (case-insensitive). Even if not redacting, this param is required but ignored in annotation modes.
   - **Example**: `"config/redaction.json"` with content: `{"confidential": "[REDACTED]", "john doe": "[ANONYMIZED]"}`
   - **Interactions**: Essential for `redact_targets=True` (unless `black_redaction=True`, where replacements are skipped). If a target lacks a mapping, it's redacted without replacement (fallback behavior).
   - **Best Practice**: Keep keys in sync with `target_words`. Use placeholders like "[REDACTED]" for compliance.

#### 5. `show_all_boxes` (Optional, Default: False, Type: bool)
   - **Description**: If True, draws black outline boxes around **all** detected words (targets and non-targets) for comprehensive annotation. Useful for debugging OCR accuracy or visualizing text extraction.
   - **Usage Notes**: Only active when `redact_targets=False` (annotation mode). Overrides color distinctions—everything gets black boxes.
   - **Example**: Set to `True` for a full word-level markup.
   - **Interactions**: Conflicts with redaction; if `redact_targets=True`, this is ignored. Combine with `only_target_boxes=False` for maximum coverage.
   - **Best Practice**: Use for initial tests to verify word detection before applying targeted changes.

#### 6. `only_target_boxes` (Optional, Default: False, Type: bool)
   - **Description**: If True, processes (draws or redacts) only boxes containing target words, ignoring non-targets. This focuses operations on sensitive areas.
   - **Usage Notes**: Applies in both annotation and redaction modes. Reduces processing overhead on large documents.
   - **Example**: Set to `True` when you only care about highlighting/redacting specific terms.
   - **Interactions**: In annotation mode without `show_all_boxes`, targets get blue boxes, non-targets are skipped. In redaction, only targets are modified.
   - **Best Practice**: Enable for efficiency in targeted workflows; disable for broader analysis.

#### 7. `redact_targets` (Optional, Default: False, Type: bool)
   - **Description**: If True, switches to redaction mode: targets are covered (whited out or blacked out) and optionally replaced with text from the redaction dict. If False, defaults to annotation mode (drawing boxes).
   - **Usage Notes**: Primary mode toggle. In redaction, boxes are filled; no outlines are drawn.
   - **Example**: Set to `True` for privacy-focused tasks like anonymization.
   - **Interactions**: Enables `black_redaction`. If True, `show_all_boxes` is ignored. Uses `redaction_json_path` for replacements.
   - **Best Practice**: Combine with `black_redaction=False` for readable redactions (e.g., legal docs) or True for irreversible hiding.

#### 8. `black_redaction` (Optional, Default: False, Type: bool
   - **Description**: If True **and** `redact_targets=True`, fills target boxes with solid black without any replacement text. This is for permanent, non-reversible redaction.
   - **Usage Notes**: Ignores the redaction dict's replacements. Only affects targets.
   - **Example**: Set to `True` for high-security scenarios where even placeholders are too revealing.
   - **Interactions**: Requires `redact_targets=True`; otherwise ignored. No font sizing or text overlay occurs.
   - **Best Practice**: Use sparingly, as it can disrupt document flow; test with lower DPI first.

#### 9. `tesseract_cmd` (Optional, Default: None, Type: str)
   - **Description**: Optional path to the Tesseract OCR executable. If None, assumes Tesseract is in your system's PATH.
   - **Usage Notes**: Useful for custom installations or environments without PATH setup. Validates the path exists.
   - **Example**: `"C:\\Program Files\Tesseract-OCR\\tesseract.exe"` on Windows.
   - **Interactions**: Affects OCR performance. If invalid, raises `FileNotFoundError`.
   - **Best Practice**: Provide if Tesseract isn't globally available; otherwise, leave as None for simplicity.

#### 10. `dpi` (Optional, Default: 500, Type: int)
   - **Description**: The resolution (dots per inch) for rendering PDF pages as images before OCR. Higher values improve accuracy but increase processing time and memory use.
   - **Usage Notes**: Minimum practical is ~200; 500 is a good balance for most documents. Affects image size and OCR quality.
   - **Example**: `300` for faster runs, `600` for high-precision needs.
   - **Interactions**: Directly impacts OCR reliability—low DPI may miss small text. Scales with page complexity.
   - **Best Practice**: Start at 500; adjust based on document font size and trial runs.

#### 11. `verbosity` (Optional, Default: 1, Type: int)
   - **Description**: Controls output logging level: 0 (silent, no prints), 1 (basic progress bars and timings), 2 (detailed logs like word counts, phrases).
   - **Usage Notes**: Uses `print` for output and `tqdm` for progress (disabled at 0). Helps with monitoring long processes.
   - **Example**: `2` for debugging, `0` for background scripts.
   - **Interactions**: Affects all steps—e.g., at 2, you'll see per-page details.
   - **Best Practice**: Use 1 for most cases; 2 when troubleshooting OCR or grouping issues.

#### Key Interactions and Modes Summary
- **Annotation Mode** (`redact_targets=False`): Draws boxes. Use `show_all_boxes` for all words, `only_target_boxes` to focus.
- **Redaction Mode** (`redact_targets=True`): Modifies targets. Toggle `black_redaction` for style.
- **Common Pitfalls**: Ensure `target_words` matches redaction keys. Test with small PDFs first.
- **Performance**: Processing time scales with page count, DPI, and document size. For large files, lower DPI or use verbosity=1.

## Usage Examples
SpectrePDF offers versatile modes. Here are some scenarios with code snippets:

### 1. Detect and Highlight Target Words (Draw Blue Boxes)
Review a PDF by outlining targets in blue.

```python
process_pdf(
    input_pdf="report.pdf",
    output_pdf="highlighted.pdf",
    target_words=["sensitive", "private"],
    redaction_json_path="redaction.json",  # Still needed, even if not redacting
    show_all_boxes=False,  # Only draw around targets
    redact_targets=False,  # Don't redact, just annotate
    dpi=300,
    verbosity=2
)
```

**What Happens**: Pages are rendered, OCR detects words, blue boxes are drawn around matches like "sensitive", and the PDF is rebuilt.

### 2. Redact with Custom Text Replacements
Anonymize names or terms while keeping the document readable.

```python
process_pdf(
    input_pdf="patient_records.pdf",
    output_pdf="anonymized.pdf",
    target_words=["patient name", "ssn"],
    redaction_json_path="redaction.json",  # e.g., {"patient name": "[PATIENT]", "ssn": "[REDACTED]"}
    redact_targets=True,
    black_redaction=False,
    only_target_boxes=True,  # Focus only on targets
    dpi=500,
    verbosity=1
)
```

**What Happens**: Targets are whited out, replaced with fitting text (auto-sized font), and non-targets remain untouched.

### 3. Blackout Redaction for Maximum Security
Irreversibly hide sensitive info with black boxes.

```python
process_pdf(
    input_pdf="classified_doc.pdf",
    output_pdf="blacked_out.pdf",
    target_words=["top secret"],
    redaction_json_path="redaction.json",  # Replacement dict ignored in black mode
    redact_targets=True,
    black_redaction=True,
    dpi=400,
    verbosity=1
)
```

**What Happens**: Matching words/phrases are filled with solid black—no text overlay.

### 4. Debug Mode: Draw Boxes Around All Words
Inspect OCR accuracy by outlining every detected word (black for all).

```python
process_pdf(
    input_pdf="test.pdf",
    output_pdf="debug_all_boxes.pdf",
    target_words=[],  # No targets needed for this mode
    redaction_json_path="redaction.json",
    show_all_boxes=True,
    redact_targets=False,
    verbosity=2
)
```

**What Happens**: Every word gets a black outline, helping you verify layout and detection.

### Advanced Tips
- **Multi-Page PDFs**: Handles large documents efficiently with progress tracking.
- **Font Handling**: Automatically scales replacement text to fit boxes; falls back to default font if "arial.ttf" is unavailable.
- **Error Handling**: Raises informative exceptions for missing files, invalid inputs, or OCR failures.
- **Performance**: Higher DPI improves accuracy but increases processing time. Start with 300-500 DPI.

## Contributing
We welcome contributions! Fork the repo, create a branch, and submit a pull request. Ideas for improvements:
- Support for more OCR languages.
- Batch processing multiple PDFs.
- GUI integration.

Report issues on GitHub.

## License
MIT License. See [LICENSE](LICENSE) for details.

## Acknowledgments
- Powered by PyMuPDF, Tesseract, and Pillow.
- Inspired by real-world needs for secure document handling.

Get started with SpectrePDF today and take control of your PDF privacy! If you have questions, check the code or open an issue. 🚀

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "SpectrePDF",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.11",
    "maintainer_email": null,
    "keywords": "pdf, redaction, ocr, security, privacy, text extraction",
    "author": null,
    "author_email": "Udbhav Ram <udbhavram41@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/7c/31/3bb6ea554b61b5dc48f2012d6f2db33c274d2363f07627adfc9c22f53b56/spectrepdf-0.2.1.tar.gz",
    "platform": null,
    "description": "# SpectrePDF: Advanced PDF Redaction and Annotation Tool\r\n\r\nWelcome to **SpectrePDF**, a powerful Python library designed for detecting, annotating, and redacting sensitive information in PDF documents using OCR (Optical Character Recognition). Whether you're handling confidential reports, legal documents, or personal files, SpectrePDF makes it easy to identify specific words or phrases and either highlight them with boxes or redact them securely\u2014replacing them with custom text or blacking them out entirely.\r\n\r\nBuilt with efficiency and accuracy in mind, SpectrePDF leverages high-resolution rendering and intelligent word grouping to ensure precise results, even on complex layouts. It's ideal for privacy compliance (e.g., GDPR, HIPAA), data anonymization, or simply reviewing PDFs for key terms.\r\n\r\n## Why Choose SpectrePDF?\r\n- **Accurate OCR Detection**: Uses Tesseract OCR to scan rendered PDF pages at customizable DPI for reliable word recognition.\r\n- **Flexible Redaction Options**: Redact with custom replacement text, black boxes, or just draw outlines for review.\r\n- **Smart Grouping**: Automatically groups words into lines and merges adjacent targets for clean, professional results.\r\n- **High Performance**: Processes multi-page PDFs with progress tracking via `tqdm`.\r\n- **Open-Source and Extensible**: Easy to integrate into your workflows, with verbose logging for debugging.\r\n\r\nSpectrePDF is perfect for developers, data privacy officers, researchers, and anyone needing robust PDF manipulation tools.\r\n\r\n## Features\r\n- **Word Detection**: Scan PDFs for specific target words or phrases using OCR.\r\n- **Box Annotation**:\r\n  - Draw colored outlines around detected words (blue for targets, red for others).\r\n  - Option to show boxes around all words or only targets.\r\n- **Redaction Modes**:\r\n  - Replace targets with custom text from a JSON dictionary (e.g., anonymize names like \"John Doe\" to \"[REDACTED]\").\r\n  - Blackout targets completely for irreversible redaction.\r\n  - Whiteout and overlay replacement text with auto-scaled font sizing.\r\n- **Customization**:\r\n  - Adjustable DPI for rendering (higher for better accuracy, e.g., 500 DPI).\r\n  - Verbosity levels for silent, basic, or detailed output.\r\n  - Support for custom Tesseract executable path.\r\n- **Efficiency Tools**: Progress bars, timing metrics, and error handling for seamless processing.\r\n- **Output**: Generates a new PDF with modifications, preserving original layout.\r\n\r\n## Installation\r\nSpectrePDF is a Python package that requires Python 3.6+. Install it via pip (assuming you've packaged it; if not, clone the repo and install dependencies manually).\r\n\r\n```bash\r\npip install spectrepdf  # If published on PyPI; otherwise, use git\r\n# Or clone and install:\r\ngit clone https://github.com/udiram/spectrePDF.git\r\ncd spectrePDF\r\npip install -r requirements.txt\r\n```\r\n\r\n### Dependencies\r\nSpectrePDF relies on the following libraries (install via `pip`):\r\n- `pymupdf` (PyMuPDF for PDF handling)\r\n- `Pillow` (PIL for image manipulation)\r\n- `pytesseract` (Tesseract OCR wrapper)\r\n- `img2pdf` (Image to PDF conversion)\r\n- `tqdm` (Progress bars)\r\n- `statistics` (Standard library, for median calculations)\r\n\r\nAdditionally, install Tesseract OCR on your system:\r\n- Windows: Download from [Tesseract at UB Mannheim](https://github.com/UB-Mannheim/tesseract/wiki) and add to PATH.\r\n- macOS: `brew install tesseract`\r\n- Linux: `sudo apt install tesseract-ocr`\r\n\r\nNo internet access is required during runtime\u2014everything runs locally.\r\n\r\n## Quick Start\r\n1. **Prepare Your Redaction Dictionary**: Create a `redaction.json` file mapping target words to replacements. Example:\r\n   ```json\r\n   {\r\n       \"confidential\": \"[REDACTED]\",\r\n       \"secret\": \"[CLASSIFIED]\",\r\n       \"john doe\": \"[ANONYMIZED]\"\r\n   }\r\n   ```\r\n   Keys are case-insensitive during detection.\r\n\r\n2. **Run the Processor**: Use the `process_pdf` function from `SpectrePDF.anonymizer`.\r\n\r\n   Here's a basic example script:\r\n\r\n   ```python\r\n   from SpectrePDF.anonymizer import process_pdf\r\n   import json\r\n\r\n   # Load targets from redaction dict keys\r\n   with open(\"redaction.json\", \"r\") as f:\r\n       redaction_dict = json.load(f)\r\n   target_words = list(redaction_dict.keys())\r\n\r\n   # Process the PDF (redact with replacements)\r\n   process_pdf(\r\n       input_pdf=\"input.pdf\",\r\n       output_pdf=\"output_redacted.pdf\",\r\n       target_words=target_words,\r\n       redaction_json_path=\"redaction.json\",\r\n       redact_targets=True,\r\n       black_redaction=False,  # False for text replacement; True for black boxes\r\n       dpi=500,\r\n       tesseract_cmd=\"/usr/local/bin/tesseract\",  # Adjust to your Tesseract path\r\n       verbosity=1  # 0: silent, 1: progress, 2: detailed\r\n   )\r\n   ```\r\n\r\n   This will scan `input.pdf` for words like \"confidential\" or \"john doe\", redact them with replacements from the JSON, and save the result as `output_redacted.pdf`.\r\n\r\n## Understanding the `process_pdf` Function\r\n\r\nThe `process_pdf` function is the core entry point of the SpectrePDF library. It enables users to load a PDF document, perform OCR-based detection of specific words or phrases, and then either annotate the document by drawing boxes around detected elements or redact sensitive information. The process involves rendering PDF pages as high-resolution images, running OCR to extract text data, grouping words logically, and applying modifications before saving a new PDF.\r\n\r\nThis function is highly configurable, allowing for various modes of operation such as simple detection with annotations, targeted redaction with replacements, or secure blackouts. Below, I'll break down its purpose, workflow, and every parameter in detail to ensure you can use it effectively. All parameters are designed to balance flexibility, accuracy, and performance.\r\n\r\n#### Function Signature and Overview\r\n```python\r\ndef process_pdf(\r\n    input_pdf: str,\r\n    output_pdf: str,\r\n    target_words: list,\r\n    redaction_json_path: str,\r\n    show_all_boxes: bool = False,\r\n    only_target_boxes: bool = False,\r\n    redact_targets: bool = False,\r\n    black_redaction: bool = False,\r\n    tesseract_cmd: str = None,\r\n    dpi: int = 500,\r\n    verbosity: int = 1\r\n):\r\n    \"\"\"\r\n    Process a PDF to detect, optionally draw boxes around, or redact and replace specific target words using OCR.\r\n    \"\"\"\r\n```\r\n- **Core Workflow**:\r\n  1. Open the input PDF using PyMuPDF.\r\n  2. Render each page as an image at the specified DPI.\r\n  3. Perform OCR on each image using Tesseract to extract words, positions, and bounding boxes.\r\n  4. Collect and group words into lines based on vertical alignment.\r\n  5. Identify target words (case-insensitive partial matches) and merge adjacent targets if needed.\r\n  6. Depending on parameters, either draw boxes for annotation or apply redaction.\r\n  7. Convert modified images back to a PDF and save it.\r\n  8. Handle errors gracefully (e.g., file not found, invalid JSON) and provide timing metrics.\r\n\r\nThe function raises exceptions for invalid inputs (e.g., non-list `target_words`, missing files) to prevent silent failures. It supports multi-page PDFs and uses progress bars (`tqdm`) for user-friendly feedback.\r\n\r\nNow, let's discuss **all the options (parameters)** in detail, including their types, defaults, interactions, and best practices.\r\n\r\n#### 1. `input_pdf` (Required, Type: str)\r\n   - **Description**: The full path to the input PDF file you want to process. This can be any valid PDF, including scanned documents or those with embedded text (though OCR is always used for consistency).\r\n   - **Usage Notes**: Ensure the file exists and is readable. Relative or absolute paths are fine. If the path is invalid, a `FileNotFoundError` is raised.\r\n   - **Example**: `\"documents/confidential_report.pdf\"`\r\n   - **Interactions**: None specific, but higher-complexity PDFs (e.g., with images or tables) may benefit from increased `dpi` for better OCR results.\r\n   - **Best Practice**: Use absolute paths to avoid working directory issues in scripts.\r\n\r\n#### 2. `output_pdf` (Required, Type: str)\r\n   - **Description**: The full path where the processed PDF will be saved. This will be a new file containing the annotated or redacted pages.\r\n   - **Usage Notes**: The output is always a PDF generated from modified images, preserving the original layout as closely as possible. If the path's directory doesn't exist, it will be created implicitly by the file write operation.\r\n   - **Example**: `\"output/annotated_report.pdf\"`\r\n   - **Interactions**: Overwrites existing files without warning, so choose a unique name.\r\n   - **Best Practice**: Include descriptive suffixes like \"_redacted\" or \"_highlighted\" to distinguish outputs.\r\n\r\n#### 3. `target_words` (Required, Type: list)\r\n   - **Description**: A list of strings representing the words or phrases to detect as \"targets.\" Detection is case-insensitive and matches if any target substring appears in a word (e.g., \"conf\" would match \"confidential\").\r\n   - **Usage Notes**: Must be a list; single strings will raise a `ValueError`. Empty list is allowed but results in no targets (useful for modes like `show_all_boxes=True`).\r\n   - **Example**: `[\"confidential\", \"secret\", \"john doe\"]`\r\n   - **Interactions**: Targets are derived from or used alongside `redaction_json_path`. In redaction modes, only these words trigger changes. Combine with `only_target_boxes` to ignore non-targets.\r\n   - **Best Practice**: Use lowercase for consistency, as matching is lowercased. For phrases, include spaces (e.g., \"top secret\").\r\n\r\n#### 4. `redaction_json_path` (Required, Type: str)\r\n   - **Description**: Path to a JSON file containing a dictionary that maps target words (keys) to their replacement strings (values). This is used during redaction to overlay custom text.\r\n   - **Usage Notes**: The JSON must be a valid dict; invalid formats raise a `ValueError`. Keys should match `target_words` (case-insensitive). Even if not redacting, this param is required but ignored in annotation modes.\r\n   - **Example**: `\"config/redaction.json\"` with content: `{\"confidential\": \"[REDACTED]\", \"john doe\": \"[ANONYMIZED]\"}`\r\n   - **Interactions**: Essential for `redact_targets=True` (unless `black_redaction=True`, where replacements are skipped). If a target lacks a mapping, it's redacted without replacement (fallback behavior).\r\n   - **Best Practice**: Keep keys in sync with `target_words`. Use placeholders like \"[REDACTED]\" for compliance.\r\n\r\n#### 5. `show_all_boxes` (Optional, Default: False, Type: bool)\r\n   - **Description**: If True, draws black outline boxes around **all** detected words (targets and non-targets) for comprehensive annotation. Useful for debugging OCR accuracy or visualizing text extraction.\r\n   - **Usage Notes**: Only active when `redact_targets=False` (annotation mode). Overrides color distinctions\u2014everything gets black boxes.\r\n   - **Example**: Set to `True` for a full word-level markup.\r\n   - **Interactions**: Conflicts with redaction; if `redact_targets=True`, this is ignored. Combine with `only_target_boxes=False` for maximum coverage.\r\n   - **Best Practice**: Use for initial tests to verify word detection before applying targeted changes.\r\n\r\n#### 6. `only_target_boxes` (Optional, Default: False, Type: bool)\r\n   - **Description**: If True, processes (draws or redacts) only boxes containing target words, ignoring non-targets. This focuses operations on sensitive areas.\r\n   - **Usage Notes**: Applies in both annotation and redaction modes. Reduces processing overhead on large documents.\r\n   - **Example**: Set to `True` when you only care about highlighting/redacting specific terms.\r\n   - **Interactions**: In annotation mode without `show_all_boxes`, targets get blue boxes, non-targets are skipped. In redaction, only targets are modified.\r\n   - **Best Practice**: Enable for efficiency in targeted workflows; disable for broader analysis.\r\n\r\n#### 7. `redact_targets` (Optional, Default: False, Type: bool)\r\n   - **Description**: If True, switches to redaction mode: targets are covered (whited out or blacked out) and optionally replaced with text from the redaction dict. If False, defaults to annotation mode (drawing boxes).\r\n   - **Usage Notes**: Primary mode toggle. In redaction, boxes are filled; no outlines are drawn.\r\n   - **Example**: Set to `True` for privacy-focused tasks like anonymization.\r\n   - **Interactions**: Enables `black_redaction`. If True, `show_all_boxes` is ignored. Uses `redaction_json_path` for replacements.\r\n   - **Best Practice**: Combine with `black_redaction=False` for readable redactions (e.g., legal docs) or True for irreversible hiding.\r\n\r\n#### 8. `black_redaction` (Optional, Default: False, Type: bool\r\n   - **Description**: If True **and** `redact_targets=True`, fills target boxes with solid black without any replacement text. This is for permanent, non-reversible redaction.\r\n   - **Usage Notes**: Ignores the redaction dict's replacements. Only affects targets.\r\n   - **Example**: Set to `True` for high-security scenarios where even placeholders are too revealing.\r\n   - **Interactions**: Requires `redact_targets=True`; otherwise ignored. No font sizing or text overlay occurs.\r\n   - **Best Practice**: Use sparingly, as it can disrupt document flow; test with lower DPI first.\r\n\r\n#### 9. `tesseract_cmd` (Optional, Default: None, Type: str)\r\n   - **Description**: Optional path to the Tesseract OCR executable. If None, assumes Tesseract is in your system's PATH.\r\n   - **Usage Notes**: Useful for custom installations or environments without PATH setup. Validates the path exists.\r\n   - **Example**: `\"C:\\\\Program Files\\Tesseract-OCR\\\\tesseract.exe\"` on Windows.\r\n   - **Interactions**: Affects OCR performance. If invalid, raises `FileNotFoundError`.\r\n   - **Best Practice**: Provide if Tesseract isn't globally available; otherwise, leave as None for simplicity.\r\n\r\n#### 10. `dpi` (Optional, Default: 500, Type: int)\r\n   - **Description**: The resolution (dots per inch) for rendering PDF pages as images before OCR. Higher values improve accuracy but increase processing time and memory use.\r\n   - **Usage Notes**: Minimum practical is ~200; 500 is a good balance for most documents. Affects image size and OCR quality.\r\n   - **Example**: `300` for faster runs, `600` for high-precision needs.\r\n   - **Interactions**: Directly impacts OCR reliability\u2014low DPI may miss small text. Scales with page complexity.\r\n   - **Best Practice**: Start at 500; adjust based on document font size and trial runs.\r\n\r\n#### 11. `verbosity` (Optional, Default: 1, Type: int)\r\n   - **Description**: Controls output logging level: 0 (silent, no prints), 1 (basic progress bars and timings), 2 (detailed logs like word counts, phrases).\r\n   - **Usage Notes**: Uses `print` for output and `tqdm` for progress (disabled at 0). Helps with monitoring long processes.\r\n   - **Example**: `2` for debugging, `0` for background scripts.\r\n   - **Interactions**: Affects all steps\u2014e.g., at 2, you'll see per-page details.\r\n   - **Best Practice**: Use 1 for most cases; 2 when troubleshooting OCR or grouping issues.\r\n\r\n#### Key Interactions and Modes Summary\r\n- **Annotation Mode** (`redact_targets=False`): Draws boxes. Use `show_all_boxes` for all words, `only_target_boxes` to focus.\r\n- **Redaction Mode** (`redact_targets=True`): Modifies targets. Toggle `black_redaction` for style.\r\n- **Common Pitfalls**: Ensure `target_words` matches redaction keys. Test with small PDFs first.\r\n- **Performance**: Processing time scales with page count, DPI, and document size. For large files, lower DPI or use verbosity=1.\r\n\r\n## Usage Examples\r\nSpectrePDF offers versatile modes. Here are some scenarios with code snippets:\r\n\r\n### 1. Detect and Highlight Target Words (Draw Blue Boxes)\r\nReview a PDF by outlining targets in blue.\r\n\r\n```python\r\nprocess_pdf(\r\n    input_pdf=\"report.pdf\",\r\n    output_pdf=\"highlighted.pdf\",\r\n    target_words=[\"sensitive\", \"private\"],\r\n    redaction_json_path=\"redaction.json\",  # Still needed, even if not redacting\r\n    show_all_boxes=False,  # Only draw around targets\r\n    redact_targets=False,  # Don't redact, just annotate\r\n    dpi=300,\r\n    verbosity=2\r\n)\r\n```\r\n\r\n**What Happens**: Pages are rendered, OCR detects words, blue boxes are drawn around matches like \"sensitive\", and the PDF is rebuilt.\r\n\r\n### 2. Redact with Custom Text Replacements\r\nAnonymize names or terms while keeping the document readable.\r\n\r\n```python\r\nprocess_pdf(\r\n    input_pdf=\"patient_records.pdf\",\r\n    output_pdf=\"anonymized.pdf\",\r\n    target_words=[\"patient name\", \"ssn\"],\r\n    redaction_json_path=\"redaction.json\",  # e.g., {\"patient name\": \"[PATIENT]\", \"ssn\": \"[REDACTED]\"}\r\n    redact_targets=True,\r\n    black_redaction=False,\r\n    only_target_boxes=True,  # Focus only on targets\r\n    dpi=500,\r\n    verbosity=1\r\n)\r\n```\r\n\r\n**What Happens**: Targets are whited out, replaced with fitting text (auto-sized font), and non-targets remain untouched.\r\n\r\n### 3. Blackout Redaction for Maximum Security\r\nIrreversibly hide sensitive info with black boxes.\r\n\r\n```python\r\nprocess_pdf(\r\n    input_pdf=\"classified_doc.pdf\",\r\n    output_pdf=\"blacked_out.pdf\",\r\n    target_words=[\"top secret\"],\r\n    redaction_json_path=\"redaction.json\",  # Replacement dict ignored in black mode\r\n    redact_targets=True,\r\n    black_redaction=True,\r\n    dpi=400,\r\n    verbosity=1\r\n)\r\n```\r\n\r\n**What Happens**: Matching words/phrases are filled with solid black\u2014no text overlay.\r\n\r\n### 4. Debug Mode: Draw Boxes Around All Words\r\nInspect OCR accuracy by outlining every detected word (black for all).\r\n\r\n```python\r\nprocess_pdf(\r\n    input_pdf=\"test.pdf\",\r\n    output_pdf=\"debug_all_boxes.pdf\",\r\n    target_words=[],  # No targets needed for this mode\r\n    redaction_json_path=\"redaction.json\",\r\n    show_all_boxes=True,\r\n    redact_targets=False,\r\n    verbosity=2\r\n)\r\n```\r\n\r\n**What Happens**: Every word gets a black outline, helping you verify layout and detection.\r\n\r\n### Advanced Tips\r\n- **Multi-Page PDFs**: Handles large documents efficiently with progress tracking.\r\n- **Font Handling**: Automatically scales replacement text to fit boxes; falls back to default font if \"arial.ttf\" is unavailable.\r\n- **Error Handling**: Raises informative exceptions for missing files, invalid inputs, or OCR failures.\r\n- **Performance**: Higher DPI improves accuracy but increases processing time. Start with 300-500 DPI.\r\n\r\n## Contributing\r\nWe welcome contributions! Fork the repo, create a branch, and submit a pull request. Ideas for improvements:\r\n- Support for more OCR languages.\r\n- Batch processing multiple PDFs.\r\n- GUI integration.\r\n\r\nReport issues on GitHub.\r\n\r\n## License\r\nMIT License. See [LICENSE](LICENSE) for details.\r\n\r\n## Acknowledgments\r\n- Powered by PyMuPDF, Tesseract, and Pillow.\r\n- Inspired by real-world needs for secure document handling.\r\n\r\nGet started with SpectrePDF today and take control of your PDF privacy! If you have questions, check the code or open an issue. \ud83d\ude80\r\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A tool for processing and redacting PDFs based on target words using OCR.",
    "version": "0.2.1",
    "project_urls": {
        "Homepage": "https://github.com/udiram/SpectrePDF/",
        "Issues": "https://github.com/udiram/SpectrePDF/issues"
    },
    "split_keywords": [
        "pdf",
        " redaction",
        " ocr",
        " security",
        " privacy",
        " text extraction"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "80ba59c0db36d6d6269f1d972788a08fd83af4554864ef6000f2e9798d93e15f",
                "md5": "8298aafc3b74404a2f367f3413f306b6",
                "sha256": "51b3e4ea260984bdf69a673e3a17f47b32c67c7a81fcbb12ea7b3f830e8762ce"
            },
            "downloads": -1,
            "filename": "spectrepdf-0.2.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "8298aafc3b74404a2f367f3413f306b6",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11",
            "size": 12667,
            "upload_time": "2025-07-14T17:57:13",
            "upload_time_iso_8601": "2025-07-14T17:57:13.199050Z",
            "url": "https://files.pythonhosted.org/packages/80/ba/59c0db36d6d6269f1d972788a08fd83af4554864ef6000f2e9798d93e15f/spectrepdf-0.2.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "7c313bb6ea554b61b5dc48f2012d6f2db33c274d2363f07627adfc9c22f53b56",
                "md5": "cd4433ed7bc64395f7866e6b673d994d",
                "sha256": "8f185369dbfa38874f48b5e472a37dcfea9b873d45f5a0c12438f5cc18293a15"
            },
            "downloads": -1,
            "filename": "spectrepdf-0.2.1.tar.gz",
            "has_sig": false,
            "md5_digest": "cd4433ed7bc64395f7866e6b673d994d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11",
            "size": 12957,
            "upload_time": "2025-07-14T17:57:14",
            "upload_time_iso_8601": "2025-07-14T17:57:14.690212Z",
            "url": "https://files.pythonhosted.org/packages/7c/31/3bb6ea554b61b5dc48f2012d6f2db33c274d2363f07627adfc9c22f53b56/spectrepdf-0.2.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-14 17:57:14",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "udiram",
    "github_project": "SpectrePDF",
    "github_not_found": true,
    "lcname": "spectrepdf"
}
        
Elapsed time: 0.42382s