allyanonimiser

Name	allyanonimiser JSON
Version	2.4.0 JSON
	download
home_page	https://github.com/srepho/Allyanonimiser
Summary	Australian-focused PII detection and anonymization for the insurance industry
upload_time	2025-07-25 06:53:26
maintainer	None
docs_url	None
author	Stephen Oates
requires_python	>=3.10
license	MIT License Copyright (c) 2025 Stephen Oates Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords	pii anonymization privacy insurance australia
VCS
bugtrack_url
requirements	spacy presidio-analyzer presidio-anonymizer pandas tqdm pytest pytest-cov black isort flake8 mypy
Travis-CI	No Travis.
coveralls test coverage

            # Allyanonimiser

[![PyPI version](https://img.shields.io/badge/pypi-v2.4.0-blue)](https://pypi.org/project/allyanonimiser/2.4.0/)
[![Python Versions](https://img.shields.io/pypi/pyversions/allyanonimiser.svg)](https://pypi.org/project/allyanonimiser/)
[![Tests](https://github.com/srepho/Allyanonimiser/actions/workflows/tests.yml/badge.svg)](https://github.com/srepho/Allyanonimiser/actions/workflows/tests.yml)
[![Coverage](https://codecov.io/gh/srepho/Allyanonimiser/branch/main/graph/badge.svg)](https://codecov.io/gh/srepho/Allyanonimiser)
[![Package](https://github.com/srepho/Allyanonimiser/actions/workflows/package.yml/badge.svg)](https://github.com/srepho/Allyanonimiser/actions/workflows/package.yml)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Documentation](https://img.shields.io/badge/docs-online-brightgreen.svg)](https://srepho.github.io/Allyanonimiser/)

Australian-focused PII detection and anonymization for the insurance industry with support for stream processing of very large files.

📖 **[Read the full documentation](https://srepho.github.io/Allyanonimiser/)**

## Version 2.4.0 - Enhanced spaCy Integration & Setup Verification

### What's New in v2.4.0
- **Enhanced spaCy Status Reporting**: Clear visual feedback when loading spaCy models with installation guidance
- **New `check_spacy_status()` Method**: Programmatically check spaCy configuration and get recommendations
- **Setup Verification Script**: New `verify_setup.py` script to check all dependencies and configurations
- **Improved Documentation**: Clearer guidance on spaCy requirements and their impact on functionality
- **Better Error Messages**: More helpful feedback when spaCy models are missing or misconfigured

### Previous Version (v2.3.0)
- **Enhanced False Positive Filtering**: Comprehensive filtering for PERSON and LOCATION entities eliminates common misdetections
- **Improved Pattern Detection**: Fixed BSB/Account Number detection, enhanced Organization detection for Pty Ltd companies
- **NAME_CONSULTANT Pattern**: New pattern for detecting consultant/agent names with proper boundary detection
- **Refined Vehicle Registration**: More accurate detection with reduced false positives from all-caps text
- **Better Medicare Support**: Fixed Medicare number detection and validation
- **Enhanced Date Handling**: Improved date validation to avoid false positives (e.g., "NSW 2000" no longer detected as DATE)
- **Service Number Detection**: Added support for Australian service numbers (1300, 1800, 13xx)
- **Context-Aware Detection**: New context analyzer improves entity detection accuracy
- **Multiple Entity Masking**: Robust support for masking multiple entity types simultaneously

## Installation

```bash
# Basic installation
pip install allyanonimiser==2.4.0

# With stream processing support for large files
pip install "allyanonimiser[stream]==2.4.0"

# With LLM integration for advanced pattern generation
pip install "allyanonimiser[llm]==2.4.0"

# Complete installation with all optional dependencies
pip install "allyanonimiser[stream,llm]==2.4.0"
```

**Prerequisites:**
- Python 3.10 or higher
- A spaCy language model for Named Entity Recognition (NER):
  
  ```bash
  # Recommended - Best accuracy (788 MB)
  python -m spacy download en_core_web_lg
  
  # Alternative - Smaller size (44 MB)
  python -m spacy download en_core_web_sm
  ```
  
  > **Note**: spaCy models enable detection of PERSON, ORGANIZATION, LOCATION, and DATE entities. 
  > Without a spaCy model, pattern-based detection (emails, phones, IDs, etc.) will still work.

### Verify Your Setup

After installation, verify that everything is properly configured:

```bash
python verify_setup.py
```

Or check spaCy status programmatically:

```python
from allyanonimiser import create_allyanonimiser

ally = create_allyanonimiser()
status = ally.check_spacy_status()

if not status['is_loaded']:
    print(status['recommendation'])
else:
    print(f"✓ spaCy model loaded: {status['model_name']}")
```

## Quick Start

```python
from allyanonimiser import create_allyanonimiser

# Create the Allyanonimiser instance with default settings
ally = create_allyanonimiser()

# Analyze text
results = ally.analyze(
    text="Please reference your policy AU-12345678 for claims related to your vehicle rego XYZ123."
)

# Print results
for result in results:
    print(f"Entity: {result.entity_type}, Text: {result.text}, Score: {result.score}")

# Anonymize text
anonymized = ally.anonymize(
    text="Please reference your policy AU-12345678 for claims related to your vehicle rego XYZ123.",
    operators={
        "POLICY_NUMBER": "mask",  # Replace with asterisks
        "VEHICLE_REGISTRATION": "replace"  # Replace with <VEHICLE_REGISTRATION>
    }
)

print(anonymized["text"])
# Output: "Please reference your policy ********** for claims related to your vehicle rego <VEHICLE_REGISTRATION>."
```

### Adding Custom Patterns

```python
from allyanonimiser import create_allyanonimiser

# Create an Allyanonimiser instance
ally = create_allyanonimiser()

# Add a custom pattern with regex
ally.add_pattern({
    "entity_type": "REFERENCE_CODE",
    "patterns": [r"REF-\d{6}-[A-Z]{2}", r"Reference\s+#\d{6}"],
    "context": ["reference", "code", "ref"],
    "name": "Reference Code"
})

# Generate a pattern from examples
ally.create_pattern_from_examples(
    entity_type="EMPLOYEE_ID",
    examples=["EMP00123", "EMP45678", "EMP98765"],
    context=["employee", "staff", "id"],
    generalization_level="medium"  # Options: none, low, medium, high
)

# Test custom patterns
text = "Employee EMP12345 created REF-123456-AB for the project."
results = ally.analyze(text)
for result in results:
    print(f"Found {result.entity_type}: {result.text}")
```

## Built-in Pattern Reference

### Australian Patterns

| Entity Type | Description | Example Pattern | Pattern File |
|-------------|-------------|----------------|-------------|
| AU_TFN | Australian Tax File Number | `\b\d{3}\s*\d{3}\s*\d{3}\b` | au_patterns.py |
| AU_PHONE | Australian Phone Number | `\b(?:\+?61\|0)4\d{2}\s*\d{3}\s*\d{3}\b` | au_patterns.py |
| AU_MEDICARE | Australian Medicare Number | `\b\d{4}\s*\d{5}\s*\d{1}\b` | au_patterns.py |
| AU_DRIVERS_LICENSE | Australian Driver's License | Various formats including<br>`\b\d{8}\b` (NSW)<br>`\b\d{4}[a-zA-Z]{2}\b` (NSW legacy) | au_patterns.py |
| AU_ADDRESS | Australian Address | Address patterns with street names | au_patterns.py |
| AU_POSTCODE | Australian Postcode | `\b\d{4}\b` | au_patterns.py |
| AU_BSB_ACCOUNT | Australian BSB and Account Number | `\b\d{3}-\d{3}\s*\d{6,10}\b` | au_patterns.py |
| AU_ABN | Australian Business Number | `\b\d{2}\s*\d{3}\s*\d{3}\s*\d{3}\b` | au_patterns.py |
| AU_PASSPORT | Australian Passport Number | `\b[A-Za-z]\d{8}\b` | au_patterns.py |

### General Patterns

| Entity Type | Description | Example Pattern | Pattern File |
|-------------|-------------|----------------|-------------|
| CREDIT_CARD | Credit Card Number | `\b\d{4}[\s-]\d{4}[\s-]\d{4}[\s-]\d{4}\b` | general_patterns.py |
| PERSON | Person Name | Name patterns with context | general_patterns.py |
| EMAIL_ADDRESS | Email Address | `\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z\|a-z]{2,}\b` | general_patterns.py |
| DATE_OF_BIRTH | Date of Birth | `\bDOB:\s*\d{1,2}[/.-]\d{1,2}[/.-]\d{2,4}\b` | general_patterns.py |
| LOCATION | Location | City and location patterns | general_patterns.py |
| DATE | Date | `\b\d{1,2}[/.-]\d{1,2}[/.-]\d{2,4}\b` | general_patterns.py |
| MONETARY_VALUE | Money Amount | `\$\d{1,3}(?:,\d{3})*(?:\.\d{2})?\b` | general_patterns.py |
| ORGANIZATION | Organization | Organization name patterns | general_patterns.py |

### Insurance Patterns

| Entity Type | Description | Example Pattern | Pattern File |
|-------------|-------------|----------------|-------------|
| INSURANCE_POLICY_NUMBER | Insurance Policy Number | `\b(?:POL\|P\|Policy)[- ]?\d{6,9}\b` | insurance_patterns.py |
| INSURANCE_CLAIM_NUMBER | Insurance Claim Number | `\b(?:CL\|C)[- ]?\d{6,9}\b` | insurance_patterns.py |
| INSURANCE_MEMBER_NUMBER | Insurance Member Number | Member ID patterns | insurance_patterns.py |
| INSURANCE_GROUP_NUMBER | Group Policy Number | Group policy patterns | insurance_patterns.py |
| VEHICLE_IDENTIFIER | Vehicle ID (VIN, plates) | `\b[A-HJ-NPR-Z0-9]{17}\b` | insurance_patterns.py |
| CASE_REFERENCE | Case Reference Numbers | Case ID patterns | insurance_patterns.py |
| VEHICLE_DETAILS | Vehicle Details | Make/model patterns | insurance_patterns.py |

## Features

- **Australian-Focused PII Detection**: Specialized patterns for TFNs, Medicare numbers, vehicle registrations, addresses, and more
- **Insurance Industry Specialization**: Detect policy numbers, claim references, and other industry-specific identifiers
- **Multiple Entity Types**: Comprehensive detection of general and specialized PII
- **Flexible Anonymization**: Multiple anonymization operators (replace, mask, redact, hash, and more)
- **Stream Processing**: Memory-efficient processing of large files with chunking support
- **Reporting System**: Comprehensive tracking and visualization of anonymization activities
- **Jupyter Integration**: Rich visualization capabilities in notebook environments
- **DataFrame Support**: Process pandas DataFrames with batch processing and multi-processing support
- **Robust Detection**: Enhanced validation and context-aware detection to reduce false positives
- **Overlapping Entity Handling**: Smart resolution of overlapping entities for clean anonymization
- **Pattern Loading**: Automatic loading of all default patterns (Australian, Insurance, General)
- **Improved Medicare Detection**: Fixed detection and validation for Australian Medicare numbers
- **Multiple Entity Masking**: Simultaneous masking of multiple entity types with different operators

## Supported Entity Types

Allyanonimiser supports **38 different entity types** across four categories:

### Complete Entity Type Reference

<details>
<summary><b>🇦🇺 Australian-Specific Entities (13 types)</b></summary>

| Entity Type | Description | Example |
|-------------|-------------|---------|
| AU_TFN | Tax File Number | 123 456 789 |
| AU_PHONE | Phone Number | 0412 345 678, (02) 9876 5432 |
| AU_MEDICARE | Medicare Number | 2123 45678 1 |
| AU_DRIVERS_LICENSE | Driver's License | VIC1234567, NSW98765 |
| AU_ADDRESS | Street Address | 123 Collins St, Melbourne VIC 3000 |
| AU_POSTCODE | Postcode | 2000, 3000, 4000 |
| AU_BSB | Bank State Branch | 062-000, 123-456 |
| AU_ACCOUNT_NUMBER | Bank Account | 1234567890 |
| AU_ABN | Business Number | 11 222 333 444 |
| AU_ACN | Company Number | 123 456 789 |
| AU_PASSPORT | Passport Number | PA1234567, AB9876543 |
| AU_CENTRELINK_CRN | Centrelink Reference | 123 456 789A |
| VEHICLE_REGISTRATION | Vehicle Rego | ABC123, 1ABC23 |

</details>

<details>
<summary><b>🏢 Insurance-Specific Entities (8 types)</b></summary>

| Entity Type | Description | Example |
|-------------|-------------|---------|
| INSURANCE_POLICY_NUMBER | Policy Number | POL-12345678, AU-98765432 |
| INSURANCE_CLAIM_NUMBER | Claim Number | CL-23456789, C-987654 |
| VEHICLE_VIN | Vehicle ID Number | 1HGCM82633A123456 |
| INVOICE_NUMBER | Invoice/Quote | INV-2024001, Q-5678 |
| BROKER_CODE | Broker Code | BRK-1234 |
| VEHICLE_DETAILS | Vehicle Description | 2022 Toyota Camry |
| INCIDENT_DATE | Date of Incident | on 15/03/2024 |
| NAME_CONSULTANT | Consultant Name | Assigned To: Sarah Johnson |

</details>

<details>
<summary><b>👤 General PII Entities (8 types)</b></summary>

| Entity Type | Description | Example |
|-------------|-------------|---------|
| CREDIT_CARD | Credit Card | 4111 1111 1111 1111 |
| PERSON | Person Name | John Smith, Dr. Sarah O'Connor |
| EMAIL_ADDRESS | Email | john@example.com |
| DATE_OF_BIRTH | Date of Birth | DOB: 01/01/1990 |
| LOCATION | Location | Sydney, New South Wales |
| DATE | General Date | 15/03/2024, March 15, 2024 |
| MONEY_AMOUNT | Money Amount | $1,234.56 |
| ORGANIZATION | Organization | ABC Pty Ltd, XYZ Limited |

</details>

<details>
<summary><b>🤖 Additional spaCy NER Entities (9 types)</b></summary>

| Entity Type | Description | Example |
|-------------|-------------|---------|
| NUMBER | Numeric Value | 42, 1234 |
| TIME | Time Expression | 3:30 PM, 14:45 |
| PERCENT | Percentage | 25%, 99.9% |
| PRODUCT | Product Name | iPhone 15, Windows 11 |
| EVENT | Event Name | Olympic Games |
| WORK_OF_ART | Title | "The Great Gatsby" |
| LAW | Legal Document | Privacy Act 1988 |
| LANGUAGE | Language | English, Spanish |
| FACILITY | Building/Airport | Sydney Opera House |

</details>

### Quick Entity Detection Test

```python
from allyanonimiser import create_allyanonimiser

ally = create_allyanonimiser()

# Test text with multiple entity types
test_text = """
Customer: John Smith (TFN: 123 456 789)
Phone: 0412 345 678
Email: john.smith@example.com
Policy: POL-12345678
Address: 123 Collins St, Melbourne VIC 3000
Payment: $1,234.56 via Credit Card 4111 1111 1111 1111
"""

results = ally.analyze(test_text)

# Display detected entities by type
from collections import Counter
entity_types = Counter(r.entity_type for r in results)
print(f"Found {len(results)} entities across {len(entity_types)} types:")
for entity_type, count in entity_types.most_common():
    print(f"  {entity_type}: {count}")
```

## Usage Examples

### Analyze Text for PII Entities

```python
from allyanonimiser import create_allyanonimiser

ally = create_allyanonimiser()

# Analyze text
results = ally.analyze(
    text="Customer John Smith (TFN: 123 456 789) reported an incident on 15/06/2023 at his residence in Sydney NSW 2000."
)

# Print detected entities
for result in results:
    print(f"Entity: {result.entity_type}, Text: {result.text}, Score: {result.score}")
```

### Anonymize Text with Different Operators

```python
from allyanonimiser import create_allyanonimiser, AnonymizationConfig

ally = create_allyanonimiser()

# Using configuration object
config = AnonymizationConfig(
    operators={
        "PERSON": "replace",           # Replace with <PERSON>
        "AU_TFN": "hash",              # Replace with SHA-256 hash
        "DATE": "redact",              # Replace with [REDACTED]
        "AU_ADDRESS": "mask",          # Replace with *****
        "DATE_OF_BIRTH": "age_bracket" # Replace with age bracket (e.g., <40-45>)
    },
    age_bracket_size=5  # Size of age brackets
)

# Anonymize text
anonymized = ally.anonymize(
    text="Customer John Smith (TFN: 123 456 789) reported an incident on 15/06/2023. He was born on 05/08/1982 and lives at 42 Main St, Sydney NSW 2000.",
    config=config
)

print(anonymized["text"])
```

### Process Text with Analysis and Anonymization

```python
from allyanonimiser import create_allyanonimiser, AnalysisConfig, AnonymizationConfig

ally = create_allyanonimiser()

# Configure analysis
analysis_config = AnalysisConfig(
    active_entity_types=["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER", "DATE_OF_BIRTH"],
    min_score_threshold=0.7
)

# Configure anonymization
anonymization_config = AnonymizationConfig(
    operators={
        "PERSON": "replace",
        "EMAIL_ADDRESS": "mask",
        "PHONE_NUMBER": "redact",
        "DATE_OF_BIRTH": "age_bracket"
    }
)

# Process text (analyze + anonymize)
result = ally.process(
    text="Customer Jane Doe (jane.doe@example.com) called on 0412-345-678 regarding her DOB: 22/05/1990.",
    analysis_config=analysis_config,
    anonymization_config=anonymization_config
)

# Access different parts of the result
print("Anonymized text:")
print(result["anonymized"])

print("\nDetected entities:")
for entity in result["analysis"]["entities"]:
    print(f"{entity['entity_type']}: {entity['text']} (score: {entity['score']:.2f})")

print("\nPII-rich segments:")
for segment in result["segments"]:
    print(f"Original: {segment['text']}")
    print(f"Anonymized: {segment['anonymized']}")
```

### Working with DataFrames

```python
import pandas as pd
from allyanonimiser import create_allyanonimiser

# Create DataFrame
df = pd.DataFrame({
    "id": [1, 2, 3],
    "notes": [
        "Customer John Smith (DOB: 15/6/1980) called about policy POL-123456.",
        "Jane Doe (email: jane.doe@example.com) requested a refund.",
        "Alex Johnson from Sydney NSW 2000 reported an incident."
    ]
})

# Create Allyanonimiser
ally = create_allyanonimiser()

# Anonymize a specific column
anonymized_df = ally.process_dataframe(
    df, 
    column="notes", 
    operation="anonymize",
    output_column="anonymized_notes",  # New column for anonymized text
    operators={
        "PERSON": "replace",
        "EMAIL_ADDRESS": "mask",
        "PHONE_NUMBER": "redact"
    }
)

# Display result
print(anonymized_df[["id", "notes", "anonymized_notes"]])
```

## Pattern Generalization Levels

When creating patterns from examples using `create_pattern_from_examples()`, you can specify a generalization level that controls how flexible the resulting pattern will be:

| Level | Description | Example Input | Generated Pattern | Will Match |
|-------|-------------|--------------|------------------|------------|
| `none` | Exact matching only | "EMP12345" | `\bEMP12345\b` | Only "EMP12345" |
| `low` | Basic generalization | "EMP12345" | `\bEMP\d{5}\b` | "EMP12345", "EMP67890", but not "EMP123" |
| `medium` | Moderate flexibility | "EMP12345" | `\bEMP\d+\b` | "EMP12345", "EMP123", "EMP9", but not "EMPLOYEE12345" |
| `high` | Maximum flexibility | "EMP12345" | `\b[A-Z]{3}\d+\b` | "EMP12345", "ABC123", etc. |

Higher generalization levels detect more variants but may increase false positives. Choose the appropriate level based on your needs for precision vs. recall.

## Anonymization Operators

The package supports several anonymization operators that control how detected entities are transformed:

| Operator | Description | Example | Result |
|----------|-------------|---------|--------|
| `replace` | Replace with entity type | "John Smith" | `<PERSON>` |
| `redact` | Fully redact the entity | "john.smith@example.com" | `[REDACTED]` |
| `mask` | Partially mask while preserving structure | "john.smith@example.com" | `j***.s****@e******.com` |
| `hash` | Replace with consistent hash | "John Smith" | `7f9d6a...` (same for all "John Smith") |
| `encrypt` | Encrypt with a key (recoverable) | "John Smith" | `AES256:a7f9c...` |
| `age_bracket` | Convert dates to age brackets | "DOB: 15/03/1980" | `DOB: 40-44` |
| `custom` | User-defined function | (depends on function) | (custom output) |

Example usage:

```python
result = ally.anonymize(
    text="Please contact John Smith at john.smith@example.com",
    operators={
        "PERSON": "replace",       # Replace with entity type
        "EMAIL_ADDRESS": "mask",   # Partially mask the email
        "PHONE_NUMBER": "redact",  # Fully redact phone numbers
        "DATE_OF_BIRTH": "age_bracket"  # Convert DOB to age bracket
    }
)
```

### Custom Operator Example

The custom operator allows you to define your own transformation function:

```python
from allyanonimiser import create_allyanonimiser

# Create a custom transformation function
def randomize_names(entity_text, entity_type):
    """Replace person names with random names from a predefined list."""
    if entity_type != "PERSON":
        return entity_text
        
    # Simple list of random replacement names
    replacements = ["Alex Taylor", "Sam Johnson", "Jordan Lee", "Casey Brown"]
    
    # Use hash of original name to consistently select the same replacement
    import hashlib
    hash_val = int(hashlib.md5(entity_text.encode()).hexdigest(), 16)
    index = hash_val % len(replacements)
    
    return replacements[index]

# Create an Allyanonimiser instance
ally = create_allyanonimiser()

# Use the custom operator
result = ally.anonymize(
    text="Customer John Smith sent an email to Mary Johnson about policy POL-123456.",
    operators={
        "PERSON": randomize_names,  # Pass the function directly
        "POLICY_NUMBER": "mask"     # Other operators work as usual
    }
)

print(result["text"])
# Output: "Customer Alex Taylor sent an email to Sam Johnson about policy ***-******."
```

Custom operators are powerful for specialized anonymization needs like:
- Generating synthetic but realistic replacements
- Contextual anonymization based on entity values
- Domain-specific transformations (e.g., preserving data distributions)
- Implementing differential privacy mechanisms

### Pattern Management

```python
from allyanonimiser import create_allyanonimiser

# Create an instance
ally = create_allyanonimiser()

# 1. Adding pattern to an existing group in pattern files
# If you want to contribute a new pattern to the codebase,
# edit the appropriate file in patterns/ directory:
#  - patterns/au_patterns.py: For Australian-specific patterns
#  - patterns/general_patterns.py: For general PII patterns 
#  - patterns/insurance_patterns.py: For insurance-specific patterns

# 2. Using custom patterns without modifying code
# Add a custom pattern with detailed options
ally.add_pattern({
    "entity_type": "COMPANY_PROJECT_ID",
    "patterns": [
        r"PRJ-\d{4}-[A-Z]{3}",       # Format: PRJ-1234-ABC
        r"Project\s+ID:\s*(\d{4})"   # Format: Project ID: 1234
    ],
    "context": ["project", "id", "identifier", "code"],
    "name": "Company Project ID",
    "score": 0.85,                   # Confidence score (0-1)
    "language": "en",                # Language code
    "description": "Internal project identifier format"
})

# 3. Save patterns for reuse
ally.export_config("company_patterns.json")

# 4. Load saved patterns in another session
new_ally = create_allyanonimiser(settings_path="company_patterns.json")

# 5. Pattern testing and validation
from allyanonimiser.validators import test_pattern_against_examples

# Test if a pattern works against examples
results = test_pattern_against_examples(
    pattern=r"PRJ-\d{4}-[A-Z]{3}",
    positive_examples=["PRJ-1234-ABC", "PRJ-5678-XYZ"],
    negative_examples=["PRJ-123-AB", "PROJECT-1234"]
)
print(f"Pattern is valid: {results['is_valid']}")
print(f"Diagnostic info: {results['message']}")
```

## Advanced Features

### Generating Reports

```python
from allyanonimiser import create_allyanonimiser
import os

# Create output directory
os.makedirs("output", exist_ok=True)

# Create an Allyanonimiser instance
ally = create_allyanonimiser()

# Start a new report session
ally.start_new_report(session_id="example_session")

# Configure anonymization
anonymization_config = {
    "operators": {
        "PERSON": "replace",
        "EMAIL_ADDRESS": "mask",
        "PHONE_NUMBER": "redact",
        "AU_ADDRESS": "replace",
        "DATE_OF_BIRTH": "age_bracket",
        "AU_TFN": "hash",
        "AU_MEDICARE": "mask"
    },
    "age_bracket_size": 10
}

# Process a batch of files
result = ally.process_files(
    file_paths=["data/sample1.txt", "data/sample2.txt", "data/sample3.txt"],
    output_dir="output/anonymized",
    anonymize=True,
    operators=anonymization_config["operators"],
    report=True,
    report_output="output/report.html",
    report_format="html"
)

# Display summary
print(f"Processed {result['total_files']} files")
print(f"Detected {result['report']['total_entities']} entities")
print(f"Average processing time: {result['report']['report']['avg_processing_time']*1000:.2f} ms")
```

### Comprehensive Reporting System

Allyanonimiser includes a comprehensive reporting system that allows you to track, analyze, and visualize anonymization activities.

```python
from allyanonimiser import create_allyanonimiser

# Create instance
ally = create_allyanonimiser()

# Start a new report session
ally.start_new_report("my_session")

# Process multiple texts
texts = [
    "Customer John Smith (DOB: 15/06/1980) called about claim CL-12345.",
    "Jane Doe (email: jane.doe@example.com) requested policy information.",
    "Claims assessor reviewed case for Robert Johnson (ID: 987654321)."
]

for i, text in enumerate(texts):
    ally.anonymize(
        text=text,
        operators={
            "PERSON": "replace",
            "EMAIL_ADDRESS": "mask",
            "DATE_OF_BIRTH": "age_bracket"
        },
        document_id=f"doc_{i+1}"
    )

# Get report summary
report = ally.get_report()
summary = report.get_summary()

print(f"Total documents processed: {summary['total_documents']}")
print(f"Total entities detected: {summary['total_entities']}")
print(f"Entities per document: {summary['entities_per_document']:.2f}")
print(f"Anonymization rate: {summary['anonymization_rate']*100:.2f}%")
print(f"Average processing time: {summary['avg_processing_time']*1000:.2f} ms")

# Export report to different formats
report.export_report("report.html", "html")  # Rich HTML visualization
report.export_report("report.json", "json")  # Detailed JSON data
report.export_report("report.csv", "csv")    # CSV statistics
```

### In Jupyter Notebooks

```python
from allyanonimiser import create_allyanonimiser
import pandas as pd
import matplotlib.pyplot as plt

# Create an Allyanonimiser instance
ally = create_allyanonimiser()

# Start a report session and process some texts
# ... processing code ...

# Display rich interactive report
ally.display_report_in_notebook()

# Access report data programmatically
report = ally.get_report()
summary = report.get_summary()

# Create custom visualizations
entity_counts = summary['entity_counts']
plt.figure(figsize=(10, 6))
plt.bar(entity_counts.keys(), entity_counts.values())
plt.title('Entity Type Distribution')
plt.xlabel('Entity Type')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
```

## What's New in Version 2.1.0

- Added support for NSW legacy driver's license pattern
- Improved pattern recognition for Australian TFNs
- Enhanced handling of date formats
- Fixed issues with BSB recognition
- Added more comprehensive test suite for Australian patterns
- Performance improvements for large file processing

For older versions and detailed change history, see the [CHANGELOG.md](CHANGELOG.md) file.

## Documentation

For complete documentation, examples, and advanced usage, visit our [documentation site](https://srepho.github.io/Allyanonimiser/):

- [Installation Guide](https://srepho.github.io/Allyanonimiser/getting-started/installation/)
- [Quick Start Tutorial](https://srepho.github.io/Allyanonimiser/getting-started/quick-start/)
- [Pattern Reference](https://srepho.github.io/Allyanonimiser/patterns/overview/)
- [Anonymization Operators](https://srepho.github.io/Allyanonimiser/advanced/anonymization-operators/)
- [API Reference](https://srepho.github.io/Allyanonimiser/api/main/)

## License

This project is licensed under the MIT License - see the LICENSE file for details.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/srepho/Allyanonimiser",
    "name": "allyanonimiser",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "pii, anonymization, privacy, insurance, australia",
    "author": "Stephen Oates",
    "author_email": "Stephen Oates <stephen.j.a.oates@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/d9/10/2803712bce5dbec508923a272b27cffe9760aa7e9571779b1ab47b9f5569/allyanonimiser-2.4.0.tar.gz",
    "platform": null,
    "description": "# Allyanonimiser\n\n[![PyPI version](https://img.shields.io/badge/pypi-v2.4.0-blue)](https://pypi.org/project/allyanonimiser/2.4.0/)\n[![Python Versions](https://img.shields.io/pypi/pyversions/allyanonimiser.svg)](https://pypi.org/project/allyanonimiser/)\n[![Tests](https://github.com/srepho/Allyanonimiser/actions/workflows/tests.yml/badge.svg)](https://github.com/srepho/Allyanonimiser/actions/workflows/tests.yml)\n[![Coverage](https://codecov.io/gh/srepho/Allyanonimiser/branch/main/graph/badge.svg)](https://codecov.io/gh/srepho/Allyanonimiser)\n[![Package](https://github.com/srepho/Allyanonimiser/actions/workflows/package.yml/badge.svg)](https://github.com/srepho/Allyanonimiser/actions/workflows/package.yml)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Documentation](https://img.shields.io/badge/docs-online-brightgreen.svg)](https://srepho.github.io/Allyanonimiser/)\n\nAustralian-focused PII detection and anonymization for the insurance industry with support for stream processing of very large files.\n\n\ud83d\udcd6 **[Read the full documentation](https://srepho.github.io/Allyanonimiser/)**\n\n## Version 2.4.0 - Enhanced spaCy Integration & Setup Verification\n\n### What's New in v2.4.0\n- **Enhanced spaCy Status Reporting**: Clear visual feedback when loading spaCy models with installation guidance\n- **New `check_spacy_status()` Method**: Programmatically check spaCy configuration and get recommendations\n- **Setup Verification Script**: New `verify_setup.py` script to check all dependencies and configurations\n- **Improved Documentation**: Clearer guidance on spaCy requirements and their impact on functionality\n- **Better Error Messages**: More helpful feedback when spaCy models are missing or misconfigured\n\n### Previous Version (v2.3.0)\n- **Enhanced False Positive Filtering**: Comprehensive filtering for PERSON and LOCATION entities eliminates common misdetections\n- **Improved Pattern Detection**: Fixed BSB/Account Number detection, enhanced Organization detection for Pty Ltd companies\n- **NAME_CONSULTANT Pattern**: New pattern for detecting consultant/agent names with proper boundary detection\n- **Refined Vehicle Registration**: More accurate detection with reduced false positives from all-caps text\n- **Better Medicare Support**: Fixed Medicare number detection and validation\n- **Enhanced Date Handling**: Improved date validation to avoid false positives (e.g., \"NSW 2000\" no longer detected as DATE)\n- **Service Number Detection**: Added support for Australian service numbers (1300, 1800, 13xx)\n- **Context-Aware Detection**: New context analyzer improves entity detection accuracy\n- **Multiple Entity Masking**: Robust support for masking multiple entity types simultaneously\n\n## Installation\n\n```bash\n# Basic installation\npip install allyanonimiser==2.4.0\n\n# With stream processing support for large files\npip install \"allyanonimiser[stream]==2.4.0\"\n\n# With LLM integration for advanced pattern generation\npip install \"allyanonimiser[llm]==2.4.0\"\n\n# Complete installation with all optional dependencies\npip install \"allyanonimiser[stream,llm]==2.4.0\"\n```\n\n**Prerequisites:**\n- Python 3.10 or higher\n- A spaCy language model for Named Entity Recognition (NER):\n  \n  ```bash\n  # Recommended - Best accuracy (788 MB)\n  python -m spacy download en_core_web_lg\n  \n  # Alternative - Smaller size (44 MB)\n  python -m spacy download en_core_web_sm\n  ```\n  \n  > **Note**: spaCy models enable detection of PERSON, ORGANIZATION, LOCATION, and DATE entities. \n  > Without a spaCy model, pattern-based detection (emails, phones, IDs, etc.) will still work.\n\n### Verify Your Setup\n\nAfter installation, verify that everything is properly configured:\n\n```bash\npython verify_setup.py\n```\n\nOr check spaCy status programmatically:\n\n```python\nfrom allyanonimiser import create_allyanonimiser\n\nally = create_allyanonimiser()\nstatus = ally.check_spacy_status()\n\nif not status['is_loaded']:\n    print(status['recommendation'])\nelse:\n    print(f\"\u2713 spaCy model loaded: {status['model_name']}\")\n```\n\n## Quick Start\n\n```python\nfrom allyanonimiser import create_allyanonimiser\n\n# Create the Allyanonimiser instance with default settings\nally = create_allyanonimiser()\n\n# Analyze text\nresults = ally.analyze(\n    text=\"Please reference your policy AU-12345678 for claims related to your vehicle rego XYZ123.\"\n)\n\n# Print results\nfor result in results:\n    print(f\"Entity: {result.entity_type}, Text: {result.text}, Score: {result.score}\")\n\n# Anonymize text\nanonymized = ally.anonymize(\n    text=\"Please reference your policy AU-12345678 for claims related to your vehicle rego XYZ123.\",\n    operators={\n        \"POLICY_NUMBER\": \"mask\",  # Replace with asterisks\n        \"VEHICLE_REGISTRATION\": \"replace\"  # Replace with <VEHICLE_REGISTRATION>\n    }\n)\n\nprint(anonymized[\"text\"])\n# Output: \"Please reference your policy ********** for claims related to your vehicle rego <VEHICLE_REGISTRATION>.\"\n```\n\n### Adding Custom Patterns\n\n```python\nfrom allyanonimiser import create_allyanonimiser\n\n# Create an Allyanonimiser instance\nally = create_allyanonimiser()\n\n# Add a custom pattern with regex\nally.add_pattern({\n    \"entity_type\": \"REFERENCE_CODE\",\n    \"patterns\": [r\"REF-\\d{6}-[A-Z]{2}\", r\"Reference\\s+#\\d{6}\"],\n    \"context\": [\"reference\", \"code\", \"ref\"],\n    \"name\": \"Reference Code\"\n})\n\n# Generate a pattern from examples\nally.create_pattern_from_examples(\n    entity_type=\"EMPLOYEE_ID\",\n    examples=[\"EMP00123\", \"EMP45678\", \"EMP98765\"],\n    context=[\"employee\", \"staff\", \"id\"],\n    generalization_level=\"medium\"  # Options: none, low, medium, high\n)\n\n# Test custom patterns\ntext = \"Employee EMP12345 created REF-123456-AB for the project.\"\nresults = ally.analyze(text)\nfor result in results:\n    print(f\"Found {result.entity_type}: {result.text}\")\n```\n\n## Built-in Pattern Reference\n\n### Australian Patterns\n\n| Entity Type | Description | Example Pattern | Pattern File |\n|-------------|-------------|----------------|-------------|\n| AU_TFN | Australian Tax File Number | `\\b\\d{3}\\s*\\d{3}\\s*\\d{3}\\b` | au_patterns.py |\n| AU_PHONE | Australian Phone Number | `\\b(?:\\+?61\\|0)4\\d{2}\\s*\\d{3}\\s*\\d{3}\\b` | au_patterns.py |\n| AU_MEDICARE | Australian Medicare Number | `\\b\\d{4}\\s*\\d{5}\\s*\\d{1}\\b` | au_patterns.py |\n| AU_DRIVERS_LICENSE | Australian Driver's License | Various formats including<br>`\\b\\d{8}\\b` (NSW)<br>`\\b\\d{4}[a-zA-Z]{2}\\b` (NSW legacy) | au_patterns.py |\n| AU_ADDRESS | Australian Address | Address patterns with street names | au_patterns.py |\n| AU_POSTCODE | Australian Postcode | `\\b\\d{4}\\b` | au_patterns.py |\n| AU_BSB_ACCOUNT | Australian BSB and Account Number | `\\b\\d{3}-\\d{3}\\s*\\d{6,10}\\b` | au_patterns.py |\n| AU_ABN | Australian Business Number | `\\b\\d{2}\\s*\\d{3}\\s*\\d{3}\\s*\\d{3}\\b` | au_patterns.py |\n| AU_PASSPORT | Australian Passport Number | `\\b[A-Za-z]\\d{8}\\b` | au_patterns.py |\n\n### General Patterns\n\n| Entity Type | Description | Example Pattern | Pattern File |\n|-------------|-------------|----------------|-------------|\n| CREDIT_CARD | Credit Card Number | `\\b\\d{4}[\\s-]\\d{4}[\\s-]\\d{4}[\\s-]\\d{4}\\b` | general_patterns.py |\n| PERSON | Person Name | Name patterns with context | general_patterns.py |\n| EMAIL_ADDRESS | Email Address | `\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z\\|a-z]{2,}\\b` | general_patterns.py |\n| DATE_OF_BIRTH | Date of Birth | `\\bDOB:\\s*\\d{1,2}[/.-]\\d{1,2}[/.-]\\d{2,4}\\b` | general_patterns.py |\n| LOCATION | Location | City and location patterns | general_patterns.py |\n| DATE | Date | `\\b\\d{1,2}[/.-]\\d{1,2}[/.-]\\d{2,4}\\b` | general_patterns.py |\n| MONETARY_VALUE | Money Amount | `\\$\\d{1,3}(?:,\\d{3})*(?:\\.\\d{2})?\\b` | general_patterns.py |\n| ORGANIZATION | Organization | Organization name patterns | general_patterns.py |\n\n### Insurance Patterns\n\n| Entity Type | Description | Example Pattern | Pattern File |\n|-------------|-------------|----------------|-------------|\n| INSURANCE_POLICY_NUMBER | Insurance Policy Number | `\\b(?:POL\\|P\\|Policy)[- ]?\\d{6,9}\\b` | insurance_patterns.py |\n| INSURANCE_CLAIM_NUMBER | Insurance Claim Number | `\\b(?:CL\\|C)[- ]?\\d{6,9}\\b` | insurance_patterns.py |\n| INSURANCE_MEMBER_NUMBER | Insurance Member Number | Member ID patterns | insurance_patterns.py |\n| INSURANCE_GROUP_NUMBER | Group Policy Number | Group policy patterns | insurance_patterns.py |\n| VEHICLE_IDENTIFIER | Vehicle ID (VIN, plates) | `\\b[A-HJ-NPR-Z0-9]{17}\\b` | insurance_patterns.py |\n| CASE_REFERENCE | Case Reference Numbers | Case ID patterns | insurance_patterns.py |\n| VEHICLE_DETAILS | Vehicle Details | Make/model patterns | insurance_patterns.py |\n\n## Features\n\n- **Australian-Focused PII Detection**: Specialized patterns for TFNs, Medicare numbers, vehicle registrations, addresses, and more\n- **Insurance Industry Specialization**: Detect policy numbers, claim references, and other industry-specific identifiers\n- **Multiple Entity Types**: Comprehensive detection of general and specialized PII\n- **Flexible Anonymization**: Multiple anonymization operators (replace, mask, redact, hash, and more)\n- **Stream Processing**: Memory-efficient processing of large files with chunking support\n- **Reporting System**: Comprehensive tracking and visualization of anonymization activities\n- **Jupyter Integration**: Rich visualization capabilities in notebook environments\n- **DataFrame Support**: Process pandas DataFrames with batch processing and multi-processing support\n- **Robust Detection**: Enhanced validation and context-aware detection to reduce false positives\n- **Overlapping Entity Handling**: Smart resolution of overlapping entities for clean anonymization\n- **Pattern Loading**: Automatic loading of all default patterns (Australian, Insurance, General)\n- **Improved Medicare Detection**: Fixed detection and validation for Australian Medicare numbers\n- **Multiple Entity Masking**: Simultaneous masking of multiple entity types with different operators\n\n## Supported Entity Types\n\nAllyanonimiser supports **38 different entity types** across four categories:\n\n### Complete Entity Type Reference\n\n<details>\n<summary><b>\ud83c\udde6\ud83c\uddfa Australian-Specific Entities (13 types)</b></summary>\n\n| Entity Type | Description | Example |\n|-------------|-------------|---------|\n| AU_TFN | Tax File Number | 123 456 789 |\n| AU_PHONE | Phone Number | 0412 345 678, (02) 9876 5432 |\n| AU_MEDICARE | Medicare Number | 2123 45678 1 |\n| AU_DRIVERS_LICENSE | Driver's License | VIC1234567, NSW98765 |\n| AU_ADDRESS | Street Address | 123 Collins St, Melbourne VIC 3000 |\n| AU_POSTCODE | Postcode | 2000, 3000, 4000 |\n| AU_BSB | Bank State Branch | 062-000, 123-456 |\n| AU_ACCOUNT_NUMBER | Bank Account | 1234567890 |\n| AU_ABN | Business Number | 11 222 333 444 |\n| AU_ACN | Company Number | 123 456 789 |\n| AU_PASSPORT | Passport Number | PA1234567, AB9876543 |\n| AU_CENTRELINK_CRN | Centrelink Reference | 123 456 789A |\n| VEHICLE_REGISTRATION | Vehicle Rego | ABC123, 1ABC23 |\n\n</details>\n\n<details>\n<summary><b>\ud83c\udfe2 Insurance-Specific Entities (8 types)</b></summary>\n\n| Entity Type | Description | Example |\n|-------------|-------------|---------|\n| INSURANCE_POLICY_NUMBER | Policy Number | POL-12345678, AU-98765432 |\n| INSURANCE_CLAIM_NUMBER | Claim Number | CL-23456789, C-987654 |\n| VEHICLE_VIN | Vehicle ID Number | 1HGCM82633A123456 |\n| INVOICE_NUMBER | Invoice/Quote | INV-2024001, Q-5678 |\n| BROKER_CODE | Broker Code | BRK-1234 |\n| VEHICLE_DETAILS | Vehicle Description | 2022 Toyota Camry |\n| INCIDENT_DATE | Date of Incident | on 15/03/2024 |\n| NAME_CONSULTANT | Consultant Name | Assigned To: Sarah Johnson |\n\n</details>\n\n<details>\n<summary><b>\ud83d\udc64 General PII Entities (8 types)</b></summary>\n\n| Entity Type | Description | Example |\n|-------------|-------------|---------|\n| CREDIT_CARD | Credit Card | 4111 1111 1111 1111 |\n| PERSON | Person Name | John Smith, Dr. Sarah O'Connor |\n| EMAIL_ADDRESS | Email | john@example.com |\n| DATE_OF_BIRTH | Date of Birth | DOB: 01/01/1990 |\n| LOCATION | Location | Sydney, New South Wales |\n| DATE | General Date | 15/03/2024, March 15, 2024 |\n| MONEY_AMOUNT | Money Amount | $1,234.56 |\n| ORGANIZATION | Organization | ABC Pty Ltd, XYZ Limited |\n\n</details>\n\n<details>\n<summary><b>\ud83e\udd16 Additional spaCy NER Entities (9 types)</b></summary>\n\n| Entity Type | Description | Example |\n|-------------|-------------|---------|\n| NUMBER | Numeric Value | 42, 1234 |\n| TIME | Time Expression | 3:30 PM, 14:45 |\n| PERCENT | Percentage | 25%, 99.9% |\n| PRODUCT | Product Name | iPhone 15, Windows 11 |\n| EVENT | Event Name | Olympic Games |\n| WORK_OF_ART | Title | \"The Great Gatsby\" |\n| LAW | Legal Document | Privacy Act 1988 |\n| LANGUAGE | Language | English, Spanish |\n| FACILITY | Building/Airport | Sydney Opera House |\n\n</details>\n\n### Quick Entity Detection Test\n\n```python\nfrom allyanonimiser import create_allyanonimiser\n\nally = create_allyanonimiser()\n\n# Test text with multiple entity types\ntest_text = \"\"\"\nCustomer: John Smith (TFN: 123 456 789)\nPhone: 0412 345 678\nEmail: john.smith@example.com\nPolicy: POL-12345678\nAddress: 123 Collins St, Melbourne VIC 3000\nPayment: $1,234.56 via Credit Card 4111 1111 1111 1111\n\"\"\"\n\nresults = ally.analyze(test_text)\n\n# Display detected entities by type\nfrom collections import Counter\nentity_types = Counter(r.entity_type for r in results)\nprint(f\"Found {len(results)} entities across {len(entity_types)} types:\")\nfor entity_type, count in entity_types.most_common():\n    print(f\"  {entity_type}: {count}\")\n```\n\n## Usage Examples\n\n### Analyze Text for PII Entities\n\n```python\nfrom allyanonimiser import create_allyanonimiser\n\nally = create_allyanonimiser()\n\n# Analyze text\nresults = ally.analyze(\n    text=\"Customer John Smith (TFN: 123 456 789) reported an incident on 15/06/2023 at his residence in Sydney NSW 2000.\"\n)\n\n# Print detected entities\nfor result in results:\n    print(f\"Entity: {result.entity_type}, Text: {result.text}, Score: {result.score}\")\n```\n\n### Anonymize Text with Different Operators\n\n```python\nfrom allyanonimiser import create_allyanonimiser, AnonymizationConfig\n\nally = create_allyanonimiser()\n\n# Using configuration object\nconfig = AnonymizationConfig(\n    operators={\n        \"PERSON\": \"replace\",           # Replace with <PERSON>\n        \"AU_TFN\": \"hash\",              # Replace with SHA-256 hash\n        \"DATE\": \"redact\",              # Replace with [REDACTED]\n        \"AU_ADDRESS\": \"mask\",          # Replace with *****\n        \"DATE_OF_BIRTH\": \"age_bracket\" # Replace with age bracket (e.g., <40-45>)\n    },\n    age_bracket_size=5  # Size of age brackets\n)\n\n# Anonymize text\nanonymized = ally.anonymize(\n    text=\"Customer John Smith (TFN: 123 456 789) reported an incident on 15/06/2023. He was born on 05/08/1982 and lives at 42 Main St, Sydney NSW 2000.\",\n    config=config\n)\n\nprint(anonymized[\"text\"])\n```\n\n### Process Text with Analysis and Anonymization\n\n```python\nfrom allyanonimiser import create_allyanonimiser, AnalysisConfig, AnonymizationConfig\n\nally = create_allyanonimiser()\n\n# Configure analysis\nanalysis_config = AnalysisConfig(\n    active_entity_types=[\"PERSON\", \"EMAIL_ADDRESS\", \"PHONE_NUMBER\", \"DATE_OF_BIRTH\"],\n    min_score_threshold=0.7\n)\n\n# Configure anonymization\nanonymization_config = AnonymizationConfig(\n    operators={\n        \"PERSON\": \"replace\",\n        \"EMAIL_ADDRESS\": \"mask\",\n        \"PHONE_NUMBER\": \"redact\",\n        \"DATE_OF_BIRTH\": \"age_bracket\"\n    }\n)\n\n# Process text (analyze + anonymize)\nresult = ally.process(\n    text=\"Customer Jane Doe (jane.doe@example.com) called on 0412-345-678 regarding her DOB: 22/05/1990.\",\n    analysis_config=analysis_config,\n    anonymization_config=anonymization_config\n)\n\n# Access different parts of the result\nprint(\"Anonymized text:\")\nprint(result[\"anonymized\"])\n\nprint(\"\\nDetected entities:\")\nfor entity in result[\"analysis\"][\"entities\"]:\n    print(f\"{entity['entity_type']}: {entity['text']} (score: {entity['score']:.2f})\")\n\nprint(\"\\nPII-rich segments:\")\nfor segment in result[\"segments\"]:\n    print(f\"Original: {segment['text']}\")\n    print(f\"Anonymized: {segment['anonymized']}\")\n```\n\n### Working with DataFrames\n\n```python\nimport pandas as pd\nfrom allyanonimiser import create_allyanonimiser\n\n# Create DataFrame\ndf = pd.DataFrame({\n    \"id\": [1, 2, 3],\n    \"notes\": [\n        \"Customer John Smith (DOB: 15/6/1980) called about policy POL-123456.\",\n        \"Jane Doe (email: jane.doe@example.com) requested a refund.\",\n        \"Alex Johnson from Sydney NSW 2000 reported an incident.\"\n    ]\n})\n\n# Create Allyanonimiser\nally = create_allyanonimiser()\n\n# Anonymize a specific column\nanonymized_df = ally.process_dataframe(\n    df, \n    column=\"notes\", \n    operation=\"anonymize\",\n    output_column=\"anonymized_notes\",  # New column for anonymized text\n    operators={\n        \"PERSON\": \"replace\",\n        \"EMAIL_ADDRESS\": \"mask\",\n        \"PHONE_NUMBER\": \"redact\"\n    }\n)\n\n# Display result\nprint(anonymized_df[[\"id\", \"notes\", \"anonymized_notes\"]])\n```\n\n## Pattern Generalization Levels\n\nWhen creating patterns from examples using `create_pattern_from_examples()`, you can specify a generalization level that controls how flexible the resulting pattern will be:\n\n| Level | Description | Example Input | Generated Pattern | Will Match |\n|-------|-------------|--------------|------------------|------------|\n| `none` | Exact matching only | \"EMP12345\" | `\\bEMP12345\\b` | Only \"EMP12345\" |\n| `low` | Basic generalization | \"EMP12345\" | `\\bEMP\\d{5}\\b` | \"EMP12345\", \"EMP67890\", but not \"EMP123\" |\n| `medium` | Moderate flexibility | \"EMP12345\" | `\\bEMP\\d+\\b` | \"EMP12345\", \"EMP123\", \"EMP9\", but not \"EMPLOYEE12345\" |\n| `high` | Maximum flexibility | \"EMP12345\" | `\\b[A-Z]{3}\\d+\\b` | \"EMP12345\", \"ABC123\", etc. |\n\nHigher generalization levels detect more variants but may increase false positives. Choose the appropriate level based on your needs for precision vs. recall.\n\n## Anonymization Operators\n\nThe package supports several anonymization operators that control how detected entities are transformed:\n\n| Operator | Description | Example | Result |\n|----------|-------------|---------|--------|\n| `replace` | Replace with entity type | \"John Smith\" | `<PERSON>` |\n| `redact` | Fully redact the entity | \"john.smith@example.com\" | `[REDACTED]` |\n| `mask` | Partially mask while preserving structure | \"john.smith@example.com\" | `j***.s****@e******.com` |\n| `hash` | Replace with consistent hash | \"John Smith\" | `7f9d6a...` (same for all \"John Smith\") |\n| `encrypt` | Encrypt with a key (recoverable) | \"John Smith\" | `AES256:a7f9c...` |\n| `age_bracket` | Convert dates to age brackets | \"DOB: 15/03/1980\" | `DOB: 40-44` |\n| `custom` | User-defined function | (depends on function) | (custom output) |\n\nExample usage:\n\n```python\nresult = ally.anonymize(\n    text=\"Please contact John Smith at john.smith@example.com\",\n    operators={\n        \"PERSON\": \"replace\",       # Replace with entity type\n        \"EMAIL_ADDRESS\": \"mask\",   # Partially mask the email\n        \"PHONE_NUMBER\": \"redact\",  # Fully redact phone numbers\n        \"DATE_OF_BIRTH\": \"age_bracket\"  # Convert DOB to age bracket\n    }\n)\n```\n\n### Custom Operator Example\n\nThe custom operator allows you to define your own transformation function:\n\n```python\nfrom allyanonimiser import create_allyanonimiser\n\n# Create a custom transformation function\ndef randomize_names(entity_text, entity_type):\n    \"\"\"Replace person names with random names from a predefined list.\"\"\"\n    if entity_type != \"PERSON\":\n        return entity_text\n        \n    # Simple list of random replacement names\n    replacements = [\"Alex Taylor\", \"Sam Johnson\", \"Jordan Lee\", \"Casey Brown\"]\n    \n    # Use hash of original name to consistently select the same replacement\n    import hashlib\n    hash_val = int(hashlib.md5(entity_text.encode()).hexdigest(), 16)\n    index = hash_val % len(replacements)\n    \n    return replacements[index]\n\n# Create an Allyanonimiser instance\nally = create_allyanonimiser()\n\n# Use the custom operator\nresult = ally.anonymize(\n    text=\"Customer John Smith sent an email to Mary Johnson about policy POL-123456.\",\n    operators={\n        \"PERSON\": randomize_names,  # Pass the function directly\n        \"POLICY_NUMBER\": \"mask\"     # Other operators work as usual\n    }\n)\n\nprint(result[\"text\"])\n# Output: \"Customer Alex Taylor sent an email to Sam Johnson about policy ***-******.\"\n```\n\nCustom operators are powerful for specialized anonymization needs like:\n- Generating synthetic but realistic replacements\n- Contextual anonymization based on entity values\n- Domain-specific transformations (e.g., preserving data distributions)\n- Implementing differential privacy mechanisms\n\n### Pattern Management\n\n```python\nfrom allyanonimiser import create_allyanonimiser\n\n# Create an instance\nally = create_allyanonimiser()\n\n# 1. Adding pattern to an existing group in pattern files\n# If you want to contribute a new pattern to the codebase,\n# edit the appropriate file in patterns/ directory:\n#  - patterns/au_patterns.py: For Australian-specific patterns\n#  - patterns/general_patterns.py: For general PII patterns \n#  - patterns/insurance_patterns.py: For insurance-specific patterns\n\n# 2. Using custom patterns without modifying code\n# Add a custom pattern with detailed options\nally.add_pattern({\n    \"entity_type\": \"COMPANY_PROJECT_ID\",\n    \"patterns\": [\n        r\"PRJ-\\d{4}-[A-Z]{3}\",       # Format: PRJ-1234-ABC\n        r\"Project\\s+ID:\\s*(\\d{4})\"   # Format: Project ID: 1234\n    ],\n    \"context\": [\"project\", \"id\", \"identifier\", \"code\"],\n    \"name\": \"Company Project ID\",\n    \"score\": 0.85,                   # Confidence score (0-1)\n    \"language\": \"en\",                # Language code\n    \"description\": \"Internal project identifier format\"\n})\n\n# 3. Save patterns for reuse\nally.export_config(\"company_patterns.json\")\n\n# 4. Load saved patterns in another session\nnew_ally = create_allyanonimiser(settings_path=\"company_patterns.json\")\n\n# 5. Pattern testing and validation\nfrom allyanonimiser.validators import test_pattern_against_examples\n\n# Test if a pattern works against examples\nresults = test_pattern_against_examples(\n    pattern=r\"PRJ-\\d{4}-[A-Z]{3}\",\n    positive_examples=[\"PRJ-1234-ABC\", \"PRJ-5678-XYZ\"],\n    negative_examples=[\"PRJ-123-AB\", \"PROJECT-1234\"]\n)\nprint(f\"Pattern is valid: {results['is_valid']}\")\nprint(f\"Diagnostic info: {results['message']}\")\n```\n\n## Advanced Features\n\n### Generating Reports\n\n```python\nfrom allyanonimiser import create_allyanonimiser\nimport os\n\n# Create output directory\nos.makedirs(\"output\", exist_ok=True)\n\n# Create an Allyanonimiser instance\nally = create_allyanonimiser()\n\n# Start a new report session\nally.start_new_report(session_id=\"example_session\")\n\n# Configure anonymization\nanonymization_config = {\n    \"operators\": {\n        \"PERSON\": \"replace\",\n        \"EMAIL_ADDRESS\": \"mask\",\n        \"PHONE_NUMBER\": \"redact\",\n        \"AU_ADDRESS\": \"replace\",\n        \"DATE_OF_BIRTH\": \"age_bracket\",\n        \"AU_TFN\": \"hash\",\n        \"AU_MEDICARE\": \"mask\"\n    },\n    \"age_bracket_size\": 10\n}\n\n# Process a batch of files\nresult = ally.process_files(\n    file_paths=[\"data/sample1.txt\", \"data/sample2.txt\", \"data/sample3.txt\"],\n    output_dir=\"output/anonymized\",\n    anonymize=True,\n    operators=anonymization_config[\"operators\"],\n    report=True,\n    report_output=\"output/report.html\",\n    report_format=\"html\"\n)\n\n# Display summary\nprint(f\"Processed {result['total_files']} files\")\nprint(f\"Detected {result['report']['total_entities']} entities\")\nprint(f\"Average processing time: {result['report']['report']['avg_processing_time']*1000:.2f} ms\")\n```\n\n### Comprehensive Reporting System\n\nAllyanonimiser includes a comprehensive reporting system that allows you to track, analyze, and visualize anonymization activities.\n\n```python\nfrom allyanonimiser import create_allyanonimiser\n\n# Create instance\nally = create_allyanonimiser()\n\n# Start a new report session\nally.start_new_report(\"my_session\")\n\n# Process multiple texts\ntexts = [\n    \"Customer John Smith (DOB: 15/06/1980) called about claim CL-12345.\",\n    \"Jane Doe (email: jane.doe@example.com) requested policy information.\",\n    \"Claims assessor reviewed case for Robert Johnson (ID: 987654321).\"\n]\n\nfor i, text in enumerate(texts):\n    ally.anonymize(\n        text=text,\n        operators={\n            \"PERSON\": \"replace\",\n            \"EMAIL_ADDRESS\": \"mask\",\n            \"DATE_OF_BIRTH\": \"age_bracket\"\n        },\n        document_id=f\"doc_{i+1}\"\n    )\n\n# Get report summary\nreport = ally.get_report()\nsummary = report.get_summary()\n\nprint(f\"Total documents processed: {summary['total_documents']}\")\nprint(f\"Total entities detected: {summary['total_entities']}\")\nprint(f\"Entities per document: {summary['entities_per_document']:.2f}\")\nprint(f\"Anonymization rate: {summary['anonymization_rate']*100:.2f}%\")\nprint(f\"Average processing time: {summary['avg_processing_time']*1000:.2f} ms\")\n\n# Export report to different formats\nreport.export_report(\"report.html\", \"html\")  # Rich HTML visualization\nreport.export_report(\"report.json\", \"json\")  # Detailed JSON data\nreport.export_report(\"report.csv\", \"csv\")    # CSV statistics\n```\n\n### In Jupyter Notebooks\n\n```python\nfrom allyanonimiser import create_allyanonimiser\nimport pandas as pd\nimport matplotlib.pyplot as plt\n\n# Create an Allyanonimiser instance\nally = create_allyanonimiser()\n\n# Start a report session and process some texts\n# ... processing code ...\n\n# Display rich interactive report\nally.display_report_in_notebook()\n\n# Access report data programmatically\nreport = ally.get_report()\nsummary = report.get_summary()\n\n# Create custom visualizations\nentity_counts = summary['entity_counts']\nplt.figure(figsize=(10, 6))\nplt.bar(entity_counts.keys(), entity_counts.values())\nplt.title('Entity Type Distribution')\nplt.xlabel('Entity Type')\nplt.ylabel('Count')\nplt.xticks(rotation=45)\nplt.tight_layout()\nplt.show()\n```\n\n## What's New in Version 2.1.0\n\n- Added support for NSW legacy driver's license pattern\n- Improved pattern recognition for Australian TFNs\n- Enhanced handling of date formats\n- Fixed issues with BSB recognition\n- Added more comprehensive test suite for Australian patterns\n- Performance improvements for large file processing\n\nFor older versions and detailed change history, see the [CHANGELOG.md](CHANGELOG.md) file.\n\n## Documentation\n\nFor complete documentation, examples, and advanced usage, visit our [documentation site](https://srepho.github.io/Allyanonimiser/):\n\n- [Installation Guide](https://srepho.github.io/Allyanonimiser/getting-started/installation/)\n- [Quick Start Tutorial](https://srepho.github.io/Allyanonimiser/getting-started/quick-start/)\n- [Pattern Reference](https://srepho.github.io/Allyanonimiser/patterns/overview/)\n- [Anonymization Operators](https://srepho.github.io/Allyanonimiser/advanced/anonymization-operators/)\n- [API Reference](https://srepho.github.io/Allyanonimiser/api/main/)\n\n## License\n\nThis project is licensed under the MIT License - see the LICENSE file for details.\n",
    "bugtrack_url": null,
    "license": "MIT License\n        \n        Copyright (c) 2025 Stephen Oates\n        \n        Permission is hereby granted, free of charge, to any person obtaining a copy\n        of this software and associated documentation files (the \"Software\"), to deal\n        in the Software without restriction, including without limitation the rights\n        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n        copies of the Software, and to permit persons to whom the Software is\n        furnished to do so, subject to the following conditions:\n        \n        The above copyright notice and this permission notice shall be included in all\n        copies or substantial portions of the Software.\n        \n        THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\n        SOFTWARE.\n        ",
    "summary": "Australian-focused PII detection and anonymization for the insurance industry",
    "version": "2.4.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/srepho/Allyanonimiser/issues",
        "Documentation": "https://github.com/srepho/Allyanonimiser#readme",
        "Homepage": "https://github.com/srepho/Allyanonimiser"
    },
    "split_keywords": [
        "pii",
        " anonymization",
        " privacy",
        " insurance",
        " australia"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "8ed07e387fafb7dbf6e27b7b7a4a7d8ab91e631b36adc9579447562938f15bf0",
                "md5": "3fa78e3ba597feacef1fa2b3cca31acb",
                "sha256": "09ce7feb7b6004c126ae2f76f9adc4cc3c135751c7f54bc7e84ac50d2b109f29"
            },
            "downloads": -1,
            "filename": "allyanonimiser-2.4.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "3fa78e3ba597feacef1fa2b3cca31acb",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 170784,
            "upload_time": "2025-07-25T06:53:24",
            "upload_time_iso_8601": "2025-07-25T06:53:24.338326Z",
            "url": "https://files.pythonhosted.org/packages/8e/d0/7e387fafb7dbf6e27b7b7a4a7d8ab91e631b36adc9579447562938f15bf0/allyanonimiser-2.4.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "d9102803712bce5dbec508923a272b27cffe9760aa7e9571779b1ab47b9f5569",
                "md5": "89c30c8a5b25eaf5624cb2f79caae788",
                "sha256": "993bb5388690ec3bb923054ebe460562a6b1c2fc4f33543c7916967bd07d5955"
            },
            "downloads": -1,
            "filename": "allyanonimiser-2.4.0.tar.gz",
            "has_sig": false,
            "md5_digest": "89c30c8a5b25eaf5624cb2f79caae788",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 153664,
            "upload_time": "2025-07-25T06:53:26",
            "upload_time_iso_8601": "2025-07-25T06:53:26.711376Z",
            "url": "https://files.pythonhosted.org/packages/d9/10/2803712bce5dbec508923a272b27cffe9760aa7e9571779b1ab47b9f5569/allyanonimiser-2.4.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-25 06:53:26",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "srepho",
    "github_project": "Allyanonimiser",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": true,
    "requirements": [
        {
            "name": "spacy",
            "specs": [
                [
                    ">=",
                    "3.5.0"
                ]
            ]
        },
        {
            "name": "presidio-analyzer",
            "specs": [
                [
                    ">=",
                    "2.2.0"
                ]
            ]
        },
        {
            "name": "presidio-anonymizer",
            "specs": [
                [
                    ">=",
                    "2.2.0"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    ">=",
                    "1.0.0"
                ]
            ]
        },
        {
            "name": "tqdm",
            "specs": [
                [
                    ">=",
                    "4.64.0"
                ]
            ]
        },
        {
            "name": "pytest",
            "specs": [
                [
                    ">=",
                    "7.0.0"
                ]
            ]
        },
        {
            "name": "pytest-cov",
            "specs": [
                [
                    ">=",
                    "4.0.0"
                ]
            ]
        },
        {
            "name": "black",
            "specs": [
                [
                    ">=",
                    "23.0.0"
                ]
            ]
        },
        {
            "name": "isort",
            "specs": [
                [
                    ">=",
                    "5.0.0"
                ]
            ]
        },
        {
            "name": "flake8",
            "specs": [
                [
                    ">=",
                    "6.0.0"
                ]
            ]
        },
        {
            "name": "mypy",
            "specs": [
                [
                    ">=",
                    "1.0.0"
                ]
            ]
        }
    ],
    "lcname": "allyanonimiser"
}

Stephen Oates