nopii


Namenopii JSON
Version 0.1.3 PyPI version JSON
download
home_pageNone
SummaryA batteries-included Python toolkit for detecting, transforming, masking, pseudonymizing, and auditing PII across data engineering workflows
upload_time2025-10-22 04:51:42
maintainerNone
docs_urlNone
authoray-mich
requires_python>=3.12
licenseNone
keywords pii privacy data-protection transformation compliance gdpr ccpa
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # NoPII

A Python package for detecting, transforming, and auditing Personally Identifiable Information (PII) in your data. Supports multiple data sources including CSV, JSON, Parquet, and pandas DataFrames with policy-driven configuration.

## Features

### πŸ” **PII Detection**

- **Built-in Detectors**: Identifies email addresses, phone numbers, credit cards, SSNs, IP addresses, names, addresses, and dates of birth
- **Confidence Scoring**: Each detection includes a confidence score (0-100%) with configurable thresholds to balance precision and recall
- **Custom Pattern Support**: Create your own detectors using regex patterns or implement the BaseDetector interface for complex logic
- **Multi-language Support**: Localized detection patterns for different regions and formats (US phone numbers, international emails, etc.)

### πŸ›‘οΈ **Transformation Strategies**

- **Masking**: Replace characters with asterisks or custom symbols while preserving format (e.g., `john@example.com` β†’ `****@example.com`)
- **Redacting**: Replace entire PII values with placeholder text (e.g., `john@example.com` β†’ `[REDACTED]`)
- **Hashing**: One-way cryptographic transformation using SHA-256 or other algorithms, with optional salt for security
- **Tokenization**: Replace with reversible tokens for data analysis while maintaining referential integrity across datasets
- **Nullification**: Replace with null/empty values for complete data removal

### πŸ“Š **Data Processing**

- **Pandas DataFrames**: Process tabular data with vectorized operations for performance, supporting column-wise scanning and transformation
- **File Formats**: Direct support for CSV, JSON, Parquet, and Excel files with streaming for large datasets
- **Text & Dictionaries**: Scan and transform plain text strings and Python dictionaries for flexible data handling
- **Memory Efficient**: Streaming processing for large files to avoid loading entire datasets into memory

### πŸ“‹ **Policy Management**

- **YAML Configuration**: Human-readable policy files defining detection rules, transformation actions, and confidence thresholds
- **Rule-based System**: Match PII types (email, phone, ssn) to specific actions (mask, redact, hash) with customizable options
- **Exception Handling**: Define patterns to skip (e.g., company email domains, test data) with regex-based exclusions
- **Policy Validation**: Built-in validation ensures policy syntax is correct and transformation options are compatible

### πŸ”§ **CLI & SDK**

- **Command Line Interface**: Five main commands (scan, transform, report, diff, policy) for file processing and policy management
- **Python SDK**: High-level NoPIIClient for quick operations and low-level Scanner/Transform classes for fine-grained control
- **Audit Reporting**: JSON audit trails with HTML/Markdown report generation including coverage metrics and PII type breakdowns
- **Coverage Scoring**: Quantitative metrics showing percentage of data scanned and residual risk assessment

## Installation

```bash
pip install nopii
```

The base installation includes core PII detection and transformation capabilities for text files, JSON, and basic CSV processing.

### Optional Dependencies

Install optional extras for extended functionality:

```bash
# Pandas support for DataFrame operations and advanced tabular file formats
# Enables: Excel files, Parquet, advanced CSV operations, column-wise processing
pip install "nopii[pandas]"

# HTML reporting with styled templates and interactive elements
# Enables: Rich HTML reports, charts, detailed PII breakdowns, export options
pip install "nopii[report-html]"

# Install all optional dependencies
pip install "nopii[pandas,report-html]"
```

## Quick Start

### CLI Usage

The CLI provides five main commands for different PII processing workflows:

```bash
# Scan: Detect PII without modifying data
# Outputs findings with confidence scores and locations
nopii scan data.csv --format json --output scan_results.json

# Transform: Remove or mask PII from files
# Creates cleaned data + audit trail of what was changed
nopii transform data.csv transformed_data.csv --audit-report audit.json

# Report: Generate human-readable reports from audit data
# Convert JSON audit logs into HTML/Markdown with charts and summaries
nopii report audit.json --format html --output report.html

# Diff: Compare original vs transformed files
# Shows exactly what PII was detected and how it was changed
nopii diff original.csv transformed.csv

# Policy: Manage detection and transformation rules
# Validate YAML policies or create new ones
nopii policy validate my_policy.yaml

# Create a new policy file
nopii policy create new_policy.yaml --default-action redact
```

# Note: the CLI is also available as 'no-pii' (alias)

# nopii scan data.csv --format json

````

Exit codes:

- `0` when no PII is detected
- `1` when PII is found
- Non‑zero on errors

### Python SDK / Core

The SDK provides two levels of access: low-level core classes for fine-grained control and a high-level client for quick operations.

#### Core Classes (Low-level API)

Use Scanner and Transform classes directly when you need precise control over detection and transformation:

```python
from nopii.core.scanner import Scanner
from nopii.core.transform import Transform
from nopii.policy.loader import create_default_policy, load_policy

# Load policy (default or custom YAML)
policy = create_default_policy()  # or load_policy("policy.yaml")

# Scanner: Detect PII without modifying data
# Returns list of Finding objects with location, confidence, and PII type
scanner = Scanner(policy)
findings = scanner.scan_text("Contact john@example.com or 555-123-4567")
print(f"Found {len(findings)} findings")

# Transform: Apply policy actions (mask, redact, hash) to PII
# Returns tuple of (cleaned_text, findings_list)
transformer = Transform(policy)
transformed_text, findings = transformer.transform_text("Contact john@example.com or 555-123-4567")
print(f"Transformed: {transformed_text}")

# DataFrame operations (requires pandas extra)
import pandas as pd
df = pd.DataFrame({"email": ["john@example.com"], "phone": ["555-123-4567"]})

# Scan entire DataFrame, get detailed results per column
scan_result = scanner.scan_dataframe(df, dataset_name="contacts")

# Transform DataFrame, get cleaned data + comprehensive audit report
df_transformed, audit_report = transformer.transform_dataframe(df, dataset_name="contacts")
print(f"Coverage: {audit_report.coverage_score:.1%}, Risk: {audit_report.residual_risk}")
````

#### High-Level Client (Quick Operations)

Use NoPIIClient for simple, one-line operations with sensible defaults:

```python
from nopii.sdk import NoPIIClient

client = NoPIIClient()

# Scan text
findings = client.scanner.scan_text("Contact john@example.com")
print(f"Found {len(findings)} PII items")

# Transform text
result = client.transform_text("Contact john@example.com")
print(result)  # "Contact ****@example.com"
```

### DataFrame Processing

```python
import pandas as pd
from nopii.core.scanner import Scanner
from nopii.core.transform import Transform
from nopii.policy.loader import create_default_policy

policy = create_default_policy()
df = pd.read_csv("data.csv")

scanner = Scanner(policy)
transformer = Transform(policy)

# Load and process data
df = pd.read_csv("customer_data.csv")
scan_result = scanner.scan_dataframe(df, dataset_name="customers")
transformed_df, audit = transformer.transform_dataframe(df, dataset_name="customers")

# Review results
print(f"Processed {len(df)} rows, coverage: {audit.coverage_score:.1%}")
print(f"PII types found: {[f.pii_type for f in scan_result.findings]}")
print(f"Columns affected: {len(audit.column_reports)}")
```

### Performance & Streaming

NoPII is designed for efficient processing of large datasets:

**Memory-Efficient Streaming:**

- CLI and SDK automatically stream `.csv` and `.txt/.md` files to avoid loading entire files into memory
- Processes files line-by-line or in configurable chunks (default: 1000 rows)
- Suitable for multi-GB files on standard hardware

**In-Memory Operations:**

- JSON/Parquet files and DataFrame operations require pandas and load data into memory
- Recommended for files under 1GB or when you need full DataFrame functionality
- For very large JSON, consider line-delimited JSON (JSONL) and chunked processing.
- Coverage metrics for streaming scans are computed without a full DataFrame, using policy rules and detected items.

## Policy Configuration (YAML)

```yaml
name: my_policy
default_action: mask
thresholds:
  min_confidence: 0.7
rules:
  - match: email
    action: mask
    options:
      mask_char: "*"
  - match: phone
    action: redact
  - match: ssn
    action: hash
    options:
      algorithm: sha256
exceptions: []
```

### Rule Options Validation

Policy rule `options` are validated based on the rule `action`:

- mask
  - `mask_char`: string
  - `preserve_first`: integer
  - `preserve_last`: integer
- hash
  - `algorithm`: one of `md5`, `sha1`, `sha256`, `sha512`
  - `max_length`: integer
- tokenize
  - `deterministic`: boolean
  - `token_length`: integer

Invalid or mismatched types will be reported by `PolicyValidator` as errors when loading/validating a policy.

## Performance

- Streams large CSV/text files to avoid memory issues
- Processes multi-GB files efficiently
- DataFrame operations require pandas (in-memory)

## License

This project is licensed under the Apache License, Version 2.0 - see the [LICENSE](LICENSE) file for details.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "nopii",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.12",
    "maintainer_email": null,
    "keywords": "pii, privacy, data-protection, transformation, compliance, gdpr, ccpa",
    "author": "ay-mich",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/af/50/730e7cc19826e3171195769b375c1e71b6214fd05a350f5620cd539297f3/nopii-0.1.3.tar.gz",
    "platform": null,
    "description": "# NoPII\n\nA Python package for detecting, transforming, and auditing Personally Identifiable Information (PII) in your data. Supports multiple data sources including CSV, JSON, Parquet, and pandas DataFrames with policy-driven configuration.\n\n## Features\n\n### \ud83d\udd0d **PII Detection**\n\n- **Built-in Detectors**: Identifies email addresses, phone numbers, credit cards, SSNs, IP addresses, names, addresses, and dates of birth\n- **Confidence Scoring**: Each detection includes a confidence score (0-100%) with configurable thresholds to balance precision and recall\n- **Custom Pattern Support**: Create your own detectors using regex patterns or implement the BaseDetector interface for complex logic\n- **Multi-language Support**: Localized detection patterns for different regions and formats (US phone numbers, international emails, etc.)\n\n### \ud83d\udee1\ufe0f **Transformation Strategies**\n\n- **Masking**: Replace characters with asterisks or custom symbols while preserving format (e.g., `john@example.com` \u2192 `****@example.com`)\n- **Redacting**: Replace entire PII values with placeholder text (e.g., `john@example.com` \u2192 `[REDACTED]`)\n- **Hashing**: One-way cryptographic transformation using SHA-256 or other algorithms, with optional salt for security\n- **Tokenization**: Replace with reversible tokens for data analysis while maintaining referential integrity across datasets\n- **Nullification**: Replace with null/empty values for complete data removal\n\n### \ud83d\udcca **Data Processing**\n\n- **Pandas DataFrames**: Process tabular data with vectorized operations for performance, supporting column-wise scanning and transformation\n- **File Formats**: Direct support for CSV, JSON, Parquet, and Excel files with streaming for large datasets\n- **Text & Dictionaries**: Scan and transform plain text strings and Python dictionaries for flexible data handling\n- **Memory Efficient**: Streaming processing for large files to avoid loading entire datasets into memory\n\n### \ud83d\udccb **Policy Management**\n\n- **YAML Configuration**: Human-readable policy files defining detection rules, transformation actions, and confidence thresholds\n- **Rule-based System**: Match PII types (email, phone, ssn) to specific actions (mask, redact, hash) with customizable options\n- **Exception Handling**: Define patterns to skip (e.g., company email domains, test data) with regex-based exclusions\n- **Policy Validation**: Built-in validation ensures policy syntax is correct and transformation options are compatible\n\n### \ud83d\udd27 **CLI & SDK**\n\n- **Command Line Interface**: Five main commands (scan, transform, report, diff, policy) for file processing and policy management\n- **Python SDK**: High-level NoPIIClient for quick operations and low-level Scanner/Transform classes for fine-grained control\n- **Audit Reporting**: JSON audit trails with HTML/Markdown report generation including coverage metrics and PII type breakdowns\n- **Coverage Scoring**: Quantitative metrics showing percentage of data scanned and residual risk assessment\n\n## Installation\n\n```bash\npip install nopii\n```\n\nThe base installation includes core PII detection and transformation capabilities for text files, JSON, and basic CSV processing.\n\n### Optional Dependencies\n\nInstall optional extras for extended functionality:\n\n```bash\n# Pandas support for DataFrame operations and advanced tabular file formats\n# Enables: Excel files, Parquet, advanced CSV operations, column-wise processing\npip install \"nopii[pandas]\"\n\n# HTML reporting with styled templates and interactive elements\n# Enables: Rich HTML reports, charts, detailed PII breakdowns, export options\npip install \"nopii[report-html]\"\n\n# Install all optional dependencies\npip install \"nopii[pandas,report-html]\"\n```\n\n## Quick Start\n\n### CLI Usage\n\nThe CLI provides five main commands for different PII processing workflows:\n\n```bash\n# Scan: Detect PII without modifying data\n# Outputs findings with confidence scores and locations\nnopii scan data.csv --format json --output scan_results.json\n\n# Transform: Remove or mask PII from files\n# Creates cleaned data + audit trail of what was changed\nnopii transform data.csv transformed_data.csv --audit-report audit.json\n\n# Report: Generate human-readable reports from audit data\n# Convert JSON audit logs into HTML/Markdown with charts and summaries\nnopii report audit.json --format html --output report.html\n\n# Diff: Compare original vs transformed files\n# Shows exactly what PII was detected and how it was changed\nnopii diff original.csv transformed.csv\n\n# Policy: Manage detection and transformation rules\n# Validate YAML policies or create new ones\nnopii policy validate my_policy.yaml\n\n# Create a new policy file\nnopii policy create new_policy.yaml --default-action redact\n```\n\n# Note: the CLI is also available as 'no-pii' (alias)\n\n# nopii scan data.csv --format json\n\n````\n\nExit codes:\n\n- `0` when no PII is detected\n- `1` when PII is found\n- Non\u2011zero on errors\n\n### Python SDK / Core\n\nThe SDK provides two levels of access: low-level core classes for fine-grained control and a high-level client for quick operations.\n\n#### Core Classes (Low-level API)\n\nUse Scanner and Transform classes directly when you need precise control over detection and transformation:\n\n```python\nfrom nopii.core.scanner import Scanner\nfrom nopii.core.transform import Transform\nfrom nopii.policy.loader import create_default_policy, load_policy\n\n# Load policy (default or custom YAML)\npolicy = create_default_policy()  # or load_policy(\"policy.yaml\")\n\n# Scanner: Detect PII without modifying data\n# Returns list of Finding objects with location, confidence, and PII type\nscanner = Scanner(policy)\nfindings = scanner.scan_text(\"Contact john@example.com or 555-123-4567\")\nprint(f\"Found {len(findings)} findings\")\n\n# Transform: Apply policy actions (mask, redact, hash) to PII\n# Returns tuple of (cleaned_text, findings_list)\ntransformer = Transform(policy)\ntransformed_text, findings = transformer.transform_text(\"Contact john@example.com or 555-123-4567\")\nprint(f\"Transformed: {transformed_text}\")\n\n# DataFrame operations (requires pandas extra)\nimport pandas as pd\ndf = pd.DataFrame({\"email\": [\"john@example.com\"], \"phone\": [\"555-123-4567\"]})\n\n# Scan entire DataFrame, get detailed results per column\nscan_result = scanner.scan_dataframe(df, dataset_name=\"contacts\")\n\n# Transform DataFrame, get cleaned data + comprehensive audit report\ndf_transformed, audit_report = transformer.transform_dataframe(df, dataset_name=\"contacts\")\nprint(f\"Coverage: {audit_report.coverage_score:.1%}, Risk: {audit_report.residual_risk}\")\n````\n\n#### High-Level Client (Quick Operations)\n\nUse NoPIIClient for simple, one-line operations with sensible defaults:\n\n```python\nfrom nopii.sdk import NoPIIClient\n\nclient = NoPIIClient()\n\n# Scan text\nfindings = client.scanner.scan_text(\"Contact john@example.com\")\nprint(f\"Found {len(findings)} PII items\")\n\n# Transform text\nresult = client.transform_text(\"Contact john@example.com\")\nprint(result)  # \"Contact ****@example.com\"\n```\n\n### DataFrame Processing\n\n```python\nimport pandas as pd\nfrom nopii.core.scanner import Scanner\nfrom nopii.core.transform import Transform\nfrom nopii.policy.loader import create_default_policy\n\npolicy = create_default_policy()\ndf = pd.read_csv(\"data.csv\")\n\nscanner = Scanner(policy)\ntransformer = Transform(policy)\n\n# Load and process data\ndf = pd.read_csv(\"customer_data.csv\")\nscan_result = scanner.scan_dataframe(df, dataset_name=\"customers\")\ntransformed_df, audit = transformer.transform_dataframe(df, dataset_name=\"customers\")\n\n# Review results\nprint(f\"Processed {len(df)} rows, coverage: {audit.coverage_score:.1%}\")\nprint(f\"PII types found: {[f.pii_type for f in scan_result.findings]}\")\nprint(f\"Columns affected: {len(audit.column_reports)}\")\n```\n\n### Performance & Streaming\n\nNoPII is designed for efficient processing of large datasets:\n\n**Memory-Efficient Streaming:**\n\n- CLI and SDK automatically stream `.csv` and `.txt/.md` files to avoid loading entire files into memory\n- Processes files line-by-line or in configurable chunks (default: 1000 rows)\n- Suitable for multi-GB files on standard hardware\n\n**In-Memory Operations:**\n\n- JSON/Parquet files and DataFrame operations require pandas and load data into memory\n- Recommended for files under 1GB or when you need full DataFrame functionality\n- For very large JSON, consider line-delimited JSON (JSONL) and chunked processing.\n- Coverage metrics for streaming scans are computed without a full DataFrame, using policy rules and detected items.\n\n## Policy Configuration (YAML)\n\n```yaml\nname: my_policy\ndefault_action: mask\nthresholds:\n  min_confidence: 0.7\nrules:\n  - match: email\n    action: mask\n    options:\n      mask_char: \"*\"\n  - match: phone\n    action: redact\n  - match: ssn\n    action: hash\n    options:\n      algorithm: sha256\nexceptions: []\n```\n\n### Rule Options Validation\n\nPolicy rule `options` are validated based on the rule `action`:\n\n- mask\n  - `mask_char`: string\n  - `preserve_first`: integer\n  - `preserve_last`: integer\n- hash\n  - `algorithm`: one of `md5`, `sha1`, `sha256`, `sha512`\n  - `max_length`: integer\n- tokenize\n  - `deterministic`: boolean\n  - `token_length`: integer\n\nInvalid or mismatched types will be reported by `PolicyValidator` as errors when loading/validating a policy.\n\n## Performance\n\n- Streams large CSV/text files to avoid memory issues\n- Processes multi-GB files efficiently\n- DataFrame operations require pandas (in-memory)\n\n## License\n\nThis project is licensed under the Apache License, Version 2.0 - see the [LICENSE](LICENSE) file for details.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A batteries-included Python toolkit for detecting, transforming, masking, pseudonymizing, and auditing PII across data engineering workflows",
    "version": "0.1.3",
    "project_urls": {
        "Documentation": "https://ay-mich.github.io/nopii/nopii.html",
        "Homepage": "https://github.com/ay-mich/nopii",
        "Issues": "https://github.com/ay-mich/nopii/issues",
        "Repository": "https://github.com/ay-mich/nopii"
    },
    "split_keywords": [
        "pii",
        " privacy",
        " data-protection",
        " transformation",
        " compliance",
        " gdpr",
        " ccpa"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "1c382ee96bc2a5282ac04afc44ed54d43ef7744be8209259625ea755319c66af",
                "md5": "de95a5fb755aa69956b8ce738f70c1a2",
                "sha256": "18a0d5e489f294490339eb70b45dff2f6ea994a99a56b563b54751795946e859"
            },
            "downloads": -1,
            "filename": "nopii-0.1.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "de95a5fb755aa69956b8ce738f70c1a2",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.12",
            "size": 78508,
            "upload_time": "2025-10-22T04:51:40",
            "upload_time_iso_8601": "2025-10-22T04:51:40.393109Z",
            "url": "https://files.pythonhosted.org/packages/1c/38/2ee96bc2a5282ac04afc44ed54d43ef7744be8209259625ea755319c66af/nopii-0.1.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "af50730e7cc19826e3171195769b375c1e71b6214fd05a350f5620cd539297f3",
                "md5": "a1081e774f6e7bea3e4ae612422efb41",
                "sha256": "e4b3038426364368a9bfc2e377c93b94537db1a4aaf1fa713fd065d5f9005269"
            },
            "downloads": -1,
            "filename": "nopii-0.1.3.tar.gz",
            "has_sig": false,
            "md5_digest": "a1081e774f6e7bea3e4ae612422efb41",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.12",
            "size": 77286,
            "upload_time": "2025-10-22T04:51:42",
            "upload_time_iso_8601": "2025-10-22T04:51:42.358489Z",
            "url": "https://files.pythonhosted.org/packages/af/50/730e7cc19826e3171195769b375c1e71b6214fd05a350f5620cd539297f3/nopii-0.1.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-22 04:51:42",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ay-mich",
    "github_project": "nopii",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "nopii"
}
        
Elapsed time: 2.56847s