| Name | nopii JSON |
| Version |
0.1.3
JSON |
| download |
| home_page | None |
| Summary | A batteries-included Python toolkit for detecting, transforming, masking, pseudonymizing, and auditing PII across data engineering workflows |
| upload_time | 2025-10-22 04:51:42 |
| maintainer | None |
| docs_url | None |
| author | ay-mich |
| requires_python | >=3.12 |
| license | None |
| keywords |
pii
privacy
data-protection
transformation
compliance
gdpr
ccpa
|
| VCS |
 |
| bugtrack_url |
|
| requirements |
No requirements were recorded.
|
| Travis-CI |
No Travis.
|
| coveralls test coverage |
No coveralls.
|
# NoPII
A Python package for detecting, transforming, and auditing Personally Identifiable Information (PII) in your data. Supports multiple data sources including CSV, JSON, Parquet, and pandas DataFrames with policy-driven configuration.
## Features
### π **PII Detection**
- **Built-in Detectors**: Identifies email addresses, phone numbers, credit cards, SSNs, IP addresses, names, addresses, and dates of birth
- **Confidence Scoring**: Each detection includes a confidence score (0-100%) with configurable thresholds to balance precision and recall
- **Custom Pattern Support**: Create your own detectors using regex patterns or implement the BaseDetector interface for complex logic
- **Multi-language Support**: Localized detection patterns for different regions and formats (US phone numbers, international emails, etc.)
### π‘οΈ **Transformation Strategies**
- **Masking**: Replace characters with asterisks or custom symbols while preserving format (e.g., `john@example.com` β `****@example.com`)
- **Redacting**: Replace entire PII values with placeholder text (e.g., `john@example.com` β `[REDACTED]`)
- **Hashing**: One-way cryptographic transformation using SHA-256 or other algorithms, with optional salt for security
- **Tokenization**: Replace with reversible tokens for data analysis while maintaining referential integrity across datasets
- **Nullification**: Replace with null/empty values for complete data removal
### π **Data Processing**
- **Pandas DataFrames**: Process tabular data with vectorized operations for performance, supporting column-wise scanning and transformation
- **File Formats**: Direct support for CSV, JSON, Parquet, and Excel files with streaming for large datasets
- **Text & Dictionaries**: Scan and transform plain text strings and Python dictionaries for flexible data handling
- **Memory Efficient**: Streaming processing for large files to avoid loading entire datasets into memory
### π **Policy Management**
- **YAML Configuration**: Human-readable policy files defining detection rules, transformation actions, and confidence thresholds
- **Rule-based System**: Match PII types (email, phone, ssn) to specific actions (mask, redact, hash) with customizable options
- **Exception Handling**: Define patterns to skip (e.g., company email domains, test data) with regex-based exclusions
- **Policy Validation**: Built-in validation ensures policy syntax is correct and transformation options are compatible
### π§ **CLI & SDK**
- **Command Line Interface**: Five main commands (scan, transform, report, diff, policy) for file processing and policy management
- **Python SDK**: High-level NoPIIClient for quick operations and low-level Scanner/Transform classes for fine-grained control
- **Audit Reporting**: JSON audit trails with HTML/Markdown report generation including coverage metrics and PII type breakdowns
- **Coverage Scoring**: Quantitative metrics showing percentage of data scanned and residual risk assessment
## Installation
```bash
pip install nopii
```
The base installation includes core PII detection and transformation capabilities for text files, JSON, and basic CSV processing.
### Optional Dependencies
Install optional extras for extended functionality:
```bash
# Pandas support for DataFrame operations and advanced tabular file formats
# Enables: Excel files, Parquet, advanced CSV operations, column-wise processing
pip install "nopii[pandas]"
# HTML reporting with styled templates and interactive elements
# Enables: Rich HTML reports, charts, detailed PII breakdowns, export options
pip install "nopii[report-html]"
# Install all optional dependencies
pip install "nopii[pandas,report-html]"
```
## Quick Start
### CLI Usage
The CLI provides five main commands for different PII processing workflows:
```bash
# Scan: Detect PII without modifying data
# Outputs findings with confidence scores and locations
nopii scan data.csv --format json --output scan_results.json
# Transform: Remove or mask PII from files
# Creates cleaned data + audit trail of what was changed
nopii transform data.csv transformed_data.csv --audit-report audit.json
# Report: Generate human-readable reports from audit data
# Convert JSON audit logs into HTML/Markdown with charts and summaries
nopii report audit.json --format html --output report.html
# Diff: Compare original vs transformed files
# Shows exactly what PII was detected and how it was changed
nopii diff original.csv transformed.csv
# Policy: Manage detection and transformation rules
# Validate YAML policies or create new ones
nopii policy validate my_policy.yaml
# Create a new policy file
nopii policy create new_policy.yaml --default-action redact
```
# Note: the CLI is also available as 'no-pii' (alias)
# nopii scan data.csv --format json
````
Exit codes:
- `0` when no PII is detected
- `1` when PII is found
- Nonβzero on errors
### Python SDK / Core
The SDK provides two levels of access: low-level core classes for fine-grained control and a high-level client for quick operations.
#### Core Classes (Low-level API)
Use Scanner and Transform classes directly when you need precise control over detection and transformation:
```python
from nopii.core.scanner import Scanner
from nopii.core.transform import Transform
from nopii.policy.loader import create_default_policy, load_policy
# Load policy (default or custom YAML)
policy = create_default_policy() # or load_policy("policy.yaml")
# Scanner: Detect PII without modifying data
# Returns list of Finding objects with location, confidence, and PII type
scanner = Scanner(policy)
findings = scanner.scan_text("Contact john@example.com or 555-123-4567")
print(f"Found {len(findings)} findings")
# Transform: Apply policy actions (mask, redact, hash) to PII
# Returns tuple of (cleaned_text, findings_list)
transformer = Transform(policy)
transformed_text, findings = transformer.transform_text("Contact john@example.com or 555-123-4567")
print(f"Transformed: {transformed_text}")
# DataFrame operations (requires pandas extra)
import pandas as pd
df = pd.DataFrame({"email": ["john@example.com"], "phone": ["555-123-4567"]})
# Scan entire DataFrame, get detailed results per column
scan_result = scanner.scan_dataframe(df, dataset_name="contacts")
# Transform DataFrame, get cleaned data + comprehensive audit report
df_transformed, audit_report = transformer.transform_dataframe(df, dataset_name="contacts")
print(f"Coverage: {audit_report.coverage_score:.1%}, Risk: {audit_report.residual_risk}")
````
#### High-Level Client (Quick Operations)
Use NoPIIClient for simple, one-line operations with sensible defaults:
```python
from nopii.sdk import NoPIIClient
client = NoPIIClient()
# Scan text
findings = client.scanner.scan_text("Contact john@example.com")
print(f"Found {len(findings)} PII items")
# Transform text
result = client.transform_text("Contact john@example.com")
print(result) # "Contact ****@example.com"
```
### DataFrame Processing
```python
import pandas as pd
from nopii.core.scanner import Scanner
from nopii.core.transform import Transform
from nopii.policy.loader import create_default_policy
policy = create_default_policy()
df = pd.read_csv("data.csv")
scanner = Scanner(policy)
transformer = Transform(policy)
# Load and process data
df = pd.read_csv("customer_data.csv")
scan_result = scanner.scan_dataframe(df, dataset_name="customers")
transformed_df, audit = transformer.transform_dataframe(df, dataset_name="customers")
# Review results
print(f"Processed {len(df)} rows, coverage: {audit.coverage_score:.1%}")
print(f"PII types found: {[f.pii_type for f in scan_result.findings]}")
print(f"Columns affected: {len(audit.column_reports)}")
```
### Performance & Streaming
NoPII is designed for efficient processing of large datasets:
**Memory-Efficient Streaming:**
- CLI and SDK automatically stream `.csv` and `.txt/.md` files to avoid loading entire files into memory
- Processes files line-by-line or in configurable chunks (default: 1000 rows)
- Suitable for multi-GB files on standard hardware
**In-Memory Operations:**
- JSON/Parquet files and DataFrame operations require pandas and load data into memory
- Recommended for files under 1GB or when you need full DataFrame functionality
- For very large JSON, consider line-delimited JSON (JSONL) and chunked processing.
- Coverage metrics for streaming scans are computed without a full DataFrame, using policy rules and detected items.
## Policy Configuration (YAML)
```yaml
name: my_policy
default_action: mask
thresholds:
min_confidence: 0.7
rules:
- match: email
action: mask
options:
mask_char: "*"
- match: phone
action: redact
- match: ssn
action: hash
options:
algorithm: sha256
exceptions: []
```
### Rule Options Validation
Policy rule `options` are validated based on the rule `action`:
- mask
- `mask_char`: string
- `preserve_first`: integer
- `preserve_last`: integer
- hash
- `algorithm`: one of `md5`, `sha1`, `sha256`, `sha512`
- `max_length`: integer
- tokenize
- `deterministic`: boolean
- `token_length`: integer
Invalid or mismatched types will be reported by `PolicyValidator` as errors when loading/validating a policy.
## Performance
- Streams large CSV/text files to avoid memory issues
- Processes multi-GB files efficiently
- DataFrame operations require pandas (in-memory)
## License
This project is licensed under the Apache License, Version 2.0 - see the [LICENSE](LICENSE) file for details.
Raw data
{
"_id": null,
"home_page": null,
"name": "nopii",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.12",
"maintainer_email": null,
"keywords": "pii, privacy, data-protection, transformation, compliance, gdpr, ccpa",
"author": "ay-mich",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/af/50/730e7cc19826e3171195769b375c1e71b6214fd05a350f5620cd539297f3/nopii-0.1.3.tar.gz",
"platform": null,
"description": "# NoPII\n\nA Python package for detecting, transforming, and auditing Personally Identifiable Information (PII) in your data. Supports multiple data sources including CSV, JSON, Parquet, and pandas DataFrames with policy-driven configuration.\n\n## Features\n\n### \ud83d\udd0d **PII Detection**\n\n- **Built-in Detectors**: Identifies email addresses, phone numbers, credit cards, SSNs, IP addresses, names, addresses, and dates of birth\n- **Confidence Scoring**: Each detection includes a confidence score (0-100%) with configurable thresholds to balance precision and recall\n- **Custom Pattern Support**: Create your own detectors using regex patterns or implement the BaseDetector interface for complex logic\n- **Multi-language Support**: Localized detection patterns for different regions and formats (US phone numbers, international emails, etc.)\n\n### \ud83d\udee1\ufe0f **Transformation Strategies**\n\n- **Masking**: Replace characters with asterisks or custom symbols while preserving format (e.g., `john@example.com` \u2192 `****@example.com`)\n- **Redacting**: Replace entire PII values with placeholder text (e.g., `john@example.com` \u2192 `[REDACTED]`)\n- **Hashing**: One-way cryptographic transformation using SHA-256 or other algorithms, with optional salt for security\n- **Tokenization**: Replace with reversible tokens for data analysis while maintaining referential integrity across datasets\n- **Nullification**: Replace with null/empty values for complete data removal\n\n### \ud83d\udcca **Data Processing**\n\n- **Pandas DataFrames**: Process tabular data with vectorized operations for performance, supporting column-wise scanning and transformation\n- **File Formats**: Direct support for CSV, JSON, Parquet, and Excel files with streaming for large datasets\n- **Text & Dictionaries**: Scan and transform plain text strings and Python dictionaries for flexible data handling\n- **Memory Efficient**: Streaming processing for large files to avoid loading entire datasets into memory\n\n### \ud83d\udccb **Policy Management**\n\n- **YAML Configuration**: Human-readable policy files defining detection rules, transformation actions, and confidence thresholds\n- **Rule-based System**: Match PII types (email, phone, ssn) to specific actions (mask, redact, hash) with customizable options\n- **Exception Handling**: Define patterns to skip (e.g., company email domains, test data) with regex-based exclusions\n- **Policy Validation**: Built-in validation ensures policy syntax is correct and transformation options are compatible\n\n### \ud83d\udd27 **CLI & SDK**\n\n- **Command Line Interface**: Five main commands (scan, transform, report, diff, policy) for file processing and policy management\n- **Python SDK**: High-level NoPIIClient for quick operations and low-level Scanner/Transform classes for fine-grained control\n- **Audit Reporting**: JSON audit trails with HTML/Markdown report generation including coverage metrics and PII type breakdowns\n- **Coverage Scoring**: Quantitative metrics showing percentage of data scanned and residual risk assessment\n\n## Installation\n\n```bash\npip install nopii\n```\n\nThe base installation includes core PII detection and transformation capabilities for text files, JSON, and basic CSV processing.\n\n### Optional Dependencies\n\nInstall optional extras for extended functionality:\n\n```bash\n# Pandas support for DataFrame operations and advanced tabular file formats\n# Enables: Excel files, Parquet, advanced CSV operations, column-wise processing\npip install \"nopii[pandas]\"\n\n# HTML reporting with styled templates and interactive elements\n# Enables: Rich HTML reports, charts, detailed PII breakdowns, export options\npip install \"nopii[report-html]\"\n\n# Install all optional dependencies\npip install \"nopii[pandas,report-html]\"\n```\n\n## Quick Start\n\n### CLI Usage\n\nThe CLI provides five main commands for different PII processing workflows:\n\n```bash\n# Scan: Detect PII without modifying data\n# Outputs findings with confidence scores and locations\nnopii scan data.csv --format json --output scan_results.json\n\n# Transform: Remove or mask PII from files\n# Creates cleaned data + audit trail of what was changed\nnopii transform data.csv transformed_data.csv --audit-report audit.json\n\n# Report: Generate human-readable reports from audit data\n# Convert JSON audit logs into HTML/Markdown with charts and summaries\nnopii report audit.json --format html --output report.html\n\n# Diff: Compare original vs transformed files\n# Shows exactly what PII was detected and how it was changed\nnopii diff original.csv transformed.csv\n\n# Policy: Manage detection and transformation rules\n# Validate YAML policies or create new ones\nnopii policy validate my_policy.yaml\n\n# Create a new policy file\nnopii policy create new_policy.yaml --default-action redact\n```\n\n# Note: the CLI is also available as 'no-pii' (alias)\n\n# nopii scan data.csv --format json\n\n````\n\nExit codes:\n\n- `0` when no PII is detected\n- `1` when PII is found\n- Non\u2011zero on errors\n\n### Python SDK / Core\n\nThe SDK provides two levels of access: low-level core classes for fine-grained control and a high-level client for quick operations.\n\n#### Core Classes (Low-level API)\n\nUse Scanner and Transform classes directly when you need precise control over detection and transformation:\n\n```python\nfrom nopii.core.scanner import Scanner\nfrom nopii.core.transform import Transform\nfrom nopii.policy.loader import create_default_policy, load_policy\n\n# Load policy (default or custom YAML)\npolicy = create_default_policy() # or load_policy(\"policy.yaml\")\n\n# Scanner: Detect PII without modifying data\n# Returns list of Finding objects with location, confidence, and PII type\nscanner = Scanner(policy)\nfindings = scanner.scan_text(\"Contact john@example.com or 555-123-4567\")\nprint(f\"Found {len(findings)} findings\")\n\n# Transform: Apply policy actions (mask, redact, hash) to PII\n# Returns tuple of (cleaned_text, findings_list)\ntransformer = Transform(policy)\ntransformed_text, findings = transformer.transform_text(\"Contact john@example.com or 555-123-4567\")\nprint(f\"Transformed: {transformed_text}\")\n\n# DataFrame operations (requires pandas extra)\nimport pandas as pd\ndf = pd.DataFrame({\"email\": [\"john@example.com\"], \"phone\": [\"555-123-4567\"]})\n\n# Scan entire DataFrame, get detailed results per column\nscan_result = scanner.scan_dataframe(df, dataset_name=\"contacts\")\n\n# Transform DataFrame, get cleaned data + comprehensive audit report\ndf_transformed, audit_report = transformer.transform_dataframe(df, dataset_name=\"contacts\")\nprint(f\"Coverage: {audit_report.coverage_score:.1%}, Risk: {audit_report.residual_risk}\")\n````\n\n#### High-Level Client (Quick Operations)\n\nUse NoPIIClient for simple, one-line operations with sensible defaults:\n\n```python\nfrom nopii.sdk import NoPIIClient\n\nclient = NoPIIClient()\n\n# Scan text\nfindings = client.scanner.scan_text(\"Contact john@example.com\")\nprint(f\"Found {len(findings)} PII items\")\n\n# Transform text\nresult = client.transform_text(\"Contact john@example.com\")\nprint(result) # \"Contact ****@example.com\"\n```\n\n### DataFrame Processing\n\n```python\nimport pandas as pd\nfrom nopii.core.scanner import Scanner\nfrom nopii.core.transform import Transform\nfrom nopii.policy.loader import create_default_policy\n\npolicy = create_default_policy()\ndf = pd.read_csv(\"data.csv\")\n\nscanner = Scanner(policy)\ntransformer = Transform(policy)\n\n# Load and process data\ndf = pd.read_csv(\"customer_data.csv\")\nscan_result = scanner.scan_dataframe(df, dataset_name=\"customers\")\ntransformed_df, audit = transformer.transform_dataframe(df, dataset_name=\"customers\")\n\n# Review results\nprint(f\"Processed {len(df)} rows, coverage: {audit.coverage_score:.1%}\")\nprint(f\"PII types found: {[f.pii_type for f in scan_result.findings]}\")\nprint(f\"Columns affected: {len(audit.column_reports)}\")\n```\n\n### Performance & Streaming\n\nNoPII is designed for efficient processing of large datasets:\n\n**Memory-Efficient Streaming:**\n\n- CLI and SDK automatically stream `.csv` and `.txt/.md` files to avoid loading entire files into memory\n- Processes files line-by-line or in configurable chunks (default: 1000 rows)\n- Suitable for multi-GB files on standard hardware\n\n**In-Memory Operations:**\n\n- JSON/Parquet files and DataFrame operations require pandas and load data into memory\n- Recommended for files under 1GB or when you need full DataFrame functionality\n- For very large JSON, consider line-delimited JSON (JSONL) and chunked processing.\n- Coverage metrics for streaming scans are computed without a full DataFrame, using policy rules and detected items.\n\n## Policy Configuration (YAML)\n\n```yaml\nname: my_policy\ndefault_action: mask\nthresholds:\n min_confidence: 0.7\nrules:\n - match: email\n action: mask\n options:\n mask_char: \"*\"\n - match: phone\n action: redact\n - match: ssn\n action: hash\n options:\n algorithm: sha256\nexceptions: []\n```\n\n### Rule Options Validation\n\nPolicy rule `options` are validated based on the rule `action`:\n\n- mask\n - `mask_char`: string\n - `preserve_first`: integer\n - `preserve_last`: integer\n- hash\n - `algorithm`: one of `md5`, `sha1`, `sha256`, `sha512`\n - `max_length`: integer\n- tokenize\n - `deterministic`: boolean\n - `token_length`: integer\n\nInvalid or mismatched types will be reported by `PolicyValidator` as errors when loading/validating a policy.\n\n## Performance\n\n- Streams large CSV/text files to avoid memory issues\n- Processes multi-GB files efficiently\n- DataFrame operations require pandas (in-memory)\n\n## License\n\nThis project is licensed under the Apache License, Version 2.0 - see the [LICENSE](LICENSE) file for details.\n",
"bugtrack_url": null,
"license": null,
"summary": "A batteries-included Python toolkit for detecting, transforming, masking, pseudonymizing, and auditing PII across data engineering workflows",
"version": "0.1.3",
"project_urls": {
"Documentation": "https://ay-mich.github.io/nopii/nopii.html",
"Homepage": "https://github.com/ay-mich/nopii",
"Issues": "https://github.com/ay-mich/nopii/issues",
"Repository": "https://github.com/ay-mich/nopii"
},
"split_keywords": [
"pii",
" privacy",
" data-protection",
" transformation",
" compliance",
" gdpr",
" ccpa"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "1c382ee96bc2a5282ac04afc44ed54d43ef7744be8209259625ea755319c66af",
"md5": "de95a5fb755aa69956b8ce738f70c1a2",
"sha256": "18a0d5e489f294490339eb70b45dff2f6ea994a99a56b563b54751795946e859"
},
"downloads": -1,
"filename": "nopii-0.1.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "de95a5fb755aa69956b8ce738f70c1a2",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.12",
"size": 78508,
"upload_time": "2025-10-22T04:51:40",
"upload_time_iso_8601": "2025-10-22T04:51:40.393109Z",
"url": "https://files.pythonhosted.org/packages/1c/38/2ee96bc2a5282ac04afc44ed54d43ef7744be8209259625ea755319c66af/nopii-0.1.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "af50730e7cc19826e3171195769b375c1e71b6214fd05a350f5620cd539297f3",
"md5": "a1081e774f6e7bea3e4ae612422efb41",
"sha256": "e4b3038426364368a9bfc2e377c93b94537db1a4aaf1fa713fd065d5f9005269"
},
"downloads": -1,
"filename": "nopii-0.1.3.tar.gz",
"has_sig": false,
"md5_digest": "a1081e774f6e7bea3e4ae612422efb41",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.12",
"size": 77286,
"upload_time": "2025-10-22T04:51:42",
"upload_time_iso_8601": "2025-10-22T04:51:42.358489Z",
"url": "https://files.pythonhosted.org/packages/af/50/730e7cc19826e3171195769b375c1e71b6214fd05a350f5620cd539297f3/nopii-0.1.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-22 04:51:42",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "ay-mich",
"github_project": "nopii",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "nopii"
}