# Data Detector
**data-detector** is a general-purpose engine that detects and masks personal information (mobile phone numbers, social security numbers, email addresses, etc.) by **country and information type**, using a "pattern file-based + library + daemon (server)."
## Features
- 🌍 **Global Support**: Patterns organized by country (ISO2) and information type
- 🔍 **Detection**: Find PII in text using multiple patterns
- ✅ **Validation**: Validate text against specific patterns with optional verification functions
- 🔒 **Redaction**: Mask, hash, or tokenize sensitive information
- 🚀 **Multiple Interfaces**: Library API, CLI, and HTTP/gRPC server
- ⚡ **High Performance**: p95 < 10ms for 1KB text (single namespace)
- 🔄 **Hot Reload**: Non-disruptive pattern reloading
- 📊 **Observability**: Prometheus metrics and health checks
## Quick Start
### Clone repository
```bash
git clone https://github.com/data-detector.git
```
### Installation
```bash
pip install data-detector
```
See [Installation Guide](docs/installation.md) for more options.
### Library Usage
```python
from datadetector import Engine, load_registry
# Load patterns from directory
registry = load_registry(paths=["patterns/"])
engine = Engine(registry)
# Validate
is_valid = engine.validate("010-1234-5678", "kr/mobile_01")
# Find PII (searches all loaded patterns)
results = engine.find("My phone: 01012345678, email: test@example.com")
# Redact
redacted = engine.redact("SSN: 900101-1234567", namespaces=["kr"])
print(redacted.redacted_text)
```
### YAML Pattern Management
Create and manage pattern files programmatically:
```python
from datadetector import PatternFileHandler
# Create a new pattern file
PatternFileHandler.create_pattern_file(
file_path="custom_patterns.yml",
namespace="custom",
description="My custom patterns",
patterns=[{
"id": "api_key_01",
"location": "custom",
"category": "token",
"pattern": r"API-[A-Z0-9]{32}",
"mask": "API-" + "*" * 32,
"policy": {
"store_raw": False,
"action_on_match": "redact",
"severity": "critical"
}
}]
)
# Add, update, or remove patterns
PatternFileHandler.add_pattern_to_file("custom_patterns.yml", new_pattern)
PatternFileHandler.update_pattern_in_file("custom_patterns.yml", "api_key_01", {...})
PatternFileHandler.remove_pattern_from_file("custom_patterns.yml", "api_key_01")
```
See [YAML Utilities Documentation](docs/yaml_utilities.md) for complete guide.
### Pattern Restoration Utility
The `tokens.yml` pattern file may use modified patterns (with `rk_` prefix) during development to avoid triggering GitHub's push protection. Use the restoration utility to convert these back to real-world Stripe API key patterns:
```bash
# After installing via pip
data-detector-restore-tokens
# Or run directly
python restore_tokens.py
# Or as a module
python -m datadetector.restore_tokens
```
**What it does:**
- Converts fake `rk_(live|test)_` patterns to real `[sp]k_(live|test)_` Stripe patterns
- Updates examples to use proper `sk_test_`, `sk_live_`, `pk_test_` prefixes
- Uses obviously fake example keys to avoid secret scanner detection
**Security Note:** All examples use FAKE keys like "EXAMPLEKEY" for security scanner compatibility. This is a defensive security tool - the patterns help detect real leaked credentials.
### CLI Usage
```bash
# Find PII
data-detector find --text "010-1234-5678" --ns kr
# Redact PII
data-detector redact --in input.log --out redacted.log --ns kr us
# Start server
data-detector serve --port 8080
```
### REST API
```bash
# Start server
data-detector serve --port 8080
# Find PII
curl -X POST http://localhost:8080/find \
-H "Content-Type: application/json" \
-d '{"text": "010-1234-5678", "namespaces": ["kr"]}'
```
## Documentation
- [Architecture](docs/ARCHITECTURE.md) - System architecture and design overview
- [Quick Start Guide](docs/quickstart.md) - Get started quickly
- [Pattern Structure](docs/patterns.md) - Learn about pattern definitions
- [Custom Patterns](docs/custom-patterns.md) - Create your own patterns
- [YAML Utilities](docs/yaml_utilities.md) - **NEW!** Programmatically create and manage pattern files
- [Verification Functions](docs/verification.md) - Add custom validation logic
- [Configuration](docs/configuration.md) - Server and registry configuration
- [API Reference](docs/api-reference.md) - Complete API documentation
- [Supported Patterns](docs/supported-patterns.md) - Built-in pattern catalog
- [Testing](TESTING.md) - Test suite documentation and coverage
## Supported Pattern Types
- 📱 Phone numbers (KR, US, TW, JP, CN, IN)
- 🆔 National IDs (SSN, RRN, Aadhaar, etc.)
- 📧 Email addresses
- 🏦 Bank accounts & IBANs (with Mod-97 verification)
- 💳 Credit cards (Visa, MasterCard, Amex, etc.)
- 🛂 Passport numbers
- 📍 Physical addresses
- 🌐 IP addresses & URLs
**Total**: 60+ patterns across 7 locations (Common, KR, US, TW, JP, CN, IN)
See [Supported Patterns](docs/supported-patterns.md) for the complete list.
## Verification Functions
Patterns can include verification functions for additional validation beyond regex:
```yaml
- id: iban_01
category: iban
pattern: '[A-Z]{2}\d{2}[A-Z0-9]{11,30}'
verification: iban_mod97 # Validates IBAN checksum
```
Built-in verification functions:
- `iban_mod97` - IBAN Mod-97 checksum validation
- `luhn` - Luhn algorithm for credit cards
You can also register custom verification functions. See [Verification Functions](docs/verification.md) for details.
## Performance
- **Latency**: p95 < 10ms for 1KB text with single namespace
- **Throughput**: 500+ RPS on 1 vCPU, 512MB RAM
- **Scalability**: Handles 1k+ patterns and 1k+ concurrent requests
## Security & Privacy
- No raw PII is logged (only hashes/metadata)
- TLS support for server
- Configurable rate limiting
- GDPR/CCPA compliant operations
## Development
```bash
# Install with dev dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Format code
black src/ tests/
ruff check src/ tests/
# Validate patterns
python -c "from datadetector import load_registry; load_registry(validate_examples=True)"
```
## Docker
```bash
# Build
docker build -t data-detector:latest .
# Run
docker run -p 8080:8080 -v ./patterns:/app/patterns data-detector:latest
```
## License
MIT License - see [LICENSE](LICENSE) file for details.
## Contributing
Contributions are welcome! Please read our contributing guidelines and submit pull requests.
## Support
- 📖 [Documentation](docs/)
- 🐛 [Issue Tracker](https://github.com/yourusername/data-detector/issues)
- 💬 [Discussions](https://github.com/yourusername/data-detector/discussions)
# Next Step
- Pattern Expansion: Support for additional countries like the EU, the UK, Canada, and Australia, as well as new PII types like Social Security Numbers, Vehicle Numbers, and Driver's License Numbers, will expand the usability of the pattern. Enhance the contribution guidelines to facilitate pattern additions by the community.
- Web UI/Test Tool: Currently, text must be submitted via the CLI or gRPC. Providing a UI that allows users to directly input patterns and view results, such as a web-based demo or a VS Code extension, will improve the user experience.
- Asynchronous/Streaming API: Adding an asyncio-based asynchronous API for high-speed log processing or data pipeline integration, or providing Kafka/Flink connectors, will facilitate application to large-scale systems.
- Automated Pattern Management: Maintaining the pattern catalog in a remote repository and implementing version control to automatically deploy pattern updates will improve operational convenience. Strictly defining the pattern format as a JSON schema will help prevent errors.
- Other Language Bindings: While gRPC allows calls from various languages, providing wrapper libraries for Node.js and Java would increase developer adoption.
- Monitoring and Deployment: In addition to the performance metrics presented in the README, adding benchmarks measuring memory usage and parallel processing performance in real environments, along with Kubernetes/Helm deployment examples and CI processes, would facilitate adoption by operations teams.
Raw data
{
"_id": null,
"home_page": null,
"name": "data-detector",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "pii, detection, redaction, regex, privacy, gdpr",
"author": null,
"author_email": "zafrem <zafrem@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/2e/f9/45432abe97a0a9bdfc743ce841f4fcb3d068d8b6ba698593dec5e103e720/data_detector-0.0.1.tar.gz",
"platform": null,
"description": "# Data Detector\n\n**data-detector** is a general-purpose engine that detects and masks personal information (mobile phone numbers, social security numbers, email addresses, etc.) by **country and information type**, using a \"pattern file-based + library + daemon (server).\"\n\n## Features\n\n- \ud83c\udf0d **Global Support**: Patterns organized by country (ISO2) and information type\n- \ud83d\udd0d **Detection**: Find PII in text using multiple patterns\n- \u2705 **Validation**: Validate text against specific patterns with optional verification functions\n- \ud83d\udd12 **Redaction**: Mask, hash, or tokenize sensitive information\n- \ud83d\ude80 **Multiple Interfaces**: Library API, CLI, and HTTP/gRPC server\n- \u26a1 **High Performance**: p95 < 10ms for 1KB text (single namespace)\n- \ud83d\udd04 **Hot Reload**: Non-disruptive pattern reloading\n- \ud83d\udcca **Observability**: Prometheus metrics and health checks\n\n## Quick Start\n### Clone repository\n```bash\ngit clone https://github.com/data-detector.git\n```\n\n### Installation\n\n```bash\npip install data-detector\n```\n\nSee [Installation Guide](docs/installation.md) for more options.\n\n### Library Usage\n\n```python\nfrom datadetector import Engine, load_registry\n\n# Load patterns from directory\nregistry = load_registry(paths=[\"patterns/\"])\nengine = Engine(registry)\n\n# Validate\nis_valid = engine.validate(\"010-1234-5678\", \"kr/mobile_01\")\n\n# Find PII (searches all loaded patterns)\nresults = engine.find(\"My phone: 01012345678, email: test@example.com\")\n\n# Redact\nredacted = engine.redact(\"SSN: 900101-1234567\", namespaces=[\"kr\"])\nprint(redacted.redacted_text)\n```\n\n### YAML Pattern Management\n\nCreate and manage pattern files programmatically:\n\n```python\nfrom datadetector import PatternFileHandler\n\n# Create a new pattern file\nPatternFileHandler.create_pattern_file(\n file_path=\"custom_patterns.yml\",\n namespace=\"custom\",\n description=\"My custom patterns\",\n patterns=[{\n \"id\": \"api_key_01\",\n \"location\": \"custom\",\n \"category\": \"token\",\n \"pattern\": r\"API-[A-Z0-9]{32}\",\n \"mask\": \"API-\" + \"*\" * 32,\n \"policy\": {\n \"store_raw\": False,\n \"action_on_match\": \"redact\",\n \"severity\": \"critical\"\n }\n }]\n)\n\n# Add, update, or remove patterns\nPatternFileHandler.add_pattern_to_file(\"custom_patterns.yml\", new_pattern)\nPatternFileHandler.update_pattern_in_file(\"custom_patterns.yml\", \"api_key_01\", {...})\nPatternFileHandler.remove_pattern_from_file(\"custom_patterns.yml\", \"api_key_01\")\n```\n\nSee [YAML Utilities Documentation](docs/yaml_utilities.md) for complete guide.\n\n### Pattern Restoration Utility\n\nThe `tokens.yml` pattern file may use modified patterns (with `rk_` prefix) during development to avoid triggering GitHub's push protection. Use the restoration utility to convert these back to real-world Stripe API key patterns:\n\n```bash\n# After installing via pip\ndata-detector-restore-tokens\n\n# Or run directly\npython restore_tokens.py\n\n# Or as a module\npython -m datadetector.restore_tokens\n```\n\n**What it does:**\n- Converts fake `rk_(live|test)_` patterns to real `[sp]k_(live|test)_` Stripe patterns\n- Updates examples to use proper `sk_test_`, `sk_live_`, `pk_test_` prefixes\n- Uses obviously fake example keys to avoid secret scanner detection\n\n**Security Note:** All examples use FAKE keys like \"EXAMPLEKEY\" for security scanner compatibility. This is a defensive security tool - the patterns help detect real leaked credentials.\n\n### CLI Usage\n\n```bash\n# Find PII\ndata-detector find --text \"010-1234-5678\" --ns kr\n\n# Redact PII\ndata-detector redact --in input.log --out redacted.log --ns kr us\n\n# Start server\ndata-detector serve --port 8080\n```\n\n### REST API\n\n```bash\n# Start server\ndata-detector serve --port 8080\n\n# Find PII\ncurl -X POST http://localhost:8080/find \\\n -H \"Content-Type: application/json\" \\\n -d '{\"text\": \"010-1234-5678\", \"namespaces\": [\"kr\"]}'\n```\n\n## Documentation\n\n- [Architecture](docs/ARCHITECTURE.md) - System architecture and design overview\n- [Quick Start Guide](docs/quickstart.md) - Get started quickly\n- [Pattern Structure](docs/patterns.md) - Learn about pattern definitions\n- [Custom Patterns](docs/custom-patterns.md) - Create your own patterns\n- [YAML Utilities](docs/yaml_utilities.md) - **NEW!** Programmatically create and manage pattern files\n- [Verification Functions](docs/verification.md) - Add custom validation logic\n- [Configuration](docs/configuration.md) - Server and registry configuration\n- [API Reference](docs/api-reference.md) - Complete API documentation\n- [Supported Patterns](docs/supported-patterns.md) - Built-in pattern catalog\n- [Testing](TESTING.md) - Test suite documentation and coverage\n\n## Supported Pattern Types\n\n- \ud83d\udcf1 Phone numbers (KR, US, TW, JP, CN, IN)\n- \ud83c\udd94 National IDs (SSN, RRN, Aadhaar, etc.)\n- \ud83d\udce7 Email addresses\n- \ud83c\udfe6 Bank accounts & IBANs (with Mod-97 verification)\n- \ud83d\udcb3 Credit cards (Visa, MasterCard, Amex, etc.)\n- \ud83d\udec2 Passport numbers\n- \ud83d\udccd Physical addresses\n- \ud83c\udf10 IP addresses & URLs\n\n**Total**: 60+ patterns across 7 locations (Common, KR, US, TW, JP, CN, IN)\n\nSee [Supported Patterns](docs/supported-patterns.md) for the complete list.\n\n## Verification Functions\n\nPatterns can include verification functions for additional validation beyond regex:\n\n```yaml\n- id: iban_01\n category: iban\n pattern: '[A-Z]{2}\\d{2}[A-Z0-9]{11,30}'\n verification: iban_mod97 # Validates IBAN checksum\n```\n\nBuilt-in verification functions:\n- `iban_mod97` - IBAN Mod-97 checksum validation\n- `luhn` - Luhn algorithm for credit cards\n\nYou can also register custom verification functions. See [Verification Functions](docs/verification.md) for details.\n\n## Performance\n\n- **Latency**: p95 < 10ms for 1KB text with single namespace\n- **Throughput**: 500+ RPS on 1 vCPU, 512MB RAM\n- **Scalability**: Handles 1k+ patterns and 1k+ concurrent requests\n\n## Security & Privacy\n\n- No raw PII is logged (only hashes/metadata)\n- TLS support for server\n- Configurable rate limiting\n- GDPR/CCPA compliant operations\n\n## Development\n\n```bash\n# Install with dev dependencies\npip install -e \".[dev]\"\n\n# Run tests\npytest\n\n# Format code\nblack src/ tests/\nruff check src/ tests/\n\n# Validate patterns\npython -c \"from datadetector import load_registry; load_registry(validate_examples=True)\"\n```\n\n## Docker\n\n```bash\n# Build\ndocker build -t data-detector:latest .\n\n# Run\ndocker run -p 8080:8080 -v ./patterns:/app/patterns data-detector:latest\n```\n\n## License\n\nMIT License - see [LICENSE](LICENSE) file for details.\n\n## Contributing\n\nContributions are welcome! Please read our contributing guidelines and submit pull requests.\n\n## Support\n\n- \ud83d\udcd6 [Documentation](docs/)\n- \ud83d\udc1b [Issue Tracker](https://github.com/yourusername/data-detector/issues)\n- \ud83d\udcac [Discussions](https://github.com/yourusername/data-detector/discussions)\n\n\n# Next Step\n- Pattern Expansion: Support for additional countries like the EU, the UK, Canada, and Australia, as well as new PII types like Social Security Numbers, Vehicle Numbers, and Driver's License Numbers, will expand the usability of the pattern. Enhance the contribution guidelines to facilitate pattern additions by the community.\n\n- Web UI/Test Tool: Currently, text must be submitted via the CLI or gRPC. Providing a UI that allows users to directly input patterns and view results, such as a web-based demo or a VS Code extension, will improve the user experience.\n\n- Asynchronous/Streaming API: Adding an asyncio-based asynchronous API for high-speed log processing or data pipeline integration, or providing Kafka/Flink connectors, will facilitate application to large-scale systems.\n\n- Automated Pattern Management: Maintaining the pattern catalog in a remote repository and implementing version control to automatically deploy pattern updates will improve operational convenience. Strictly defining the pattern format as a JSON schema will help prevent errors.\n\n- Other Language Bindings: While gRPC allows calls from various languages, providing wrapper libraries for Node.js and Java would increase developer adoption.\n\n- Monitoring and Deployment: In addition to the performance metrics presented in the README, adding benchmarks measuring memory usage and parallel processing performance in real environments, along with Kubernetes/Helm deployment examples and CI processes, would facilitate adoption by operations teams.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A general-purpose engine that detects and masks personal information by country and information type",
"version": "0.0.1",
"project_urls": {
"Homepage": "https://github.com/zafrem/data-detector",
"Issues": "https://github.com/zafrem/data-detector/issues",
"Repository": "https://github.com/zafrem/data-detector"
},
"split_keywords": [
"pii",
" detection",
" redaction",
" regex",
" privacy",
" gdpr"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "a8a77368cef73fb9d767503ea7466cdd697051419faf078379034089cef02c6d",
"md5": "c700f50bcd7e57b98f1961f62ecccfcf",
"sha256": "e8c5a1337b2008a2f1b1adc44f1ba0a60ea320047204f832cac6abefac01eff6"
},
"downloads": -1,
"filename": "data_detector-0.0.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "c700f50bcd7e57b98f1961f62ecccfcf",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 27284,
"upload_time": "2025-10-11T09:14:44",
"upload_time_iso_8601": "2025-10-11T09:14:44.069448Z",
"url": "https://files.pythonhosted.org/packages/a8/a7/7368cef73fb9d767503ea7466cdd697051419faf078379034089cef02c6d/data_detector-0.0.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "2ef945432abe97a0a9bdfc743ce841f4fcb3d068d8b6ba698593dec5e103e720",
"md5": "549906509628c5bfb232768a128b2831",
"sha256": "1cf237bf73bac43c7ccdca74abac04bb48b31785fa40a57458b7394f06979d02"
},
"downloads": -1,
"filename": "data_detector-0.0.1.tar.gz",
"has_sig": false,
"md5_digest": "549906509628c5bfb232768a128b2831",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 102422,
"upload_time": "2025-10-11T09:14:45",
"upload_time_iso_8601": "2025-10-11T09:14:45.447878Z",
"url": "https://files.pythonhosted.org/packages/2e/f9/45432abe97a0a9bdfc743ce841f4fcb3d068d8b6ba698593dec5e103e720/data_detector-0.0.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-11 09:14:45",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "zafrem",
"github_project": "data-detector",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "pyyaml",
"specs": [
[
">=",
"6.0"
]
]
},
{
"name": "jsonschema",
"specs": [
[
">=",
"4.17.0"
]
]
},
{
"name": "click",
"specs": [
[
">=",
"8.1.0"
]
]
},
{
"name": "fastapi",
"specs": [
[
">=",
"0.104.0"
]
]
},
{
"name": "uvicorn",
"specs": [
[
">=",
"0.24.0"
]
]
},
{
"name": "grpcio",
"specs": [
[
">=",
"1.59.0"
]
]
},
{
"name": "grpcio-tools",
"specs": [
[
">=",
"1.59.0"
]
]
},
{
"name": "prometheus-client",
"specs": [
[
">=",
"0.19.0"
]
]
},
{
"name": "pydantic",
"specs": [
[
">=",
"2.0.0"
]
]
},
{
"name": "python-multipart",
"specs": [
[
">=",
"0.0.6"
]
]
}
],
"lcname": "data-detector"
}