secretsentry


Namesecretsentry JSON
Version 2.0.0 PyPI version JSON
download
home_pageNone
SummaryAdvanced sensitive data scanner with Jupyter notebook support and intelligent false positive filtering
upload_time2025-08-08 09:33:09
maintainerNone
docs_urlNone
authorNone
requires_python>=3.7
licenseMIT
keywords security secrets scanner pii jupyter notebook api-keys credentials sanitization privacy devops ci-cd machine-learning ml transformers false-positives
VCS
bugtrack_url
requirements tqdm numpy
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # SecretSentry πŸ›‘οΈ

> **The first AI-powered sensitive data scanner built for modern data science and web development workflows**

[![PyPI version](https://badge.fury.io/py/secretsentry.svg)](https://badge.fury.io/py/secretsentry)
[![Python Support](https://img.shields.io/pypi/pyversions/secretsentry.svg)](https://pypi.org/project/secretsentry/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

SecretSentry is an advanced sensitive data scanner that goes beyond traditional secret detection. Built specifically for **Jupyter notebooks**, **web development**, and **data science workflows**, it combines **machine learning** with regex patterns to intelligently filter false positives while detecting API keys, PII, credentials, and other sensitive information.

## 🎯 **Why SecretSentry?**

### **Built for Modern Workflows**
- πŸ”¬ **Jupyter Notebook Specialist**: First scanner designed for `.ipynb` files
- πŸ€– **AI-Powered Detection**: Machine learning models reduce false positives by up to 80%
- 🧠 **Smart Context Awareness**: Understands code context, not just pattern matching
- 🌐 **Multi-Environment**: CLI, Jupyter notebooks, and Python scripts
- πŸŽ›οΈ **Interactive Analysis**: Built-in widgets for exploring findings

### **Comprehensive Detection**
- πŸ”‘ **50+ Built-in Patterns**: API keys, tokens, secrets, credentials
- πŸ‘€ **PII Detection**: SSNs, credit cards, phone numbers, emails
- πŸ’° **Financial Data**: Salary information, bank accounts, routing numbers
- 🌍 **Geographic Data**: Coordinates, IP addresses, postal codes
- πŸ₯ **Sensitive Categories**: Ethnic data, religious information, medical records

### **Advanced Features**
- πŸ›‘οΈ **Smart Sanitization**: Context-aware gibberish replacement
- πŸ€– **Ensemble Detection**: Combines regex + ML for maximum accuracy
- πŸ“Š **Rich Visualizations**: Charts and statistics (with matplotlib/seaborn)
- πŸ“ˆ **Pandas Integration**: Export to DataFrames for analysis
- 🎯 **Confidence Scoring**: ML predictions with 0.0-1.0 confidence scores
- πŸ”„ **CI/CD Ready**: Perfect for automation and pipelines
- πŸ–₯️ **Cross-Platform**: Works on macOS, Windows, and Linux

## πŸš€ **Quick Start**

### **Installation**

```bash
# Basic installation (regex-only detection)
pip install secretsentry

# With machine learning capabilities
pip install secretsentry[ml]

# Advanced ML with transformers (best accuracy)
pip install secretsentry[ml-advanced]

# Full installation with all features
pip install secretsentry[full]

# For Jupyter notebooks only
pip install secretsentry[jupyter]
```

### **Basic Usage**

```python
from secretsentry import SecretSentry, quick_scan, quick_ml_scan

# Quick scan with regex detection
scanner = quick_scan("./my_project")

# Quick scan with AI/ML enhancement (recommended)
scanner = quick_ml_scan("./my_project", confidence_threshold=0.7)

# Manual scanning with ML capabilities
scanner = SecretSentry(
    use_ml_detection=True,
    ml_confidence_threshold=0.7,
    ml_ensemble_mode=True  # Combines regex + ML
)
findings = scanner.scan_directory("./my_project")
scanner.display_findings()

# Access ML-specific results
ml_findings = scanner.get_ml_findings()
high_confidence = scanner.get_high_confidence_findings(0.8)

# Sanitize files (creates backups automatically)
stats = scanner.sanitize_files(dry_run=True)  # Preview changes
stats = scanner.sanitize_files()  # Actually sanitize
```

### **Command Line**

```bash
# Basic regex scanning
secretsentry scan ./my_project --display

# AI-enhanced scanning (recommended)
secretsentry scan ./my_project --ml --display

# Quick ML scan with optimal settings
secretsentry scan ./my_project --ml-quick

# ML-only detection with custom confidence
secretsentry scan ./my_project --ml-only --ml-confidence 0.8

# Check ML requirements
secretsentry scan --check-ml

# Export findings with ML metadata
secretsentry scan ./my_project --ml --export findings.json

# Sanitize files (with backup)
secretsentry scan ./my_project --sanitize --dry-run
secretsentry scan ./my_project --sanitize

# List all detection patterns
secretsentry list-patterns
```

## πŸ€– **AI-Powered Detection**

SecretSentry's machine learning capabilities provide **context-aware detection** that dramatically reduces false positives:

### **ML Detection Modes**

```python
# Ensemble Mode (recommended): Combines regex + ML
scanner = SecretSentry(
    use_ml_detection=True,
    ml_ensemble_mode=True,
    ml_confidence_threshold=0.7
)

# ML-Only Mode: Pure machine learning detection
scanner = SecretSentry(
    use_ml_detection=True,
    ml_ensemble_mode=False,
    ml_confidence_threshold=0.8
)

# Quick ML scan with optimal settings
scanner = quick_ml_scan("./my_project")
```

### **ML Features**

- 🧠 **Context Understanding**: Analyzes surrounding code context, not just patterns
- πŸ“Š **Confidence Scoring**: Every ML detection includes a 0.0-1.0 confidence score  
- πŸ”¬ **Feature Extraction**: Text entropy, keyword analysis, pattern recognition
- πŸ‹οΈ **Multiple Models**: Logistic Regression, Isolation Forest, optional Transformers
- πŸ’Ύ **Model Caching**: Trained models cached for faster subsequent scans
- πŸ–₯️ **Local Processing**: All ML inference happens on your machine (no data sent externally)

### **ML Requirements**

```bash
# Check what's available on your system
secretsentry scan --check-ml

# Install ML dependencies
pip install secretsentry[ml]           # Basic ML (scikit-learn)
pip install secretsentry[ml-advanced]  # Advanced ML (transformers)
```

## πŸŽ“ **Jupyter Notebook Integration**

SecretSentry shines in Jupyter environments with **zero false positives** from notebook metadata:

```python
# In Jupyter notebook
from secretsentry import quick_scan, quick_ml_scan

# Quick ML scan with visualizations
scanner = quick_ml_scan("./test_data", show_plots=True)

# Interactive exploration with ML metadata
scanner.create_interactive_viewer()

# Data analysis with ML findings
df = scanner.to_dataframe(include_ml_findings=True)
summary = df.groupby(['pattern_type', 'detection_method']).size()

# Analyze confidence scores
ml_df = df[df['detection_method'] == 'ml']
confidence_analysis = ml_df['confidence_score'].describe()
```

## πŸ“Š **What Makes It Special**

### **AI-Enhanced Accuracy**

**Traditional regex scanners** flag these as secrets:
```
❌ aws_secret_key: iVBORw0KGgoAAAANSUhEUgAABKYAAAMW...  # Just a PNG image!
❌ api_key: "cell_type": "code"  # Notebook metadata!  
❌ secret: #3498db  # CSS color!
❌ token: "placeholder_for_testing"  # Test data!
```

**SecretSentry with ML** understands context and only reports **real secrets**:
```
βœ… aws_secret_key: AKIAIOSFODNN7EXAMPLE (confidence: 0.95)
βœ… stripe_key: sk_live_1234567890abcdef123456789 (confidence: 0.89)  
βœ… database_url: postgresql://user:password@localhost/db (confidence: 0.92)
```

**ML Advantages:**
- 🎯 **Context Awareness**: Understands surrounding code patterns
- πŸ“Š **Confidence Scoring**: Know how certain each detection is
- 🧠 **Learning**: Improves over time with usage patterns
- πŸ›‘οΈ **Adaptive**: Handles new secret formats without regex updates

### **Smart Sanitization**

SecretSentry doesn't just find secretsβ€”it **fixes them safely**:

```python
# Before sanitization
API_KEY = "sk_live_1234567890abcdef"
employee_ssn = "123-45-6789"
coordinates = "40.7128, -74.0060"

# After sanitization (context-aware gibberish)
API_KEY = "sk_live_xK8mP9nQ4vL7wR2Z"
employee_ssn = "456-78-9123"  
coordinates = "38.8951, -77.0364"
```

## πŸ”§ **Advanced Usage**

### **Custom Patterns**

```python
# Add organization-specific patterns
custom_patterns = {
    'employee_id': r'EMP-\d{6}',
    'project_code': r'PROJ-[A-Z]{3}-\d{4}',
    'internal_api': r'internal_key_[a-zA-Z0-9]{32}'
}

scanner = SecretSentry(custom_patterns=custom_patterns)
```

### **CI/CD Integration**

```python
#!/usr/bin/env python3
# security_check.py
import sys
from secretsentry import SecretSentry

def security_gate():
    # Use ML-enhanced detection for better accuracy in CI/CD
    scanner = SecretSentry(
        use_ml_detection=True,
        ml_ensemble_mode=True,
        ml_confidence_threshold=0.8  # Higher threshold for CI/CD
    )
    findings = scanner.scan_directory(".", show_progress=False)
    
    if findings:
        print(f"❌ SECURITY CHECK FAILED: {len(findings)} secrets found")
        
        # Show high-confidence ML findings first
        if scanner.use_ml_detection:
            ml_findings = scanner.get_ml_findings()
            high_conf = scanner.get_high_confidence_findings(0.9)
            print(f"πŸ€– ML Analysis: {len(ml_findings)} ML findings, {len(high_conf)} high confidence")
        
        scanner.display_findings(max_display=10)
        return 1
    else:
        print("βœ… SECURITY CHECK PASSED: No secrets detected")
        return 0

if __name__ == "__main__":
    sys.exit(security_gate())
```

**CI/CD CLI Usage:**
```bash
# Basic CI/CD check
secretsentry scan . --ml --quiet || exit 1

# High-confidence only for sensitive deployments  
secretsentry scan . --ml-only --ml-confidence 0.9 --quiet || exit 1
```

### **Batch Processing**

```python
# Scan multiple projects with ML
from secretsentry import SecretSentry
import os

projects = ["./frontend", "./backend", "./data-science"]
all_results = {}

for project in projects:
    if os.path.exists(project):
        # Use ML for better accuracy across different project types
        scanner = SecretSentry(
            use_ml_detection=True,
            ml_ensemble_mode=True,
            ml_confidence_threshold=0.7
        )
        findings = scanner.scan_directory(project)
        
        # Collect ML statistics
        ml_findings = scanner.get_ml_findings()
        all_results[project] = {
            'total_findings': len(findings),
            'ml_findings': len(ml_findings),
            'high_confidence': len(scanner.get_high_confidence_findings(0.8))
        }
        
        # Export detailed reports with ML metadata
        scanner.export_findings(f"{project.replace('./', '')}_security_report.json")

print("Security Summary:", all_results)
```

## πŸ“ˆ **Detection Categories**

<details>
<summary><b>πŸ”‘ API Keys & Secrets (20+ patterns)</b></summary>

- AWS Access/Secret Keys
- GitHub Tokens (classic & fine-grained)  
- Google API Keys
- Stripe Keys (live & test)
- Slack Tokens & Webhooks
- SendGrid API Keys
- Twilio Keys
- Mailgun Keys
- Azure Storage Keys
- Heroku API Keys
- Generic API patterns

</details>

<details>
<summary><b>πŸ’³ Financial Data (8+ patterns)</b></summary>

- Credit Cards (Visa, MasterCard, AmEx, Discover, JCB, Diners)
- Bank Account Numbers
- Routing Numbers  
- IBAN & SWIFT Codes
- Salary Information

</details>

<details>
<summary><b>πŸ‘€ Personal Information (10+ patterns)</b></summary>

- Social Security Numbers
- Phone Numbers (US & International)
- Email Addresses
- Passport Numbers
- Driver's License Numbers
- Medical Record Numbers

</details>

<details>
<summary><b>🌍 Geographic Data (5+ patterns)</b></summary>

- GPS Coordinates
- IP Addresses (IPv4 & IPv6)
- MAC Addresses  
- ZIP/Postal Codes

</details>

<details>
<summary><b>πŸ₯ Sensitive Personal Data (5+ patterns)</b></summary>

- Ethnic/Racial Categories
- Religious Affiliations  
- Medical Information
- Disability Status

</details>

<details>
<summary><b>πŸ” Cryptographic Material (5+ patterns)</b></summary>

- Private Keys (RSA, SSH)
- Public Keys & Certificates
- JWT Tokens
- OAuth Tokens  

</details>

## πŸŽ›οΈ **Configuration**

### **Environment Variables**
```bash
# Disable progress bars
export SECRETSENTRY_NO_PROGRESS=1

# Custom config file
export SECRETSENTRY_CONFIG=/path/to/config.json

# ML model cache directory (optional)
export SECRETSENTRY_MODEL_CACHE=/path/to/ml/models

# Force ML detection on/off
export SECRETSENTRY_USE_ML=true
export SECRETSENTRY_ML_CONFIDENCE=0.7
```

### **Configuration File**
```json
{
    "excluded_patterns": ["test_", "example_", "demo_"],
    "excluded_files": ["*.test.js", "test_*.py"],
    "excluded_dirs": ["tests", "examples", "docs"],
    "custom_patterns": {
        "company_id": "COMP-\\d{8}"
    },
    "sanitization": {
        "create_backups": true,
        "backup_suffix": ".backup"
    },
    "ml_detection": {
        "enabled": true,
        "confidence_threshold": 0.7,
        "ensemble_mode": true,
        "use_transformers": false,
        "model_cache_dir": "~/.cache/secretsentry/models"
    }
}
```

## ⚑ **Performance & Requirements**

### **ML Performance**

| Detection Mode | Speed | Accuracy | Memory Usage | Dependencies |
|---------------|-------|----------|--------------|--------------|
| **Regex Only** | ⚑⚑⚑⚑⚑ | βœ…βœ…βœ… | 🟒 Low | Minimal |
| **ML Basic** | ⚑⚑⚑⚑ | βœ…βœ…βœ…βœ… | 🟑 Medium | scikit-learn |
| **ML Advanced** | ⚑⚑⚑ | βœ…βœ…βœ…βœ…βœ… | πŸ”΄ High | transformers |

### **System Requirements**

**Minimum (Regex-only):**
- Python 3.7+
- 50MB RAM
- Any CPU

**Recommended (ML Basic):**
- Python 3.8+
- 512MB RAM
- 2+ CPU cores
- 200MB disk space

**Optimal (ML Advanced):**  
- Python 3.9+
- 2GB+ RAM
- 4+ CPU cores
- 1GB disk space

### **Installation Time**

```bash
pip install secretsentry              # ~30 seconds
pip install secretsentry[ml]          # ~2 minutes  
pip install secretsentry[ml-advanced] # ~5 minutes (downloads models)
```

### **First Run Performance**

- **Regex detection**: Instant
- **ML Basic**: ~30 seconds (model training on first run)
- **ML Advanced**: ~2 minutes (model download + training)
- **Subsequent runs**: Fast (models cached)

## 🀝 **Contributing**

We welcome contributions! Here's how to get started:

```bash
# Clone the repository
git clone https://github.com/yourusername/secretsentry.git
cd secretsentry

# Install development dependencies (includes ML dependencies)
pip install -e ".[full]"
pip install pytest black flake8

# Run tests (includes ML tests)
pytest tests/

# Test ML functionality specifically
python test_ml_detection.py

# Format code
black secretsentry/
flake8 secretsentry/
```

## πŸ“ **License**

MIT License - see [LICENSE](LICENSE) file for details.

## πŸ™ **Acknowledgments**

- Inspired by [detect-secrets](https://github.com/Yelp/detect-secrets) and [truffleHog](https://github.com/dxa4481/truffleHog)
- ML capabilities powered by [scikit-learn](https://scikit-learn.org/) and [Transformers](https://huggingface.co/transformers/)
- Built for the data science and security communities
- Special thanks to all contributors and the open source community
- Grateful to the broader AI/ML community for advancing secret detection research

## πŸ“ž **Support**

- πŸ“– **Documentation**: [Full docs](https://github.com/yourusername/secretsentry#readme)
- πŸ› **Issues**: [Report bugs](https://github.com/yourusername/secretsentry/issues)
- πŸ’¬ **Discussions**: [Community forum](https://github.com/yourusername/secretsentry/discussions)
- πŸ“§ **Contact**: your.email@example.com

---

**SecretSentry** - *Standing guard over your sensitive data* πŸ›‘οΈ

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "secretsentry",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "security, secrets, scanner, pii, jupyter, notebook, api-keys, credentials, sanitization, privacy, devops, ci-cd, machine-learning, ml, transformers, false-positives",
    "author": null,
    "author_email": "Abdul Jilani <abdul.jilani@evolveailabs.com>",
    "download_url": "https://files.pythonhosted.org/packages/e3/5f/169dfd0f2098cc4bb455aad8a3e53fa13689c2aa48ce95a87734d7623465/secretsentry-2.0.0.tar.gz",
    "platform": null,
    "description": "# SecretSentry \ud83d\udee1\ufe0f\n\n> **The first AI-powered sensitive data scanner built for modern data science and web development workflows**\n\n[![PyPI version](https://badge.fury.io/py/secretsentry.svg)](https://badge.fury.io/py/secretsentry)\n[![Python Support](https://img.shields.io/pypi/pyversions/secretsentry.svg)](https://pypi.org/project/secretsentry/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\nSecretSentry is an advanced sensitive data scanner that goes beyond traditional secret detection. Built specifically for **Jupyter notebooks**, **web development**, and **data science workflows**, it combines **machine learning** with regex patterns to intelligently filter false positives while detecting API keys, PII, credentials, and other sensitive information.\n\n## \ud83c\udfaf **Why SecretSentry?**\n\n### **Built for Modern Workflows**\n- \ud83d\udd2c **Jupyter Notebook Specialist**: First scanner designed for `.ipynb` files\n- \ud83e\udd16 **AI-Powered Detection**: Machine learning models reduce false positives by up to 80%\n- \ud83e\udde0 **Smart Context Awareness**: Understands code context, not just pattern matching\n- \ud83c\udf10 **Multi-Environment**: CLI, Jupyter notebooks, and Python scripts\n- \ud83c\udf9b\ufe0f **Interactive Analysis**: Built-in widgets for exploring findings\n\n### **Comprehensive Detection**\n- \ud83d\udd11 **50+ Built-in Patterns**: API keys, tokens, secrets, credentials\n- \ud83d\udc64 **PII Detection**: SSNs, credit cards, phone numbers, emails\n- \ud83d\udcb0 **Financial Data**: Salary information, bank accounts, routing numbers\n- \ud83c\udf0d **Geographic Data**: Coordinates, IP addresses, postal codes\n- \ud83c\udfe5 **Sensitive Categories**: Ethnic data, religious information, medical records\n\n### **Advanced Features**\n- \ud83d\udee1\ufe0f **Smart Sanitization**: Context-aware gibberish replacement\n- \ud83e\udd16 **Ensemble Detection**: Combines regex + ML for maximum accuracy\n- \ud83d\udcca **Rich Visualizations**: Charts and statistics (with matplotlib/seaborn)\n- \ud83d\udcc8 **Pandas Integration**: Export to DataFrames for analysis\n- \ud83c\udfaf **Confidence Scoring**: ML predictions with 0.0-1.0 confidence scores\n- \ud83d\udd04 **CI/CD Ready**: Perfect for automation and pipelines\n- \ud83d\udda5\ufe0f **Cross-Platform**: Works on macOS, Windows, and Linux\n\n## \ud83d\ude80 **Quick Start**\n\n### **Installation**\n\n```bash\n# Basic installation (regex-only detection)\npip install secretsentry\n\n# With machine learning capabilities\npip install secretsentry[ml]\n\n# Advanced ML with transformers (best accuracy)\npip install secretsentry[ml-advanced]\n\n# Full installation with all features\npip install secretsentry[full]\n\n# For Jupyter notebooks only\npip install secretsentry[jupyter]\n```\n\n### **Basic Usage**\n\n```python\nfrom secretsentry import SecretSentry, quick_scan, quick_ml_scan\n\n# Quick scan with regex detection\nscanner = quick_scan(\"./my_project\")\n\n# Quick scan with AI/ML enhancement (recommended)\nscanner = quick_ml_scan(\"./my_project\", confidence_threshold=0.7)\n\n# Manual scanning with ML capabilities\nscanner = SecretSentry(\n    use_ml_detection=True,\n    ml_confidence_threshold=0.7,\n    ml_ensemble_mode=True  # Combines regex + ML\n)\nfindings = scanner.scan_directory(\"./my_project\")\nscanner.display_findings()\n\n# Access ML-specific results\nml_findings = scanner.get_ml_findings()\nhigh_confidence = scanner.get_high_confidence_findings(0.8)\n\n# Sanitize files (creates backups automatically)\nstats = scanner.sanitize_files(dry_run=True)  # Preview changes\nstats = scanner.sanitize_files()  # Actually sanitize\n```\n\n### **Command Line**\n\n```bash\n# Basic regex scanning\nsecretsentry scan ./my_project --display\n\n# AI-enhanced scanning (recommended)\nsecretsentry scan ./my_project --ml --display\n\n# Quick ML scan with optimal settings\nsecretsentry scan ./my_project --ml-quick\n\n# ML-only detection with custom confidence\nsecretsentry scan ./my_project --ml-only --ml-confidence 0.8\n\n# Check ML requirements\nsecretsentry scan --check-ml\n\n# Export findings with ML metadata\nsecretsentry scan ./my_project --ml --export findings.json\n\n# Sanitize files (with backup)\nsecretsentry scan ./my_project --sanitize --dry-run\nsecretsentry scan ./my_project --sanitize\n\n# List all detection patterns\nsecretsentry list-patterns\n```\n\n## \ud83e\udd16 **AI-Powered Detection**\n\nSecretSentry's machine learning capabilities provide **context-aware detection** that dramatically reduces false positives:\n\n### **ML Detection Modes**\n\n```python\n# Ensemble Mode (recommended): Combines regex + ML\nscanner = SecretSentry(\n    use_ml_detection=True,\n    ml_ensemble_mode=True,\n    ml_confidence_threshold=0.7\n)\n\n# ML-Only Mode: Pure machine learning detection\nscanner = SecretSentry(\n    use_ml_detection=True,\n    ml_ensemble_mode=False,\n    ml_confidence_threshold=0.8\n)\n\n# Quick ML scan with optimal settings\nscanner = quick_ml_scan(\"./my_project\")\n```\n\n### **ML Features**\n\n- \ud83e\udde0 **Context Understanding**: Analyzes surrounding code context, not just patterns\n- \ud83d\udcca **Confidence Scoring**: Every ML detection includes a 0.0-1.0 confidence score  \n- \ud83d\udd2c **Feature Extraction**: Text entropy, keyword analysis, pattern recognition\n- \ud83c\udfcb\ufe0f **Multiple Models**: Logistic Regression, Isolation Forest, optional Transformers\n- \ud83d\udcbe **Model Caching**: Trained models cached for faster subsequent scans\n- \ud83d\udda5\ufe0f **Local Processing**: All ML inference happens on your machine (no data sent externally)\n\n### **ML Requirements**\n\n```bash\n# Check what's available on your system\nsecretsentry scan --check-ml\n\n# Install ML dependencies\npip install secretsentry[ml]           # Basic ML (scikit-learn)\npip install secretsentry[ml-advanced]  # Advanced ML (transformers)\n```\n\n## \ud83c\udf93 **Jupyter Notebook Integration**\n\nSecretSentry shines in Jupyter environments with **zero false positives** from notebook metadata:\n\n```python\n# In Jupyter notebook\nfrom secretsentry import quick_scan, quick_ml_scan\n\n# Quick ML scan with visualizations\nscanner = quick_ml_scan(\"./test_data\", show_plots=True)\n\n# Interactive exploration with ML metadata\nscanner.create_interactive_viewer()\n\n# Data analysis with ML findings\ndf = scanner.to_dataframe(include_ml_findings=True)\nsummary = df.groupby(['pattern_type', 'detection_method']).size()\n\n# Analyze confidence scores\nml_df = df[df['detection_method'] == 'ml']\nconfidence_analysis = ml_df['confidence_score'].describe()\n```\n\n## \ud83d\udcca **What Makes It Special**\n\n### **AI-Enhanced Accuracy**\n\n**Traditional regex scanners** flag these as secrets:\n```\n\u274c aws_secret_key: iVBORw0KGgoAAAANSUhEUgAABKYAAAMW...  # Just a PNG image!\n\u274c api_key: \"cell_type\": \"code\"  # Notebook metadata!  \n\u274c secret: #3498db  # CSS color!\n\u274c token: \"placeholder_for_testing\"  # Test data!\n```\n\n**SecretSentry with ML** understands context and only reports **real secrets**:\n```\n\u2705 aws_secret_key: AKIAIOSFODNN7EXAMPLE (confidence: 0.95)\n\u2705 stripe_key: sk_live_1234567890abcdef123456789 (confidence: 0.89)  \n\u2705 database_url: postgresql://user:password@localhost/db (confidence: 0.92)\n```\n\n**ML Advantages:**\n- \ud83c\udfaf **Context Awareness**: Understands surrounding code patterns\n- \ud83d\udcca **Confidence Scoring**: Know how certain each detection is\n- \ud83e\udde0 **Learning**: Improves over time with usage patterns\n- \ud83d\udee1\ufe0f **Adaptive**: Handles new secret formats without regex updates\n\n### **Smart Sanitization**\n\nSecretSentry doesn't just find secrets\u2014it **fixes them safely**:\n\n```python\n# Before sanitization\nAPI_KEY = \"sk_live_1234567890abcdef\"\nemployee_ssn = \"123-45-6789\"\ncoordinates = \"40.7128, -74.0060\"\n\n# After sanitization (context-aware gibberish)\nAPI_KEY = \"sk_live_xK8mP9nQ4vL7wR2Z\"\nemployee_ssn = \"456-78-9123\"  \ncoordinates = \"38.8951, -77.0364\"\n```\n\n## \ud83d\udd27 **Advanced Usage**\n\n### **Custom Patterns**\n\n```python\n# Add organization-specific patterns\ncustom_patterns = {\n    'employee_id': r'EMP-\\d{6}',\n    'project_code': r'PROJ-[A-Z]{3}-\\d{4}',\n    'internal_api': r'internal_key_[a-zA-Z0-9]{32}'\n}\n\nscanner = SecretSentry(custom_patterns=custom_patterns)\n```\n\n### **CI/CD Integration**\n\n```python\n#!/usr/bin/env python3\n# security_check.py\nimport sys\nfrom secretsentry import SecretSentry\n\ndef security_gate():\n    # Use ML-enhanced detection for better accuracy in CI/CD\n    scanner = SecretSentry(\n        use_ml_detection=True,\n        ml_ensemble_mode=True,\n        ml_confidence_threshold=0.8  # Higher threshold for CI/CD\n    )\n    findings = scanner.scan_directory(\".\", show_progress=False)\n    \n    if findings:\n        print(f\"\u274c SECURITY CHECK FAILED: {len(findings)} secrets found\")\n        \n        # Show high-confidence ML findings first\n        if scanner.use_ml_detection:\n            ml_findings = scanner.get_ml_findings()\n            high_conf = scanner.get_high_confidence_findings(0.9)\n            print(f\"\ud83e\udd16 ML Analysis: {len(ml_findings)} ML findings, {len(high_conf)} high confidence\")\n        \n        scanner.display_findings(max_display=10)\n        return 1\n    else:\n        print(\"\u2705 SECURITY CHECK PASSED: No secrets detected\")\n        return 0\n\nif __name__ == \"__main__\":\n    sys.exit(security_gate())\n```\n\n**CI/CD CLI Usage:**\n```bash\n# Basic CI/CD check\nsecretsentry scan . --ml --quiet || exit 1\n\n# High-confidence only for sensitive deployments  \nsecretsentry scan . --ml-only --ml-confidence 0.9 --quiet || exit 1\n```\n\n### **Batch Processing**\n\n```python\n# Scan multiple projects with ML\nfrom secretsentry import SecretSentry\nimport os\n\nprojects = [\"./frontend\", \"./backend\", \"./data-science\"]\nall_results = {}\n\nfor project in projects:\n    if os.path.exists(project):\n        # Use ML for better accuracy across different project types\n        scanner = SecretSentry(\n            use_ml_detection=True,\n            ml_ensemble_mode=True,\n            ml_confidence_threshold=0.7\n        )\n        findings = scanner.scan_directory(project)\n        \n        # Collect ML statistics\n        ml_findings = scanner.get_ml_findings()\n        all_results[project] = {\n            'total_findings': len(findings),\n            'ml_findings': len(ml_findings),\n            'high_confidence': len(scanner.get_high_confidence_findings(0.8))\n        }\n        \n        # Export detailed reports with ML metadata\n        scanner.export_findings(f\"{project.replace('./', '')}_security_report.json\")\n\nprint(\"Security Summary:\", all_results)\n```\n\n## \ud83d\udcc8 **Detection Categories**\n\n<details>\n<summary><b>\ud83d\udd11 API Keys & Secrets (20+ patterns)</b></summary>\n\n- AWS Access/Secret Keys\n- GitHub Tokens (classic & fine-grained)  \n- Google API Keys\n- Stripe Keys (live & test)\n- Slack Tokens & Webhooks\n- SendGrid API Keys\n- Twilio Keys\n- Mailgun Keys\n- Azure Storage Keys\n- Heroku API Keys\n- Generic API patterns\n\n</details>\n\n<details>\n<summary><b>\ud83d\udcb3 Financial Data (8+ patterns)</b></summary>\n\n- Credit Cards (Visa, MasterCard, AmEx, Discover, JCB, Diners)\n- Bank Account Numbers\n- Routing Numbers  \n- IBAN & SWIFT Codes\n- Salary Information\n\n</details>\n\n<details>\n<summary><b>\ud83d\udc64 Personal Information (10+ patterns)</b></summary>\n\n- Social Security Numbers\n- Phone Numbers (US & International)\n- Email Addresses\n- Passport Numbers\n- Driver's License Numbers\n- Medical Record Numbers\n\n</details>\n\n<details>\n<summary><b>\ud83c\udf0d Geographic Data (5+ patterns)</b></summary>\n\n- GPS Coordinates\n- IP Addresses (IPv4 & IPv6)\n- MAC Addresses  \n- ZIP/Postal Codes\n\n</details>\n\n<details>\n<summary><b>\ud83c\udfe5 Sensitive Personal Data (5+ patterns)</b></summary>\n\n- Ethnic/Racial Categories\n- Religious Affiliations  \n- Medical Information\n- Disability Status\n\n</details>\n\n<details>\n<summary><b>\ud83d\udd10 Cryptographic Material (5+ patterns)</b></summary>\n\n- Private Keys (RSA, SSH)\n- Public Keys & Certificates\n- JWT Tokens\n- OAuth Tokens  \n\n</details>\n\n## \ud83c\udf9b\ufe0f **Configuration**\n\n### **Environment Variables**\n```bash\n# Disable progress bars\nexport SECRETSENTRY_NO_PROGRESS=1\n\n# Custom config file\nexport SECRETSENTRY_CONFIG=/path/to/config.json\n\n# ML model cache directory (optional)\nexport SECRETSENTRY_MODEL_CACHE=/path/to/ml/models\n\n# Force ML detection on/off\nexport SECRETSENTRY_USE_ML=true\nexport SECRETSENTRY_ML_CONFIDENCE=0.7\n```\n\n### **Configuration File**\n```json\n{\n    \"excluded_patterns\": [\"test_\", \"example_\", \"demo_\"],\n    \"excluded_files\": [\"*.test.js\", \"test_*.py\"],\n    \"excluded_dirs\": [\"tests\", \"examples\", \"docs\"],\n    \"custom_patterns\": {\n        \"company_id\": \"COMP-\\\\d{8}\"\n    },\n    \"sanitization\": {\n        \"create_backups\": true,\n        \"backup_suffix\": \".backup\"\n    },\n    \"ml_detection\": {\n        \"enabled\": true,\n        \"confidence_threshold\": 0.7,\n        \"ensemble_mode\": true,\n        \"use_transformers\": false,\n        \"model_cache_dir\": \"~/.cache/secretsentry/models\"\n    }\n}\n```\n\n## \u26a1 **Performance & Requirements**\n\n### **ML Performance**\n\n| Detection Mode | Speed | Accuracy | Memory Usage | Dependencies |\n|---------------|-------|----------|--------------|--------------|\n| **Regex Only** | \u26a1\u26a1\u26a1\u26a1\u26a1 | \u2705\u2705\u2705 | \ud83d\udfe2 Low | Minimal |\n| **ML Basic** | \u26a1\u26a1\u26a1\u26a1 | \u2705\u2705\u2705\u2705 | \ud83d\udfe1 Medium | scikit-learn |\n| **ML Advanced** | \u26a1\u26a1\u26a1 | \u2705\u2705\u2705\u2705\u2705 | \ud83d\udd34 High | transformers |\n\n### **System Requirements**\n\n**Minimum (Regex-only):**\n- Python 3.7+\n- 50MB RAM\n- Any CPU\n\n**Recommended (ML Basic):**\n- Python 3.8+\n- 512MB RAM\n- 2+ CPU cores\n- 200MB disk space\n\n**Optimal (ML Advanced):**  \n- Python 3.9+\n- 2GB+ RAM\n- 4+ CPU cores\n- 1GB disk space\n\n### **Installation Time**\n\n```bash\npip install secretsentry              # ~30 seconds\npip install secretsentry[ml]          # ~2 minutes  \npip install secretsentry[ml-advanced] # ~5 minutes (downloads models)\n```\n\n### **First Run Performance**\n\n- **Regex detection**: Instant\n- **ML Basic**: ~30 seconds (model training on first run)\n- **ML Advanced**: ~2 minutes (model download + training)\n- **Subsequent runs**: Fast (models cached)\n\n## \ud83e\udd1d **Contributing**\n\nWe welcome contributions! Here's how to get started:\n\n```bash\n# Clone the repository\ngit clone https://github.com/yourusername/secretsentry.git\ncd secretsentry\n\n# Install development dependencies (includes ML dependencies)\npip install -e \".[full]\"\npip install pytest black flake8\n\n# Run tests (includes ML tests)\npytest tests/\n\n# Test ML functionality specifically\npython test_ml_detection.py\n\n# Format code\nblack secretsentry/\nflake8 secretsentry/\n```\n\n## \ud83d\udcdd **License**\n\nMIT License - see [LICENSE](LICENSE) file for details.\n\n## \ud83d\ude4f **Acknowledgments**\n\n- Inspired by [detect-secrets](https://github.com/Yelp/detect-secrets) and [truffleHog](https://github.com/dxa4481/truffleHog)\n- ML capabilities powered by [scikit-learn](https://scikit-learn.org/) and [Transformers](https://huggingface.co/transformers/)\n- Built for the data science and security communities\n- Special thanks to all contributors and the open source community\n- Grateful to the broader AI/ML community for advancing secret detection research\n\n## \ud83d\udcde **Support**\n\n- \ud83d\udcd6 **Documentation**: [Full docs](https://github.com/yourusername/secretsentry#readme)\n- \ud83d\udc1b **Issues**: [Report bugs](https://github.com/yourusername/secretsentry/issues)\n- \ud83d\udcac **Discussions**: [Community forum](https://github.com/yourusername/secretsentry/discussions)\n- \ud83d\udce7 **Contact**: your.email@example.com\n\n---\n\n**SecretSentry** - *Standing guard over your sensitive data* \ud83d\udee1\ufe0f\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Advanced sensitive data scanner with Jupyter notebook support and intelligent false positive filtering",
    "version": "2.0.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/y2ee201/secretsentry/issues",
        "Documentation": "https://github.com/y2ee201/secretsentry#readme",
        "Homepage": "https://github.com/y2ee201/secretsentry",
        "Repository": "https://github.com/y2ee201/secretsentry.git"
    },
    "split_keywords": [
        "security",
        " secrets",
        " scanner",
        " pii",
        " jupyter",
        " notebook",
        " api-keys",
        " credentials",
        " sanitization",
        " privacy",
        " devops",
        " ci-cd",
        " machine-learning",
        " ml",
        " transformers",
        " false-positives"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3ee51c5d8ac3855e28950b82926a58afd934f698ecf1ef07068f17d1058d02cb",
                "md5": "c78b67e5d08b3d0f31dbe3ad6b5a6bf6",
                "sha256": "694f2b82f26d10096b4e2cb0c1322e075c10a696bd65abac8b804cc2b5cdc708"
            },
            "downloads": -1,
            "filename": "secretsentry-2.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "c78b67e5d08b3d0f31dbe3ad6b5a6bf6",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 32338,
            "upload_time": "2025-08-08T09:33:08",
            "upload_time_iso_8601": "2025-08-08T09:33:08.309878Z",
            "url": "https://files.pythonhosted.org/packages/3e/e5/1c5d8ac3855e28950b82926a58afd934f698ecf1ef07068f17d1058d02cb/secretsentry-2.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e35f169dfd0f2098cc4bb455aad8a3e53fa13689c2aa48ce95a87734d7623465",
                "md5": "ed28722cc75ad03a61bec12fc689e384",
                "sha256": "556070798c435564ee69e210dc27482b71aa776114f24df027b334ba174b0f69"
            },
            "downloads": -1,
            "filename": "secretsentry-2.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "ed28722cc75ad03a61bec12fc689e384",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 40041,
            "upload_time": "2025-08-08T09:33:09",
            "upload_time_iso_8601": "2025-08-08T09:33:09.676078Z",
            "url": "https://files.pythonhosted.org/packages/e3/5f/169dfd0f2098cc4bb455aad8a3e53fa13689c2aa48ce95a87734d7623465/secretsentry-2.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-08 09:33:09",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "y2ee201",
    "github_project": "secretsentry",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "tqdm",
            "specs": [
                [
                    ">=",
                    "4.62.0"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    ">=",
                    "1.19.0"
                ]
            ]
        }
    ],
    "lcname": "secretsentry"
}
        
Elapsed time: 3.10875s