synthetic-data-mcp


Namesynthetic-data-mcp JSON
Version 0.1.1 PyPI version JSON
download
home_pageNone
SummaryDomain-specific synthetic data generation MCP server for healthcare and finance compliance
upload_time2025-09-18 22:39:32
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseMIT
keywords synthetic-data mcp healthcare finance compliance privacy
VCS
bugtrack_url
requirements fastmcp dspy-ai pydantic numpy pandas scipy scikit-learn cryptography python-dateutil faker opendp sqlalchemy alembic pyjwt httpx requests loguru typer rich pytest pytest-asyncio gunicorn uvicorn fastapi prometheus-client structlog asyncpg aiomysql motor aioredis psycopg2-binary google-cloud-bigquery google-cloud-bigquery-storage google-oauth2-tool snowflake-connector-python boto3 aiofiles pymongo
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Synthetic Data MCP Server

Enterprise-grade Model Context Protocol (MCP) server for generating privacy-compliant synthetic datasets. Built for regulated industries requiring HIPAA, PCI DSS, SOX, and GDPR compliance with multiple LLM provider support.

## ๐Ÿš€ Features
<img align="right" width="300" height="300" alt="synthetic-data-mcp" src="https://github.com/user-attachments/assets/46620579-0933-4d55-82e9-3700c75fe566" />

### Core Capabilities
- **Privacy-First Local Inference**: Ollama integration for 100% local data generation
- **Domain-Specific Generation**: Specialized synthetic data for healthcare and finance
- **Privacy Protection**: Differential privacy, k-anonymity, l-diversity
- **PII Safety Guarantee**: Never retains or outputs original personal data
- **Compliance Validation**: HIPAA, PCI DSS, SOX, GDPR compliance checking
- **Statistical Fidelity**: Advanced validation to ensure data utility
- **Audit Trail**: Comprehensive logging for regulatory compliance
- **Multi-Provider Support**: Ollama (default), OpenAI, Anthropic, Google, OpenRouter

### LLM Provider Support (2025 Models)
- **OpenAI**: GPT-5, GPT-5 Mini/Nano, GPT-4o
- **Anthropic**: Claude Opus 4.1, Claude Sonnet 4 (1M context), Claude 3.5 series
- **Google**: Gemini 2.5 Pro/Flash/Flash-Lite (1M+ context, multimodal)
- **Local Models**: Dynamic Ollama integration (Llama 3.3, Qwen 2.5/3, DeepSeek-R1, Mistral Small 3)
- **Smart Routing**: Automatic provider selection with cost optimization
- **Fallback**: Multi-tier fallback with local model support

### Technology Stack (2025 Latest)
- **FastAPI 0.116+**: High-performance async web framework
- **FastMCP**: High-performance MCP server implementation
- **Pydantic 2.11+**: Type-safe data validation with enhanced performance
- **SQLAlchemy 2.0+**: Modern async ORM with type safety
- **DSPy**: Language model programming framework for intelligent data generation
- **NumPy 2.3+ & Pandas 2.3+**: Advanced data processing capabilities
- **Redis & DiskCache**: Multi-tier caching for cost optimization
- **Rich**: Beautiful terminal interfaces and progress indicators

## ๐ŸŽฏ Enterprise Benefits

- **Privacy-First**: Generate synthetic data without exposing sensitive information
- **Compliance-Ready**: Built-in validation for HIPAA, PCI DSS, SOX, and GDPR
- **Multi-Provider**: Support for cloud APIs and local inference
- **Production-Scale**: High-performance generation for enterprise data volumes
- **Zero Vendor Lock-in**: Switch between providers seamlessly
- **Cost Control**: Use local models for unlimited generation

## ๐Ÿฅ Healthcare Use Cases

- Patient record synthesis with HIPAA Safe Harbor compliance
- Clinical trial data generation for FDA submissions
- Medical research datasets without PHI exposure
- Drug discovery data augmentation
- Healthcare analytics and ML model training
- EHR system testing and validation

## ๐Ÿ’ฐ Finance Use Cases

- Transaction pattern modeling for fraud detection
- Credit risk assessment dataset generation
- Regulatory stress testing data (Basel III, Dodd-Frank)
- PCI DSS compliant payment data synthesis
- Trading algorithm development and backtesting
- Financial reporting system validation

## ๐Ÿ› ๏ธ Installation

### Production Installation
```bash
pip install synthetic-data-mcp
```

### Development Installation
```bash
git clone https://github.com/marc-shade/synthetic-data-mcp
cd synthetic-data-mcp
pip install -e ".[dev,healthcare,finance]"
```

## ๐ŸŽฏ Quick Start

### 1. Configure LLM Provider

Choose your preferred provider:

#### OpenAI (Recommended for Production)
```bash
export OPENAI_API_KEY="sk-your-key-here"
```

#### Anthropic Claude
```bash
export ANTHROPIC_API_KEY="sk-ant-your-key-here"
```

#### Google Gemini
```bash
export GOOGLE_API_KEY="your-key-here"
```

#### OpenRouter (Access to 100+ Models)
```bash
export OPENROUTER_API_KEY="sk-or-your-key-here"
export OPENROUTER_MODEL="meta-llama/llama-3.1-8b-instruct"
```

#### Local Models (Ollama) - Privacy-First (DEFAULT)
```bash
# Install Ollama first: https://ollama.ai
ollama pull mistral-small:latest  # Or any preferred model
export OLLAMA_BASE_URL="http://localhost:11434"
export OLLAMA_MODEL="mistral-small:latest"

# The system automatically detects and uses Ollama if available
# No API keys required for local inference!
```

### 2. Start the MCP Server

```bash
synthetic-data-mcp serve --port 3000
```

### 3. Add to Claude Desktop Configuration

```json
{
  "mcpServers": {
    "synthetic-data": {
      "command": "python",
      "args": ["-m", "synthetic_data_mcp.server"],
      "env": {
        "OPENAI_API_KEY": "your-api-key"
      }
    }
  }
}
```

### 4. Generate Synthetic Data

```python
# Using the MCP client
result = await client.call_tool(
    "generate_synthetic_dataset",
    {
        "domain": "healthcare",
        "dataset_type": "patient_records",
        "record_count": 10000,
        "privacy_level": "high",
        "compliance_frameworks": ["hipaa"],
        "output_format": "json"
    }
)
```

## ๐Ÿ—๏ธ Provider Configuration

### Priority-Based Provider Selection

The system automatically selects the best available provider:

1. **Local Models (Ollama)** - Highest privacy, no API costs
2. **OpenAI** - Best performance and reliability
3. **Anthropic Claude** - Excellent reasoning capabilities
4. **Google Gemini** - Fast and cost-effective
5. **OpenRouter** - Access to open source models
6. **Fallback Mock** - Testing and development

### Provider-Specific Configuration

#### OpenAI Configuration
```python
# Environment variables
OPENAI_API_KEY="sk-your-key-here"
OPENAI_MODEL="gpt-4"  # or gpt-4-turbo, gpt-3.5-turbo
OPENAI_TEMPERATURE="0.7"
OPENAI_MAX_TOKENS="2000"
```

#### Anthropic Configuration
```python
# Environment variables
ANTHROPIC_API_KEY="sk-ant-your-key-here"
ANTHROPIC_MODEL="claude-3-opus-20240229"  # or claude-3-sonnet, claude-3-haiku
ANTHROPIC_MAX_TOKENS="2000"
```

#### Local Ollama Configuration
```python
# Environment variables
OLLAMA_BASE_URL="http://localhost:11434"
OLLAMA_MODEL="llama3.1:8b"  # or any installed model

# Supported local models:
# - llama3.1:8b, llama3.1:70b
# - mistral:7b, mixtral:8x7b
# - qwen2:7b, deepseek-coder:6.7b
# - and 20+ more models
```

## ๐Ÿ”ง Available MCP Tools

### `generate_synthetic_dataset`
Generate domain-specific synthetic datasets with compliance validation.

**Parameters:**
- `domain`: Healthcare, finance, or custom
- `dataset_type`: Patient records, transactions, clinical trials, etc.
- `record_count`: Number of synthetic records to generate
- `privacy_level`: Privacy protection level (low/medium/high/maximum)
- `compliance_frameworks`: Required compliance validations
- `output_format`: JSON, CSV, Parquet, or database export
- `provider`: Override automatic provider selection

### `validate_dataset_compliance`
Validate existing datasets against regulatory requirements.

### `analyze_privacy_risk`
Comprehensive privacy risk assessment for datasets.

### `generate_domain_schema`
Create Pydantic schemas for domain-specific data structures.

### `benchmark_synthetic_data`
Performance and utility benchmarking against real data.

## ๐Ÿ“‹ Compliance Frameworks

### Healthcare Compliance
- **HIPAA Safe Harbor**: Automatic validation of 18 identifiers
- **HIPAA Expert Determination**: Statistical disclosure control
- **FDA Guidance**: Synthetic clinical data for submissions
- **GDPR**: Healthcare data processing compliance
- **HITECH**: Security and breach notification

### Finance Compliance
- **PCI DSS**: Payment card industry data security
- **SOX**: Sarbanes-Oxley internal controls
- **Basel III**: Banking regulatory framework
- **MiFID II**: Markets in Financial Instruments Directive
- **Dodd-Frank**: Financial reform regulations

## ๐Ÿ”’ Privacy Protection

### Core Privacy Features
- **Differential Privacy**: Configurable ฮต values (0.1-1.0)
- **Statistical Disclosure Control**: k-anonymity, l-diversity, t-closeness
- **Synthetic Data Indistinguishability**: Provable privacy guarantees
- **Re-identification Risk Assessment**: Continuous monitoring
- **Privacy Budget Management**: Automatic composition tracking

### PII Protection Guarantee
- **NO Data Retention**: Original personal data is NEVER stored
- **Automatic PII Detection**: Identifies names, emails, SSNs, phones, addresses, credit cards
- **Complete Anonymization**: All PII is anonymized before pattern learning
- **Statistical Learning Only**: Only learns distributions, means, and frequencies
- **100% Synthetic Output**: Generated data is completely fake

### Credit Card Safety
- **Test Card Numbers Only**: Uses official test cards (4242-4242-4242-4242, etc.)
- **Provider Support**: Visa, Mastercard, AmEx, Discover, and more
- **Configurable Providers**: Specify provider or use weighted distribution
- **Never Real Cards**: Original credit card numbers are never retained or output

Example usage with credit card provider selection:
```python
# Use specific provider test cards
result = await pipeline.ingest(
    source=data,
    credit_card_provider='visa'  # Uses Visa test cards
)

# Or let system use mixed providers (default)
result = await pipeline.ingest(
    source=data  # Automatically uses weighted distribution
)
```

## ๐Ÿ“Š Performance & Quality

- **Statistical Fidelity**: 95%+ correlation preservation
- **Privacy Preservation**: <1% re-identification risk
- **Utility Preservation**: >90% ML model performance
- **Compliance Rate**: 100% regulatory framework adherence
- **Generation Speed**: 1,000-10,000 records/second (provider dependent)

### Provider Performance Comparison

| Provider | Speed (req/s) | Quality | Privacy | Cost |
|----------|---------------|---------|---------|------|
| Ollama Local | 10-50 | High | Maximum | Free |
| OpenAI GPT-4 | 20-100 | Excellent | Medium | $$$ |
| Claude 3 Opus | 15-80 | Excellent | Medium | $$$ |
| Gemini Pro | 50-200 | Good | Medium | $ |
| OpenRouter | 10-100 | Variable | Medium | $ |

## ๐Ÿงช Testing

```bash
# Run all tests
pytest

# Run compliance tests only
pytest -m compliance

# Run privacy tests
pytest -m privacy

# Run with coverage
pytest --cov=synthetic_data_mcp --cov-report=html

# Test specific provider
OPENAI_API_KEY=sk-test pytest -m integration
```

## ๐Ÿš€ Deployment

### Docker Deployment
```bash
docker build -t synthetic-data-mcp .
docker run -p 3000:3000 \
  -e OPENAI_API_KEY=your-key \
  synthetic-data-mcp
```

### Kubernetes Deployment
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: synthetic-data-mcp
spec:
  replicas: 3
  selector:
    matchLabels:
      app: synthetic-data-mcp
  template:
    metadata:
      labels:
        app: synthetic-data-mcp
    spec:
      containers:
      - name: synthetic-data-mcp
        image: synthetic-data-mcp:latest
        ports:
        - containerPort: 3000
        env:
        - name: OPENAI_API_KEY
          valueFrom:
            secretKeyRef:
              name: llm-secrets
              key: openai-key
```

## ๐Ÿ”ง Development

### Code Quality
```bash
# Format code
black .
isort .

# Run linting
flake8 src tests

# Type checking
mypy src
```

### Adding New Providers

1. Create provider module in `src/synthetic_data_mcp/providers/`
2. Implement DSPy LM interface
3. Add configuration in `core/generator.py`
4. Add tests in `tests/test_providers.py`

## ๐Ÿ“š Examples

### Healthcare Example
```python
import asyncio
from synthetic_data_mcp import SyntheticDataGenerator

async def generate_patients():
    generator = SyntheticDataGenerator()
    
    result = await generator.generate_dataset(
        domain="healthcare",
        dataset_type="patient_records",
        record_count=1000,
        privacy_level="high",
        compliance_frameworks=["hipaa"]
    )
    
    print(f"Generated {len(result['dataset'])} patient records")
    return result

# Run the example
asyncio.run(generate_patients())
```

### Finance Example
```python
async def generate_transactions():
    generator = SyntheticDataGenerator()
    
    result = await generator.generate_dataset(
        domain="finance",
        dataset_type="transactions",
        record_count=50000,
        privacy_level="high",
        compliance_frameworks=["pci_dss"]
    )
    
    print(f"Generated {len(result['dataset'])} transactions")
    return result
```

## ๐Ÿค Contributing

We welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details.

### Development Setup
```bash
git clone https://github.com/marc-shade/synthetic-data-mcp
cd synthetic-data-mcp
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev,healthcare,finance]"
pre-commit install
```

## ๐Ÿ“„ License

MIT License - see [LICENSE](LICENSE) file for details.

## ๐Ÿ†˜ Support

- [GitHub Issues](https://github.com/marc-shade/synthetic-data-mcp/issues)
- [GitHub Discussions](https://github.com/marc-shade/synthetic-data-mcp/discussions)
- Email: support@2acrestudios.com

## ๐Ÿ”— Related Projects

- [Model Context Protocol (MCP)](https://modelcontextprotocol.io/)
- [DSPy Framework](https://dspy-docs.vercel.app/)
- [Ollama](https://ollama.ai/) - Local LLM inference
- [OpenRouter](https://openrouter.ai/) - Access to 100+ models

---

Built with โค๏ธ for enterprise developers who need compliant, privacy-preserving synthetic data generation.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "synthetic-data-mcp",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "synthetic-data, mcp, healthcare, finance, compliance, privacy",
    "author": null,
    "author_email": "Marc Shade <marc@2acrestudios.com>",
    "download_url": "https://files.pythonhosted.org/packages/ea/47/0b30752660c8caaeaf9530ae312adc8e5f3f1a64d6f756608b5e86ed53c8/synthetic_data_mcp-0.1.1.tar.gz",
    "platform": null,
    "description": "# Synthetic Data MCP Server\n\nEnterprise-grade Model Context Protocol (MCP) server for generating privacy-compliant synthetic datasets. Built for regulated industries requiring HIPAA, PCI DSS, SOX, and GDPR compliance with multiple LLM provider support.\n\n## \ud83d\ude80 Features\n<img align=\"right\" width=\"300\" height=\"300\" alt=\"synthetic-data-mcp\" src=\"https://github.com/user-attachments/assets/46620579-0933-4d55-82e9-3700c75fe566\" />\n\n### Core Capabilities\n- **Privacy-First Local Inference**: Ollama integration for 100% local data generation\n- **Domain-Specific Generation**: Specialized synthetic data for healthcare and finance\n- **Privacy Protection**: Differential privacy, k-anonymity, l-diversity\n- **PII Safety Guarantee**: Never retains or outputs original personal data\n- **Compliance Validation**: HIPAA, PCI DSS, SOX, GDPR compliance checking\n- **Statistical Fidelity**: Advanced validation to ensure data utility\n- **Audit Trail**: Comprehensive logging for regulatory compliance\n- **Multi-Provider Support**: Ollama (default), OpenAI, Anthropic, Google, OpenRouter\n\n### LLM Provider Support (2025 Models)\n- **OpenAI**: GPT-5, GPT-5 Mini/Nano, GPT-4o\n- **Anthropic**: Claude Opus 4.1, Claude Sonnet 4 (1M context), Claude 3.5 series\n- **Google**: Gemini 2.5 Pro/Flash/Flash-Lite (1M+ context, multimodal)\n- **Local Models**: Dynamic Ollama integration (Llama 3.3, Qwen 2.5/3, DeepSeek-R1, Mistral Small 3)\n- **Smart Routing**: Automatic provider selection with cost optimization\n- **Fallback**: Multi-tier fallback with local model support\n\n### Technology Stack (2025 Latest)\n- **FastAPI 0.116+**: High-performance async web framework\n- **FastMCP**: High-performance MCP server implementation\n- **Pydantic 2.11+**: Type-safe data validation with enhanced performance\n- **SQLAlchemy 2.0+**: Modern async ORM with type safety\n- **DSPy**: Language model programming framework for intelligent data generation\n- **NumPy 2.3+ & Pandas 2.3+**: Advanced data processing capabilities\n- **Redis & DiskCache**: Multi-tier caching for cost optimization\n- **Rich**: Beautiful terminal interfaces and progress indicators\n\n## \ud83c\udfaf Enterprise Benefits\n\n- **Privacy-First**: Generate synthetic data without exposing sensitive information\n- **Compliance-Ready**: Built-in validation for HIPAA, PCI DSS, SOX, and GDPR\n- **Multi-Provider**: Support for cloud APIs and local inference\n- **Production-Scale**: High-performance generation for enterprise data volumes\n- **Zero Vendor Lock-in**: Switch between providers seamlessly\n- **Cost Control**: Use local models for unlimited generation\n\n## \ud83c\udfe5 Healthcare Use Cases\n\n- Patient record synthesis with HIPAA Safe Harbor compliance\n- Clinical trial data generation for FDA submissions\n- Medical research datasets without PHI exposure\n- Drug discovery data augmentation\n- Healthcare analytics and ML model training\n- EHR system testing and validation\n\n## \ud83d\udcb0 Finance Use Cases\n\n- Transaction pattern modeling for fraud detection\n- Credit risk assessment dataset generation\n- Regulatory stress testing data (Basel III, Dodd-Frank)\n- PCI DSS compliant payment data synthesis\n- Trading algorithm development and backtesting\n- Financial reporting system validation\n\n## \ud83d\udee0\ufe0f Installation\n\n### Production Installation\n```bash\npip install synthetic-data-mcp\n```\n\n### Development Installation\n```bash\ngit clone https://github.com/marc-shade/synthetic-data-mcp\ncd synthetic-data-mcp\npip install -e \".[dev,healthcare,finance]\"\n```\n\n## \ud83c\udfaf Quick Start\n\n### 1. Configure LLM Provider\n\nChoose your preferred provider:\n\n#### OpenAI (Recommended for Production)\n```bash\nexport OPENAI_API_KEY=\"sk-your-key-here\"\n```\n\n#### Anthropic Claude\n```bash\nexport ANTHROPIC_API_KEY=\"sk-ant-your-key-here\"\n```\n\n#### Google Gemini\n```bash\nexport GOOGLE_API_KEY=\"your-key-here\"\n```\n\n#### OpenRouter (Access to 100+ Models)\n```bash\nexport OPENROUTER_API_KEY=\"sk-or-your-key-here\"\nexport OPENROUTER_MODEL=\"meta-llama/llama-3.1-8b-instruct\"\n```\n\n#### Local Models (Ollama) - Privacy-First (DEFAULT)\n```bash\n# Install Ollama first: https://ollama.ai\nollama pull mistral-small:latest  # Or any preferred model\nexport OLLAMA_BASE_URL=\"http://localhost:11434\"\nexport OLLAMA_MODEL=\"mistral-small:latest\"\n\n# The system automatically detects and uses Ollama if available\n# No API keys required for local inference!\n```\n\n### 2. Start the MCP Server\n\n```bash\nsynthetic-data-mcp serve --port 3000\n```\n\n### 3. Add to Claude Desktop Configuration\n\n```json\n{\n  \"mcpServers\": {\n    \"synthetic-data\": {\n      \"command\": \"python\",\n      \"args\": [\"-m\", \"synthetic_data_mcp.server\"],\n      \"env\": {\n        \"OPENAI_API_KEY\": \"your-api-key\"\n      }\n    }\n  }\n}\n```\n\n### 4. Generate Synthetic Data\n\n```python\n# Using the MCP client\nresult = await client.call_tool(\n    \"generate_synthetic_dataset\",\n    {\n        \"domain\": \"healthcare\",\n        \"dataset_type\": \"patient_records\",\n        \"record_count\": 10000,\n        \"privacy_level\": \"high\",\n        \"compliance_frameworks\": [\"hipaa\"],\n        \"output_format\": \"json\"\n    }\n)\n```\n\n## \ud83c\udfd7\ufe0f Provider Configuration\n\n### Priority-Based Provider Selection\n\nThe system automatically selects the best available provider:\n\n1. **Local Models (Ollama)** - Highest privacy, no API costs\n2. **OpenAI** - Best performance and reliability\n3. **Anthropic Claude** - Excellent reasoning capabilities\n4. **Google Gemini** - Fast and cost-effective\n5. **OpenRouter** - Access to open source models\n6. **Fallback Mock** - Testing and development\n\n### Provider-Specific Configuration\n\n#### OpenAI Configuration\n```python\n# Environment variables\nOPENAI_API_KEY=\"sk-your-key-here\"\nOPENAI_MODEL=\"gpt-4\"  # or gpt-4-turbo, gpt-3.5-turbo\nOPENAI_TEMPERATURE=\"0.7\"\nOPENAI_MAX_TOKENS=\"2000\"\n```\n\n#### Anthropic Configuration\n```python\n# Environment variables\nANTHROPIC_API_KEY=\"sk-ant-your-key-here\"\nANTHROPIC_MODEL=\"claude-3-opus-20240229\"  # or claude-3-sonnet, claude-3-haiku\nANTHROPIC_MAX_TOKENS=\"2000\"\n```\n\n#### Local Ollama Configuration\n```python\n# Environment variables\nOLLAMA_BASE_URL=\"http://localhost:11434\"\nOLLAMA_MODEL=\"llama3.1:8b\"  # or any installed model\n\n# Supported local models:\n# - llama3.1:8b, llama3.1:70b\n# - mistral:7b, mixtral:8x7b\n# - qwen2:7b, deepseek-coder:6.7b\n# - and 20+ more models\n```\n\n## \ud83d\udd27 Available MCP Tools\n\n### `generate_synthetic_dataset`\nGenerate domain-specific synthetic datasets with compliance validation.\n\n**Parameters:**\n- `domain`: Healthcare, finance, or custom\n- `dataset_type`: Patient records, transactions, clinical trials, etc.\n- `record_count`: Number of synthetic records to generate\n- `privacy_level`: Privacy protection level (low/medium/high/maximum)\n- `compliance_frameworks`: Required compliance validations\n- `output_format`: JSON, CSV, Parquet, or database export\n- `provider`: Override automatic provider selection\n\n### `validate_dataset_compliance`\nValidate existing datasets against regulatory requirements.\n\n### `analyze_privacy_risk`\nComprehensive privacy risk assessment for datasets.\n\n### `generate_domain_schema`\nCreate Pydantic schemas for domain-specific data structures.\n\n### `benchmark_synthetic_data`\nPerformance and utility benchmarking against real data.\n\n## \ud83d\udccb Compliance Frameworks\n\n### Healthcare Compliance\n- **HIPAA Safe Harbor**: Automatic validation of 18 identifiers\n- **HIPAA Expert Determination**: Statistical disclosure control\n- **FDA Guidance**: Synthetic clinical data for submissions\n- **GDPR**: Healthcare data processing compliance\n- **HITECH**: Security and breach notification\n\n### Finance Compliance\n- **PCI DSS**: Payment card industry data security\n- **SOX**: Sarbanes-Oxley internal controls\n- **Basel III**: Banking regulatory framework\n- **MiFID II**: Markets in Financial Instruments Directive\n- **Dodd-Frank**: Financial reform regulations\n\n## \ud83d\udd12 Privacy Protection\n\n### Core Privacy Features\n- **Differential Privacy**: Configurable \u03b5 values (0.1-1.0)\n- **Statistical Disclosure Control**: k-anonymity, l-diversity, t-closeness\n- **Synthetic Data Indistinguishability**: Provable privacy guarantees\n- **Re-identification Risk Assessment**: Continuous monitoring\n- **Privacy Budget Management**: Automatic composition tracking\n\n### PII Protection Guarantee\n- **NO Data Retention**: Original personal data is NEVER stored\n- **Automatic PII Detection**: Identifies names, emails, SSNs, phones, addresses, credit cards\n- **Complete Anonymization**: All PII is anonymized before pattern learning\n- **Statistical Learning Only**: Only learns distributions, means, and frequencies\n- **100% Synthetic Output**: Generated data is completely fake\n\n### Credit Card Safety\n- **Test Card Numbers Only**: Uses official test cards (4242-4242-4242-4242, etc.)\n- **Provider Support**: Visa, Mastercard, AmEx, Discover, and more\n- **Configurable Providers**: Specify provider or use weighted distribution\n- **Never Real Cards**: Original credit card numbers are never retained or output\n\nExample usage with credit card provider selection:\n```python\n# Use specific provider test cards\nresult = await pipeline.ingest(\n    source=data,\n    credit_card_provider='visa'  # Uses Visa test cards\n)\n\n# Or let system use mixed providers (default)\nresult = await pipeline.ingest(\n    source=data  # Automatically uses weighted distribution\n)\n```\n\n## \ud83d\udcca Performance & Quality\n\n- **Statistical Fidelity**: 95%+ correlation preservation\n- **Privacy Preservation**: <1% re-identification risk\n- **Utility Preservation**: >90% ML model performance\n- **Compliance Rate**: 100% regulatory framework adherence\n- **Generation Speed**: 1,000-10,000 records/second (provider dependent)\n\n### Provider Performance Comparison\n\n| Provider | Speed (req/s) | Quality | Privacy | Cost |\n|----------|---------------|---------|---------|------|\n| Ollama Local | 10-50 | High | Maximum | Free |\n| OpenAI GPT-4 | 20-100 | Excellent | Medium | $$$ |\n| Claude 3 Opus | 15-80 | Excellent | Medium | $$$ |\n| Gemini Pro | 50-200 | Good | Medium | $ |\n| OpenRouter | 10-100 | Variable | Medium | $ |\n\n## \ud83e\uddea Testing\n\n```bash\n# Run all tests\npytest\n\n# Run compliance tests only\npytest -m compliance\n\n# Run privacy tests\npytest -m privacy\n\n# Run with coverage\npytest --cov=synthetic_data_mcp --cov-report=html\n\n# Test specific provider\nOPENAI_API_KEY=sk-test pytest -m integration\n```\n\n## \ud83d\ude80 Deployment\n\n### Docker Deployment\n```bash\ndocker build -t synthetic-data-mcp .\ndocker run -p 3000:3000 \\\n  -e OPENAI_API_KEY=your-key \\\n  synthetic-data-mcp\n```\n\n### Kubernetes Deployment\n```yaml\napiVersion: apps/v1\nkind: Deployment\nmetadata:\n  name: synthetic-data-mcp\nspec:\n  replicas: 3\n  selector:\n    matchLabels:\n      app: synthetic-data-mcp\n  template:\n    metadata:\n      labels:\n        app: synthetic-data-mcp\n    spec:\n      containers:\n      - name: synthetic-data-mcp\n        image: synthetic-data-mcp:latest\n        ports:\n        - containerPort: 3000\n        env:\n        - name: OPENAI_API_KEY\n          valueFrom:\n            secretKeyRef:\n              name: llm-secrets\n              key: openai-key\n```\n\n## \ud83d\udd27 Development\n\n### Code Quality\n```bash\n# Format code\nblack .\nisort .\n\n# Run linting\nflake8 src tests\n\n# Type checking\nmypy src\n```\n\n### Adding New Providers\n\n1. Create provider module in `src/synthetic_data_mcp/providers/`\n2. Implement DSPy LM interface\n3. Add configuration in `core/generator.py`\n4. Add tests in `tests/test_providers.py`\n\n## \ud83d\udcda Examples\n\n### Healthcare Example\n```python\nimport asyncio\nfrom synthetic_data_mcp import SyntheticDataGenerator\n\nasync def generate_patients():\n    generator = SyntheticDataGenerator()\n    \n    result = await generator.generate_dataset(\n        domain=\"healthcare\",\n        dataset_type=\"patient_records\",\n        record_count=1000,\n        privacy_level=\"high\",\n        compliance_frameworks=[\"hipaa\"]\n    )\n    \n    print(f\"Generated {len(result['dataset'])} patient records\")\n    return result\n\n# Run the example\nasyncio.run(generate_patients())\n```\n\n### Finance Example\n```python\nasync def generate_transactions():\n    generator = SyntheticDataGenerator()\n    \n    result = await generator.generate_dataset(\n        domain=\"finance\",\n        dataset_type=\"transactions\",\n        record_count=50000,\n        privacy_level=\"high\",\n        compliance_frameworks=[\"pci_dss\"]\n    )\n    \n    print(f\"Generated {len(result['dataset'])} transactions\")\n    return result\n```\n\n## \ud83e\udd1d Contributing\n\nWe welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details.\n\n### Development Setup\n```bash\ngit clone https://github.com/marc-shade/synthetic-data-mcp\ncd synthetic-data-mcp\npython -m venv .venv\nsource .venv/bin/activate\npip install -e \".[dev,healthcare,finance]\"\npre-commit install\n```\n\n## \ud83d\udcc4 License\n\nMIT License - see [LICENSE](LICENSE) file for details.\n\n## \ud83c\udd98 Support\n\n- [GitHub Issues](https://github.com/marc-shade/synthetic-data-mcp/issues)\n- [GitHub Discussions](https://github.com/marc-shade/synthetic-data-mcp/discussions)\n- Email: support@2acrestudios.com\n\n## \ud83d\udd17 Related Projects\n\n- [Model Context Protocol (MCP)](https://modelcontextprotocol.io/)\n- [DSPy Framework](https://dspy-docs.vercel.app/)\n- [Ollama](https://ollama.ai/) - Local LLM inference\n- [OpenRouter](https://openrouter.ai/) - Access to 100+ models\n\n---\n\nBuilt with \u2764\ufe0f for enterprise developers who need compliant, privacy-preserving synthetic data generation.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Domain-specific synthetic data generation MCP server for healthcare and finance compliance",
    "version": "0.1.1",
    "project_urls": {
        "Documentation": "https://synthetic-data-mcp.readthedocs.io",
        "Homepage": "https://github.com/marc-shade/synthetic-data-mcp",
        "Issues": "https://github.com/marc-shade/synthetic-data-mcp/issues",
        "Repository": "https://github.com/marc-shade/synthetic-data-mcp"
    },
    "split_keywords": [
        "synthetic-data",
        " mcp",
        " healthcare",
        " finance",
        " compliance",
        " privacy"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "4c76366c659b1af1938e4279c4a34d194e3dffa049ea32a17fef97ade2a677c4",
                "md5": "e38fbc72e67458ab3362c11779e0ac22",
                "sha256": "c09fb0523cac4aa921b530b6559ee9a1a52aff8f3abf41f2697141147c26dfe1"
            },
            "downloads": -1,
            "filename": "synthetic_data_mcp-0.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e38fbc72e67458ab3362c11779e0ac22",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 222363,
            "upload_time": "2025-09-18T22:39:30",
            "upload_time_iso_8601": "2025-09-18T22:39:30.745147Z",
            "url": "https://files.pythonhosted.org/packages/4c/76/366c659b1af1938e4279c4a34d194e3dffa049ea32a17fef97ade2a677c4/synthetic_data_mcp-0.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "ea470b30752660c8caaeaf9530ae312adc8e5f3f1a64d6f756608b5e86ed53c8",
                "md5": "8a24e27c6d3f74053c2ef85a92efbd21",
                "sha256": "cf043f1f11b5870eceead26b8834c271ed811daa1662498a4a2fa7103d101789"
            },
            "downloads": -1,
            "filename": "synthetic_data_mcp-0.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "8a24e27c6d3f74053c2ef85a92efbd21",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 280732,
            "upload_time": "2025-09-18T22:39:32",
            "upload_time_iso_8601": "2025-09-18T22:39:32.196601Z",
            "url": "https://files.pythonhosted.org/packages/ea/47/0b30752660c8caaeaf9530ae312adc8e5f3f1a64d6f756608b5e86ed53c8/synthetic_data_mcp-0.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-18 22:39:32",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "marc-shade",
    "github_project": "synthetic-data-mcp",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "fastmcp",
            "specs": [
                [
                    ">=",
                    "0.1.0"
                ]
            ]
        },
        {
            "name": "dspy-ai",
            "specs": [
                [
                    ">=",
                    "2.4.0"
                ]
            ]
        },
        {
            "name": "pydantic",
            "specs": [
                [
                    ">=",
                    "2.5.0"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    ">=",
                    "1.24.0"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    ">=",
                    "2.0.0"
                ]
            ]
        },
        {
            "name": "scipy",
            "specs": [
                [
                    ">=",
                    "1.10.0"
                ]
            ]
        },
        {
            "name": "scikit-learn",
            "specs": [
                [
                    ">=",
                    "1.3.0"
                ]
            ]
        },
        {
            "name": "cryptography",
            "specs": [
                [
                    ">=",
                    "41.0.0"
                ]
            ]
        },
        {
            "name": "python-dateutil",
            "specs": [
                [
                    ">=",
                    "2.8.0"
                ]
            ]
        },
        {
            "name": "faker",
            "specs": [
                [
                    ">=",
                    "19.0.0"
                ]
            ]
        },
        {
            "name": "opendp",
            "specs": [
                [
                    ">=",
                    "0.8.0"
                ]
            ]
        },
        {
            "name": "sqlalchemy",
            "specs": [
                [
                    ">=",
                    "2.0.0"
                ]
            ]
        },
        {
            "name": "alembic",
            "specs": [
                [
                    ">=",
                    "1.12.0"
                ]
            ]
        },
        {
            "name": "pyjwt",
            "specs": [
                [
                    ">=",
                    "2.8.0"
                ]
            ]
        },
        {
            "name": "httpx",
            "specs": [
                [
                    ">=",
                    "0.25.0"
                ]
            ]
        },
        {
            "name": "requests",
            "specs": [
                [
                    ">=",
                    "2.31.0"
                ]
            ]
        },
        {
            "name": "loguru",
            "specs": [
                [
                    ">=",
                    "0.7.0"
                ]
            ]
        },
        {
            "name": "typer",
            "specs": [
                [
                    ">=",
                    "0.9.0"
                ]
            ]
        },
        {
            "name": "rich",
            "specs": [
                [
                    ">=",
                    "13.0.0"
                ]
            ]
        },
        {
            "name": "pytest",
            "specs": [
                [
                    ">=",
                    "7.4.0"
                ]
            ]
        },
        {
            "name": "pytest-asyncio",
            "specs": [
                [
                    ">=",
                    "0.21.0"
                ]
            ]
        },
        {
            "name": "gunicorn",
            "specs": [
                [
                    ">=",
                    "21.2.0"
                ]
            ]
        },
        {
            "name": "uvicorn",
            "specs": [
                [
                    ">=",
                    "0.24.0"
                ]
            ]
        },
        {
            "name": "fastapi",
            "specs": [
                [
                    ">=",
                    "0.104.0"
                ]
            ]
        },
        {
            "name": "prometheus-client",
            "specs": [
                [
                    ">=",
                    "0.19.0"
                ]
            ]
        },
        {
            "name": "structlog",
            "specs": [
                [
                    ">=",
                    "23.2.0"
                ]
            ]
        },
        {
            "name": "asyncpg",
            "specs": [
                [
                    ">=",
                    "0.29.0"
                ]
            ]
        },
        {
            "name": "aiomysql",
            "specs": [
                [
                    ">=",
                    "0.2.0"
                ]
            ]
        },
        {
            "name": "motor",
            "specs": [
                [
                    ">=",
                    "3.3.0"
                ]
            ]
        },
        {
            "name": "aioredis",
            "specs": [
                [
                    ">=",
                    "2.0.0"
                ]
            ]
        },
        {
            "name": "psycopg2-binary",
            "specs": [
                [
                    ">=",
                    "2.9.0"
                ]
            ]
        },
        {
            "name": "google-cloud-bigquery",
            "specs": [
                [
                    ">=",
                    "3.13.0"
                ]
            ]
        },
        {
            "name": "google-cloud-bigquery-storage",
            "specs": [
                [
                    ">=",
                    "2.22.0"
                ]
            ]
        },
        {
            "name": "google-oauth2-tool",
            "specs": [
                [
                    ">=",
                    "0.0.3"
                ]
            ]
        },
        {
            "name": "snowflake-connector-python",
            "specs": [
                [
                    ">=",
                    "3.5.0"
                ]
            ]
        },
        {
            "name": "boto3",
            "specs": [
                [
                    ">=",
                    "1.34.0"
                ]
            ]
        },
        {
            "name": "aiofiles",
            "specs": [
                [
                    ">=",
                    "23.2.0"
                ]
            ]
        },
        {
            "name": "pymongo",
            "specs": [
                [
                    ">=",
                    "4.6.0"
                ]
            ]
        }
    ],
    "lcname": "synthetic-data-mcp"
}
        
Elapsed time: 2.46420s