# Synthetic Data MCP Server
Enterprise-grade Model Context Protocol (MCP) server for generating privacy-compliant synthetic datasets. Built for regulated industries requiring HIPAA, PCI DSS, SOX, and GDPR compliance with multiple LLM provider support.
## ๐ Features
<img align="right" width="300" height="300" alt="synthetic-data-mcp" src="https://github.com/user-attachments/assets/46620579-0933-4d55-82e9-3700c75fe566" />
### Core Capabilities
- **Privacy-First Local Inference**: Ollama integration for 100% local data generation
- **Domain-Specific Generation**: Specialized synthetic data for healthcare and finance
- **Privacy Protection**: Differential privacy, k-anonymity, l-diversity
- **PII Safety Guarantee**: Never retains or outputs original personal data
- **Compliance Validation**: HIPAA, PCI DSS, SOX, GDPR compliance checking
- **Statistical Fidelity**: Advanced validation to ensure data utility
- **Audit Trail**: Comprehensive logging for regulatory compliance
- **Multi-Provider Support**: Ollama (default), OpenAI, Anthropic, Google, OpenRouter
### LLM Provider Support (2025 Models)
- **OpenAI**: GPT-5, GPT-5 Mini/Nano, GPT-4o
- **Anthropic**: Claude Opus 4.1, Claude Sonnet 4 (1M context), Claude 3.5 series
- **Google**: Gemini 2.5 Pro/Flash/Flash-Lite (1M+ context, multimodal)
- **Local Models**: Dynamic Ollama integration (Llama 3.3, Qwen 2.5/3, DeepSeek-R1, Mistral Small 3)
- **Smart Routing**: Automatic provider selection with cost optimization
- **Fallback**: Multi-tier fallback with local model support
### Technology Stack (2025 Latest)
- **FastAPI 0.116+**: High-performance async web framework
- **FastMCP**: High-performance MCP server implementation
- **Pydantic 2.11+**: Type-safe data validation with enhanced performance
- **SQLAlchemy 2.0+**: Modern async ORM with type safety
- **DSPy**: Language model programming framework for intelligent data generation
- **NumPy 2.3+ & Pandas 2.3+**: Advanced data processing capabilities
- **Redis & DiskCache**: Multi-tier caching for cost optimization
- **Rich**: Beautiful terminal interfaces and progress indicators
## ๐ฏ Enterprise Benefits
- **Privacy-First**: Generate synthetic data without exposing sensitive information
- **Compliance-Ready**: Built-in validation for HIPAA, PCI DSS, SOX, and GDPR
- **Multi-Provider**: Support for cloud APIs and local inference
- **Production-Scale**: High-performance generation for enterprise data volumes
- **Zero Vendor Lock-in**: Switch between providers seamlessly
- **Cost Control**: Use local models for unlimited generation
## ๐ฅ Healthcare Use Cases
- Patient record synthesis with HIPAA Safe Harbor compliance
- Clinical trial data generation for FDA submissions
- Medical research datasets without PHI exposure
- Drug discovery data augmentation
- Healthcare analytics and ML model training
- EHR system testing and validation
## ๐ฐ Finance Use Cases
- Transaction pattern modeling for fraud detection
- Credit risk assessment dataset generation
- Regulatory stress testing data (Basel III, Dodd-Frank)
- PCI DSS compliant payment data synthesis
- Trading algorithm development and backtesting
- Financial reporting system validation
## ๐ ๏ธ Installation
### Production Installation
```bash
pip install synthetic-data-mcp
```
### Development Installation
```bash
git clone https://github.com/marc-shade/synthetic-data-mcp
cd synthetic-data-mcp
pip install -e ".[dev,healthcare,finance]"
```
## ๐ฏ Quick Start
### 1. Configure LLM Provider
Choose your preferred provider:
#### OpenAI (Recommended for Production)
```bash
export OPENAI_API_KEY="sk-your-key-here"
```
#### Anthropic Claude
```bash
export ANTHROPIC_API_KEY="sk-ant-your-key-here"
```
#### Google Gemini
```bash
export GOOGLE_API_KEY="your-key-here"
```
#### OpenRouter (Access to 100+ Models)
```bash
export OPENROUTER_API_KEY="sk-or-your-key-here"
export OPENROUTER_MODEL="meta-llama/llama-3.1-8b-instruct"
```
#### Local Models (Ollama) - Privacy-First (DEFAULT)
```bash
# Install Ollama first: https://ollama.ai
ollama pull mistral-small:latest # Or any preferred model
export OLLAMA_BASE_URL="http://localhost:11434"
export OLLAMA_MODEL="mistral-small:latest"
# The system automatically detects and uses Ollama if available
# No API keys required for local inference!
```
### 2. Start the MCP Server
```bash
synthetic-data-mcp serve --port 3000
```
### 3. Add to Claude Desktop Configuration
```json
{
"mcpServers": {
"synthetic-data": {
"command": "python",
"args": ["-m", "synthetic_data_mcp.server"],
"env": {
"OPENAI_API_KEY": "your-api-key"
}
}
}
}
```
### 4. Generate Synthetic Data
```python
# Using the MCP client
result = await client.call_tool(
"generate_synthetic_dataset",
{
"domain": "healthcare",
"dataset_type": "patient_records",
"record_count": 10000,
"privacy_level": "high",
"compliance_frameworks": ["hipaa"],
"output_format": "json"
}
)
```
## ๐๏ธ Provider Configuration
### Priority-Based Provider Selection
The system automatically selects the best available provider:
1. **Local Models (Ollama)** - Highest privacy, no API costs
2. **OpenAI** - Best performance and reliability
3. **Anthropic Claude** - Excellent reasoning capabilities
4. **Google Gemini** - Fast and cost-effective
5. **OpenRouter** - Access to open source models
6. **Fallback Mock** - Testing and development
### Provider-Specific Configuration
#### OpenAI Configuration
```python
# Environment variables
OPENAI_API_KEY="sk-your-key-here"
OPENAI_MODEL="gpt-4" # or gpt-4-turbo, gpt-3.5-turbo
OPENAI_TEMPERATURE="0.7"
OPENAI_MAX_TOKENS="2000"
```
#### Anthropic Configuration
```python
# Environment variables
ANTHROPIC_API_KEY="sk-ant-your-key-here"
ANTHROPIC_MODEL="claude-3-opus-20240229" # or claude-3-sonnet, claude-3-haiku
ANTHROPIC_MAX_TOKENS="2000"
```
#### Local Ollama Configuration
```python
# Environment variables
OLLAMA_BASE_URL="http://localhost:11434"
OLLAMA_MODEL="llama3.1:8b" # or any installed model
# Supported local models:
# - llama3.1:8b, llama3.1:70b
# - mistral:7b, mixtral:8x7b
# - qwen2:7b, deepseek-coder:6.7b
# - and 20+ more models
```
## ๐ง Available MCP Tools
### `generate_synthetic_dataset`
Generate domain-specific synthetic datasets with compliance validation.
**Parameters:**
- `domain`: Healthcare, finance, or custom
- `dataset_type`: Patient records, transactions, clinical trials, etc.
- `record_count`: Number of synthetic records to generate
- `privacy_level`: Privacy protection level (low/medium/high/maximum)
- `compliance_frameworks`: Required compliance validations
- `output_format`: JSON, CSV, Parquet, or database export
- `provider`: Override automatic provider selection
### `validate_dataset_compliance`
Validate existing datasets against regulatory requirements.
### `analyze_privacy_risk`
Comprehensive privacy risk assessment for datasets.
### `generate_domain_schema`
Create Pydantic schemas for domain-specific data structures.
### `benchmark_synthetic_data`
Performance and utility benchmarking against real data.
## ๐ Compliance Frameworks
### Healthcare Compliance
- **HIPAA Safe Harbor**: Automatic validation of 18 identifiers
- **HIPAA Expert Determination**: Statistical disclosure control
- **FDA Guidance**: Synthetic clinical data for submissions
- **GDPR**: Healthcare data processing compliance
- **HITECH**: Security and breach notification
### Finance Compliance
- **PCI DSS**: Payment card industry data security
- **SOX**: Sarbanes-Oxley internal controls
- **Basel III**: Banking regulatory framework
- **MiFID II**: Markets in Financial Instruments Directive
- **Dodd-Frank**: Financial reform regulations
## ๐ Privacy Protection
### Core Privacy Features
- **Differential Privacy**: Configurable ฮต values (0.1-1.0)
- **Statistical Disclosure Control**: k-anonymity, l-diversity, t-closeness
- **Synthetic Data Indistinguishability**: Provable privacy guarantees
- **Re-identification Risk Assessment**: Continuous monitoring
- **Privacy Budget Management**: Automatic composition tracking
### PII Protection Guarantee
- **NO Data Retention**: Original personal data is NEVER stored
- **Automatic PII Detection**: Identifies names, emails, SSNs, phones, addresses, credit cards
- **Complete Anonymization**: All PII is anonymized before pattern learning
- **Statistical Learning Only**: Only learns distributions, means, and frequencies
- **100% Synthetic Output**: Generated data is completely fake
### Credit Card Safety
- **Test Card Numbers Only**: Uses official test cards (4242-4242-4242-4242, etc.)
- **Provider Support**: Visa, Mastercard, AmEx, Discover, and more
- **Configurable Providers**: Specify provider or use weighted distribution
- **Never Real Cards**: Original credit card numbers are never retained or output
Example usage with credit card provider selection:
```python
# Use specific provider test cards
result = await pipeline.ingest(
source=data,
credit_card_provider='visa' # Uses Visa test cards
)
# Or let system use mixed providers (default)
result = await pipeline.ingest(
source=data # Automatically uses weighted distribution
)
```
## ๐ Performance & Quality
- **Statistical Fidelity**: 95%+ correlation preservation
- **Privacy Preservation**: <1% re-identification risk
- **Utility Preservation**: >90% ML model performance
- **Compliance Rate**: 100% regulatory framework adherence
- **Generation Speed**: 1,000-10,000 records/second (provider dependent)
### Provider Performance Comparison
| Provider | Speed (req/s) | Quality | Privacy | Cost |
|----------|---------------|---------|---------|------|
| Ollama Local | 10-50 | High | Maximum | Free |
| OpenAI GPT-4 | 20-100 | Excellent | Medium | $$$ |
| Claude 3 Opus | 15-80 | Excellent | Medium | $$$ |
| Gemini Pro | 50-200 | Good | Medium | $ |
| OpenRouter | 10-100 | Variable | Medium | $ |
## ๐งช Testing
```bash
# Run all tests
pytest
# Run compliance tests only
pytest -m compliance
# Run privacy tests
pytest -m privacy
# Run with coverage
pytest --cov=synthetic_data_mcp --cov-report=html
# Test specific provider
OPENAI_API_KEY=sk-test pytest -m integration
```
## ๐ Deployment
### Docker Deployment
```bash
docker build -t synthetic-data-mcp .
docker run -p 3000:3000 \
-e OPENAI_API_KEY=your-key \
synthetic-data-mcp
```
### Kubernetes Deployment
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: synthetic-data-mcp
spec:
replicas: 3
selector:
matchLabels:
app: synthetic-data-mcp
template:
metadata:
labels:
app: synthetic-data-mcp
spec:
containers:
- name: synthetic-data-mcp
image: synthetic-data-mcp:latest
ports:
- containerPort: 3000
env:
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: llm-secrets
key: openai-key
```
## ๐ง Development
### Code Quality
```bash
# Format code
black .
isort .
# Run linting
flake8 src tests
# Type checking
mypy src
```
### Adding New Providers
1. Create provider module in `src/synthetic_data_mcp/providers/`
2. Implement DSPy LM interface
3. Add configuration in `core/generator.py`
4. Add tests in `tests/test_providers.py`
## ๐ Examples
### Healthcare Example
```python
import asyncio
from synthetic_data_mcp import SyntheticDataGenerator
async def generate_patients():
generator = SyntheticDataGenerator()
result = await generator.generate_dataset(
domain="healthcare",
dataset_type="patient_records",
record_count=1000,
privacy_level="high",
compliance_frameworks=["hipaa"]
)
print(f"Generated {len(result['dataset'])} patient records")
return result
# Run the example
asyncio.run(generate_patients())
```
### Finance Example
```python
async def generate_transactions():
generator = SyntheticDataGenerator()
result = await generator.generate_dataset(
domain="finance",
dataset_type="transactions",
record_count=50000,
privacy_level="high",
compliance_frameworks=["pci_dss"]
)
print(f"Generated {len(result['dataset'])} transactions")
return result
```
## ๐ค Contributing
We welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details.
### Development Setup
```bash
git clone https://github.com/marc-shade/synthetic-data-mcp
cd synthetic-data-mcp
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev,healthcare,finance]"
pre-commit install
```
## ๐ License
MIT License - see [LICENSE](LICENSE) file for details.
## ๐ Support
- [GitHub Issues](https://github.com/marc-shade/synthetic-data-mcp/issues)
- [GitHub Discussions](https://github.com/marc-shade/synthetic-data-mcp/discussions)
- Email: support@2acrestudios.com
## ๐ Related Projects
- [Model Context Protocol (MCP)](https://modelcontextprotocol.io/)
- [DSPy Framework](https://dspy-docs.vercel.app/)
- [Ollama](https://ollama.ai/) - Local LLM inference
- [OpenRouter](https://openrouter.ai/) - Access to 100+ models
---
Built with โค๏ธ for enterprise developers who need compliant, privacy-preserving synthetic data generation.
Raw data
{
"_id": null,
"home_page": null,
"name": "synthetic-data-mcp",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "synthetic-data, mcp, healthcare, finance, compliance, privacy",
"author": null,
"author_email": "Marc Shade <marc@2acrestudios.com>",
"download_url": "https://files.pythonhosted.org/packages/ea/47/0b30752660c8caaeaf9530ae312adc8e5f3f1a64d6f756608b5e86ed53c8/synthetic_data_mcp-0.1.1.tar.gz",
"platform": null,
"description": "# Synthetic Data MCP Server\n\nEnterprise-grade Model Context Protocol (MCP) server for generating privacy-compliant synthetic datasets. Built for regulated industries requiring HIPAA, PCI DSS, SOX, and GDPR compliance with multiple LLM provider support.\n\n## \ud83d\ude80 Features\n<img align=\"right\" width=\"300\" height=\"300\" alt=\"synthetic-data-mcp\" src=\"https://github.com/user-attachments/assets/46620579-0933-4d55-82e9-3700c75fe566\" />\n\n### Core Capabilities\n- **Privacy-First Local Inference**: Ollama integration for 100% local data generation\n- **Domain-Specific Generation**: Specialized synthetic data for healthcare and finance\n- **Privacy Protection**: Differential privacy, k-anonymity, l-diversity\n- **PII Safety Guarantee**: Never retains or outputs original personal data\n- **Compliance Validation**: HIPAA, PCI DSS, SOX, GDPR compliance checking\n- **Statistical Fidelity**: Advanced validation to ensure data utility\n- **Audit Trail**: Comprehensive logging for regulatory compliance\n- **Multi-Provider Support**: Ollama (default), OpenAI, Anthropic, Google, OpenRouter\n\n### LLM Provider Support (2025 Models)\n- **OpenAI**: GPT-5, GPT-5 Mini/Nano, GPT-4o\n- **Anthropic**: Claude Opus 4.1, Claude Sonnet 4 (1M context), Claude 3.5 series\n- **Google**: Gemini 2.5 Pro/Flash/Flash-Lite (1M+ context, multimodal)\n- **Local Models**: Dynamic Ollama integration (Llama 3.3, Qwen 2.5/3, DeepSeek-R1, Mistral Small 3)\n- **Smart Routing**: Automatic provider selection with cost optimization\n- **Fallback**: Multi-tier fallback with local model support\n\n### Technology Stack (2025 Latest)\n- **FastAPI 0.116+**: High-performance async web framework\n- **FastMCP**: High-performance MCP server implementation\n- **Pydantic 2.11+**: Type-safe data validation with enhanced performance\n- **SQLAlchemy 2.0+**: Modern async ORM with type safety\n- **DSPy**: Language model programming framework for intelligent data generation\n- **NumPy 2.3+ & Pandas 2.3+**: Advanced data processing capabilities\n- **Redis & DiskCache**: Multi-tier caching for cost optimization\n- **Rich**: Beautiful terminal interfaces and progress indicators\n\n## \ud83c\udfaf Enterprise Benefits\n\n- **Privacy-First**: Generate synthetic data without exposing sensitive information\n- **Compliance-Ready**: Built-in validation for HIPAA, PCI DSS, SOX, and GDPR\n- **Multi-Provider**: Support for cloud APIs and local inference\n- **Production-Scale**: High-performance generation for enterprise data volumes\n- **Zero Vendor Lock-in**: Switch between providers seamlessly\n- **Cost Control**: Use local models for unlimited generation\n\n## \ud83c\udfe5 Healthcare Use Cases\n\n- Patient record synthesis with HIPAA Safe Harbor compliance\n- Clinical trial data generation for FDA submissions\n- Medical research datasets without PHI exposure\n- Drug discovery data augmentation\n- Healthcare analytics and ML model training\n- EHR system testing and validation\n\n## \ud83d\udcb0 Finance Use Cases\n\n- Transaction pattern modeling for fraud detection\n- Credit risk assessment dataset generation\n- Regulatory stress testing data (Basel III, Dodd-Frank)\n- PCI DSS compliant payment data synthesis\n- Trading algorithm development and backtesting\n- Financial reporting system validation\n\n## \ud83d\udee0\ufe0f Installation\n\n### Production Installation\n```bash\npip install synthetic-data-mcp\n```\n\n### Development Installation\n```bash\ngit clone https://github.com/marc-shade/synthetic-data-mcp\ncd synthetic-data-mcp\npip install -e \".[dev,healthcare,finance]\"\n```\n\n## \ud83c\udfaf Quick Start\n\n### 1. Configure LLM Provider\n\nChoose your preferred provider:\n\n#### OpenAI (Recommended for Production)\n```bash\nexport OPENAI_API_KEY=\"sk-your-key-here\"\n```\n\n#### Anthropic Claude\n```bash\nexport ANTHROPIC_API_KEY=\"sk-ant-your-key-here\"\n```\n\n#### Google Gemini\n```bash\nexport GOOGLE_API_KEY=\"your-key-here\"\n```\n\n#### OpenRouter (Access to 100+ Models)\n```bash\nexport OPENROUTER_API_KEY=\"sk-or-your-key-here\"\nexport OPENROUTER_MODEL=\"meta-llama/llama-3.1-8b-instruct\"\n```\n\n#### Local Models (Ollama) - Privacy-First (DEFAULT)\n```bash\n# Install Ollama first: https://ollama.ai\nollama pull mistral-small:latest # Or any preferred model\nexport OLLAMA_BASE_URL=\"http://localhost:11434\"\nexport OLLAMA_MODEL=\"mistral-small:latest\"\n\n# The system automatically detects and uses Ollama if available\n# No API keys required for local inference!\n```\n\n### 2. Start the MCP Server\n\n```bash\nsynthetic-data-mcp serve --port 3000\n```\n\n### 3. Add to Claude Desktop Configuration\n\n```json\n{\n \"mcpServers\": {\n \"synthetic-data\": {\n \"command\": \"python\",\n \"args\": [\"-m\", \"synthetic_data_mcp.server\"],\n \"env\": {\n \"OPENAI_API_KEY\": \"your-api-key\"\n }\n }\n }\n}\n```\n\n### 4. Generate Synthetic Data\n\n```python\n# Using the MCP client\nresult = await client.call_tool(\n \"generate_synthetic_dataset\",\n {\n \"domain\": \"healthcare\",\n \"dataset_type\": \"patient_records\",\n \"record_count\": 10000,\n \"privacy_level\": \"high\",\n \"compliance_frameworks\": [\"hipaa\"],\n \"output_format\": \"json\"\n }\n)\n```\n\n## \ud83c\udfd7\ufe0f Provider Configuration\n\n### Priority-Based Provider Selection\n\nThe system automatically selects the best available provider:\n\n1. **Local Models (Ollama)** - Highest privacy, no API costs\n2. **OpenAI** - Best performance and reliability\n3. **Anthropic Claude** - Excellent reasoning capabilities\n4. **Google Gemini** - Fast and cost-effective\n5. **OpenRouter** - Access to open source models\n6. **Fallback Mock** - Testing and development\n\n### Provider-Specific Configuration\n\n#### OpenAI Configuration\n```python\n# Environment variables\nOPENAI_API_KEY=\"sk-your-key-here\"\nOPENAI_MODEL=\"gpt-4\" # or gpt-4-turbo, gpt-3.5-turbo\nOPENAI_TEMPERATURE=\"0.7\"\nOPENAI_MAX_TOKENS=\"2000\"\n```\n\n#### Anthropic Configuration\n```python\n# Environment variables\nANTHROPIC_API_KEY=\"sk-ant-your-key-here\"\nANTHROPIC_MODEL=\"claude-3-opus-20240229\" # or claude-3-sonnet, claude-3-haiku\nANTHROPIC_MAX_TOKENS=\"2000\"\n```\n\n#### Local Ollama Configuration\n```python\n# Environment variables\nOLLAMA_BASE_URL=\"http://localhost:11434\"\nOLLAMA_MODEL=\"llama3.1:8b\" # or any installed model\n\n# Supported local models:\n# - llama3.1:8b, llama3.1:70b\n# - mistral:7b, mixtral:8x7b\n# - qwen2:7b, deepseek-coder:6.7b\n# - and 20+ more models\n```\n\n## \ud83d\udd27 Available MCP Tools\n\n### `generate_synthetic_dataset`\nGenerate domain-specific synthetic datasets with compliance validation.\n\n**Parameters:**\n- `domain`: Healthcare, finance, or custom\n- `dataset_type`: Patient records, transactions, clinical trials, etc.\n- `record_count`: Number of synthetic records to generate\n- `privacy_level`: Privacy protection level (low/medium/high/maximum)\n- `compliance_frameworks`: Required compliance validations\n- `output_format`: JSON, CSV, Parquet, or database export\n- `provider`: Override automatic provider selection\n\n### `validate_dataset_compliance`\nValidate existing datasets against regulatory requirements.\n\n### `analyze_privacy_risk`\nComprehensive privacy risk assessment for datasets.\n\n### `generate_domain_schema`\nCreate Pydantic schemas for domain-specific data structures.\n\n### `benchmark_synthetic_data`\nPerformance and utility benchmarking against real data.\n\n## \ud83d\udccb Compliance Frameworks\n\n### Healthcare Compliance\n- **HIPAA Safe Harbor**: Automatic validation of 18 identifiers\n- **HIPAA Expert Determination**: Statistical disclosure control\n- **FDA Guidance**: Synthetic clinical data for submissions\n- **GDPR**: Healthcare data processing compliance\n- **HITECH**: Security and breach notification\n\n### Finance Compliance\n- **PCI DSS**: Payment card industry data security\n- **SOX**: Sarbanes-Oxley internal controls\n- **Basel III**: Banking regulatory framework\n- **MiFID II**: Markets in Financial Instruments Directive\n- **Dodd-Frank**: Financial reform regulations\n\n## \ud83d\udd12 Privacy Protection\n\n### Core Privacy Features\n- **Differential Privacy**: Configurable \u03b5 values (0.1-1.0)\n- **Statistical Disclosure Control**: k-anonymity, l-diversity, t-closeness\n- **Synthetic Data Indistinguishability**: Provable privacy guarantees\n- **Re-identification Risk Assessment**: Continuous monitoring\n- **Privacy Budget Management**: Automatic composition tracking\n\n### PII Protection Guarantee\n- **NO Data Retention**: Original personal data is NEVER stored\n- **Automatic PII Detection**: Identifies names, emails, SSNs, phones, addresses, credit cards\n- **Complete Anonymization**: All PII is anonymized before pattern learning\n- **Statistical Learning Only**: Only learns distributions, means, and frequencies\n- **100% Synthetic Output**: Generated data is completely fake\n\n### Credit Card Safety\n- **Test Card Numbers Only**: Uses official test cards (4242-4242-4242-4242, etc.)\n- **Provider Support**: Visa, Mastercard, AmEx, Discover, and more\n- **Configurable Providers**: Specify provider or use weighted distribution\n- **Never Real Cards**: Original credit card numbers are never retained or output\n\nExample usage with credit card provider selection:\n```python\n# Use specific provider test cards\nresult = await pipeline.ingest(\n source=data,\n credit_card_provider='visa' # Uses Visa test cards\n)\n\n# Or let system use mixed providers (default)\nresult = await pipeline.ingest(\n source=data # Automatically uses weighted distribution\n)\n```\n\n## \ud83d\udcca Performance & Quality\n\n- **Statistical Fidelity**: 95%+ correlation preservation\n- **Privacy Preservation**: <1% re-identification risk\n- **Utility Preservation**: >90% ML model performance\n- **Compliance Rate**: 100% regulatory framework adherence\n- **Generation Speed**: 1,000-10,000 records/second (provider dependent)\n\n### Provider Performance Comparison\n\n| Provider | Speed (req/s) | Quality | Privacy | Cost |\n|----------|---------------|---------|---------|------|\n| Ollama Local | 10-50 | High | Maximum | Free |\n| OpenAI GPT-4 | 20-100 | Excellent | Medium | $$$ |\n| Claude 3 Opus | 15-80 | Excellent | Medium | $$$ |\n| Gemini Pro | 50-200 | Good | Medium | $ |\n| OpenRouter | 10-100 | Variable | Medium | $ |\n\n## \ud83e\uddea Testing\n\n```bash\n# Run all tests\npytest\n\n# Run compliance tests only\npytest -m compliance\n\n# Run privacy tests\npytest -m privacy\n\n# Run with coverage\npytest --cov=synthetic_data_mcp --cov-report=html\n\n# Test specific provider\nOPENAI_API_KEY=sk-test pytest -m integration\n```\n\n## \ud83d\ude80 Deployment\n\n### Docker Deployment\n```bash\ndocker build -t synthetic-data-mcp .\ndocker run -p 3000:3000 \\\n -e OPENAI_API_KEY=your-key \\\n synthetic-data-mcp\n```\n\n### Kubernetes Deployment\n```yaml\napiVersion: apps/v1\nkind: Deployment\nmetadata:\n name: synthetic-data-mcp\nspec:\n replicas: 3\n selector:\n matchLabels:\n app: synthetic-data-mcp\n template:\n metadata:\n labels:\n app: synthetic-data-mcp\n spec:\n containers:\n - name: synthetic-data-mcp\n image: synthetic-data-mcp:latest\n ports:\n - containerPort: 3000\n env:\n - name: OPENAI_API_KEY\n valueFrom:\n secretKeyRef:\n name: llm-secrets\n key: openai-key\n```\n\n## \ud83d\udd27 Development\n\n### Code Quality\n```bash\n# Format code\nblack .\nisort .\n\n# Run linting\nflake8 src tests\n\n# Type checking\nmypy src\n```\n\n### Adding New Providers\n\n1. Create provider module in `src/synthetic_data_mcp/providers/`\n2. Implement DSPy LM interface\n3. Add configuration in `core/generator.py`\n4. Add tests in `tests/test_providers.py`\n\n## \ud83d\udcda Examples\n\n### Healthcare Example\n```python\nimport asyncio\nfrom synthetic_data_mcp import SyntheticDataGenerator\n\nasync def generate_patients():\n generator = SyntheticDataGenerator()\n \n result = await generator.generate_dataset(\n domain=\"healthcare\",\n dataset_type=\"patient_records\",\n record_count=1000,\n privacy_level=\"high\",\n compliance_frameworks=[\"hipaa\"]\n )\n \n print(f\"Generated {len(result['dataset'])} patient records\")\n return result\n\n# Run the example\nasyncio.run(generate_patients())\n```\n\n### Finance Example\n```python\nasync def generate_transactions():\n generator = SyntheticDataGenerator()\n \n result = await generator.generate_dataset(\n domain=\"finance\",\n dataset_type=\"transactions\",\n record_count=50000,\n privacy_level=\"high\",\n compliance_frameworks=[\"pci_dss\"]\n )\n \n print(f\"Generated {len(result['dataset'])} transactions\")\n return result\n```\n\n## \ud83e\udd1d Contributing\n\nWe welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details.\n\n### Development Setup\n```bash\ngit clone https://github.com/marc-shade/synthetic-data-mcp\ncd synthetic-data-mcp\npython -m venv .venv\nsource .venv/bin/activate\npip install -e \".[dev,healthcare,finance]\"\npre-commit install\n```\n\n## \ud83d\udcc4 License\n\nMIT License - see [LICENSE](LICENSE) file for details.\n\n## \ud83c\udd98 Support\n\n- [GitHub Issues](https://github.com/marc-shade/synthetic-data-mcp/issues)\n- [GitHub Discussions](https://github.com/marc-shade/synthetic-data-mcp/discussions)\n- Email: support@2acrestudios.com\n\n## \ud83d\udd17 Related Projects\n\n- [Model Context Protocol (MCP)](https://modelcontextprotocol.io/)\n- [DSPy Framework](https://dspy-docs.vercel.app/)\n- [Ollama](https://ollama.ai/) - Local LLM inference\n- [OpenRouter](https://openrouter.ai/) - Access to 100+ models\n\n---\n\nBuilt with \u2764\ufe0f for enterprise developers who need compliant, privacy-preserving synthetic data generation.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Domain-specific synthetic data generation MCP server for healthcare and finance compliance",
"version": "0.1.1",
"project_urls": {
"Documentation": "https://synthetic-data-mcp.readthedocs.io",
"Homepage": "https://github.com/marc-shade/synthetic-data-mcp",
"Issues": "https://github.com/marc-shade/synthetic-data-mcp/issues",
"Repository": "https://github.com/marc-shade/synthetic-data-mcp"
},
"split_keywords": [
"synthetic-data",
" mcp",
" healthcare",
" finance",
" compliance",
" privacy"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "4c76366c659b1af1938e4279c4a34d194e3dffa049ea32a17fef97ade2a677c4",
"md5": "e38fbc72e67458ab3362c11779e0ac22",
"sha256": "c09fb0523cac4aa921b530b6559ee9a1a52aff8f3abf41f2697141147c26dfe1"
},
"downloads": -1,
"filename": "synthetic_data_mcp-0.1.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "e38fbc72e67458ab3362c11779e0ac22",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 222363,
"upload_time": "2025-09-18T22:39:30",
"upload_time_iso_8601": "2025-09-18T22:39:30.745147Z",
"url": "https://files.pythonhosted.org/packages/4c/76/366c659b1af1938e4279c4a34d194e3dffa049ea32a17fef97ade2a677c4/synthetic_data_mcp-0.1.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "ea470b30752660c8caaeaf9530ae312adc8e5f3f1a64d6f756608b5e86ed53c8",
"md5": "8a24e27c6d3f74053c2ef85a92efbd21",
"sha256": "cf043f1f11b5870eceead26b8834c271ed811daa1662498a4a2fa7103d101789"
},
"downloads": -1,
"filename": "synthetic_data_mcp-0.1.1.tar.gz",
"has_sig": false,
"md5_digest": "8a24e27c6d3f74053c2ef85a92efbd21",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 280732,
"upload_time": "2025-09-18T22:39:32",
"upload_time_iso_8601": "2025-09-18T22:39:32.196601Z",
"url": "https://files.pythonhosted.org/packages/ea/47/0b30752660c8caaeaf9530ae312adc8e5f3f1a64d6f756608b5e86ed53c8/synthetic_data_mcp-0.1.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-18 22:39:32",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "marc-shade",
"github_project": "synthetic-data-mcp",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "fastmcp",
"specs": [
[
">=",
"0.1.0"
]
]
},
{
"name": "dspy-ai",
"specs": [
[
">=",
"2.4.0"
]
]
},
{
"name": "pydantic",
"specs": [
[
">=",
"2.5.0"
]
]
},
{
"name": "numpy",
"specs": [
[
">=",
"1.24.0"
]
]
},
{
"name": "pandas",
"specs": [
[
">=",
"2.0.0"
]
]
},
{
"name": "scipy",
"specs": [
[
">=",
"1.10.0"
]
]
},
{
"name": "scikit-learn",
"specs": [
[
">=",
"1.3.0"
]
]
},
{
"name": "cryptography",
"specs": [
[
">=",
"41.0.0"
]
]
},
{
"name": "python-dateutil",
"specs": [
[
">=",
"2.8.0"
]
]
},
{
"name": "faker",
"specs": [
[
">=",
"19.0.0"
]
]
},
{
"name": "opendp",
"specs": [
[
">=",
"0.8.0"
]
]
},
{
"name": "sqlalchemy",
"specs": [
[
">=",
"2.0.0"
]
]
},
{
"name": "alembic",
"specs": [
[
">=",
"1.12.0"
]
]
},
{
"name": "pyjwt",
"specs": [
[
">=",
"2.8.0"
]
]
},
{
"name": "httpx",
"specs": [
[
">=",
"0.25.0"
]
]
},
{
"name": "requests",
"specs": [
[
">=",
"2.31.0"
]
]
},
{
"name": "loguru",
"specs": [
[
">=",
"0.7.0"
]
]
},
{
"name": "typer",
"specs": [
[
">=",
"0.9.0"
]
]
},
{
"name": "rich",
"specs": [
[
">=",
"13.0.0"
]
]
},
{
"name": "pytest",
"specs": [
[
">=",
"7.4.0"
]
]
},
{
"name": "pytest-asyncio",
"specs": [
[
">=",
"0.21.0"
]
]
},
{
"name": "gunicorn",
"specs": [
[
">=",
"21.2.0"
]
]
},
{
"name": "uvicorn",
"specs": [
[
">=",
"0.24.0"
]
]
},
{
"name": "fastapi",
"specs": [
[
">=",
"0.104.0"
]
]
},
{
"name": "prometheus-client",
"specs": [
[
">=",
"0.19.0"
]
]
},
{
"name": "structlog",
"specs": [
[
">=",
"23.2.0"
]
]
},
{
"name": "asyncpg",
"specs": [
[
">=",
"0.29.0"
]
]
},
{
"name": "aiomysql",
"specs": [
[
">=",
"0.2.0"
]
]
},
{
"name": "motor",
"specs": [
[
">=",
"3.3.0"
]
]
},
{
"name": "aioredis",
"specs": [
[
">=",
"2.0.0"
]
]
},
{
"name": "psycopg2-binary",
"specs": [
[
">=",
"2.9.0"
]
]
},
{
"name": "google-cloud-bigquery",
"specs": [
[
">=",
"3.13.0"
]
]
},
{
"name": "google-cloud-bigquery-storage",
"specs": [
[
">=",
"2.22.0"
]
]
},
{
"name": "google-oauth2-tool",
"specs": [
[
">=",
"0.0.3"
]
]
},
{
"name": "snowflake-connector-python",
"specs": [
[
">=",
"3.5.0"
]
]
},
{
"name": "boto3",
"specs": [
[
">=",
"1.34.0"
]
]
},
{
"name": "aiofiles",
"specs": [
[
">=",
"23.2.0"
]
]
},
{
"name": "pymongo",
"specs": [
[
">=",
"4.6.0"
]
]
}
],
"lcname": "synthetic-data-mcp"
}