# DDN Metadata Bootstrap
[](https://badge.fury.io/py/ddn-metadata-bootstrap)
[](https://pypi.org/project/ddn-metadata-bootstrap/)
[](https://opensource.org/licenses/MIT)
AI-powered metadata enhancement for Hasura DDN (Data Delivery Network) schema files. Automatically generate high-quality descriptions and detect sophisticated relationships in your YAML/HML schema definitions using advanced AI with comprehensive configuration management.
## ๐ Features
### ๐ค **Multi-Provider AI Support**
- **Anthropic Claude**: Default provider with claude-3-haiku, claude-3-sonnet, and claude-3-opus models
- **OpenAI GPT**: Support for gpt-3.5-turbo, gpt-4, gpt-4o-mini, and latest models
- **Google Gemini**: Support for gemini-pro, gemini-1.5-pro, and gemini-1.5-flash models
- **Automatic Fallback**: Graceful degradation between providers with configurable priorities
- **Provider-Specific Optimization**: Model-specific prompting and parameter tuning
### ๐ฏ **Granular Feature Control**
- **Individual Feature Flags**: Control each processing feature independently
- **Flexible Processing Modes**: Choose between all, forward-only, or none for relationships
- **Selective Enhancement**: Process only descriptions, only relationships, or both
- **Rebuild Capabilities**: Rebuild existing relationships from scratch when needed
### ๐ง **Advanced AI Generation**
- **Quality Assessment with Retry Logic**: Multi-attempt generation with configurable scoring thresholds
- **Context-Aware Business Descriptions**: Domain-specific system prompts with industry context
- **Smart Field Analysis**: Automatically detects and skips self-explanatory, generic, or cryptic fields
- **Configurable Length Controls**: Precise control over description length and token usage
### ๐ง **Intelligent Caching System**
- **Similarity-Based Matching**: Reuses descriptions for similar fields across entities (85% similarity threshold)
- **Performance Optimization**: Reduces API calls by up to 70% on large schemas through intelligent caching
- **Cache Statistics**: Real-time performance monitoring with hit rates and API cost savings tracking
- **Type-Aware Matching**: Considers field types and entity context for better cache accuracy
### ๐ **WordNet-Based Linguistic Analysis**
- **Generic Term Detection**: Uses NLTK and WordNet for sophisticated term analysis to skip meaningless fields
- **Semantic Density Analysis**: Evaluates conceptual richness and specificity of field names
- **Definition Quality Scoring**: Ensures meaningful, non-circular descriptions through linguistic validation
- **Abstraction Level Calculation**: Determines appropriate description depth based on semantic analysis
### ๐ **Enhanced Acronym Expansion**
- **Comprehensive Mappings**: 200+ pre-configured acronyms for technology, finance, and business domains
- **Context-Aware Expansion**: Industry-specific acronym interpretation based on domain context
- **Pre-Generation Enhancement**: Expands acronyms BEFORE AI generation for better context
- **Custom Domain Support**: Fully configurable acronym mappings via YAML configuration
### ๐ **Advanced Relationship Detection**
- **Template-Based FK Detection**: Sophisticated foreign key detection with confidence scoring and semantic validation
- **Shared Business Key Relationships**: Many-to-many relationships via shared field analysis with FK-aware filtering
- **Cross-Subgraph Intelligence**: Smart entity matching across different subgraphs
- **Configurable Templates**: Flexible FK template patterns with placeholders for complex naming conventions
- **Advanced Relationship Blocking**: Precision rule-based system to prevent inappropriate cross-connector relationships
### โ๏ธ **Comprehensive Configuration System**
- **YAML-First Configuration**: Central `config.yaml` file for all settings with full documentation
- **Waterfall Precedence**: CLI args > Environment variables > config.yaml > defaults
- **Configuration Validation**: Comprehensive validation with helpful error messages and source tracking
- **Feature Toggles**: Granular control over processing features with clear flag names
### ๐ฏ **Advanced Quality Controls**
- **Buzzword Detection**: Avoids corporate jargon and meaningless generic terms
- **Pattern-Based Filtering**: Regex-based rejection of poor description formats
- **Technical Language Translation**: Converts technical terms to business-friendly language
- **Length Optimization**: Multiple validation layers with hard limits and target lengths
### ๐ **Intelligent Field Selection**
- **Generic Field Detection**: Skips overly common fields that don't benefit from descriptions
- **Cryptic Abbreviation Handling**: Configurable handling of unclear field names with vowel analysis
- **Self-Explanatory Pattern Recognition**: Automatically identifies fields that don't need descriptions
- **Value Assessment**: Only generates descriptions that add meaningful business value
## ๐ฆ Installation
### From PyPI (Recommended)
```bash
pip install ddn-metadata-bootstrap
```
### Provider-Specific Dependencies
The tool supports multiple AI providers. Install the dependencies for your chosen provider:
```bash
# For Anthropic Claude (default)
pip install ddn-metadata-bootstrap[anthropic]
# or separately:
pip install anthropic
# For OpenAI GPT
pip install ddn-metadata-bootstrap[openai]
# or separately:
pip install openai
# For Google Gemini
pip install ddn-metadata-bootstrap[gemini]
# or separately:
pip install google-generativeai
# Install all providers
pip install ddn-metadata-bootstrap[all]
```
### From Source
```bash
git clone https://github.com/hasura/ddn-metadata-bootstrap.git
cd ddn-metadata-bootstrap
pip install -e .
```
## ๐ Quick Start
### 1. Choose Your AI Provider
#### Option A: Anthropic Claude (Default - Recommended)
```bash
export ANTHROPIC_API_KEY="your-anthropic-api-key"
export METADATA_BOOTSTRAP_AI_PROVIDER="anthropic" # Optional (default)
export METADATA_BOOTSTRAP_ANTHROPIC_MODEL="claude-3-haiku-20240307" # Optional
```
#### Option B: OpenAI GPT
```bash
export OPENAI_API_KEY="your-openai-api-key"
export METADATA_BOOTSTRAP_AI_PROVIDER="openai"
export METADATA_BOOTSTRAP_OPENAI_MODEL="gpt-3.5-turbo" # Optional
```
#### Option C: Google Gemini
```bash
export GEMINI_API_KEY="your-gemini-api-key"
# or alternatively:
export GOOGLE_API_KEY="your-gemini-api-key"
export METADATA_BOOTSTRAP_AI_PROVIDER="gemini"
export METADATA_BOOTSTRAP_GEMINI_MODEL="gemini-pro" # Optional
```
### 2. Set up your directories
```bash
export METADATA_BOOTSTRAP_INPUT_DIR="./app/metadata"
export METADATA_BOOTSTRAP_OUTPUT_DIR="./enhanced_metadata"
```
### 3. Create a configuration file (Recommended)
Create a `config.yaml` file in your project directory:
```yaml
# config.yaml - DDN Metadata Bootstrap Configuration
# =============================================================================
# GLOBAL PROCESSING CONFIGURATION
# =============================================================================
# Controls which features are enabled and basic processing behavior
# Feature Flags - what processing to perform
create_fk: all # all|forward|none - FK relationships
create_shared_keys: all # all|forward|none - Shared key relationships
create_command_relationship_hints: true # true|false - Command relationship hints
create_descriptions: true # true|false - AI-generated descriptions
rebuild_relationships: false # true|false - Rebuild existing relationships from scratch
enable_quality_assessment: true # Enable AI to score and improve its own descriptions
# AI Provider Configuration
ai_provider: "anthropic" # Choose: anthropic, openai, gemini
# Provider-specific API keys (alternatively set via environment variables)
# anthropic_api_key: "your-anthropic-key"
# openai_api_key: "your-openai-key"
# gemini_api_key: "your-gemini-key"
# Provider-specific models
anthropic_model: "claude-3-haiku-20240307" # claude-3-sonnet-20240229, claude-3-opus-20240229
openai_model: "gpt-3.5-turbo" # gpt-4, gpt-4o-mini, gpt-4-turbo-preview
gemini_model: "gemini-pro" # gemini-1.5-pro-latest, gemini-1.5-flash
# =============================================================================
# DESCRIPTION GENERATION CONFIGURATION
# =============================================================================
# Domain-specific system prompt for your organization
system_prompt: |
You generate concise field descriptions for database schema metadata at a global financial services firm.
DOMAIN CONTEXT:
- Organization: Global bank
- Department: Cybersecurity operations
- Use case: Risk management and security compliance
- Regulatory environment: Financial services (SOX, Basel III, GDPR, etc.)
Think: "What would a cybersecurity analyst at a bank need to know about this field?"
# Token and length limits
field_tokens: 25 # Max tokens AI can generate for field descriptions
kind_tokens: 50 # Max tokens AI can generate for kind descriptions
field_desc_max_length: 120 # Maximum total characters for field descriptions
kind_desc_max_length: 250 # Maximum total characters for entity descriptions
# Quality thresholds
minimum_description_score: 70 # Minimum score (0-100) to accept a description
max_description_retry_attempts: 3 # How many times to retry for better quality
# =============================================================================
# ENHANCED ACRONYM EXPANSION
# =============================================================================
acronym_mappings:
# Technology & Computing
api: "Application Programming Interface"
ui: "User Interface"
db: "Database"
# Security & Access Management
mfa: "Multi-Factor Authentication"
sso: "Single Sign-On"
iam: "Identity and Access Management"
siem: "Security Information and Event Management"
# Financial Services & Compliance
pci: "Payment Card Industry"
sox: "Sarbanes-Oxley Act"
kyc: "Know-Your-Customer"
aml: "Anti-Money Laundering"
# ... 200+ total mappings available
# =============================================================================
# INTELLIGENT FIELD SELECTION
# =============================================================================
# Fields to skip entirely - these will not get descriptions at all
skip_field_patterns:
- "^id$"
- "^_id$"
- "^uuid$"
- "^created_at$"
- "^updated_at$"
- "^debug_.*"
- "^test_.*"
- "^temp_.*"
# Generic fields - won't get unique descriptions (too common)
generic_fields:
- "id"
- "key"
- "uid"
- "guid"
- "name"
# Self-explanatory fields - simple patterns that don't need descriptions
self_explanatory_patterns:
- '^id$'
- '^_id$'
- '^guid$'
- '^uuid$'
- '^key$'
# Cryptic Field Handling
skip_cryptic_abbreviations: true # Skip fields with unclear abbreviations
skip_ultra_short_fields: true # Skip very short field names that are likely abbreviations
max_cryptic_field_length: 4 # Field names this length or shorter are considered cryptic
# Content quality controls
buzzwords: [
'synergy', 'leverage', 'paradigm', 'ecosystem',
'contains', 'stores', 'holds', 'represents'
]
forbidden_patterns: [
'this\\s+field\\s+represents',
'used\\s+to\\s+(track|manage|identify)',
'business.*information'
]
# =============================================================================
# RELATIONSHIP DETECTION
# =============================================================================
# FK Template Patterns for relationship detection
# Format: "{pk_pattern}|{fk_pattern}"
# Placeholders: {gi}=generic_id, {pt}=primary_table, {ps}=primary_subgraph, {pm}=prefix_modifier
fk_templates:
- "{gi}|{pm}_{pt}_{gi}" # active_service_name โ Services.name
- "{gi}|{pt}_{gi}" # user_id โ Users.id
- "{pt}_{gi}|{pm}_{pt}_{gi}" # user_id โ ActiveUsers.active_user_id
# =============================================================================
# ADVANCED RELATIONSHIP BLOCKING
# =============================================================================
# Precision rule-based system to prevent inappropriate relationships
# Uses bidirectional validation with data_connector + entity + field pattern matching
fk_key_blacklist:
# Block cross-cloud provider connections with infrastructure fields
- entity_pattern_a:
data_connector: "^(gcp|arg|various)$" # Google Cloud Platform, Azure Resource Graph, Various
entity: "^(gcp_|google_).*" # Google/GCP entities
field: ".*" # Any field
entity_pattern_b:
data_connector: "^(gcp|arg|various)$"
entity: "^(az_|azure_).*" # Azure entities
field: ".*(resource|project|policy|storage|compute).*" # Infrastructure fields only
logic: "and"
reason: "Block google/gcp entities from connecting to azure entities with infrastructure-related fields"
# Complete isolation between major cloud platforms
- entity_pattern_a:
data_connector: "^gcp$" # Google Cloud Platform connector
entity: ".*" # Any entity
field: ".*" # Any field
entity_pattern_b:
data_connector: "^arg$" # Azure Resource Graph connector
entity: ".*" # Any entity
field: ".*" # Any field
logic: "and"
reason: "Block all connections between Google Cloud Platform and Azure Resource Graph connectors"
# Shared relationship limits
max_shared_relationships: 10000
max_shared_per_entity: 10
min_shared_confidence: 30
# Shared Key Rejection Patterns - fields matching these won't be used for shared relationships
shared_key_rejection_patterns:
# Private/Technical Fields (leading underscore indicates internal use)
- "^_.*$"
# Primary Identifiers (too generic for meaningful relationships)
- "^_?(id|key)$"
# Generic Classification Fields (overly broad categorization)
- "^(name|type|category|title|code|level|kind)$"
# State/Status Fields (frequently changing, not structural)
- "^(status|state|active)$"
# Audit Fields - Temporal Only (timestamp-based, not relational)
- "^(created|updated|modified)(_at|_date|_time|_timestamp)?$"
```
### 4. Run the tool with your chosen provider
```bash
# Use default provider (Anthropic) with default settings
ddn-metadata-bootstrap
# Use OpenAI explicitly
ddn-metadata-bootstrap --ai-provider openai --openai-api-key your-key
# Use Gemini with specific model
ddn-metadata-bootstrap --ai-provider gemini --gemini-model gemini-1.5-pro
# Show configuration including AI provider setup
ddn-metadata-bootstrap --show-config
# Test your AI provider connection
ddn-metadata-bootstrap --test-provider
# Process only relationships (skip descriptions)
ddn-metadata-bootstrap --create-descriptions false
# Process only descriptions (skip relationships)
ddn-metadata-bootstrap --create-fk none --create-shared-keys none
# Rebuild all relationships from scratch
ddn-metadata-bootstrap --rebuild-relationships
# Use custom configuration file
ddn-metadata-bootstrap --config custom-config.yaml
# Enable verbose logging to see AI provider selection and caching
ddn-metadata-bootstrap --verbose
```
## ๐ฏ Feature Control System
The tool provides granular control over each processing feature through clean, intuitive flags:
### Core Processing Features
| Feature | Config Key | Values | Description |
|---------|------------|--------|-------------|
| **FK Relationships** | `create_fk` | `all`, `forward`, `none` | Foreign key relationship detection |
| **Shared Key Relationships** | `create_shared_keys` | `all`, `forward`, `none` | Shared field relationship detection |
| **Command Hints** | `create_command_relationship_hints` | `true`, `false` | Command relationship hints |
| **Descriptions** | `create_descriptions` | `true`, `false` | AI-generated descriptions |
| **Rebuild Mode** | `rebuild_relationships` | `true`, `false` | Rebuild existing relationships |
### Processing Modes
#### **All Mode** (`all`)
- Creates relationships in both directions
- Full bidirectional relationship graph
- Best for comprehensive schema understanding
#### **Forward Mode** (`forward`)
- Creates relationships in forward direction only
- Reduces relationship complexity
- Useful for directed schema analysis
#### **None Mode** (`none`)
- Skips the feature entirely
- Fastest processing
- Use when feature not needed
### Feature Combinations
```bash
# Only generate descriptions (no relationships)
create_fk: none
create_shared_keys: none
create_descriptions: true
# Only generate FK relationships (no descriptions or shared keys)
create_fk: all
create_shared_keys: none
create_descriptions: false
# Minimal processing (relationships only, forward direction)
create_fk: forward
create_shared_keys: forward
create_descriptions: false
# Full processing with rebuild
create_fk: all
create_shared_keys: all
create_descriptions: true
rebuild_relationships: true
```
## ๐ Advanced Relationship Blocking System
The tool includes a sophisticated **bidirectional relationship blocking system** that prevents inappropriate foreign key relationships from being generated. This is particularly important in enterprise environments with multiple data connectors, cloud providers, and security boundaries.
### Key Features
#### **Precision Pattern Matching**
Each blocking rule uses three-part patterns for maximum precision:
- **Data Connector**: Regex pattern matching the connector name (e.g., `^gcp$`, `^(test|dev)_.*`)
- **Entity Name**: Regex pattern matching the entity/table name (e.g., `^google_.*`, `^azure_storage.*`)
- **Field Name**: Regex pattern matching the field name (e.g., `.*resource.*`, `.*secret.*`)
#### **Bidirectional Validation**
Rules automatically check both directions of a relationship:
- **Pattern A โ Pattern B**: `google_compute` โ `azure_storage_resource`
- **Pattern B โ Pattern A**: `azure_vm` โ `google_analytics_data`
Both directions are blocked by a single rule definition.
#### **Flexible Logic Operators**
- **AND Logic**: All patterns (connector AND entity AND field) must match for both sides
- **OR Logic**: Either side matching its full pattern triggers the block
### Real-World Examples
#### **Cross-Cloud Security Isolation**
```yaml
# Block Google Cloud from Azure Resource Graph
- entity_pattern_a:
data_connector: "^gcp$" # Google Cloud Platform
entity: ".*" # Any GCP entity
field: ".*" # Any field
entity_pattern_b:
data_connector: "^arg$" # Azure Resource Graph
entity: ".*" # Any Azure entity
field: ".*" # Any field
logic: "and"
reason: "Complete isolation between cloud providers for security compliance"
```
#### **Environment Separation**
```yaml
# Block test environments from production sensitive data
- entity_pattern_a:
data_connector: "^(test|dev)_.*"
entity: ".*"
field: ".*"
entity_pattern_b:
data_connector: "^prod_.*"
entity: ".*"
field: ".*(pii|ssn|credit_card|password).*"
logic: "and"
reason: "Prevent test/dev access to production sensitive data"
```
#### **Infrastructure Boundaries**
```yaml
# Block legacy systems from modern cloud infrastructure
- entity_pattern_a:
data_connector: "^legacy_.*"
entity: "^(mainframe|cobol)_.*"
field: ".*"
entity_pattern_b:
data_connector: "^(gcp|aws|azure)_.*"
entity: "^(kubernetes|container|serverless)_.*"
field: ".*"
logic: "and"
reason: "Prevent direct legacy-to-cloud connections without proper integration layer"
```
### Configuration Validation
The system includes comprehensive validation:
```bash
# Validate your FK blacklist rules
ddn-metadata-bootstrap --validate-config
# Test specific blocking scenarios
ddn-metadata-bootstrap --test-fk-blocking
# Show compiled regex patterns
ddn-metadata-bootstrap --show-config --verbose
```
## ๐ค AI Provider Comparison
### Performance & Cost Comparison
| Provider | Speed | Cost | Quality | Best For |
|----------|-------|------|---------|----------|
| **Anthropic Claude Haiku** | โกโกโก Very Fast | ๐ฐ Low | โญโญโญโญ High | Development, High Volume |
| **Anthropic Claude Sonnet** | โกโก Fast | ๐ฐ๐ฐ Medium | โญโญโญโญโญ Excellent | Production, Balanced |
| **Anthropic Claude Opus** | โก Medium | ๐ฐ๐ฐ๐ฐ High | โญโญโญโญโญ Excellent | Critical Schemas |
| **OpenAI GPT-3.5 Turbo** | โกโกโก Very Fast | ๐ฐ Very Low | โญโญโญ Good | Development, Budget |
| **OpenAI GPT-4o Mini** | โกโกโก Very Fast | ๐ฐ Low | โญโญโญโญ High | Production, Cost-Optimized |
| **OpenAI GPT-4** | โกโก Fast | ๐ฐ๐ฐ๐ฐ High | โญโญโญโญโญ Excellent | Premium Quality |
| **Google Gemini Pro** | โกโก Fast | ๐ฐ Very Low | โญโญโญโญ High | Large Scale, Budget |
| **Google Gemini 1.5 Flash** | โกโกโก Very Fast | ๐ฐ Low | โญโญโญ Good | High Throughput |
### Provider-Specific Configuration Examples
#### Anthropic Claude (Recommended)
```yaml
ai_provider: "anthropic"
anthropic_model: "claude-3-haiku-20240307" # Fast & cost-effective
# anthropic_model: "claude-3-sonnet-20240229" # Balanced
# anthropic_model: "claude-3-opus-20240229" # Highest quality
# Anthropic-optimized settings
field_tokens: 30
system_prompt: |
Generate concise, business-focused field descriptions.
Focus on practical utility and clear business meaning.
```
#### OpenAI GPT (Cost-Optimized)
```yaml
ai_provider: "openai"
openai_model: "gpt-4o-mini" # Best balance of cost and quality
# openai_model: "gpt-3.5-turbo" # Most cost-effective
# openai_model: "gpt-4-turbo-preview" # Highest quality
# OpenAI-optimized settings
field_tokens: 25
system_prompt: |
You are a technical writer creating database field descriptions.
Be concise, specific, and business-focused.
```
#### Google Gemini (High Volume)
```yaml
ai_provider: "gemini"
gemini_model: "gemini-1.5-flash" # High throughput
# gemini_model: "gemini-pro" # Balanced
# gemini_model: "gemini-1.5-pro-latest" # Highest quality
# Gemini-optimized settings
field_tokens: 35
system_prompt: |
Create clear, professional descriptions for database schema fields.
Focus on business value and practical understanding.
```
## ๐ Enhanced Examples
### Multi-Provider Description Generation
#### Input Schema (HML)
```yaml
kind: ObjectType
version: v1
definition:
name: ThreatAssessment
fields:
- name: riskId
type: String!
- name: mfaEnabled
type: Boolean!
- name: ssoConfig
type: String
- name: iamPolicy
type: String
```
#### Output with Different Providers
##### Anthropic Claude (Business-Focused)
```yaml
kind: ObjectType
version: v1
definition:
name: ThreatAssessment
description: |
Security risk evaluation and compliance status tracking for
organizational threat management and regulatory oversight.
fields:
- name: riskId
type: String!
description: Risk assessment identifier for tracking security evaluations.
- name: mfaEnabled
type: Boolean!
description: Multi-Factor Authentication enablement status for security policy compliance.
- name: ssoConfig
type: String
description: Single Sign-On configuration settings for identity management.
- name: iamPolicy
type: String
description: Identity and Access Management policy governing user permissions.
```
### Feature Control Examples
#### Descriptions Only (No Relationships)
```bash
# CLI
ddn-metadata-bootstrap --create-fk none --create-shared-keys none --create-descriptions true
# Config YAML
create_fk: none
create_shared_keys: none
create_command_relationship_hints: false
create_descriptions: true
```
#### Relationships Only (No Descriptions)
```bash
# CLI
ddn-metadata-bootstrap --create-descriptions false
# Config YAML
create_fk: all
create_shared_keys: all
create_command_relationship_hints: true
create_descriptions: false
```
#### Forward-Only Relationships (Reduced Complexity)
```bash
# Config YAML
create_fk: forward
create_shared_keys: forward
create_command_relationship_hints: true
create_descriptions: true
```
#### Rebuild Mode (Start Fresh)
```bash
# CLI
ddn-metadata-bootstrap --rebuild-relationships
# Config YAML
rebuild_relationships: true
```
## ๐ Python API with Multi-Provider Support
```python
from ddn_metadata_bootstrap import BootstrapperConfig, MetadataBootstrapper
from ddn_metadata_bootstrap.description_generator import DescriptionGenerator
import logging
# Configure logging to see provider selection and caching
logging.basicConfig(level=logging.INFO)
# Method 1: Use configuration file
config = BootstrapperConfig(config_file="./config.yaml")
# Method 2: Programmatic feature control
config = BootstrapperConfig()
config.ai_provider = "openai"
config.openai_api_key = "your-openai-key"
config.openai_model = "gpt-4o-mini"
# Feature control
config.create_descriptions = True
config.create_fk = "all" # all|forward|none
config.create_shared_keys = "forward" # all|forward|none
config.create_command_relationship_hints = True
config.rebuild_relationships = False
# Method 3: Direct generator creation with provider
generator = DescriptionGenerator(
api_key="your-api-key",
model="claude-3-haiku-20240307",
provider="anthropic" # or "openai", "gemini"
)
# Create bootstrapper with feature control
bootstrapper = MetadataBootstrapper(config)
# Process directory with configured features
results = bootstrapper.process_directory(
input_dir="./app/metadata",
output_dir="./enhanced_metadata"
)
# Check what features were processed
processing_summary = config.get_processing_summary()
print(f"Processed: {processing_summary}")
# Get provider-specific statistics
stats = bootstrapper.get_statistics()
print(f"AI Provider: {stats['ai_provider']}")
print(f"Model Used: {stats['model_used']}")
print(f"Provider API Calls: {stats['provider_api_calls']}")
print(f"Provider Cost: ${stats['estimated_provider_cost']:.2f}")
```
## ๐ Enhanced Statistics & Monitoring
```python
# Feature-specific performance tracking
stats = bootstrapper.get_statistics()
# Feature processing summary
print(f"Processing Summary: {config.get_processing_summary()}")
print(f"Features Enabled:")
print(f" - Descriptions: {config.should_create_descriptions()}")
print(f" - FK Relationships: {config.should_create_fk_relationships()}")
print(f" - Shared Key Relationships: {config.should_create_shared_key_relationships()}")
print(f" - Command Hints: {config.should_create_command_relationship_hints()}")
print(f" - Rebuild Mode: {config.should_rebuild_relationships()}")
# AI Provider metrics
print(f"AI Provider: {stats['ai_provider']}")
print(f"Model: {stats['model_used']}")
print(f"Provider API calls: {stats['provider_api_calls']}")
print(f"Average response time: {stats['avg_response_time_ms']}ms")
print(f"Provider cost: ${stats['estimated_provider_cost']:.3f}")
# Relationship blocking statistics
if 'relationship_stats' in stats:
rel_stats = stats['relationship_stats']
print(f"Relationships considered: {rel_stats['relationships_considered']}")
print(f"Relationships blocked: {rel_stats['relationships_blocked']}")
print(f"FK blacklist hits: {rel_stats['fk_blacklist_hits']}")
print(f"Cross-connector blocks: {rel_stats['cross_connector_blocks']}")
```
## ๐ Provider-Specific Performance Improvements
### Real-World Performance by Provider
#### Anthropic Claude
```bash
Provider: Anthropic Claude Haiku
Processing Features: descriptions, FK relationships (all), shared keys (forward)
Processing 500 fields...
โ
Strengths:
- Excellent business context understanding
- Consistent quality across attempts
- Good acronym expansion integration
- Fast response times (avg 850ms)
๐ Results:
- API calls: 127 (after caching)
- Processing time: 2.1 minutes
- Average quality score: 82
- Cost: $0.89
```
#### Configuration-Based Performance
```bash
Feature Set: Descriptions only (relationships disabled)
Provider: OpenAI GPT-4o Mini
Processing 500 fields...
โ
Results:
- API calls: 89 (descriptions only)
- Processing time: 1.2 minutes
- Average quality score: 78
- Cost: $0.31
Feature Set: Relationships only (descriptions disabled)
Provider: Local processing
Processing 500 fields...
โ
Results:
- API calls: 0 (no AI needed)
- Processing time: 0.3 minutes
- Relationships generated: 247
- Cost: $0.00
```
## โ๏ธ Advanced Multi-Provider Configuration
### Environment-Based Provider Selection
```bash
# Development environment - fast and cheap
export ENVIRONMENT="development"
export METADATA_BOOTSTRAP_AI_PROVIDER="openai"
export METADATA_BOOTSTRAP_CREATE_DESCRIPTIONS="true"
export METADATA_BOOTSTRAP_CREATE_FK="forward"
export METADATA_BOOTSTRAP_CREATE_SHARED_KEYS="none"
# Staging environment - balanced
export ENVIRONMENT="staging"
export METADATA_BOOTSTRAP_AI_PROVIDER="anthropic"
export METADATA_BOOTSTRAP_CREATE_FK="all"
export METADATA_BOOTSTRAP_CREATE_SHARED_KEYS="forward"
# Production environment - comprehensive
export ENVIRONMENT="production"
export METADATA_BOOTSTRAP_AI_PROVIDER="anthropic"
export METADATA_BOOTSTRAP_ANTHROPIC_MODEL="claude-3-sonnet-20240229"
export METADATA_BOOTSTRAP_CREATE_FK="all"
export METADATA_BOOTSTRAP_CREATE_SHARED_KEYS="all"
export METADATA_BOOTSTRAP_REBUILD_RELATIONSHIPS="true"
```
## ๐งช Testing Multi-Provider Features
```bash
# Test all providers
pytest tests/test_multi_provider.py -v
# Test feature control system
pytest tests/test_feature_flags.py -v
# Test provider-specific optimizations
pytest tests/test_provider_optimization.py -v
# Test configuration validation for all providers
pytest tests/test_provider_config.py -v
# Test FK blacklist functionality
pytest tests/test_fk_blacklist.py -v
# Run performance benchmarks across providers
pytest tests/benchmark_providers.py -v --benchmark-only
```
## ๐ค Contributing
### Multi-Provider Development Areas
1. **Provider Integration**
- Additional AI provider support (Claude-4, GPT-5, etc.)
- Provider-specific optimization algorithms
- Custom model fine-tuning support
2. **Feature Control Enhancements**
- Advanced processing pipelines
- Conditional feature dependencies
- Performance profiling per feature
3. **Performance Optimization**
- Provider-specific prompt engineering
- Dynamic provider selection based on workload
- Cost optimization strategies
4. **Quality Assessment**
- Provider-specific quality metrics
- Cross-provider quality comparison
- A/B testing frameworks
5. **Relationship Blocking**
- Visual rule builder for FK blacklists
- Rule impact analysis and testing
- Advanced pattern matching algorithms
## ๐ License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## ๐ Support
- ๐ [Documentation](https://github.com/hasura/ddn-metadata-bootstrap#readme)
- ๐ [Bug Reports](https://github.com/hasura/ddn-metadata-bootstrap/issues)
- ๐ฌ [Discussions](https://github.com/hasura/ddn-metadata-bootstrap/discussions)
- ๐ค [AI Provider Issues](https://github.com/hasura/ddn-metadata-bootstrap/issues?q=label%3Aai-provider)
- ๐ฏ [Feature Control Issues](https://github.com/hasura/ddn-metadata-bootstrap/issues?q=label%3Afeature-control)
- ๐ง [Caching Issues](https://github.com/hasura/ddn-metadata-bootstrap/issues?q=label%3Acaching)
- ๐ [Quality Assessment Issues](https://github.com/hasura/ddn-metadata-bootstrap/issues?q=label%3Aquality)
- ๐ [Relationship Blocking Issues](https://github.com/hasura/ddn-metadata-bootstrap/issues?q=label%3Arelationship-blocking)
## ๐ท๏ธ Version History
See [CHANGELOG.md](CHANGELOG.md) for complete version history and breaking changes.
## โญ Acknowledgments
- Built for [Hasura DDN](https://hasura.io/ddn)
- Powered by [Anthropic Claude](https://www.anthropic.com/), [OpenAI GPT](https://openai.com/), and [Google Gemini](https://deepmind.google/technologies/gemini/)
- Linguistic analysis powered by [NLTK](https://www.nltk.org/) and [WordNet](https://wordnet.princeton.edu/)
- Inspired by the GraphQL and OpenAPI communities
- Caching algorithms inspired by database query optimization techniques
- Relationship blocking patterns inspired by enterprise security frameworks
---
Made with โค๏ธ by the Hasura team
Raw data
{
"_id": null,
"home_page": null,
"name": "ddn-metadata-bootstrap",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "Kenneth Stott <kenneth@hasura.io>",
"keywords": "hasura, ddn, graphql, schema, metadata, ai, anthropic, descriptions, relationships",
"author": null,
"author_email": "Kenneth Stott <kenneth@hasura.io>",
"download_url": "https://files.pythonhosted.org/packages/b9/9f/081634512d2d22db02f5b15f458de90a674dc4a9ffe3be7e99d5de81e211/ddn_metadata_bootstrap-1.0.16.tar.gz",
"platform": null,
"description": "# DDN Metadata Bootstrap\n\n[](https://badge.fury.io/py/ddn-metadata-bootstrap)\n[](https://pypi.org/project/ddn-metadata-bootstrap/)\n[](https://opensource.org/licenses/MIT)\n\nAI-powered metadata enhancement for Hasura DDN (Data Delivery Network) schema files. Automatically generate high-quality descriptions and detect sophisticated relationships in your YAML/HML schema definitions using advanced AI with comprehensive configuration management.\n\n## \ud83d\ude80 Features\n\n### \ud83e\udd16 **Multi-Provider AI Support**\n- **Anthropic Claude**: Default provider with claude-3-haiku, claude-3-sonnet, and claude-3-opus models\n- **OpenAI GPT**: Support for gpt-3.5-turbo, gpt-4, gpt-4o-mini, and latest models\n- **Google Gemini**: Support for gemini-pro, gemini-1.5-pro, and gemini-1.5-flash models\n- **Automatic Fallback**: Graceful degradation between providers with configurable priorities\n- **Provider-Specific Optimization**: Model-specific prompting and parameter tuning\n\n### \ud83c\udfaf **Granular Feature Control**\n- **Individual Feature Flags**: Control each processing feature independently\n- **Flexible Processing Modes**: Choose between all, forward-only, or none for relationships\n- **Selective Enhancement**: Process only descriptions, only relationships, or both\n- **Rebuild Capabilities**: Rebuild existing relationships from scratch when needed\n\n### \ud83e\udde0 **Advanced AI Generation**\n- **Quality Assessment with Retry Logic**: Multi-attempt generation with configurable scoring thresholds\n- **Context-Aware Business Descriptions**: Domain-specific system prompts with industry context\n- **Smart Field Analysis**: Automatically detects and skips self-explanatory, generic, or cryptic fields\n- **Configurable Length Controls**: Precise control over description length and token usage\n\n### \ud83e\udde0 **Intelligent Caching System** \n- **Similarity-Based Matching**: Reuses descriptions for similar fields across entities (85% similarity threshold)\n- **Performance Optimization**: Reduces API calls by up to 70% on large schemas through intelligent caching\n- **Cache Statistics**: Real-time performance monitoring with hit rates and API cost savings tracking\n- **Type-Aware Matching**: Considers field types and entity context for better cache accuracy\n\n### \ud83d\udd0d **WordNet-Based Linguistic Analysis**\n- **Generic Term Detection**: Uses NLTK and WordNet for sophisticated term analysis to skip meaningless fields\n- **Semantic Density Analysis**: Evaluates conceptual richness and specificity of field names\n- **Definition Quality Scoring**: Ensures meaningful, non-circular descriptions through linguistic validation\n- **Abstraction Level Calculation**: Determines appropriate description depth based on semantic analysis\n\n### \ud83d\udcdd **Enhanced Acronym Expansion**\n- **Comprehensive Mappings**: 200+ pre-configured acronyms for technology, finance, and business domains\n- **Context-Aware Expansion**: Industry-specific acronym interpretation based on domain context\n- **Pre-Generation Enhancement**: Expands acronyms BEFORE AI generation for better context\n- **Custom Domain Support**: Fully configurable acronym mappings via YAML configuration\n\n### \ud83d\udd17 **Advanced Relationship Detection**\n- **Template-Based FK Detection**: Sophisticated foreign key detection with confidence scoring and semantic validation\n- **Shared Business Key Relationships**: Many-to-many relationships via shared field analysis with FK-aware filtering\n- **Cross-Subgraph Intelligence**: Smart entity matching across different subgraphs\n- **Configurable Templates**: Flexible FK template patterns with placeholders for complex naming conventions\n- **Advanced Relationship Blocking**: Precision rule-based system to prevent inappropriate cross-connector relationships\n\n### \u2699\ufe0f **Comprehensive Configuration System**\n- **YAML-First Configuration**: Central `config.yaml` file for all settings with full documentation\n- **Waterfall Precedence**: CLI args > Environment variables > config.yaml > defaults\n- **Configuration Validation**: Comprehensive validation with helpful error messages and source tracking\n- **Feature Toggles**: Granular control over processing features with clear flag names\n\n### \ud83c\udfaf **Advanced Quality Controls**\n- **Buzzword Detection**: Avoids corporate jargon and meaningless generic terms\n- **Pattern-Based Filtering**: Regex-based rejection of poor description formats\n- **Technical Language Translation**: Converts technical terms to business-friendly language\n- **Length Optimization**: Multiple validation layers with hard limits and target lengths\n\n### \ud83d\udd0d **Intelligent Field Selection**\n- **Generic Field Detection**: Skips overly common fields that don't benefit from descriptions\n- **Cryptic Abbreviation Handling**: Configurable handling of unclear field names with vowel analysis\n- **Self-Explanatory Pattern Recognition**: Automatically identifies fields that don't need descriptions\n- **Value Assessment**: Only generates descriptions that add meaningful business value\n\n## \ud83d\udce6 Installation\n\n### From PyPI (Recommended)\n\n```bash\npip install ddn-metadata-bootstrap\n```\n\n### Provider-Specific Dependencies\n\nThe tool supports multiple AI providers. Install the dependencies for your chosen provider:\n\n```bash\n# For Anthropic Claude (default)\npip install ddn-metadata-bootstrap[anthropic]\n# or separately:\npip install anthropic\n\n# For OpenAI GPT \npip install ddn-metadata-bootstrap[openai]\n# or separately:\npip install openai\n\n# For Google Gemini\npip install ddn-metadata-bootstrap[gemini]\n# or separately: \npip install google-generativeai\n\n# Install all providers\npip install ddn-metadata-bootstrap[all]\n```\n\n### From Source\n\n```bash\ngit clone https://github.com/hasura/ddn-metadata-bootstrap.git\ncd ddn-metadata-bootstrap\npip install -e .\n```\n\n## \ud83c\udfc3 Quick Start\n\n### 1. Choose Your AI Provider\n\n#### Option A: Anthropic Claude (Default - Recommended)\n```bash\nexport ANTHROPIC_API_KEY=\"your-anthropic-api-key\"\nexport METADATA_BOOTSTRAP_AI_PROVIDER=\"anthropic\" # Optional (default)\nexport METADATA_BOOTSTRAP_ANTHROPIC_MODEL=\"claude-3-haiku-20240307\" # Optional\n```\n\n#### Option B: OpenAI GPT\n```bash\nexport OPENAI_API_KEY=\"your-openai-api-key\" \nexport METADATA_BOOTSTRAP_AI_PROVIDER=\"openai\"\nexport METADATA_BOOTSTRAP_OPENAI_MODEL=\"gpt-3.5-turbo\" # Optional\n```\n\n#### Option C: Google Gemini\n```bash\nexport GEMINI_API_KEY=\"your-gemini-api-key\"\n# or alternatively:\nexport GOOGLE_API_KEY=\"your-gemini-api-key\"\nexport METADATA_BOOTSTRAP_AI_PROVIDER=\"gemini\"\nexport METADATA_BOOTSTRAP_GEMINI_MODEL=\"gemini-pro\" # Optional\n```\n\n### 2. Set up your directories\n\n```bash\nexport METADATA_BOOTSTRAP_INPUT_DIR=\"./app/metadata\"\nexport METADATA_BOOTSTRAP_OUTPUT_DIR=\"./enhanced_metadata\"\n```\n\n### 3. Create a configuration file (Recommended)\n\nCreate a `config.yaml` file in your project directory:\n\n```yaml\n# config.yaml - DDN Metadata Bootstrap Configuration\n\n# =============================================================================\n# GLOBAL PROCESSING CONFIGURATION\n# =============================================================================\n# Controls which features are enabled and basic processing behavior\n\n# Feature Flags - what processing to perform\ncreate_fk: all # all|forward|none - FK relationships\ncreate_shared_keys: all # all|forward|none - Shared key relationships\ncreate_command_relationship_hints: true # true|false - Command relationship hints\ncreate_descriptions: true # true|false - AI-generated descriptions\nrebuild_relationships: false # true|false - Rebuild existing relationships from scratch\n\nenable_quality_assessment: true # Enable AI to score and improve its own descriptions\n\n# AI Provider Configuration\nai_provider: \"anthropic\" # Choose: anthropic, openai, gemini\n\n# Provider-specific API keys (alternatively set via environment variables)\n# anthropic_api_key: \"your-anthropic-key\"\n# openai_api_key: \"your-openai-key\" \n# gemini_api_key: \"your-gemini-key\"\n\n# Provider-specific models\nanthropic_model: \"claude-3-haiku-20240307\" # claude-3-sonnet-20240229, claude-3-opus-20240229\nopenai_model: \"gpt-3.5-turbo\" # gpt-4, gpt-4o-mini, gpt-4-turbo-preview\ngemini_model: \"gemini-pro\" # gemini-1.5-pro-latest, gemini-1.5-flash\n\n# =============================================================================\n# DESCRIPTION GENERATION CONFIGURATION\n# =============================================================================\n\n# Domain-specific system prompt for your organization\nsystem_prompt: |\n You generate concise field descriptions for database schema metadata at a global financial services firm.\n \n DOMAIN CONTEXT:\n - Organization: Global bank\n - Department: Cybersecurity operations \n - Use case: Risk management and security compliance\n - Regulatory environment: Financial services (SOX, Basel III, GDPR, etc.)\n \n Think: \"What would a cybersecurity analyst at a bank need to know about this field?\"\n\n# Token and length limits\nfield_tokens: 25 # Max tokens AI can generate for field descriptions\nkind_tokens: 50 # Max tokens AI can generate for kind descriptions\nfield_desc_max_length: 120 # Maximum total characters for field descriptions\nkind_desc_max_length: 250 # Maximum total characters for entity descriptions\n\n# Quality thresholds\nminimum_description_score: 70 # Minimum score (0-100) to accept a description\nmax_description_retry_attempts: 3 # How many times to retry for better quality\n\n# =============================================================================\n# ENHANCED ACRONYM EXPANSION\n# =============================================================================\nacronym_mappings:\n # Technology & Computing\n api: \"Application Programming Interface\"\n ui: \"User Interface\"\n db: \"Database\"\n \n # Security & Access Management\n mfa: \"Multi-Factor Authentication\"\n sso: \"Single Sign-On\"\n iam: \"Identity and Access Management\"\n siem: \"Security Information and Event Management\"\n \n # Financial Services & Compliance\n pci: \"Payment Card Industry\"\n sox: \"Sarbanes-Oxley Act\"\n kyc: \"Know-Your-Customer\"\n aml: \"Anti-Money Laundering\"\n # ... 200+ total mappings available\n\n# =============================================================================\n# INTELLIGENT FIELD SELECTION\n# =============================================================================\n# Fields to skip entirely - these will not get descriptions at all\nskip_field_patterns:\n - \"^id$\"\n - \"^_id$\"\n - \"^uuid$\"\n - \"^created_at$\"\n - \"^updated_at$\"\n - \"^debug_.*\"\n - \"^test_.*\"\n - \"^temp_.*\"\n\n# Generic fields - won't get unique descriptions (too common)\ngeneric_fields:\n - \"id\"\n - \"key\"\n - \"uid\"\n - \"guid\"\n - \"name\"\n\n# Self-explanatory fields - simple patterns that don't need descriptions\nself_explanatory_patterns:\n - '^id$'\n - '^_id$'\n - '^guid$'\n - '^uuid$'\n - '^key$'\n\n# Cryptic Field Handling\nskip_cryptic_abbreviations: true # Skip fields with unclear abbreviations\nskip_ultra_short_fields: true # Skip very short field names that are likely abbreviations\nmax_cryptic_field_length: 4 # Field names this length or shorter are considered cryptic\n\n# Content quality controls\nbuzzwords: [\n 'synergy', 'leverage', 'paradigm', 'ecosystem',\n 'contains', 'stores', 'holds', 'represents'\n]\n\nforbidden_patterns: [\n 'this\\\\s+field\\\\s+represents',\n 'used\\\\s+to\\\\s+(track|manage|identify)',\n 'business.*information'\n]\n\n# =============================================================================\n# RELATIONSHIP DETECTION\n# =============================================================================\n# FK Template Patterns for relationship detection\n# Format: \"{pk_pattern}|{fk_pattern}\"\n# Placeholders: {gi}=generic_id, {pt}=primary_table, {ps}=primary_subgraph, {pm}=prefix_modifier\nfk_templates:\n - \"{gi}|{pm}_{pt}_{gi}\" # active_service_name \u2192 Services.name\n - \"{gi}|{pt}_{gi}\" # user_id \u2192 Users.id\n - \"{pt}_{gi}|{pm}_{pt}_{gi}\" # user_id \u2192 ActiveUsers.active_user_id\n\n# =============================================================================\n# ADVANCED RELATIONSHIP BLOCKING\n# =============================================================================\n# Precision rule-based system to prevent inappropriate relationships\n# Uses bidirectional validation with data_connector + entity + field pattern matching\nfk_key_blacklist:\n # Block cross-cloud provider connections with infrastructure fields\n - entity_pattern_a:\n data_connector: \"^(gcp|arg|various)$\" # Google Cloud Platform, Azure Resource Graph, Various\n entity: \"^(gcp_|google_).*\" # Google/GCP entities\n field: \".*\" # Any field\n entity_pattern_b:\n data_connector: \"^(gcp|arg|various)$\" \n entity: \"^(az_|azure_).*\" # Azure entities\n field: \".*(resource|project|policy|storage|compute).*\" # Infrastructure fields only\n logic: \"and\"\n reason: \"Block google/gcp entities from connecting to azure entities with infrastructure-related fields\"\n \n # Complete isolation between major cloud platforms\n - entity_pattern_a:\n data_connector: \"^gcp$\" # Google Cloud Platform connector\n entity: \".*\" # Any entity\n field: \".*\" # Any field\n entity_pattern_b:\n data_connector: \"^arg$\" # Azure Resource Graph connector \n entity: \".*\" # Any entity\n field: \".*\" # Any field\n logic: \"and\"\n reason: \"Block all connections between Google Cloud Platform and Azure Resource Graph connectors\"\n\n# Shared relationship limits\nmax_shared_relationships: 10000\nmax_shared_per_entity: 10\nmin_shared_confidence: 30\n\n# Shared Key Rejection Patterns - fields matching these won't be used for shared relationships\nshared_key_rejection_patterns:\n # Private/Technical Fields (leading underscore indicates internal use)\n - \"^_.*$\"\n # Primary Identifiers (too generic for meaningful relationships)\n - \"^_?(id|key)$\"\n # Generic Classification Fields (overly broad categorization)\n - \"^(name|type|category|title|code|level|kind)$\"\n # State/Status Fields (frequently changing, not structural)\n - \"^(status|state|active)$\"\n # Audit Fields - Temporal Only (timestamp-based, not relational)\n - \"^(created|updated|modified)(_at|_date|_time|_timestamp)?$\"\n```\n\n### 4. Run the tool with your chosen provider\n\n```bash\n# Use default provider (Anthropic) with default settings\nddn-metadata-bootstrap\n\n# Use OpenAI explicitly\nddn-metadata-bootstrap --ai-provider openai --openai-api-key your-key\n\n# Use Gemini with specific model\nddn-metadata-bootstrap --ai-provider gemini --gemini-model gemini-1.5-pro\n\n# Show configuration including AI provider setup\nddn-metadata-bootstrap --show-config\n\n# Test your AI provider connection\nddn-metadata-bootstrap --test-provider\n\n# Process only relationships (skip descriptions)\nddn-metadata-bootstrap --create-descriptions false\n\n# Process only descriptions (skip relationships)\nddn-metadata-bootstrap --create-fk none --create-shared-keys none\n\n# Rebuild all relationships from scratch\nddn-metadata-bootstrap --rebuild-relationships\n\n# Use custom configuration file\nddn-metadata-bootstrap --config custom-config.yaml\n\n# Enable verbose logging to see AI provider selection and caching\nddn-metadata-bootstrap --verbose\n```\n\n## \ud83c\udfaf Feature Control System\n\nThe tool provides granular control over each processing feature through clean, intuitive flags:\n\n### Core Processing Features\n\n| Feature | Config Key | Values | Description |\n|---------|------------|--------|-------------|\n| **FK Relationships** | `create_fk` | `all`, `forward`, `none` | Foreign key relationship detection |\n| **Shared Key Relationships** | `create_shared_keys` | `all`, `forward`, `none` | Shared field relationship detection |\n| **Command Hints** | `create_command_relationship_hints` | `true`, `false` | Command relationship hints |\n| **Descriptions** | `create_descriptions` | `true`, `false` | AI-generated descriptions |\n| **Rebuild Mode** | `rebuild_relationships` | `true`, `false` | Rebuild existing relationships |\n\n### Processing Modes\n\n#### **All Mode** (`all`)\n- Creates relationships in both directions\n- Full bidirectional relationship graph\n- Best for comprehensive schema understanding\n\n#### **Forward Mode** (`forward`)\n- Creates relationships in forward direction only\n- Reduces relationship complexity\n- Useful for directed schema analysis\n\n#### **None Mode** (`none`)\n- Skips the feature entirely\n- Fastest processing\n- Use when feature not needed\n\n### Feature Combinations\n\n```bash\n# Only generate descriptions (no relationships)\ncreate_fk: none\ncreate_shared_keys: none\ncreate_descriptions: true\n\n# Only generate FK relationships (no descriptions or shared keys)\ncreate_fk: all\ncreate_shared_keys: none\ncreate_descriptions: false\n\n# Minimal processing (relationships only, forward direction)\ncreate_fk: forward\ncreate_shared_keys: forward\ncreate_descriptions: false\n\n# Full processing with rebuild\ncreate_fk: all\ncreate_shared_keys: all\ncreate_descriptions: true\nrebuild_relationships: true\n```\n\n## \ud83d\udd17 Advanced Relationship Blocking System\n\nThe tool includes a sophisticated **bidirectional relationship blocking system** that prevents inappropriate foreign key relationships from being generated. This is particularly important in enterprise environments with multiple data connectors, cloud providers, and security boundaries.\n\n### Key Features\n\n#### **Precision Pattern Matching**\nEach blocking rule uses three-part patterns for maximum precision:\n- **Data Connector**: Regex pattern matching the connector name (e.g., `^gcp$`, `^(test|dev)_.*`)\n- **Entity Name**: Regex pattern matching the entity/table name (e.g., `^google_.*`, `^azure_storage.*`)\n- **Field Name**: Regex pattern matching the field name (e.g., `.*resource.*`, `.*secret.*`)\n\n#### **Bidirectional Validation**\nRules automatically check both directions of a relationship:\n- **Pattern A \u2192 Pattern B**: `google_compute` \u2192 `azure_storage_resource`\n- **Pattern B \u2192 Pattern A**: `azure_vm` \u2192 `google_analytics_data`\n\nBoth directions are blocked by a single rule definition.\n\n#### **Flexible Logic Operators**\n- **AND Logic**: All patterns (connector AND entity AND field) must match for both sides\n- **OR Logic**: Either side matching its full pattern triggers the block\n\n### Real-World Examples\n\n#### **Cross-Cloud Security Isolation**\n```yaml\n# Block Google Cloud from Azure Resource Graph\n- entity_pattern_a:\n data_connector: \"^gcp$\" # Google Cloud Platform\n entity: \".*\" # Any GCP entity\n field: \".*\" # Any field\n entity_pattern_b:\n data_connector: \"^arg$\" # Azure Resource Graph \n entity: \".*\" # Any Azure entity\n field: \".*\" # Any field\n logic: \"and\"\n reason: \"Complete isolation between cloud providers for security compliance\"\n```\n\n#### **Environment Separation**\n```yaml\n# Block test environments from production sensitive data\n- entity_pattern_a:\n data_connector: \"^(test|dev)_.*\"\n entity: \".*\"\n field: \".*\"\n entity_pattern_b:\n data_connector: \"^prod_.*\"\n entity: \".*\"\n field: \".*(pii|ssn|credit_card|password).*\"\n logic: \"and\"\n reason: \"Prevent test/dev access to production sensitive data\"\n```\n\n#### **Infrastructure Boundaries**\n```yaml\n# Block legacy systems from modern cloud infrastructure\n- entity_pattern_a:\n data_connector: \"^legacy_.*\"\n entity: \"^(mainframe|cobol)_.*\"\n field: \".*\"\n entity_pattern_b:\n data_connector: \"^(gcp|aws|azure)_.*\"\n entity: \"^(kubernetes|container|serverless)_.*\"\n field: \".*\"\n logic: \"and\"\n reason: \"Prevent direct legacy-to-cloud connections without proper integration layer\"\n```\n\n### Configuration Validation\n\nThe system includes comprehensive validation:\n\n```bash\n# Validate your FK blacklist rules\nddn-metadata-bootstrap --validate-config\n\n# Test specific blocking scenarios\nddn-metadata-bootstrap --test-fk-blocking\n\n# Show compiled regex patterns\nddn-metadata-bootstrap --show-config --verbose\n```\n\n## \ud83e\udd16 AI Provider Comparison\n\n### Performance & Cost Comparison\n\n| Provider | Speed | Cost | Quality | Best For |\n|----------|-------|------|---------|----------|\n| **Anthropic Claude Haiku** | \u26a1\u26a1\u26a1 Very Fast | \ud83d\udcb0 Low | \u2b50\u2b50\u2b50\u2b50 High | Development, High Volume |\n| **Anthropic Claude Sonnet** | \u26a1\u26a1 Fast | \ud83d\udcb0\ud83d\udcb0 Medium | \u2b50\u2b50\u2b50\u2b50\u2b50 Excellent | Production, Balanced |\n| **Anthropic Claude Opus** | \u26a1 Medium | \ud83d\udcb0\ud83d\udcb0\ud83d\udcb0 High | \u2b50\u2b50\u2b50\u2b50\u2b50 Excellent | Critical Schemas |\n| **OpenAI GPT-3.5 Turbo** | \u26a1\u26a1\u26a1 Very Fast | \ud83d\udcb0 Very Low | \u2b50\u2b50\u2b50 Good | Development, Budget |\n| **OpenAI GPT-4o Mini** | \u26a1\u26a1\u26a1 Very Fast | \ud83d\udcb0 Low | \u2b50\u2b50\u2b50\u2b50 High | Production, Cost-Optimized |\n| **OpenAI GPT-4** | \u26a1\u26a1 Fast | \ud83d\udcb0\ud83d\udcb0\ud83d\udcb0 High | \u2b50\u2b50\u2b50\u2b50\u2b50 Excellent | Premium Quality |\n| **Google Gemini Pro** | \u26a1\u26a1 Fast | \ud83d\udcb0 Very Low | \u2b50\u2b50\u2b50\u2b50 High | Large Scale, Budget |\n| **Google Gemini 1.5 Flash** | \u26a1\u26a1\u26a1 Very Fast | \ud83d\udcb0 Low | \u2b50\u2b50\u2b50 Good | High Throughput |\n\n### Provider-Specific Configuration Examples\n\n#### Anthropic Claude (Recommended)\n```yaml\nai_provider: \"anthropic\"\nanthropic_model: \"claude-3-haiku-20240307\" # Fast & cost-effective\n# anthropic_model: \"claude-3-sonnet-20240229\" # Balanced\n# anthropic_model: \"claude-3-opus-20240229\" # Highest quality\n\n# Anthropic-optimized settings\nfield_tokens: 30\nsystem_prompt: |\n Generate concise, business-focused field descriptions.\n Focus on practical utility and clear business meaning.\n```\n\n#### OpenAI GPT (Cost-Optimized)\n```yaml\nai_provider: \"openai\"\nopenai_model: \"gpt-4o-mini\" # Best balance of cost and quality\n# openai_model: \"gpt-3.5-turbo\" # Most cost-effective\n# openai_model: \"gpt-4-turbo-preview\" # Highest quality\n\n# OpenAI-optimized settings\nfield_tokens: 25\nsystem_prompt: |\n You are a technical writer creating database field descriptions.\n Be concise, specific, and business-focused.\n```\n\n#### Google Gemini (High Volume)\n```yaml\nai_provider: \"gemini\"\ngemini_model: \"gemini-1.5-flash\" # High throughput\n# gemini_model: \"gemini-pro\" # Balanced\n# gemini_model: \"gemini-1.5-pro-latest\" # Highest quality\n\n# Gemini-optimized settings\nfield_tokens: 35\nsystem_prompt: |\n Create clear, professional descriptions for database schema fields.\n Focus on business value and practical understanding.\n```\n\n## \ud83d\udcdd Enhanced Examples\n\n### Multi-Provider Description Generation\n\n#### Input Schema (HML)\n```yaml\nkind: ObjectType\nversion: v1\ndefinition:\n name: ThreatAssessment\n fields:\n - name: riskId\n type: String!\n - name: mfaEnabled\n type: Boolean!\n - name: ssoConfig\n type: String\n - name: iamPolicy\n type: String\n```\n\n#### Output with Different Providers\n\n##### Anthropic Claude (Business-Focused)\n```yaml\nkind: ObjectType\nversion: v1\ndefinition:\n name: ThreatAssessment\n description: |\n Security risk evaluation and compliance status tracking for \n organizational threat management and regulatory oversight.\n fields:\n - name: riskId\n type: String!\n description: Risk assessment identifier for tracking security evaluations.\n - name: mfaEnabled\n type: Boolean!\n description: Multi-Factor Authentication enablement status for security policy compliance.\n - name: ssoConfig\n type: String\n description: Single Sign-On configuration settings for identity management.\n - name: iamPolicy\n type: String\n description: Identity and Access Management policy governing user permissions.\n```\n\n### Feature Control Examples\n\n#### Descriptions Only (No Relationships)\n```bash\n# CLI\nddn-metadata-bootstrap --create-fk none --create-shared-keys none --create-descriptions true\n\n# Config YAML\ncreate_fk: none\ncreate_shared_keys: none\ncreate_command_relationship_hints: false\ncreate_descriptions: true\n```\n\n#### Relationships Only (No Descriptions)\n```bash\n# CLI\nddn-metadata-bootstrap --create-descriptions false\n\n# Config YAML\ncreate_fk: all\ncreate_shared_keys: all\ncreate_command_relationship_hints: true\ncreate_descriptions: false\n```\n\n#### Forward-Only Relationships (Reduced Complexity)\n```bash\n# Config YAML\ncreate_fk: forward\ncreate_shared_keys: forward\ncreate_command_relationship_hints: true\ncreate_descriptions: true\n```\n\n#### Rebuild Mode (Start Fresh)\n```bash\n# CLI\nddn-metadata-bootstrap --rebuild-relationships\n\n# Config YAML\nrebuild_relationships: true\n```\n\n## \ud83d\udc0d Python API with Multi-Provider Support\n\n```python\nfrom ddn_metadata_bootstrap import BootstrapperConfig, MetadataBootstrapper\nfrom ddn_metadata_bootstrap.description_generator import DescriptionGenerator\nimport logging\n\n# Configure logging to see provider selection and caching\nlogging.basicConfig(level=logging.INFO)\n\n# Method 1: Use configuration file\nconfig = BootstrapperConfig(config_file=\"./config.yaml\")\n\n# Method 2: Programmatic feature control\nconfig = BootstrapperConfig()\nconfig.ai_provider = \"openai\"\nconfig.openai_api_key = \"your-openai-key\"\nconfig.openai_model = \"gpt-4o-mini\"\n\n# Feature control\nconfig.create_descriptions = True\nconfig.create_fk = \"all\" # all|forward|none\nconfig.create_shared_keys = \"forward\" # all|forward|none\nconfig.create_command_relationship_hints = True\nconfig.rebuild_relationships = False\n\n# Method 3: Direct generator creation with provider\ngenerator = DescriptionGenerator(\n api_key=\"your-api-key\",\n model=\"claude-3-haiku-20240307\",\n provider=\"anthropic\" # or \"openai\", \"gemini\"\n)\n\n# Create bootstrapper with feature control\nbootstrapper = MetadataBootstrapper(config)\n\n# Process directory with configured features\nresults = bootstrapper.process_directory(\n input_dir=\"./app/metadata\",\n output_dir=\"./enhanced_metadata\"\n)\n\n# Check what features were processed\nprocessing_summary = config.get_processing_summary()\nprint(f\"Processed: {processing_summary}\")\n\n# Get provider-specific statistics\nstats = bootstrapper.get_statistics()\nprint(f\"AI Provider: {stats['ai_provider']}\")\nprint(f\"Model Used: {stats['model_used']}\")\nprint(f\"Provider API Calls: {stats['provider_api_calls']}\")\nprint(f\"Provider Cost: ${stats['estimated_provider_cost']:.2f}\")\n```\n\n## \ud83d\udcca Enhanced Statistics & Monitoring\n\n```python\n# Feature-specific performance tracking\nstats = bootstrapper.get_statistics()\n\n# Feature processing summary\nprint(f\"Processing Summary: {config.get_processing_summary()}\")\nprint(f\"Features Enabled:\")\nprint(f\" - Descriptions: {config.should_create_descriptions()}\")\nprint(f\" - FK Relationships: {config.should_create_fk_relationships()}\")\nprint(f\" - Shared Key Relationships: {config.should_create_shared_key_relationships()}\")\nprint(f\" - Command Hints: {config.should_create_command_relationship_hints()}\")\nprint(f\" - Rebuild Mode: {config.should_rebuild_relationships()}\")\n\n# AI Provider metrics\nprint(f\"AI Provider: {stats['ai_provider']}\")\nprint(f\"Model: {stats['model_used']}\")\nprint(f\"Provider API calls: {stats['provider_api_calls']}\")\nprint(f\"Average response time: {stats['avg_response_time_ms']}ms\")\nprint(f\"Provider cost: ${stats['estimated_provider_cost']:.3f}\")\n\n# Relationship blocking statistics\nif 'relationship_stats' in stats:\n rel_stats = stats['relationship_stats']\n print(f\"Relationships considered: {rel_stats['relationships_considered']}\")\n print(f\"Relationships blocked: {rel_stats['relationships_blocked']}\")\n print(f\"FK blacklist hits: {rel_stats['fk_blacklist_hits']}\")\n print(f\"Cross-connector blocks: {rel_stats['cross_connector_blocks']}\")\n```\n\n## \ud83d\ude80 Provider-Specific Performance Improvements\n\n### Real-World Performance by Provider\n\n#### Anthropic Claude\n```bash\nProvider: Anthropic Claude Haiku\nProcessing Features: descriptions, FK relationships (all), shared keys (forward)\nProcessing 500 fields...\n\u2705 Strengths:\n- Excellent business context understanding\n- Consistent quality across attempts\n- Good acronym expansion integration\n- Fast response times (avg 850ms)\n\n\ud83d\udcca Results:\n- API calls: 127 (after caching)\n- Processing time: 2.1 minutes \n- Average quality score: 82\n- Cost: $0.89\n```\n\n#### Configuration-Based Performance\n```bash\nFeature Set: Descriptions only (relationships disabled)\nProvider: OpenAI GPT-4o Mini\nProcessing 500 fields...\n\u2705 Results:\n- API calls: 89 (descriptions only)\n- Processing time: 1.2 minutes\n- Average quality score: 78\n- Cost: $0.31\n\nFeature Set: Relationships only (descriptions disabled)\nProvider: Local processing\nProcessing 500 fields...\n\u2705 Results:\n- API calls: 0 (no AI needed)\n- Processing time: 0.3 minutes\n- Relationships generated: 247\n- Cost: $0.00\n```\n\n## \u2699\ufe0f Advanced Multi-Provider Configuration\n\n### Environment-Based Provider Selection\n\n```bash\n# Development environment - fast and cheap\nexport ENVIRONMENT=\"development\"\nexport METADATA_BOOTSTRAP_AI_PROVIDER=\"openai\"\nexport METADATA_BOOTSTRAP_CREATE_DESCRIPTIONS=\"true\"\nexport METADATA_BOOTSTRAP_CREATE_FK=\"forward\"\nexport METADATA_BOOTSTRAP_CREATE_SHARED_KEYS=\"none\"\n\n# Staging environment - balanced \nexport ENVIRONMENT=\"staging\"\nexport METADATA_BOOTSTRAP_AI_PROVIDER=\"anthropic\"\nexport METADATA_BOOTSTRAP_CREATE_FK=\"all\"\nexport METADATA_BOOTSTRAP_CREATE_SHARED_KEYS=\"forward\"\n\n# Production environment - comprehensive\nexport ENVIRONMENT=\"production\"\nexport METADATA_BOOTSTRAP_AI_PROVIDER=\"anthropic\"\nexport METADATA_BOOTSTRAP_ANTHROPIC_MODEL=\"claude-3-sonnet-20240229\"\nexport METADATA_BOOTSTRAP_CREATE_FK=\"all\"\nexport METADATA_BOOTSTRAP_CREATE_SHARED_KEYS=\"all\"\nexport METADATA_BOOTSTRAP_REBUILD_RELATIONSHIPS=\"true\"\n```\n\n## \ud83e\uddea Testing Multi-Provider Features\n\n```bash\n# Test all providers\npytest tests/test_multi_provider.py -v\n\n# Test feature control system\npytest tests/test_feature_flags.py -v\n\n# Test provider-specific optimizations\npytest tests/test_provider_optimization.py -v\n\n# Test configuration validation for all providers\npytest tests/test_provider_config.py -v\n\n# Test FK blacklist functionality\npytest tests/test_fk_blacklist.py -v\n\n# Run performance benchmarks across providers\npytest tests/benchmark_providers.py -v --benchmark-only\n```\n\n## \ud83e\udd1d Contributing\n\n### Multi-Provider Development Areas\n\n1. **Provider Integration**\n - Additional AI provider support (Claude-4, GPT-5, etc.)\n - Provider-specific optimization algorithms\n - Custom model fine-tuning support\n\n2. **Feature Control Enhancements**\n - Advanced processing pipelines\n - Conditional feature dependencies\n - Performance profiling per feature\n\n3. **Performance Optimization**\n - Provider-specific prompt engineering\n - Dynamic provider selection based on workload\n - Cost optimization strategies\n\n4. **Quality Assessment**\n - Provider-specific quality metrics\n - Cross-provider quality comparison\n - A/B testing frameworks\n\n5. **Relationship Blocking**\n - Visual rule builder for FK blacklists\n - Rule impact analysis and testing\n - Advanced pattern matching algorithms\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## \ud83c\udd98 Support\n\n- \ud83d\udcd6 [Documentation](https://github.com/hasura/ddn-metadata-bootstrap#readme)\n- \ud83d\udc1b [Bug Reports](https://github.com/hasura/ddn-metadata-bootstrap/issues)\n- \ud83d\udcac [Discussions](https://github.com/hasura/ddn-metadata-bootstrap/discussions)\n- \ud83e\udd16 [AI Provider Issues](https://github.com/hasura/ddn-metadata-bootstrap/issues?q=label%3Aai-provider)\n- \ud83c\udfaf [Feature Control Issues](https://github.com/hasura/ddn-metadata-bootstrap/issues?q=label%3Afeature-control)\n- \ud83e\udde0 [Caching Issues](https://github.com/hasura/ddn-metadata-bootstrap/issues?q=label%3Acaching)\n- \ud83d\udd0d [Quality Assessment Issues](https://github.com/hasura/ddn-metadata-bootstrap/issues?q=label%3Aquality)\n- \ud83d\udd17 [Relationship Blocking Issues](https://github.com/hasura/ddn-metadata-bootstrap/issues?q=label%3Arelationship-blocking)\n\n## \ud83c\udff7\ufe0f Version History\n\nSee [CHANGELOG.md](CHANGELOG.md) for complete version history and breaking changes.\n\n## \u2b50 Acknowledgments\n\n- Built for [Hasura DDN](https://hasura.io/ddn)\n- Powered by [Anthropic Claude](https://www.anthropic.com/), [OpenAI GPT](https://openai.com/), and [Google Gemini](https://deepmind.google/technologies/gemini/)\n- Linguistic analysis powered by [NLTK](https://www.nltk.org/) and [WordNet](https://wordnet.princeton.edu/)\n- Inspired by the GraphQL and OpenAPI communities\n- Caching algorithms inspired by database query optimization techniques\n- Relationship blocking patterns inspired by enterprise security frameworks\n\n---\n\nMade with \u2764\ufe0f by the Hasura team\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "AI-powered metadata enhancement for Hasura DDN schema files",
"version": "1.0.16",
"project_urls": {
"Bug Reports": "https://github.com/hasura/ddn-metadata-bootstrap/issues",
"Changelog": "https://github.com/hasura/ddn-metadata-bootstrap/blob/main/CHANGELOG.md",
"Documentation": "https://github.com/hasura/ddn-metadata-bootstrap#readme",
"Homepage": "https://github.com/hasura/ddn-metadata-bootstrap",
"Repository": "https://github.com/hasura/ddn-metadata-bootstrap.git"
},
"split_keywords": [
"hasura",
" ddn",
" graphql",
" schema",
" metadata",
" ai",
" anthropic",
" descriptions",
" relationships"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "09959f629ef85e5e8e8ed9a21172a79c694c01fa7875bd9ee1656d893ee9ff21",
"md5": "98629007dd7baba8b8597ddd17c232d4",
"sha256": "e0037df30327b30e60679cb78dc497164098e2f0f1ed924f24b390dd305d5efd"
},
"downloads": -1,
"filename": "ddn_metadata_bootstrap-1.0.16-py3-none-any.whl",
"has_sig": false,
"md5_digest": "98629007dd7baba8b8597ddd17c232d4",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 149825,
"upload_time": "2025-07-20T22:46:30",
"upload_time_iso_8601": "2025-07-20T22:46:30.232175Z",
"url": "https://files.pythonhosted.org/packages/09/95/9f629ef85e5e8e8ed9a21172a79c694c01fa7875bd9ee1656d893ee9ff21/ddn_metadata_bootstrap-1.0.16-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "b99f081634512d2d22db02f5b15f458de90a674dc4a9ffe3be7e99d5de81e211",
"md5": "defedf26d186804361d8f6008645842e",
"sha256": "21556d02f7a990b2b1e53d3f30ea58db9495fc134a7d6596410030a32a96436a"
},
"downloads": -1,
"filename": "ddn_metadata_bootstrap-1.0.16.tar.gz",
"has_sig": false,
"md5_digest": "defedf26d186804361d8f6008645842e",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 156455,
"upload_time": "2025-07-20T22:46:31",
"upload_time_iso_8601": "2025-07-20T22:46:31.660232Z",
"url": "https://files.pythonhosted.org/packages/b9/9f/081634512d2d22db02f5b15f458de90a674dc4a9ffe3be7e99d5de81e211/ddn_metadata_bootstrap-1.0.16.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-20 22:46:31",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "hasura",
"github_project": "ddn-metadata-bootstrap",
"github_not_found": true,
"lcname": "ddn-metadata-bootstrap"
}