syda


Namesyda JSON
Version 0.0.1 PyPI version JSON
download
home_pageNone
SummaryA Python library for AI-powered synthetic data generation with referential integrity
upload_time2025-08-12 02:24:09
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseLGPL-3.0-or-later
keywords synthetic data ai machine learning data generation testing privacy sqlalchemy openai anthropic claude gpt
VCS
bugtrack_url
requirements pydantic python-dotenv sqlalchemy pandas numpy networkx openai anthropic instructor python-magic python-docx openpyxl weasyprint pyyaml pytest boto3 azure-storage-blob pdfplumber pillow pytesseract sqlalchemy-utils mkdocs-material mkdocs mkdocs-macros-plugin
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Synthetic Data Generation Library

A Python-based open-source library for generating synthetic data with AI while preserving referential integrity. Allowing seamless use of OpenAI, Anthropic (Claude), and other AI models.

## Table of Contents

* [Features](#features)
* [Installation](#installation)
* [Quick Start](#quick-start)
* [Core API](#core-api)
  * [Structured Data Generation](#structured-data-generation)
  * [SQLAlchemy Model Integration](#sqlalchemy-model-integration)
  * [Handling Foreign Key Relationships](#handling-foreign-key-relationships)
  * [Multiple Schema Definition Formats](#multiple-schema-definition-formats)
    * [SQLAlchemy Models](#1-sqlalchemy-models)
    * [YAML Schema Files](#2-yaml-schema-files)
    * [JSON Schema Files](#3-json-schema-files)
    * [Dictionary-Based Schemas](#4-dictionary-based-schemas)
    * [Foreign Key Definition Methods](#foreign-key-definition-methods)
  * [Automatic Management of Multiple Related Models](#automatic-management-of-multiple-related-models)
    * [Using SQLAlchemy Models](#using-sqlalchemy-models)
    * [Using YAML Schema Files](#using-yaml-schema-files)
    * [Using JSON Schema Files](#using-json-schema-files)
    * [Using Dictionary-Based Schemas](#using-dictionary-based-schemas)
  * [Complete CRM Example](#complete-crm-example)
* [Metadata Enhancement Benefits with SQLAlchemy Models](#metadata-enhancement-benefits-with-sqlalchemy-models)
* [Custom Generators for Domain-Specific Data](#custom-generators-for-domain-specific-data)
* [Unstructured Document Generation](#unstructured-document-generation)
  * [Template-Based Document Generation](#template-based-document-generation)
  * [Template Schema Requirements](#template-schema-requirements)
  * [Supported Template Types](#supported-template-types)
* [Combined Structured and Unstructured Data](#combined-structured-and-unstructured-data)
  * [Connecting Documents to Structured Data](#connecting-documents-to-structured-data)
  * [Schema Dependencies for Documents](#schema-dependencies-for-documents)
  * [Custom Generators for Document Data](#custom-generators-for-document-data)
* [SQLAlchemy Models with Templates](#sqlalchemy-models-with-templates)
* [Model Selection and Configuration](#model-selection-and-configuration)
  * [Basic Configuration](#basic-configuration)
  * [Using Different Model Providers](#using-different-model-providers)
    * [OpenAI Models](#openai-models)
    * [Anthropic Claude Models](#anthropic-claude-models)
    * [Maximum Tokens Parameter](#maximum-tokens-parameter)
    * [Provider-Specific Optimizations](#provider-specific-optimizations)
  * [Advanced: Direct Access to LLM Client](#advanced-direct-access-to-llm-client)
* [Output Options](#output-options)
* [Configuration and Error Handling](#configuration-and-error-handling)
  * [API Keys Management](#api-keys-management)
    * [Environment Variables (Recommended)](#1-environment-variables-recommended)
    * [Direct Initialization](#2-direct-initialization)
  * [Error Handling](#error-handling)
* [Contributing](#contributing)
* [License](#license)

## Features

* **Multi-Provider AI Integration**:

  * Seamless integration with multiple AI providers
  * Support for OpenAI (GPT) and Anthropic (Claude). 
  * Default model is Anthropic Claude model claude-3-5-haiku-20241022
  * Consistent interface across different providers
  * Provider-specific parameter optimization

* **LLM-based Data Generation**:

  * AI-powered schema understanding and data creation
  * Contextually-aware synthetic records
  * Natural language prompt customization
  * Intelligent schema inference

* **SQLAlchemy Integration**:

  * Automatic extraction of model metadata, docstrings and constraints
  * Intelligent column-specific data generation
  * Parameter naming consistency with `sqlalchemy_models`
  
* **Multiple Schema Formats**:

  * SQLAlchemy model integration with automatic metadata extraction
  * YAML/JSON schema file support with full foreign key relationship handling
  * Python dictionary-based schema definitions
  
* **Referential Integrity**

  * Automatic foreign key detection and resolution
  * Multi-model dependency analysis through topological sorting
  * Robust handling of related data with referential constraints
  
* **Custom Generators**

  * Register column- or type-specific functions for domain-specific data
  * Contextual generators that adapt to other fields (like ICD-10 codes based on demographics)
  * Weighted distributions for realistic data patterns


## Installation

Install the package using pip:

```bash
pip install syda
```

## Quick Start

```python
from syda.structured import SyntheticDataGenerator
from syda.schemas import ModelConfig

model_config = ModelConfig(
    provider="anthropic",
    model_name="claude-3-5-haiku-20241022",
    temperature=0.7,
    max_tokens=8192  # Larger value for more complete responses
)

generator = SyntheticDataGenerator(model_config=model_config)

# Define schema for a single table
schemas = {
    'Patient': {
        'patient_id': 'number',
        'diagnosis_code': 'icd10_code',
        'email': 'email',
        'visit_date': 'date',
        'notes': 'text'
    }
}

prompt = "Generate realistic synthetic patient records with ICD-10 diagnosis codes, emails, visit dates, and clinical notes."

# Generate and save to CSV
results = generator.generate_for_schemas(
    schemas=schemas,
    prompts={'Patient': prompt},
    sample_sizes={'Patient': 15},
    output_dir='synthetic_output'
)
print(f"Data saved to synthetic_output/Patient.csv")
```

## Core API

### Structured Data Generation

Use simple schema maps or SQLAlchemy models to generate data:

```python
from syda.structured import SyntheticDataGenerator
from syda.schemas import ModelConfig

model_config = ModelConfig(provider='anthropic', model_name='claude-3-5-haiku-20241022')
generator = SyntheticDataGenerator(model_config=model_config)

# Simple dict schema
schemas = {
    'User': {'id': 'number', 'name': 'text'}
}
results = generator.generate_for_schemas(
    schemas=schemas,
    prompts={'User': 'Generate user records'},
    sample_sizes={'User': 10}
)
```

### SQLAlchemy Model Integration

Pass declarative models directly—docstrings and column metadata inform the prompt:

```python
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer, String
from syda.structured import SyntheticDataGenerator
from syda.schemas import ModelConfig

Base = declarative_base()
class User(Base):
    __tablename__ = 'users'
    id = Column(Integer, primary_key=True)
    name = Column(String, comment="Full name of the user")

model_config = ModelConfig(provider='anthropic', model_name='claude-3-5-haiku-20241022')
generator = SyntheticDataGenerator(model_config=model_config)
results = generator.generate_for_sqlalchemy_models(
    sqlalchemy_models=[User], 
    prompts={'User': 'Generate users'}, 
    sample_sizes={'User': 5}
)
```

### SQLAlchemy Model Integration

Pass declarative models directly—docstrings and column metadata inform the prompt:

```python
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer, String
from syda.structured import SyntheticDataGenerator
from syda.schemas import ModelConfig

Base = declarative_base()
class User(Base):
    __tablename__ = 'users'
    id = Column(Integer, primary_key=True)
    name = Column(String, comment="Full name of the user")

model_config = ModelConfig(provider='anthropic', model_name='claude-3-5-haiku-20241022')
generator = SyntheticDataGenerator(model_config=model_config)
results = generator.generate_for_sqlalchemy_models(
    sqlalchemy_models=[User], 
    prompts={'users': 'Generate users'}, 
    sample_sizes={'users': 5}
)
```

> **Important:** SQLAlchemy models **must** have either `__table__` or `__tablename__` specified. Without one of these attributes, the model cannot be properly processed by the system. The `__tablename__` attribute defines the name of the database table and is used as the schema name when generating data. For example, a model with `__tablename__ = 'users'` will be referenced as 'users' in prompts, sample_sizes, custom generators and the returned results dictionary.


### Handling Foreign Key Relationships

The library provides robust support for handling foreign key relationships with referential integrity:

1. **Automatic Foreign Key Detection**: Foreign keys are automatically detected from your yml, json, dict, SQLAlchemy models and assigned the type `'foreign_key'`.
2. **Manual Column-Specific Foreign Key Generators**: You can also manually define foreign key generators for specific columns as below snippet

```python
# After generating departments and loading them into departments_df:
def department_id_fk_generator(row, col_name):
    return random.choice(departments_df['id'].tolist())
generator.register_generator('foreign_key', department_id_fk_generator, column_name='department_id')
```

3. **Multi-Step Generation Process**: For related tables, generate parent records first, then use their IDs when generating child records:

```python
# Generate departments first, then employees with valid department_id references
results = generator.generate_for_sqlalchemy_models(
    sqlalchemy_models=[Department, Employee],
    prompts={
        'departments': 'Generate company departments',
        'employees': 'Generate realistic employee data'
    },
    sample_sizes={
        'departments': 5,
        'employees': 10
    }
)

# Access the generated dataframes
departments_df = results['departments']
employees_df = results['employees']
```

4. **Referential Integrity Preservation**: The foreign key generator samples from actual existing IDs in the parent table, ensuring all references are valid.
5. **Metadata-Enhanced Foreign Keys**: Column comments on foreign key fields are preserved and included in the prompt, helping the LLM understand the relationship context.


### Multiple Schema Definition Formats


> **Note:** For detailed information on supported field types and schema format, see the [Schema Reference](schema_reference.md) document.


Syda supports defining your data models in multiple formats, all leading to the same synthetic data generation capabilities. Choose the format that best suits your workflow:

#### 1. SQLAlchemy Models

```python
from sqlalchemy import Column, Integer, String, ForeignKey, Float, Date
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()

class Customer(Base):
    __tablename__ = 'customers'
    __doc__ = """Customer organization that places orders"""
    
    id = Column(Integer, primary_key=True)
    name = Column(String(100), nullable=False, comment="Company name")
    status = Column(String(20), comment="Customer status (Active/Inactive/Prospect)")

class Order(Base):
    __tablename__ = 'orders'
    __doc__ = """Customer order for products or services"""
    
    id = Column(Integer, primary_key=True)
    customer_id = Column(Integer, ForeignKey('customers.id'), nullable=False)
    order_date = Column(Date, nullable=False, comment="Date when order was placed")
    total_amount = Column(Float, comment="Total monetary value of the order in USD")

# Generate data from SQLAlchemy models
results = generator.generate_for_sqlalchemy_models(
    sqlalchemy_models=[Customer, Order],
    prompts={"customers": "Generate tech companies"},
    sample_sizes={"customers": 10, "orders": 30}
)
```

#### 2. YAML Schema Files

```yaml
# customer.yaml
__table_description__: Customer organization that places orders
id:
  type: number
  primary_key: true
name:
  type: text
  max_length: 100
  not_null: true
  description: Company name
status:
  type: text
  max_length: 20
  description: Customer status (Active/Inactive/Prospect)
```

```yaml
# order.yaml
__table_description__: Customer order for products or services
__foreign_keys__:
  customer_id: [Customer, id]
id:
  type: number
  primary_key: true
customer_id:
  type: foreign_key
  not_null: true
  description: Reference to the customer who placed the order
order_date:
  type: date
  not_null: true
  description: Date when order was placed
total_amount:
  type: number
  description: Total monetary value of the order in USD
```

```python
# Generate data from YAML schema files
results = generator.generate_for_schemas(
    schemas={
        'Customer': 'schemas/customer.yaml',
        'Order': 'schemas/order.yaml'
    },
    prompts={'Customer': 'Generate tech companies'},
    sample_sizes={'Customer': 10, 'Order': 30}
)
```

#### 3. JSON Schema Files

```json
// customer.json
{
  "__table_description__": "Customer organization that places orders",
  "id": {
    "type": "number",
    "primary_key": true
  },
  "name": {
    "type": "text",
    "max_length": 100,
    "not_null": true,
    "description": "Company name"
  },
  "status": {
    "type": "text",
    "max_length": 20,
    "description": "Customer status (Active/Inactive/Prospect)"
  }
}
```

```json
// order.json
{
  "__table_description__": "Customer order for products or services",
  "__foreign_keys__": {
    "customer_id": ["Customer", "id"]
  },
  "id": {
    "type": "number",
    "primary_key": true
  },
  "customer_id": {
    "type": "foreign_key",
    "not_null": true,
    "description": "Reference to the customer who placed the order"
  },
  "order_date": {
    "type": "date",
    "not_null": true,
    "description": "Date when order was placed"
  },
  "total_amount": {
    "type": "number",
    "description": "Total monetary value of the order in USD"
  }
}
```

```python
# Generate data from JSON schema files
results = generator.generate_for_schemas(
    schemas={
        'Customer': 'schemas/customer.json',
        'Order': 'schemas/order.json'
    },
    prompts={'Customer': 'Generate tech companies'},
    sample_sizes={'Customer': 10, 'Order': 30}
)
```

#### 4. Dictionary-Based Schemas

```python
# Define schemas directly as dictionaries
schemas = {
    'Customer': {
        '__table_description__': 'Customer organization that places orders',
        'id': {'type': 'number', 'primary_key': True},
        'name': {
            'type': 'text',
            'max_length': 100,
            'not_null': True,
            'description': 'Company name'
        },
        'status': {
            'type': 'text',
            'max_length': 20,
            'description': 'Customer status (Active/Inactive/Prospect)'
        }
    },
    'Order': {
        '__table_description__': 'Customer order for products or services',
        '__foreign_keys__': {
            'customer_id': ['Customer', 'id']
        },
        'id': {'type': 'number', 'primary_key': True},
        'customer_id': {
            'type': 'foreign_key',
            'not_null': True,
            'description': 'Reference to the customer who placed the order'
        },
        'order_date': {
            'type': 'date',
            'not_null': True,
            'description': 'Date when order was placed'
        },
        'total_amount': {
            'type': 'number',
            'description': 'Total monetary value of the order in USD'
        }
    }
}

# Generate data from dictionary schemas
results = generator.generate_for_schemas(
    schemas=schemas,
    prompts={'Customer': 'Generate tech companies'},
    sample_sizes={'Customer': 10, 'Order': 30}
)
```

#### Foreign Key Definition Methods

There are three ways to define foreign key relationships:

1. Using the `__foreign_keys__` special section in a schema:
   ```python
   "__foreign_keys__": {
       "customer_id": ["Customer", "id"]
   }
   ```

2. Using field-level references with type and references properties:
   ```python
   "order_id": {
       "type": "foreign_key",
       "references": {
           "schema": "Order",
           "field": "id"
       }
   }
   ```

3. Using type-based detection with naming conventions:
   ```python
   "customer_id": "foreign_key"
   ```
   (The system will attempt to infer the relationship based on naming conventions)

### Automatic Management of Multiple Related Models

#### Using SQLAlchemy Models

Simplify multi-table workflows with `generate_for_sqlalchemy_models`:

```python
from sqlalchemy import Column, Integer, String, Float, ForeignKey, Date, Text
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import relationship
from datetime import datetime, timedelta
import random
from syda.generate import SyntheticDataGenerator

Base = declarative_base()

# Customer model
class Customer(Base):
    __tablename__ = 'customers'
    
    id = Column(Integer, primary_key=True)
    name = Column(String(100), nullable=False)
    industry = Column(String(50))
    status = Column(String(20))
    contacts = relationship("Contact", back_populates="customer")
    orders = relationship("Order", back_populates="customer")

# Contact model with foreign key to Customer
class Contact(Base):
    __tablename__ = 'contacts'
    
    id = Column(Integer, primary_key=True)
    customer_id = Column(Integer, ForeignKey('customers.id'), nullable=False)
    name = Column(String(100), nullable=False)
    email = Column(String(120), nullable=False)
    phone = Column(String(20))
    customer = relationship("Customer", back_populates="contacts")

# Product model
class Product(Base):
    __tablename__ = 'products'
    
    id = Column(Integer, primary_key=True)
    name = Column(String(100), nullable=False)
    description = Column(Text)
    price = Column(Float, nullable=False)
    order_items = relationship("OrderItem", back_populates="product")

# Order model with foreign key to Customer
class Order(Base):
    __tablename__ = 'orders'
    
    id = Column(Integer, primary_key=True)
    customer_id = Column(Integer, ForeignKey('customers.id'), nullable=False)
    order_date = Column(Date, nullable=False)
    total_amount = Column(Float)
    customer = relationship("Customer", back_populates="orders")
    order_items = relationship("OrderItem", back_populates="order")

# OrderItem model with foreign keys to Order and Product
class OrderItem(Base):
    __tablename__ = 'order_items'
    
    id = Column(Integer, primary_key=True)
    order_id = Column(Integer, ForeignKey('orders.id'), nullable=False)
    product_id = Column(Integer, ForeignKey('products.id'), nullable=False)
    quantity = Column(Integer, nullable=False)
    price = Column(Float, nullable=False)
    order = relationship("Order", back_populates="order_items")
    product = relationship("Product", back_populates="order_items")

# Initialize generator
generator = SyntheticDataGenerator()

# Generate data for all models in one call
results = generator.generate_for_sqlalchemy_models(
    models=[Customer, Contact, Product, Order, OrderItem],
    prompts={
        "customers": "Generate diverse customer organizations for a B2B SaaS company.",
        "contacts": "Generate cloud software products and services."
    },
    sample_sizes={
        "customers": 10,
        "contacts": 25,
        "products": 15,
        "orders": 30,
        "order_items": 60
    },
    custom_generators={
        "customers": {
            # Ensure a specific distribution of customer statuses for business reporting
            "status": lambda row, col: random.choice(["Active", "Inactive", "Prospect"]),
        },
        "products": {
            # Ensure product categories match your specific business domains
            "category": lambda row, col: random.choice([
                "Cloud Infrastructure", "Business Intelligence", "Security Services",
                "Data Analytics", "Custom Development", "Support Package", "API Services"
            ])
        },
    }
)
```

#### Using YAML Schema Files

The same relationship management is available with YAML schemas:

```yaml
# customer.yaml
__table_name__: customers
__description__: Customer organizations

id:
  type: integer
  constraints:
    primary_key: true
    not_null: true

name:
  type: string
  constraints:
    not_null: true
    max_length: 100

industry:
  type: string
  constraints:
    max_length: 50

status:
  type: string
  constraints:
    max_length: 20
```

```yaml
# contact.yaml
__table_name__: contacts
__description__: Customer contacts
__foreign_keys__:
  customer_id: [customers, id]

id:
  type: integer
  constraints:
    primary_key: true
    not_null: true

customer_id:
  type: integer
  constraints:
    not_null: true

name:
  type: string
  constraints:
    not_null: true
    max_length: 100

email:
  type: string
  constraints:
    not_null: true
    max_length: 120

phone:
  type: string
  constraints:
    max_length: 20
```

```yaml
# order.yaml
__table_name__: orders
__description__: Customer orders
__foreign_keys__:
  customer_id: [customers, id]

id:
  type: integer
  constraints:
    primary_key: true
    not_null: true

customer_id:
  type: integer
  constraints:
    not_null: true

order_date:
  type: string
  format: date
  constraints:
    not_null: true

total_amount:
  type: number
  format: float
```

```python
# Generate data for multiple related tables with YAML schemas
results = generator.generate_for_schemas(
    schemas={
        'Customer': 'schemas/customer.yaml',
        'Contact': 'schemas/contact.yaml',
        'Product': 'schemas/product.yaml',
        'Order': 'schemas/order.yaml',
        'OrderItem': 'schemas/order_item.yaml'
    },
    prompts={
        "Customer": "Generate diverse customer organizations for a B2B SaaS company.",
        "Product": "Generate cloud software products and services."
    },
    sample_sizes={
        "Customer": 10,
        "Contact": 20,
        "Product": 15,
        "Order": 30,
        "OrderItem": 60
    }
)
```

#### Using JSON Schema Files

JSON schema files offer the same capabilities:

```json
// customer.json
{
  "__table_name__": "customers",
  "__description__": "Customer organizations",
  "id": {
    "type": "integer",
    "constraints": {
      "primary_key": true,
      "not_null": true
    }
  },
  "name": {
    "type": "string",
    "constraints": {
      "not_null": true,
      "max_length": 100
    }
  },
  "industry": {
    "type": "string",
    "constraints": {
      "max_length": 50
    }
  },
  "status": {
    "type": "string",
    "constraints": {
      "max_length": 20
    }
  }
}
```

```json
// contact.json
{
  "__table_name__": "contacts",
  "__description__": "Customer contacts",
  "__foreign_keys__": {
    "customer_id": ["customers", "id"]
  },
  "id": {
    "type": "integer",
    "constraints": {
      "primary_key": true,
      "not_null": true
    }
  },
  "customer_id": {
    "type": "integer",
    "constraints": {
      "not_null": true
    }
  },
  "name": {
    "type": "string",
    "constraints": {
      "not_null": true,
      "max_length": 100
    }
  },
  "email": {
    "type": "string",
    "constraints": {
      "not_null": true,
      "max_length": 120
    }
  },
  "phone": {
    "type": "string",
    "constraints": {
      "max_length": 20
    }
  }
}
```

```json
// order.json
{
  "__table_name__": "orders",
  "__description__": "Customer orders",
  "__foreign_keys__": {
    "customer_id": ["customers", "id"]
  },
  "id": {
    "type": "integer",
    "constraints": {
      "primary_key": true,
      "not_null": true
    }
  },
  "customer_id": {
    "type": "integer",
    "constraints": {
      "not_null": true
    }
  },
  "order_date": {
    "type": "string",
    "format": "date",
    "constraints": {
      "not_null": true
    }
  },
  "total_amount": {
    "type": "number",
    "format": "float"
  }
}
```

```python
# Generate data for multiple related tables with JSON schemas
results = generator.generate_for_schemas(
    schemas={
        'Customer': 'schemas/customer.json',
        'Contact': 'schemas/contact.json',
        'Product': 'schemas/product.json',
        'Order': 'schemas/order.json',
        'OrderItem': 'schemas/order_item.json'
    },
    prompts={
        "Customer": "Generate diverse customer organizations for a B2B SaaS company.",
        "Product": "Generate cloud software products and services."
    },
    sample_sizes={
        "Customer": 10,
        "Contact": 20,
        "Product": 15,
        "Order": 30,
        "OrderItem": 60
    }
)
```

#### Using Dictionary-Based Schemas

Similar relationship management works with dictionary schemas:

```python
# Define schemas as Python dictionaries
schemas = {
    'Customer': {
        '__table_name__': 'customers',
        '__description__': 'Customer organizations',
        'id': {
            'type': 'integer',
            'constraints': {'primary_key': True, 'not_null': True}
        },
        'name': {
            'type': 'string',
            'constraints': {'not_null': True, 'max_length': 100}
        },
        'industry': {
            'type': 'string',
            'constraints': {'max_length': 50}
        },
        'status': {
            'type': 'string',
            'constraints': {'max_length': 20}
        }
    },
    'Contact': {
        '__table_name__': 'contacts',
        '__description__': 'Customer contacts',
        '__foreign_keys__': {
            'customer_id': ['customers', 'id']
        },
        'id': {
            'type': 'integer',
            'constraints': {'primary_key': True, 'not_null': True}
        },
        'customer_id': {
            'type': 'integer',
            'constraints': {'not_null': True}
        },
        'name': {
            'type': 'string',
            'constraints': {'not_null': True, 'max_length': 100}
        },
        'email': {
            'type': 'string',
            'constraints': {'not_null': True, 'max_length': 120}
        },
        'phone': {
            'type': 'string',
            'constraints': {'max_length': 20}
        }
    },
    'Order': {
        '__table_name__': 'orders',
        '__description__': 'Customer orders',
        '__foreign_keys__': {
            'customer_id': ['customers', 'id']
        },
        'id': {
            'type': 'integer',
            'constraints': {'primary_key': True, 'not_null': True}
        },
        'customer_id': {
            'type': 'integer',
            'constraints': {'not_null': True}
        },
        'order_date': {
            'type': 'string',
            'format': 'date',
            'constraints': {'not_null': True}
        },
        'total_amount': {
            'type': 'number',
            'format': 'float'
        }
    }
}

# Generate data for dictionary schemas
results = generator.generate_for_schemas(
    schemas=schemas,
    prompts={
        'Customer': 'Generate diverse customer organizations for a B2B SaaS company.'
    },
    sample_sizes={
        'Customer': 10,
        'Contact': 20,
        'Order': 30
    }
)
```

In all cases, the generator will:
1. Analyze relationships between models/schemas
2. Determine the correct generation order using topological sorting
3. Generate parent tables first
4. Use existing primary keys when populating foreign keys in child tables
5. Maintain referential integrity across the entire dataset


### Complete CRM Example

Here’s a comprehensive example demonstrating `generate_for_sqlalchemy_models` across five interrelated models, including entity definitions, prompt setup, and data verification:

```python
#!/usr/bin/env python
import random
import datetime
from sqlalchemy import Column, Integer, String, ForeignKey, Float, Date, Boolean, Text
from sqlalchemy.orm import declarative_base, relationship
from syda.structured import SyntheticDataGenerator

Base = declarative_base()

class Customer(Base):
    __tablename__ = 'customers'
    id = Column(Integer, primary_key=True)
    name = Column(String(100), unique=True, comment="Customer organization name")
    industry = Column(String(50), comment="Customer's primary industry")
    website = Column(String(100), comment="Customer's website URL")
    status = Column(String(20), comment="Active, Inactive, Prospect")
    created_at = Column(Date, default=datetime.date.today, comment="Date when added to CRM")
    contacts = relationship("Contact", back_populates="customer")
    orders = relationship("Order", back_populates="customer")

class Contact(Base):
    __tablename__ = 'contacts'
    id = Column(Integer, primary_key=True)
    customer_id = Column(Integer, ForeignKey('customers.id'), comment="Customer this contact belongs to")
    first_name = Column(String(50), comment="Contact's first name")
    last_name = Column(String(50), comment="Contact's last name")
    email = Column(String(100), unique=True, comment="Contact's email address")
    phone = Column(String(20), comment="Contact's phone number")
    position = Column(String(100), comment="Job title or position")
    is_primary = Column(Boolean, default=False, comment="Primary contact flag")
    customer = relationship("Customer", back_populates="contacts")

class Product(Base):
    __tablename__ = 'products'
    id = Column(Integer, primary_key=True)
    name = Column(String(100), unique=True, comment="Product name")
    category = Column(String(50), comment="Product category")
    price = Column(Float, comment="Product price in USD")
    description = Column(Text, comment="Product description")
    order_items = relationship("OrderItem", back_populates="product")

class Order(Base):
    __tablename__ = 'orders'
    id = Column(Integer, primary_key=True)
    customer_id = Column(Integer, ForeignKey('customers.id'), comment="Customer who placed the order")
    order_date = Column(Date, comment="Date when order was placed")
    status = Column(String(20), comment="Order status: New, Processing, Shipped, Delivered, Cancelled")
    total_amount = Column(Float, comment="Total amount in USD")
    customer = relationship("Customer", back_populates="orders")
    items = relationship("OrderItem", back_populates="order")

class OrderItem(Base):
    __tablename__ = 'order_items'
    id = Column(Integer, primary_key=True)
    order_id = Column(Integer, ForeignKey('orders.id'), comment="Order this item belongs to")
    product_id = Column(Integer, ForeignKey('products.id'), comment="Product in the order")
    quantity = Column(Integer, comment="Quantity ordered")
    unit_price = Column(Float, comment="Unit price at order time")
    order = relationship("Order", back_populates="items")
    product = relationship("Product", back_populates="order_items")


def main():
    generator = SyntheticDataGenerator(model='gpt-4')
    output_dir = 'crm_data'
    prompts = {
        "customers": "Generate diverse customer organizations for a B2B SaaS company.",
        "products": "Generate products for a cloud software company.",
        "orders": "Generate realistic orders with appropriate dates and statuses."
    }
    sample_sizes = {"customers": 10, "contacts": 25, "products": 15, "orders": 30, "order_items": 60}

    results = generator.generate_for_sqlalchemy_models(
        sqlalchemy_models=[Customer, Contact, Product, Order, OrderItem],
        prompts=prompts,
        sample_sizes=sample_sizes,
        output_dir=output_dir
    )

    # Referential integrity checks
    print("\n🔍 Verifying referential integrity:")
    if set(results['Contact']['customer_id']).issubset(set(results['Customer']['id'])):
        print("  ✅ All Contact.customer_id values are valid.")
    if set(results['OrderItem']['product_id']).issubset(set(results['Product']['id'])):
        print("  ✅ All OrderItem.product_id values are valid.")
```

## Metadata Enhancement Benefits with SQLAlchemy Models

* **Richer Context**: Leverages docstrings, comments, and column constraints to enrich prompts.
* **Simpler Prompts**: Less manual specification; model infers details.
* **Constraint Awareness**: Respects `nullable`, `unique`, and length constraints.
* **Custom Generators**: Column-level functions for fine-tuned data.
* **Automatic Docstring Utilization**: Embeds business context from model definitions.


## Unstructured Document Generation

SYDA can generate realistic unstructured documents such as PDF reports, letters, and forms based on templates. This is useful for applications that require document generation with synthetic data.

For complete examples, see the [examples/unstructured_only](examples/unstructured_only) directory, which includes healthcare document generation samples.

### Template-Based Document Generation

Create template-based document schemas by specifying template fields in your schema:

```python
from syda.generate import SyntheticDataGenerator
from syda.schemas import ModelConfig

# Initialize generator 
config = ModelConfig(provider="anthropic", model_name="claude-3-5-haiku-20241022")
generator = SyntheticDataGenerator(model_config=config)

# Define template-based schemas
schemas = {
    'MedicalReport': 'schemas/medical_report.yml',
    'LabResult': 'schemas/lab_result.yml'
}
```

Here's an example of a medical report template schema:

```yaml
# Medical report template schema (medical_report.yml)
__template__: true
__description__: Medical report template for patient visits
__name__: MedicalReport
__foreign_keys__: {}
__template_source__: templates/medical_report_template.html
__input_file_type__: html
__output_file_type__: pdf

# Patient information
patient_id:
  type: string
  format: uuid

patient_name:
  type: string

date_of_birth:
  type: string
  format: date

visit_date:
  type: string
  format: date-time

chief_complaint:
  type: string

medical_history:
  type: string

# Vital signs
blood_pressure:
  type: string

heart_rate:
  type: integer

respiratory_rate:
  type: integer

temperature:
  type: number

oxygen_saturation:
  type: integer

# Clinical information
assessment:
  type: string

# Generate data and PDF documents
results = generator.generate_for_schemas(
    schemas=schemas,
    sample_sizes={
        'MedicalReport': 5,
        'LabResult': 5
    },
    prompts={
        'MedicalReport': 'Generate synthetic medical reports for patients',
        'LabResult': 'Generate synthetic laboratory test results for patients'
    },
    output_dir="output"
)
```

### Template Schema Requirements

Template-based schemas must include these special fields:

```yaml
__template__: true
__template_source__: /path/to/template.html
__input_file_type__: html
__output_file_type__: pdf
```

The template file (like HTML) includes variable placeholders that get replaced with generated data. Here's an example of a Jinja2 HTML template for medical reports corresponding to the schema above:

```html
<!DOCTYPE html>
<html>
<head>
    <meta charset="UTF-8">
    <title>Medical Report</title>
    <style>
        body {
            font-family: Arial, sans-serif;
            margin: 40px;
            line-height: 1.6;
        }
        .header {
            text-align: center;
            border-bottom: 2px solid #333;
            padding-bottom: 10px;
            margin-bottom: 20px;
        }
        .section {
            margin-bottom: 20px;
        }
        .section-title {
            font-weight: bold;
            margin-bottom: 5px;
        }
    </style>
</head>
<body>
    <div class="header">
        <h1>MEDICAL REPORT</h1>
    </div>
    
    <div class="section">
        <div class="section-title">PATIENT INFORMATION</div>
        <p>
            <strong>Patient ID:</strong> {{ patient_id }}<br>
            <strong>Name:</strong> {{ patient_name }}<br>
            <strong>Date of Birth:</strong> {{ date_of_birth }}
        </p>
    </div>
    
    <div class="section">
        <div class="section-title">VISIT INFORMATION</div>
        <p>
            <strong>Visit Date:</strong> {{ visit_date }}<br>
            <strong>Chief Complaint:</strong> {{ chief_complaint }}
        </p>
    </div>
    
    <div class="section">
        <div class="section-title">MEDICAL HISTORY</div>
        <p>{{ medical_history }}</p>
    </div>
    
    <div class="section">
        <div class="section-title">VITAL SIGNS</div>
        <p>
            <strong>Blood Pressure:</strong> {{ blood_pressure }}<br>
            <strong>Heart Rate:</strong> {{ heart_rate }} bpm<br>
            <strong>Respiratory Rate:</strong> {{ respiratory_rate }} breaths/min<br>
            <strong>Temperature:</strong> {{ temperature }}°F<br>
            <strong>Oxygen Saturation:</strong> {{ oxygen_saturation }}%
        </p>
    </div>
    
    <div class="section">
        <div class="section-title">ASSESSMENT</div>
        <p>{{ assessment }}</p>
    </div>
</body>
</html>
```

As you can see, the template uses Jinja2's `{{ variable_name }}` syntax to insert the data from the generated schema fields into the HTML document.

### Supported Template Types

- HTML → PDF: Best supported with complete styling control
- HTML → HTML: Simple text formatting

More template formats will be supported in next versions

## Combined Structured and Unstructured Data

SYDA excels at generating both structured data (tables/databases) and unstructured content (documents) in a coordinated way.

For working examples, see the [examples/structured_and_unstructured](examples/structured_and_unstructured) directory, which contains retail receipt generation and CRM document examples.


### Connecting Documents to Structured Data

You can create relationships between document schemas and structured data schemas:

```python
from syda.generate import SyntheticDataGenerator

generator = SyntheticDataGenerator()

# Define both structured and template-based schemas
schemas = {
    'Customer': 'schemas/customer.yml',            # Structured data
    'Product': 'schemas/product.yml',              # Structured data
    'Transaction': 'schemas/transaction.yml',      # Structured data
    'Receipt': 'schemas/receipt.yml'               # Template-based document
}
```

Here's what a structured data schema for a `Customer` might look like:

```yaml
# Customer schema (customer.yml)
__table_name__: Customer
__description__: Retail customers

id:
  type: integer
  description: Unique customer ID
  constraints:
    primary_key: true
    not_null: true
    min: 1

first_name:
  type: string
  description: Customer's first name
  constraints:
    not_null: true
    length: 50

last_name:
  type: string
  description: Customer's last name
  constraints:
    not_null: true
    length: 50
    
email:
  type: email
  description: Customer's email address
  constraints:
    not_null: true
    unique: true
    length: 100
```

And here's a template-based document schema for a `Receipt` that references the structured data:

```yaml
# Receipt template schema (receipt.yml)
__template__: true
__description__: Retail receipt template
__name__: Receipt
__depends_on__: [Product, Transaction, Customer]
__foreign_keys__:
  customer_name: [Customer, first_name]
  
__template_source__: templates/receipt.html
__input_file_type__: html
__output_file_type__: pdf

# Receipt header
store_name:
  type: string
  length: 50
  description: Name of the retail store

store_address:
  type: address
  length: 150
  description: Full address of the store

# Receipt details
receipt_number:
  type: string
  pattern: '^RCP-\d{8}$'
  length: 12
  description: Unique receipt identifier

# Product purchase details
items:
  type: array
  description: "List of purchased items with product details"


# Generate everything - maintains relationships between structured and document data
results = generator.generate_for_schemas(
    schemas=schemas,
    output_dir="output"
)

# Results include both DataFrames and generated documents
customers_df = results['Customer']
receipts_df = results['Receipt']     # Contains metadata about generated documents
```

### Schema Dependencies for Documents

Template schemas can specify dependencies on structured schemas:

```yaml
# Receipt template schema (receipt.yml)
__template__: true
__name__: Receipt
__depends_on__: [Product, Transaction, Customer]
__foreign_keys__:
  customer_id: [Customer, id]
__template_source__: templates/receipt.html
__input_file_type__: html
__output_file_type__: pdf
```

This ensures that dependent structured data is generated first, and related documents can reference that data.

Here's an example of a receipt HTML template that uses data from both the receipt schema and the related structured data:

```html
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Receipt</title>
    <style>
        body {
            font-family: 'Courier New', Courier, monospace;
            font-size: 12px;
            line-height: 1.3;
            max-width: 380px;
            margin: 0 auto;
            padding: 10px;
        }
        .header, .footer {
            text-align: center;
            margin-bottom: 10px;
        }
        .items-table {
            width: 100%;
            margin-bottom: 10px;
        }
        .totals {
            width: 100%;
            margin-bottom: 10px;
        }
    </style>
</head>
<body>
    <div class="header">
        <div class="store-name">{{ store_name }}</div>
        <div>{{ store_address }}</div>
        <div>Tel: {{ store_phone }}</div>
    </div>

    <div class="receipt-details">
        <div>
            <div>Receipt #: {{ receipt_number }}</div>
            <div>Date: {{ transaction_date }}</div>
            <div>Time: {{ transaction_time }}</div>
        </div>
    </div>

    <div class="customer-info">
        <div>Customer: {{ customer_name }}</div>
        <div>Cust ID: {{ customer_id }}</div>
    </div>

    <!-- This iterates through items array generated by the custom generator -->
    <table class="items-table">
        <thead>
            <tr>
                <th>Item</th>
                <th>Qty</th>
                <th>Price</th>
                <th>Total</th>
            </tr>
        </thead>
        <tbody>
            {% for item in items %}
            <tr>
                <td>{{ item.product_name }}<br><small>SKU: {{ item.sku }}</small></td>
                <td>{{ item.quantity }}</td>
                <td>${{ "%.2f"|format(item.unit_price) }}</td>
                <td>${{ "%.2f"|format(item.item_total) }}</td>
            </tr>
            {% endfor %}
        </tbody>
    </table>

    <table class="totals">
        <tr>
            <td>Subtotal:</td>
            <td>${{ "%.2f"|format(subtotal) }}</td>
        </tr>
        <tr>
            <td>Tax ({{ "%.2f"|format(tax_rate) }}%):</td>
            <td>${{ "%.2f"|format(tax_amount) }}</td>
        </tr>
        <tr>
            <td>TOTAL:</td>
            <td>${{ "%.2f"|format(total) }}</td>
        </tr>
    </table>

    <div class="payment-info">
        <div>Payment Method: {{ payment_method }}</div>
    </div>

    <div class="thank-you">
        Thank you for shopping with us!
    </div>
</body>
</html>
```

Note the use of Jinja2's `{% for item in items %}...{% endfor %}` loop to iterate through the array of items that was generated with our custom generator.

### Custom Generators for Document Data

For advanced use cases, you can define custom generators to map structured data into document fields:

```python
def generate_receipt_items(row, col_name=None, parent_dfs=None):
    """Generate receipt line items based on transaction and product data."""
    items = []
    if parent_dfs and 'Product' in parent_dfs and 'Transaction' in parent_dfs:
        products_df = parent_dfs['Product']
        transactions_df = parent_dfs['Transaction']
        
        # Find transactions for this customer
        customer_transactions = transactions_df[transactions_df['customer_id'] == row['customer_id']]
        
        # Add products from transactions to receipt
        for _, tx in customer_transactions.iterrows():
            product = products_df[products_df['id'] == tx['product_id']].iloc[0]
            items.append({
                "product_name": product['name'],
                "quantity": tx['quantity'],
                "unit_price": product['price'],
                "item_total": tx['quantity'] * product['price']
            })
    return items

# Register the custom generator
generator.register_generator('array', generate_receipt_items, column_name='items')
```

The `parent_dfs` parameter gives access to all previously generated structured data, allowing you to create rich, interconnected documents.


## SQLAlchemy Models with Templates

You can also use SQLAlchemy models to define both your structured data schema and template-based documents. This approach is great for applications that already use SQLAlchemy ORM:

```python
from sqlalchemy import Column, Integer, String, Float, ForeignKey, Text
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import relationship
from syda.templates import SydaTemplate

Base = declarative_base()

# Regular structured SQLAlchemy model
class Customer(Base):
    __tablename__ = 'customers'
    
    id = Column(Integer, primary_key=True)
    name = Column(String(100), nullable=False)
    industry = Column(String(50))
    annual_revenue = Column(Float)
    website = Column(String(100))
    
    # Relationships
    opportunities = relationship("Opportunity", back_populates="customer")

# Another structured model
class Opportunity(Base):
    __tablename__ = 'opportunities'
    
    id = Column(Integer, primary_key=True)
    customer_id = Column(Integer, ForeignKey('customers.id'), nullable=False)
    name = Column(String(100), nullable=False)
    value = Column(Float, nullable=False)
    description = Column(Text)
    
    # Relationships
    customer = relationship("Customer", back_populates="opportunities")

# Template model
class ProposalDocument(Base):
    __tablename__ = 'proposal_documents'
    
    # Special template attributes
    __template__ = True
    __depends_on__ = ['Opportunity']  # This template depends on the Opportunity model
    
    # Template source configuration
    __template_source__ = 'templates/proposal.html'
    __input_file_type__ = 'html'
    __output_file_type__ = 'pdf'
    
    # Fields needed for the template (these become columns in the generated data)
    id = Column(Integer, primary_key=True)
    opportunity_id = Column(Integer, ForeignKey('opportunities.id'), nullable=False)
    title = Column(String(200))
    customer_name = Column(String(100), ForeignKey('customers.name'))
    opportunity_value = Column(Float, ForeignKey('opportunities.value'))
    proposed_solutions = Column(Text)
```

Then generate all data in one call:

```python
from syda.generate import SyntheticDataGenerator
from syda.schemas import ModelConfig

# Initialize generator
config = ModelConfig(provider="anthropic", model_name="claude-3-5-haiku-20241022")
generator = SyntheticDataGenerator(model_config=config)

# Generate all data at once
results = generator.generate_for_sqlalchemy_models(
    sqlalchemy_models=[Customer, Opportunity, ProposalDocument],
    sample_sizes={'customers': 5, 'opportunities': 8, 'proposal_documents': 3},
    output_dir="output"
)
```

The example above demonstrates:
1. Regular SQLAlchemy models for structured data (Customer, Opportunity)
2. A template model (ProposalDocument)
3. Foreign key relationships between the template and structured models
4. Generating everything together with `generate_for_sqlalchemy_models`


## Model Selection and Configuration

Syda currently supports two AI providers: OpenAI and Anthropic (Claude).



### Basic Configuration

Configure provider, model, temperature, tokens, and proxy settings using the `ModelConfig` class:

```python
from syda.schemas import ModelConfig, ProxyConfig

# Create a model configuration
config = ModelConfig(
    provider='openai',  # Choose from: 'openai', 'anthropic', etc.
    model_name='gpt-4-turbo',  # Model name for the selected provider
    temperature=0.7,    # Controls randomness (0.0-1.0)
    seed=42,            # For reproducible outputs (provider-specific)
    max_tokens=4000,    # Maximum response length (default: 4000)
    proxy=ProxyConfig(  # Optional proxy configuration
        base_url='https://ai-proxy.company.com/v1',
        headers={'X-Company-Auth':'internal-token'},
        params={'team':'data-science'}
    )
)

# Initialize generator with the configuration
generator = SyntheticDataGenerator(model_config=config)
```

### Using Different Model Providers

The library currently supports OpenAI and Anthropic (Claude) models and allows you to easily switch between these providers while maintaining a consistent interface.

#### OpenAI Models

```python
# Default configuration - uses OpenAI's GPT-4 if no model_config provided
default_generator = SyntheticDataGenerator()

# Explicitly configure for GPT-3.5 Turbo (faster and more cost-effective)
openai_config = ModelConfig(
    provider='openai',
    model_name='gpt-3.5-turbo',  # You can also use 'gpt-3.5-turbo-1106' for better JSON handling
    temperature=0.7,
    response_format={"type": "json_object"}  # Forces JSON response format (GPT models)
)
gpt35_generator = SyntheticDataGenerator(model_config=openai_config)

# Generate data with specific model configuration
data = gpt35_generator.generate_data(
    schema={'product_id': 'number', 'product_name': 'text', 'price': 'number'},
    prompt="Generate electronic product data with prices between $500-$2000",
    sample_size=10
)
```

#### Anthropic Claude Models

```python
# Configure for Claude (requires ANTHROPIC_API_KEY environment variable)
claude_config = ModelConfig(
    provider='anthropic',
    model_name='claude-3-sonnet-20240229',  # Available models: claude-3-opus, claude-3-sonnet, claude-3-haiku
    temperature=0.7,
    max_tokens=2000  # Claude can sometimes need more tokens for structured output
)
claude_generator = SyntheticDataGenerator(model_config=claude_config)

# Generate data with Claude
data = claude_generator.generate_data(
    schema={'product_id': 'number', 'product_name': 'text', 'price': 'number', 'description': 'text'},
    prompt="Generate luxury product data with realistic prices over $1000",
    sample_size=5
)
```

#### Maximum Tokens Parameter

The library now uses a default of 4000 tokens for `max_tokens` to ensure complete responses with all expected columns. This helps prevent incomplete data generation issues.

```python
# Override the default max_tokens setting
config = ModelConfig(
    provider="openai",
    model_name="gpt-4",
    max_tokens=8000,  # Increase for very complex schemas or large sample sizes
    temperature=0.7
)
```

When generating complex data or data with many columns, consider increasing this value if you notice missing columns in your generated data.

#### Provider-Specific Optimizations

Each AI provider has different strengths and parameter requirements. The library automatically handles most of the differences, but you can optimize for specific providers:

```python
# OpenAI-specific optimization
openai_optimized = ModelConfig(
    provider='openai',
    model_name='gpt-4-turbo',
    temperature=0.7,
    response_format={"type": "json_object"},  # Only works with OpenAI
    seed=42  # For reproducible outputs
)

# Anthropic-specific optimization
anthropic_optimized = ModelConfig(
    provider='anthropic',
    model_name='claude-3-opus-20240229',
    temperature=0.7,
    system="You are a synthetic data generator that creates realistic, high-quality datasets based on the provided schema."  # System prompt works best with Anthropic
)
```

### Advanced: Direct Access to LLM Client

For advanced use cases, you can access the underlying LLM client directly for additional control:

```python
from syda.llm import create_llm_client

# Create a standalone LLM client
llm_client = create_llm_client(
    model_config=ModelConfig(
        provider='anthropic', 
        model_name='claude-3-opus-20240229'
    ),
    # API key is optional if set in environment variables
    anthropic_api_key="your_api_key"  
)

# Define a Pydantic model for structured output
from pydantic import BaseModel
from typing import List

class Book(BaseModel):
    title: str
    author: str
    year: int
    genre: str
    pages: int

class BookCollection(BaseModel):
    books: List[Book]

# Use the client for structured responses
books = llm_client.client.chat.completions.create(
    model="claude-3-opus-20240229",
    response_model=BookCollection,  # Automatically parses the response to this model
    messages=[{"role": "user", "content": "Generate 5 fictional sci-fi books."}]
)

# Access the structured data directly
for book in books.books:
    print(f"{book.title} by {book.author} ({book.year}) - {book.pages} pages")
```

This approach gives you direct control over the client while still providing structured data extraction capabilities.

## Output Options

Syda offers flexible output options to suit different use cases:

### Multiple Schema Generation

When generating data for multiple schemas using `generate_for_schemas` or `generate_for_sqlalchemy_models`, you can specify an output directory and format:

```python
# Generate and save data to CSV files (default)
results = generator.generate_for_schemas(
    schemas=schemas,
    output_dir="output_directory",
    output_format="csv"  # Default format
)

# Generate and save data to JSON files
results = generator.generate_for_schemas(
    schemas=schemas,
    output_dir="output_directory",
    output_format="json"
)
```

Each schema will be saved to a separate file with the schema name as the filename. For example:

* CSV format: `output_directory/customer.csv`, `output_directory/order.csv`, etc.
* JSON format: `output_directory/customer.json`, `output_directory/order.json`, etc.

The `results` dictionary will still contain all generated DataFrames, so you can both save to files and work with the data directly in your code.


## Configuration and Error Handling

### API Keys Management

You can provide appropriate API keys based on the provider you're using. There are two recommended ways to manage API keys:

#### 1. Environment Variables (Recommended)

Set API keys via environment variables:

```bash
# For OpenAI models
export OPENAI_API_KEY=your_openai_key

# For Anthropic models
export ANTHROPIC_API_KEY=your_anthropic_key

# For other providers, set the appropriate environment variables
```

You can also use a `.env` file in your project root and load it with:

```python
from dotenv import load_dotenv
load_dotenv()  # This loads API keys from .env file
```

#### 2. Direct Initialization

Provide API keys when initializing the generator:

```python
# With explicit model configuration
generator = SyntheticDataGenerator(
    model_config=ModelConfig(provider='openai', model_name='gpt-4'),
    openai_api_key="your_openai_key",      # Only needed for OpenAI models
    anthropic_api_key="your_anthropic_key"  # Only needed for Anthropic models
)
```


### Error Handling

Syda's error handling has been improved to provide more useful feedback when data generation fails. The library now:

1. **Raises Explicit Exceptions**: When data generation fails rather than returning random data
2. **Provides Detailed Error Messages**: Explaining what went wrong and potential fixes
3. **Validates Output Structure**: Ensures generated data matches the expected schema

Example error handling:

```python
try:
    data = generator.generate_data(
        schema=YourModel,
        prompt="Generate synthetic data...",
        sample_size=10
    )
    # Process the data...
except ValueError as e:
    print(f"Data generation failed: {str(e)}")
    # Implement fallback strategy or retry with different parameters
```

## Contributing

1. Fork the repository.
2. Create a feature branch.
3. Commit your changes.
4. Push to your branch.
5. Open a Pull Request.

## License

See [LICENSE](LICENSE) for details.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "syda",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "synthetic data, AI, machine learning, data generation, testing, privacy, SQLAlchemy, OpenAI, Anthropic, Claude, GPT",
    "author": null,
    "author_email": "Rama Krishna Kumar Lingamgunta <lrkkumar2606@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/aa/10/a19b5a3a3a14dfd08702eabb7bdefba3f8bdd438640da0da09bbe35c2895/syda-0.0.1.tar.gz",
    "platform": null,
    "description": "# Synthetic Data Generation Library\n\nA Python-based open-source library for generating synthetic data with AI while preserving referential integrity. Allowing seamless use of OpenAI, Anthropic (Claude), and other AI models.\n\n## Table of Contents\n\n* [Features](#features)\n* [Installation](#installation)\n* [Quick Start](#quick-start)\n* [Core API](#core-api)\n  * [Structured Data Generation](#structured-data-generation)\n  * [SQLAlchemy Model Integration](#sqlalchemy-model-integration)\n  * [Handling Foreign Key Relationships](#handling-foreign-key-relationships)\n  * [Multiple Schema Definition Formats](#multiple-schema-definition-formats)\n    * [SQLAlchemy Models](#1-sqlalchemy-models)\n    * [YAML Schema Files](#2-yaml-schema-files)\n    * [JSON Schema Files](#3-json-schema-files)\n    * [Dictionary-Based Schemas](#4-dictionary-based-schemas)\n    * [Foreign Key Definition Methods](#foreign-key-definition-methods)\n  * [Automatic Management of Multiple Related Models](#automatic-management-of-multiple-related-models)\n    * [Using SQLAlchemy Models](#using-sqlalchemy-models)\n    * [Using YAML Schema Files](#using-yaml-schema-files)\n    * [Using JSON Schema Files](#using-json-schema-files)\n    * [Using Dictionary-Based Schemas](#using-dictionary-based-schemas)\n  * [Complete CRM Example](#complete-crm-example)\n* [Metadata Enhancement Benefits with SQLAlchemy Models](#metadata-enhancement-benefits-with-sqlalchemy-models)\n* [Custom Generators for Domain-Specific Data](#custom-generators-for-domain-specific-data)\n* [Unstructured Document Generation](#unstructured-document-generation)\n  * [Template-Based Document Generation](#template-based-document-generation)\n  * [Template Schema Requirements](#template-schema-requirements)\n  * [Supported Template Types](#supported-template-types)\n* [Combined Structured and Unstructured Data](#combined-structured-and-unstructured-data)\n  * [Connecting Documents to Structured Data](#connecting-documents-to-structured-data)\n  * [Schema Dependencies for Documents](#schema-dependencies-for-documents)\n  * [Custom Generators for Document Data](#custom-generators-for-document-data)\n* [SQLAlchemy Models with Templates](#sqlalchemy-models-with-templates)\n* [Model Selection and Configuration](#model-selection-and-configuration)\n  * [Basic Configuration](#basic-configuration)\n  * [Using Different Model Providers](#using-different-model-providers)\n    * [OpenAI Models](#openai-models)\n    * [Anthropic Claude Models](#anthropic-claude-models)\n    * [Maximum Tokens Parameter](#maximum-tokens-parameter)\n    * [Provider-Specific Optimizations](#provider-specific-optimizations)\n  * [Advanced: Direct Access to LLM Client](#advanced-direct-access-to-llm-client)\n* [Output Options](#output-options)\n* [Configuration and Error Handling](#configuration-and-error-handling)\n  * [API Keys Management](#api-keys-management)\n    * [Environment Variables (Recommended)](#1-environment-variables-recommended)\n    * [Direct Initialization](#2-direct-initialization)\n  * [Error Handling](#error-handling)\n* [Contributing](#contributing)\n* [License](#license)\n\n## Features\n\n* **Multi-Provider AI Integration**:\n\n  * Seamless integration with multiple AI providers\n  * Support for OpenAI (GPT) and Anthropic (Claude). \n  * Default model is Anthropic Claude model claude-3-5-haiku-20241022\n  * Consistent interface across different providers\n  * Provider-specific parameter optimization\n\n* **LLM-based Data Generation**:\n\n  * AI-powered schema understanding and data creation\n  * Contextually-aware synthetic records\n  * Natural language prompt customization\n  * Intelligent schema inference\n\n* **SQLAlchemy Integration**:\n\n  * Automatic extraction of model metadata, docstrings and constraints\n  * Intelligent column-specific data generation\n  * Parameter naming consistency with `sqlalchemy_models`\n  \n* **Multiple Schema Formats**:\n\n  * SQLAlchemy model integration with automatic metadata extraction\n  * YAML/JSON schema file support with full foreign key relationship handling\n  * Python dictionary-based schema definitions\n  \n* **Referential Integrity**\n\n  * Automatic foreign key detection and resolution\n  * Multi-model dependency analysis through topological sorting\n  * Robust handling of related data with referential constraints\n  \n* **Custom Generators**\n\n  * Register column- or type-specific functions for domain-specific data\n  * Contextual generators that adapt to other fields (like ICD-10 codes based on demographics)\n  * Weighted distributions for realistic data patterns\n\n\n## Installation\n\nInstall the package using pip:\n\n```bash\npip install syda\n```\n\n## Quick Start\n\n```python\nfrom syda.structured import SyntheticDataGenerator\nfrom syda.schemas import ModelConfig\n\nmodel_config = ModelConfig(\n    provider=\"anthropic\",\n    model_name=\"claude-3-5-haiku-20241022\",\n    temperature=0.7,\n    max_tokens=8192  # Larger value for more complete responses\n)\n\ngenerator = SyntheticDataGenerator(model_config=model_config)\n\n# Define schema for a single table\nschemas = {\n    'Patient': {\n        'patient_id': 'number',\n        'diagnosis_code': 'icd10_code',\n        'email': 'email',\n        'visit_date': 'date',\n        'notes': 'text'\n    }\n}\n\nprompt = \"Generate realistic synthetic patient records with ICD-10 diagnosis codes, emails, visit dates, and clinical notes.\"\n\n# Generate and save to CSV\nresults = generator.generate_for_schemas(\n    schemas=schemas,\n    prompts={'Patient': prompt},\n    sample_sizes={'Patient': 15},\n    output_dir='synthetic_output'\n)\nprint(f\"Data saved to synthetic_output/Patient.csv\")\n```\n\n## Core API\n\n### Structured Data Generation\n\nUse simple schema maps or SQLAlchemy models to generate data:\n\n```python\nfrom syda.structured import SyntheticDataGenerator\nfrom syda.schemas import ModelConfig\n\nmodel_config = ModelConfig(provider='anthropic', model_name='claude-3-5-haiku-20241022')\ngenerator = SyntheticDataGenerator(model_config=model_config)\n\n# Simple dict schema\nschemas = {\n    'User': {'id': 'number', 'name': 'text'}\n}\nresults = generator.generate_for_schemas(\n    schemas=schemas,\n    prompts={'User': 'Generate user records'},\n    sample_sizes={'User': 10}\n)\n```\n\n### SQLAlchemy Model Integration\n\nPass declarative models directly\u2014docstrings and column metadata inform the prompt:\n\n```python\nfrom sqlalchemy.ext.declarative import declarative_base\nfrom sqlalchemy import Column, Integer, String\nfrom syda.structured import SyntheticDataGenerator\nfrom syda.schemas import ModelConfig\n\nBase = declarative_base()\nclass User(Base):\n    __tablename__ = 'users'\n    id = Column(Integer, primary_key=True)\n    name = Column(String, comment=\"Full name of the user\")\n\nmodel_config = ModelConfig(provider='anthropic', model_name='claude-3-5-haiku-20241022')\ngenerator = SyntheticDataGenerator(model_config=model_config)\nresults = generator.generate_for_sqlalchemy_models(\n    sqlalchemy_models=[User], \n    prompts={'User': 'Generate users'}, \n    sample_sizes={'User': 5}\n)\n```\n\n### SQLAlchemy Model Integration\n\nPass declarative models directly\u2014docstrings and column metadata inform the prompt:\n\n```python\nfrom sqlalchemy.ext.declarative import declarative_base\nfrom sqlalchemy import Column, Integer, String\nfrom syda.structured import SyntheticDataGenerator\nfrom syda.schemas import ModelConfig\n\nBase = declarative_base()\nclass User(Base):\n    __tablename__ = 'users'\n    id = Column(Integer, primary_key=True)\n    name = Column(String, comment=\"Full name of the user\")\n\nmodel_config = ModelConfig(provider='anthropic', model_name='claude-3-5-haiku-20241022')\ngenerator = SyntheticDataGenerator(model_config=model_config)\nresults = generator.generate_for_sqlalchemy_models(\n    sqlalchemy_models=[User], \n    prompts={'users': 'Generate users'}, \n    sample_sizes={'users': 5}\n)\n```\n\n> **Important:** SQLAlchemy models **must** have either `__table__` or `__tablename__` specified. Without one of these attributes, the model cannot be properly processed by the system. The `__tablename__` attribute defines the name of the database table and is used as the schema name when generating data. For example, a model with `__tablename__ = 'users'` will be referenced as 'users' in prompts, sample_sizes, custom generators and the returned results dictionary.\n\n\n### Handling Foreign Key Relationships\n\nThe library provides robust support for handling foreign key relationships with referential integrity:\n\n1. **Automatic Foreign Key Detection**: Foreign keys are automatically detected from your yml, json, dict, SQLAlchemy models and assigned the type `'foreign_key'`.\n2. **Manual Column-Specific Foreign Key Generators**: You can also manually define foreign key generators for specific columns as below snippet\n\n```python\n# After generating departments and loading them into departments_df:\ndef department_id_fk_generator(row, col_name):\n    return random.choice(departments_df['id'].tolist())\ngenerator.register_generator('foreign_key', department_id_fk_generator, column_name='department_id')\n```\n\n3. **Multi-Step Generation Process**: For related tables, generate parent records first, then use their IDs when generating child records:\n\n```python\n# Generate departments first, then employees with valid department_id references\nresults = generator.generate_for_sqlalchemy_models(\n    sqlalchemy_models=[Department, Employee],\n    prompts={\n        'departments': 'Generate company departments',\n        'employees': 'Generate realistic employee data'\n    },\n    sample_sizes={\n        'departments': 5,\n        'employees': 10\n    }\n)\n\n# Access the generated dataframes\ndepartments_df = results['departments']\nemployees_df = results['employees']\n```\n\n4. **Referential Integrity Preservation**: The foreign key generator samples from actual existing IDs in the parent table, ensuring all references are valid.\n5. **Metadata-Enhanced Foreign Keys**: Column comments on foreign key fields are preserved and included in the prompt, helping the LLM understand the relationship context.\n\n\n### Multiple Schema Definition Formats\n\n\n> **Note:** For detailed information on supported field types and schema format, see the [Schema Reference](schema_reference.md) document.\n\n\nSyda supports defining your data models in multiple formats, all leading to the same synthetic data generation capabilities. Choose the format that best suits your workflow:\n\n#### 1. SQLAlchemy Models\n\n```python\nfrom sqlalchemy import Column, Integer, String, ForeignKey, Float, Date\nfrom sqlalchemy.ext.declarative import declarative_base\n\nBase = declarative_base()\n\nclass Customer(Base):\n    __tablename__ = 'customers'\n    __doc__ = \"\"\"Customer organization that places orders\"\"\"\n    \n    id = Column(Integer, primary_key=True)\n    name = Column(String(100), nullable=False, comment=\"Company name\")\n    status = Column(String(20), comment=\"Customer status (Active/Inactive/Prospect)\")\n\nclass Order(Base):\n    __tablename__ = 'orders'\n    __doc__ = \"\"\"Customer order for products or services\"\"\"\n    \n    id = Column(Integer, primary_key=True)\n    customer_id = Column(Integer, ForeignKey('customers.id'), nullable=False)\n    order_date = Column(Date, nullable=False, comment=\"Date when order was placed\")\n    total_amount = Column(Float, comment=\"Total monetary value of the order in USD\")\n\n# Generate data from SQLAlchemy models\nresults = generator.generate_for_sqlalchemy_models(\n    sqlalchemy_models=[Customer, Order],\n    prompts={\"customers\": \"Generate tech companies\"},\n    sample_sizes={\"customers\": 10, \"orders\": 30}\n)\n```\n\n#### 2. YAML Schema Files\n\n```yaml\n# customer.yaml\n__table_description__: Customer organization that places orders\nid:\n  type: number\n  primary_key: true\nname:\n  type: text\n  max_length: 100\n  not_null: true\n  description: Company name\nstatus:\n  type: text\n  max_length: 20\n  description: Customer status (Active/Inactive/Prospect)\n```\n\n```yaml\n# order.yaml\n__table_description__: Customer order for products or services\n__foreign_keys__:\n  customer_id: [Customer, id]\nid:\n  type: number\n  primary_key: true\ncustomer_id:\n  type: foreign_key\n  not_null: true\n  description: Reference to the customer who placed the order\norder_date:\n  type: date\n  not_null: true\n  description: Date when order was placed\ntotal_amount:\n  type: number\n  description: Total monetary value of the order in USD\n```\n\n```python\n# Generate data from YAML schema files\nresults = generator.generate_for_schemas(\n    schemas={\n        'Customer': 'schemas/customer.yaml',\n        'Order': 'schemas/order.yaml'\n    },\n    prompts={'Customer': 'Generate tech companies'},\n    sample_sizes={'Customer': 10, 'Order': 30}\n)\n```\n\n#### 3. JSON Schema Files\n\n```json\n// customer.json\n{\n  \"__table_description__\": \"Customer organization that places orders\",\n  \"id\": {\n    \"type\": \"number\",\n    \"primary_key\": true\n  },\n  \"name\": {\n    \"type\": \"text\",\n    \"max_length\": 100,\n    \"not_null\": true,\n    \"description\": \"Company name\"\n  },\n  \"status\": {\n    \"type\": \"text\",\n    \"max_length\": 20,\n    \"description\": \"Customer status (Active/Inactive/Prospect)\"\n  }\n}\n```\n\n```json\n// order.json\n{\n  \"__table_description__\": \"Customer order for products or services\",\n  \"__foreign_keys__\": {\n    \"customer_id\": [\"Customer\", \"id\"]\n  },\n  \"id\": {\n    \"type\": \"number\",\n    \"primary_key\": true\n  },\n  \"customer_id\": {\n    \"type\": \"foreign_key\",\n    \"not_null\": true,\n    \"description\": \"Reference to the customer who placed the order\"\n  },\n  \"order_date\": {\n    \"type\": \"date\",\n    \"not_null\": true,\n    \"description\": \"Date when order was placed\"\n  },\n  \"total_amount\": {\n    \"type\": \"number\",\n    \"description\": \"Total monetary value of the order in USD\"\n  }\n}\n```\n\n```python\n# Generate data from JSON schema files\nresults = generator.generate_for_schemas(\n    schemas={\n        'Customer': 'schemas/customer.json',\n        'Order': 'schemas/order.json'\n    },\n    prompts={'Customer': 'Generate tech companies'},\n    sample_sizes={'Customer': 10, 'Order': 30}\n)\n```\n\n#### 4. Dictionary-Based Schemas\n\n```python\n# Define schemas directly as dictionaries\nschemas = {\n    'Customer': {\n        '__table_description__': 'Customer organization that places orders',\n        'id': {'type': 'number', 'primary_key': True},\n        'name': {\n            'type': 'text',\n            'max_length': 100,\n            'not_null': True,\n            'description': 'Company name'\n        },\n        'status': {\n            'type': 'text',\n            'max_length': 20,\n            'description': 'Customer status (Active/Inactive/Prospect)'\n        }\n    },\n    'Order': {\n        '__table_description__': 'Customer order for products or services',\n        '__foreign_keys__': {\n            'customer_id': ['Customer', 'id']\n        },\n        'id': {'type': 'number', 'primary_key': True},\n        'customer_id': {\n            'type': 'foreign_key',\n            'not_null': True,\n            'description': 'Reference to the customer who placed the order'\n        },\n        'order_date': {\n            'type': 'date',\n            'not_null': True,\n            'description': 'Date when order was placed'\n        },\n        'total_amount': {\n            'type': 'number',\n            'description': 'Total monetary value of the order in USD'\n        }\n    }\n}\n\n# Generate data from dictionary schemas\nresults = generator.generate_for_schemas(\n    schemas=schemas,\n    prompts={'Customer': 'Generate tech companies'},\n    sample_sizes={'Customer': 10, 'Order': 30}\n)\n```\n\n#### Foreign Key Definition Methods\n\nThere are three ways to define foreign key relationships:\n\n1. Using the `__foreign_keys__` special section in a schema:\n   ```python\n   \"__foreign_keys__\": {\n       \"customer_id\": [\"Customer\", \"id\"]\n   }\n   ```\n\n2. Using field-level references with type and references properties:\n   ```python\n   \"order_id\": {\n       \"type\": \"foreign_key\",\n       \"references\": {\n           \"schema\": \"Order\",\n           \"field\": \"id\"\n       }\n   }\n   ```\n\n3. Using type-based detection with naming conventions:\n   ```python\n   \"customer_id\": \"foreign_key\"\n   ```\n   (The system will attempt to infer the relationship based on naming conventions)\n\n### Automatic Management of Multiple Related Models\n\n#### Using SQLAlchemy Models\n\nSimplify multi-table workflows with `generate_for_sqlalchemy_models`:\n\n```python\nfrom sqlalchemy import Column, Integer, String, Float, ForeignKey, Date, Text\nfrom sqlalchemy.ext.declarative import declarative_base\nfrom sqlalchemy.orm import relationship\nfrom datetime import datetime, timedelta\nimport random\nfrom syda.generate import SyntheticDataGenerator\n\nBase = declarative_base()\n\n# Customer model\nclass Customer(Base):\n    __tablename__ = 'customers'\n    \n    id = Column(Integer, primary_key=True)\n    name = Column(String(100), nullable=False)\n    industry = Column(String(50))\n    status = Column(String(20))\n    contacts = relationship(\"Contact\", back_populates=\"customer\")\n    orders = relationship(\"Order\", back_populates=\"customer\")\n\n# Contact model with foreign key to Customer\nclass Contact(Base):\n    __tablename__ = 'contacts'\n    \n    id = Column(Integer, primary_key=True)\n    customer_id = Column(Integer, ForeignKey('customers.id'), nullable=False)\n    name = Column(String(100), nullable=False)\n    email = Column(String(120), nullable=False)\n    phone = Column(String(20))\n    customer = relationship(\"Customer\", back_populates=\"contacts\")\n\n# Product model\nclass Product(Base):\n    __tablename__ = 'products'\n    \n    id = Column(Integer, primary_key=True)\n    name = Column(String(100), nullable=False)\n    description = Column(Text)\n    price = Column(Float, nullable=False)\n    order_items = relationship(\"OrderItem\", back_populates=\"product\")\n\n# Order model with foreign key to Customer\nclass Order(Base):\n    __tablename__ = 'orders'\n    \n    id = Column(Integer, primary_key=True)\n    customer_id = Column(Integer, ForeignKey('customers.id'), nullable=False)\n    order_date = Column(Date, nullable=False)\n    total_amount = Column(Float)\n    customer = relationship(\"Customer\", back_populates=\"orders\")\n    order_items = relationship(\"OrderItem\", back_populates=\"order\")\n\n# OrderItem model with foreign keys to Order and Product\nclass OrderItem(Base):\n    __tablename__ = 'order_items'\n    \n    id = Column(Integer, primary_key=True)\n    order_id = Column(Integer, ForeignKey('orders.id'), nullable=False)\n    product_id = Column(Integer, ForeignKey('products.id'), nullable=False)\n    quantity = Column(Integer, nullable=False)\n    price = Column(Float, nullable=False)\n    order = relationship(\"Order\", back_populates=\"order_items\")\n    product = relationship(\"Product\", back_populates=\"order_items\")\n\n# Initialize generator\ngenerator = SyntheticDataGenerator()\n\n# Generate data for all models in one call\nresults = generator.generate_for_sqlalchemy_models(\n    models=[Customer, Contact, Product, Order, OrderItem],\n    prompts={\n        \"customers\": \"Generate diverse customer organizations for a B2B SaaS company.\",\n        \"contacts\": \"Generate cloud software products and services.\"\n    },\n    sample_sizes={\n        \"customers\": 10,\n        \"contacts\": 25,\n        \"products\": 15,\n        \"orders\": 30,\n        \"order_items\": 60\n    },\n    custom_generators={\n        \"customers\": {\n            # Ensure a specific distribution of customer statuses for business reporting\n            \"status\": lambda row, col: random.choice([\"Active\", \"Inactive\", \"Prospect\"]),\n        },\n        \"products\": {\n            # Ensure product categories match your specific business domains\n            \"category\": lambda row, col: random.choice([\n                \"Cloud Infrastructure\", \"Business Intelligence\", \"Security Services\",\n                \"Data Analytics\", \"Custom Development\", \"Support Package\", \"API Services\"\n            ])\n        },\n    }\n)\n```\n\n#### Using YAML Schema Files\n\nThe same relationship management is available with YAML schemas:\n\n```yaml\n# customer.yaml\n__table_name__: customers\n__description__: Customer organizations\n\nid:\n  type: integer\n  constraints:\n    primary_key: true\n    not_null: true\n\nname:\n  type: string\n  constraints:\n    not_null: true\n    max_length: 100\n\nindustry:\n  type: string\n  constraints:\n    max_length: 50\n\nstatus:\n  type: string\n  constraints:\n    max_length: 20\n```\n\n```yaml\n# contact.yaml\n__table_name__: contacts\n__description__: Customer contacts\n__foreign_keys__:\n  customer_id: [customers, id]\n\nid:\n  type: integer\n  constraints:\n    primary_key: true\n    not_null: true\n\ncustomer_id:\n  type: integer\n  constraints:\n    not_null: true\n\nname:\n  type: string\n  constraints:\n    not_null: true\n    max_length: 100\n\nemail:\n  type: string\n  constraints:\n    not_null: true\n    max_length: 120\n\nphone:\n  type: string\n  constraints:\n    max_length: 20\n```\n\n```yaml\n# order.yaml\n__table_name__: orders\n__description__: Customer orders\n__foreign_keys__:\n  customer_id: [customers, id]\n\nid:\n  type: integer\n  constraints:\n    primary_key: true\n    not_null: true\n\ncustomer_id:\n  type: integer\n  constraints:\n    not_null: true\n\norder_date:\n  type: string\n  format: date\n  constraints:\n    not_null: true\n\ntotal_amount:\n  type: number\n  format: float\n```\n\n```python\n# Generate data for multiple related tables with YAML schemas\nresults = generator.generate_for_schemas(\n    schemas={\n        'Customer': 'schemas/customer.yaml',\n        'Contact': 'schemas/contact.yaml',\n        'Product': 'schemas/product.yaml',\n        'Order': 'schemas/order.yaml',\n        'OrderItem': 'schemas/order_item.yaml'\n    },\n    prompts={\n        \"Customer\": \"Generate diverse customer organizations for a B2B SaaS company.\",\n        \"Product\": \"Generate cloud software products and services.\"\n    },\n    sample_sizes={\n        \"Customer\": 10,\n        \"Contact\": 20,\n        \"Product\": 15,\n        \"Order\": 30,\n        \"OrderItem\": 60\n    }\n)\n```\n\n#### Using JSON Schema Files\n\nJSON schema files offer the same capabilities:\n\n```json\n// customer.json\n{\n  \"__table_name__\": \"customers\",\n  \"__description__\": \"Customer organizations\",\n  \"id\": {\n    \"type\": \"integer\",\n    \"constraints\": {\n      \"primary_key\": true,\n      \"not_null\": true\n    }\n  },\n  \"name\": {\n    \"type\": \"string\",\n    \"constraints\": {\n      \"not_null\": true,\n      \"max_length\": 100\n    }\n  },\n  \"industry\": {\n    \"type\": \"string\",\n    \"constraints\": {\n      \"max_length\": 50\n    }\n  },\n  \"status\": {\n    \"type\": \"string\",\n    \"constraints\": {\n      \"max_length\": 20\n    }\n  }\n}\n```\n\n```json\n// contact.json\n{\n  \"__table_name__\": \"contacts\",\n  \"__description__\": \"Customer contacts\",\n  \"__foreign_keys__\": {\n    \"customer_id\": [\"customers\", \"id\"]\n  },\n  \"id\": {\n    \"type\": \"integer\",\n    \"constraints\": {\n      \"primary_key\": true,\n      \"not_null\": true\n    }\n  },\n  \"customer_id\": {\n    \"type\": \"integer\",\n    \"constraints\": {\n      \"not_null\": true\n    }\n  },\n  \"name\": {\n    \"type\": \"string\",\n    \"constraints\": {\n      \"not_null\": true,\n      \"max_length\": 100\n    }\n  },\n  \"email\": {\n    \"type\": \"string\",\n    \"constraints\": {\n      \"not_null\": true,\n      \"max_length\": 120\n    }\n  },\n  \"phone\": {\n    \"type\": \"string\",\n    \"constraints\": {\n      \"max_length\": 20\n    }\n  }\n}\n```\n\n```json\n// order.json\n{\n  \"__table_name__\": \"orders\",\n  \"__description__\": \"Customer orders\",\n  \"__foreign_keys__\": {\n    \"customer_id\": [\"customers\", \"id\"]\n  },\n  \"id\": {\n    \"type\": \"integer\",\n    \"constraints\": {\n      \"primary_key\": true,\n      \"not_null\": true\n    }\n  },\n  \"customer_id\": {\n    \"type\": \"integer\",\n    \"constraints\": {\n      \"not_null\": true\n    }\n  },\n  \"order_date\": {\n    \"type\": \"string\",\n    \"format\": \"date\",\n    \"constraints\": {\n      \"not_null\": true\n    }\n  },\n  \"total_amount\": {\n    \"type\": \"number\",\n    \"format\": \"float\"\n  }\n}\n```\n\n```python\n# Generate data for multiple related tables with JSON schemas\nresults = generator.generate_for_schemas(\n    schemas={\n        'Customer': 'schemas/customer.json',\n        'Contact': 'schemas/contact.json',\n        'Product': 'schemas/product.json',\n        'Order': 'schemas/order.json',\n        'OrderItem': 'schemas/order_item.json'\n    },\n    prompts={\n        \"Customer\": \"Generate diverse customer organizations for a B2B SaaS company.\",\n        \"Product\": \"Generate cloud software products and services.\"\n    },\n    sample_sizes={\n        \"Customer\": 10,\n        \"Contact\": 20,\n        \"Product\": 15,\n        \"Order\": 30,\n        \"OrderItem\": 60\n    }\n)\n```\n\n#### Using Dictionary-Based Schemas\n\nSimilar relationship management works with dictionary schemas:\n\n```python\n# Define schemas as Python dictionaries\nschemas = {\n    'Customer': {\n        '__table_name__': 'customers',\n        '__description__': 'Customer organizations',\n        'id': {\n            'type': 'integer',\n            'constraints': {'primary_key': True, 'not_null': True}\n        },\n        'name': {\n            'type': 'string',\n            'constraints': {'not_null': True, 'max_length': 100}\n        },\n        'industry': {\n            'type': 'string',\n            'constraints': {'max_length': 50}\n        },\n        'status': {\n            'type': 'string',\n            'constraints': {'max_length': 20}\n        }\n    },\n    'Contact': {\n        '__table_name__': 'contacts',\n        '__description__': 'Customer contacts',\n        '__foreign_keys__': {\n            'customer_id': ['customers', 'id']\n        },\n        'id': {\n            'type': 'integer',\n            'constraints': {'primary_key': True, 'not_null': True}\n        },\n        'customer_id': {\n            'type': 'integer',\n            'constraints': {'not_null': True}\n        },\n        'name': {\n            'type': 'string',\n            'constraints': {'not_null': True, 'max_length': 100}\n        },\n        'email': {\n            'type': 'string',\n            'constraints': {'not_null': True, 'max_length': 120}\n        },\n        'phone': {\n            'type': 'string',\n            'constraints': {'max_length': 20}\n        }\n    },\n    'Order': {\n        '__table_name__': 'orders',\n        '__description__': 'Customer orders',\n        '__foreign_keys__': {\n            'customer_id': ['customers', 'id']\n        },\n        'id': {\n            'type': 'integer',\n            'constraints': {'primary_key': True, 'not_null': True}\n        },\n        'customer_id': {\n            'type': 'integer',\n            'constraints': {'not_null': True}\n        },\n        'order_date': {\n            'type': 'string',\n            'format': 'date',\n            'constraints': {'not_null': True}\n        },\n        'total_amount': {\n            'type': 'number',\n            'format': 'float'\n        }\n    }\n}\n\n# Generate data for dictionary schemas\nresults = generator.generate_for_schemas(\n    schemas=schemas,\n    prompts={\n        'Customer': 'Generate diverse customer organizations for a B2B SaaS company.'\n    },\n    sample_sizes={\n        'Customer': 10,\n        'Contact': 20,\n        'Order': 30\n    }\n)\n```\n\nIn all cases, the generator will:\n1. Analyze relationships between models/schemas\n2. Determine the correct generation order using topological sorting\n3. Generate parent tables first\n4. Use existing primary keys when populating foreign keys in child tables\n5. Maintain referential integrity across the entire dataset\n\n\n### Complete CRM Example\n\nHere\u2019s a comprehensive example demonstrating `generate_for_sqlalchemy_models` across five interrelated models, including entity definitions, prompt setup, and data verification:\n\n```python\n#!/usr/bin/env python\nimport random\nimport datetime\nfrom sqlalchemy import Column, Integer, String, ForeignKey, Float, Date, Boolean, Text\nfrom sqlalchemy.orm import declarative_base, relationship\nfrom syda.structured import SyntheticDataGenerator\n\nBase = declarative_base()\n\nclass Customer(Base):\n    __tablename__ = 'customers'\n    id = Column(Integer, primary_key=True)\n    name = Column(String(100), unique=True, comment=\"Customer organization name\")\n    industry = Column(String(50), comment=\"Customer's primary industry\")\n    website = Column(String(100), comment=\"Customer's website URL\")\n    status = Column(String(20), comment=\"Active, Inactive, Prospect\")\n    created_at = Column(Date, default=datetime.date.today, comment=\"Date when added to CRM\")\n    contacts = relationship(\"Contact\", back_populates=\"customer\")\n    orders = relationship(\"Order\", back_populates=\"customer\")\n\nclass Contact(Base):\n    __tablename__ = 'contacts'\n    id = Column(Integer, primary_key=True)\n    customer_id = Column(Integer, ForeignKey('customers.id'), comment=\"Customer this contact belongs to\")\n    first_name = Column(String(50), comment=\"Contact's first name\")\n    last_name = Column(String(50), comment=\"Contact's last name\")\n    email = Column(String(100), unique=True, comment=\"Contact's email address\")\n    phone = Column(String(20), comment=\"Contact's phone number\")\n    position = Column(String(100), comment=\"Job title or position\")\n    is_primary = Column(Boolean, default=False, comment=\"Primary contact flag\")\n    customer = relationship(\"Customer\", back_populates=\"contacts\")\n\nclass Product(Base):\n    __tablename__ = 'products'\n    id = Column(Integer, primary_key=True)\n    name = Column(String(100), unique=True, comment=\"Product name\")\n    category = Column(String(50), comment=\"Product category\")\n    price = Column(Float, comment=\"Product price in USD\")\n    description = Column(Text, comment=\"Product description\")\n    order_items = relationship(\"OrderItem\", back_populates=\"product\")\n\nclass Order(Base):\n    __tablename__ = 'orders'\n    id = Column(Integer, primary_key=True)\n    customer_id = Column(Integer, ForeignKey('customers.id'), comment=\"Customer who placed the order\")\n    order_date = Column(Date, comment=\"Date when order was placed\")\n    status = Column(String(20), comment=\"Order status: New, Processing, Shipped, Delivered, Cancelled\")\n    total_amount = Column(Float, comment=\"Total amount in USD\")\n    customer = relationship(\"Customer\", back_populates=\"orders\")\n    items = relationship(\"OrderItem\", back_populates=\"order\")\n\nclass OrderItem(Base):\n    __tablename__ = 'order_items'\n    id = Column(Integer, primary_key=True)\n    order_id = Column(Integer, ForeignKey('orders.id'), comment=\"Order this item belongs to\")\n    product_id = Column(Integer, ForeignKey('products.id'), comment=\"Product in the order\")\n    quantity = Column(Integer, comment=\"Quantity ordered\")\n    unit_price = Column(Float, comment=\"Unit price at order time\")\n    order = relationship(\"Order\", back_populates=\"items\")\n    product = relationship(\"Product\", back_populates=\"order_items\")\n\n\ndef main():\n    generator = SyntheticDataGenerator(model='gpt-4')\n    output_dir = 'crm_data'\n    prompts = {\n        \"customers\": \"Generate diverse customer organizations for a B2B SaaS company.\",\n        \"products\": \"Generate products for a cloud software company.\",\n        \"orders\": \"Generate realistic orders with appropriate dates and statuses.\"\n    }\n    sample_sizes = {\"customers\": 10, \"contacts\": 25, \"products\": 15, \"orders\": 30, \"order_items\": 60}\n\n    results = generator.generate_for_sqlalchemy_models(\n        sqlalchemy_models=[Customer, Contact, Product, Order, OrderItem],\n        prompts=prompts,\n        sample_sizes=sample_sizes,\n        output_dir=output_dir\n    )\n\n    # Referential integrity checks\n    print(\"\\n\ud83d\udd0d Verifying referential integrity:\")\n    if set(results['Contact']['customer_id']).issubset(set(results['Customer']['id'])):\n        print(\"  \u2705 All Contact.customer_id values are valid.\")\n    if set(results['OrderItem']['product_id']).issubset(set(results['Product']['id'])):\n        print(\"  \u2705 All OrderItem.product_id values are valid.\")\n```\n\n## Metadata Enhancement Benefits with SQLAlchemy Models\n\n* **Richer Context**: Leverages docstrings, comments, and column constraints to enrich prompts.\n* **Simpler Prompts**: Less manual specification; model infers details.\n* **Constraint Awareness**: Respects `nullable`, `unique`, and length constraints.\n* **Custom Generators**: Column-level functions for fine-tuned data.\n* **Automatic Docstring Utilization**: Embeds business context from model definitions.\n\n\n## Unstructured Document Generation\n\nSYDA can generate realistic unstructured documents such as PDF reports, letters, and forms based on templates. This is useful for applications that require document generation with synthetic data.\n\nFor complete examples, see the [examples/unstructured_only](examples/unstructured_only) directory, which includes healthcare document generation samples.\n\n### Template-Based Document Generation\n\nCreate template-based document schemas by specifying template fields in your schema:\n\n```python\nfrom syda.generate import SyntheticDataGenerator\nfrom syda.schemas import ModelConfig\n\n# Initialize generator \nconfig = ModelConfig(provider=\"anthropic\", model_name=\"claude-3-5-haiku-20241022\")\ngenerator = SyntheticDataGenerator(model_config=config)\n\n# Define template-based schemas\nschemas = {\n    'MedicalReport': 'schemas/medical_report.yml',\n    'LabResult': 'schemas/lab_result.yml'\n}\n```\n\nHere's an example of a medical report template schema:\n\n```yaml\n# Medical report template schema (medical_report.yml)\n__template__: true\n__description__: Medical report template for patient visits\n__name__: MedicalReport\n__foreign_keys__: {}\n__template_source__: templates/medical_report_template.html\n__input_file_type__: html\n__output_file_type__: pdf\n\n# Patient information\npatient_id:\n  type: string\n  format: uuid\n\npatient_name:\n  type: string\n\ndate_of_birth:\n  type: string\n  format: date\n\nvisit_date:\n  type: string\n  format: date-time\n\nchief_complaint:\n  type: string\n\nmedical_history:\n  type: string\n\n# Vital signs\nblood_pressure:\n  type: string\n\nheart_rate:\n  type: integer\n\nrespiratory_rate:\n  type: integer\n\ntemperature:\n  type: number\n\noxygen_saturation:\n  type: integer\n\n# Clinical information\nassessment:\n  type: string\n\n# Generate data and PDF documents\nresults = generator.generate_for_schemas(\n    schemas=schemas,\n    sample_sizes={\n        'MedicalReport': 5,\n        'LabResult': 5\n    },\n    prompts={\n        'MedicalReport': 'Generate synthetic medical reports for patients',\n        'LabResult': 'Generate synthetic laboratory test results for patients'\n    },\n    output_dir=\"output\"\n)\n```\n\n### Template Schema Requirements\n\nTemplate-based schemas must include these special fields:\n\n```yaml\n__template__: true\n__template_source__: /path/to/template.html\n__input_file_type__: html\n__output_file_type__: pdf\n```\n\nThe template file (like HTML) includes variable placeholders that get replaced with generated data. Here's an example of a Jinja2 HTML template for medical reports corresponding to the schema above:\n\n```html\n<!DOCTYPE html>\n<html>\n<head>\n    <meta charset=\"UTF-8\">\n    <title>Medical Report</title>\n    <style>\n        body {\n            font-family: Arial, sans-serif;\n            margin: 40px;\n            line-height: 1.6;\n        }\n        .header {\n            text-align: center;\n            border-bottom: 2px solid #333;\n            padding-bottom: 10px;\n            margin-bottom: 20px;\n        }\n        .section {\n            margin-bottom: 20px;\n        }\n        .section-title {\n            font-weight: bold;\n            margin-bottom: 5px;\n        }\n    </style>\n</head>\n<body>\n    <div class=\"header\">\n        <h1>MEDICAL REPORT</h1>\n    </div>\n    \n    <div class=\"section\">\n        <div class=\"section-title\">PATIENT INFORMATION</div>\n        <p>\n            <strong>Patient ID:</strong> {{ patient_id }}<br>\n            <strong>Name:</strong> {{ patient_name }}<br>\n            <strong>Date of Birth:</strong> {{ date_of_birth }}\n        </p>\n    </div>\n    \n    <div class=\"section\">\n        <div class=\"section-title\">VISIT INFORMATION</div>\n        <p>\n            <strong>Visit Date:</strong> {{ visit_date }}<br>\n            <strong>Chief Complaint:</strong> {{ chief_complaint }}\n        </p>\n    </div>\n    \n    <div class=\"section\">\n        <div class=\"section-title\">MEDICAL HISTORY</div>\n        <p>{{ medical_history }}</p>\n    </div>\n    \n    <div class=\"section\">\n        <div class=\"section-title\">VITAL SIGNS</div>\n        <p>\n            <strong>Blood Pressure:</strong> {{ blood_pressure }}<br>\n            <strong>Heart Rate:</strong> {{ heart_rate }} bpm<br>\n            <strong>Respiratory Rate:</strong> {{ respiratory_rate }} breaths/min<br>\n            <strong>Temperature:</strong> {{ temperature }}\u00b0F<br>\n            <strong>Oxygen Saturation:</strong> {{ oxygen_saturation }}%\n        </p>\n    </div>\n    \n    <div class=\"section\">\n        <div class=\"section-title\">ASSESSMENT</div>\n        <p>{{ assessment }}</p>\n    </div>\n</body>\n</html>\n```\n\nAs you can see, the template uses Jinja2's `{{ variable_name }}` syntax to insert the data from the generated schema fields into the HTML document.\n\n### Supported Template Types\n\n- HTML \u2192 PDF: Best supported with complete styling control\n- HTML \u2192 HTML: Simple text formatting\n\nMore template formats will be supported in next versions\n\n## Combined Structured and Unstructured Data\n\nSYDA excels at generating both structured data (tables/databases) and unstructured content (documents) in a coordinated way.\n\nFor working examples, see the [examples/structured_and_unstructured](examples/structured_and_unstructured) directory, which contains retail receipt generation and CRM document examples.\n\n\n### Connecting Documents to Structured Data\n\nYou can create relationships between document schemas and structured data schemas:\n\n```python\nfrom syda.generate import SyntheticDataGenerator\n\ngenerator = SyntheticDataGenerator()\n\n# Define both structured and template-based schemas\nschemas = {\n    'Customer': 'schemas/customer.yml',            # Structured data\n    'Product': 'schemas/product.yml',              # Structured data\n    'Transaction': 'schemas/transaction.yml',      # Structured data\n    'Receipt': 'schemas/receipt.yml'               # Template-based document\n}\n```\n\nHere's what a structured data schema for a `Customer` might look like:\n\n```yaml\n# Customer schema (customer.yml)\n__table_name__: Customer\n__description__: Retail customers\n\nid:\n  type: integer\n  description: Unique customer ID\n  constraints:\n    primary_key: true\n    not_null: true\n    min: 1\n\nfirst_name:\n  type: string\n  description: Customer's first name\n  constraints:\n    not_null: true\n    length: 50\n\nlast_name:\n  type: string\n  description: Customer's last name\n  constraints:\n    not_null: true\n    length: 50\n    \nemail:\n  type: email\n  description: Customer's email address\n  constraints:\n    not_null: true\n    unique: true\n    length: 100\n```\n\nAnd here's a template-based document schema for a `Receipt` that references the structured data:\n\n```yaml\n# Receipt template schema (receipt.yml)\n__template__: true\n__description__: Retail receipt template\n__name__: Receipt\n__depends_on__: [Product, Transaction, Customer]\n__foreign_keys__:\n  customer_name: [Customer, first_name]\n  \n__template_source__: templates/receipt.html\n__input_file_type__: html\n__output_file_type__: pdf\n\n# Receipt header\nstore_name:\n  type: string\n  length: 50\n  description: Name of the retail store\n\nstore_address:\n  type: address\n  length: 150\n  description: Full address of the store\n\n# Receipt details\nreceipt_number:\n  type: string\n  pattern: '^RCP-\\d{8}$'\n  length: 12\n  description: Unique receipt identifier\n\n# Product purchase details\nitems:\n  type: array\n  description: \"List of purchased items with product details\"\n\n\n# Generate everything - maintains relationships between structured and document data\nresults = generator.generate_for_schemas(\n    schemas=schemas,\n    output_dir=\"output\"\n)\n\n# Results include both DataFrames and generated documents\ncustomers_df = results['Customer']\nreceipts_df = results['Receipt']     # Contains metadata about generated documents\n```\n\n### Schema Dependencies for Documents\n\nTemplate schemas can specify dependencies on structured schemas:\n\n```yaml\n# Receipt template schema (receipt.yml)\n__template__: true\n__name__: Receipt\n__depends_on__: [Product, Transaction, Customer]\n__foreign_keys__:\n  customer_id: [Customer, id]\n__template_source__: templates/receipt.html\n__input_file_type__: html\n__output_file_type__: pdf\n```\n\nThis ensures that dependent structured data is generated first, and related documents can reference that data.\n\nHere's an example of a receipt HTML template that uses data from both the receipt schema and the related structured data:\n\n```html\n<!DOCTYPE html>\n<html lang=\"en\">\n<head>\n    <meta charset=\"UTF-8\">\n    <title>Receipt</title>\n    <style>\n        body {\n            font-family: 'Courier New', Courier, monospace;\n            font-size: 12px;\n            line-height: 1.3;\n            max-width: 380px;\n            margin: 0 auto;\n            padding: 10px;\n        }\n        .header, .footer {\n            text-align: center;\n            margin-bottom: 10px;\n        }\n        .items-table {\n            width: 100%;\n            margin-bottom: 10px;\n        }\n        .totals {\n            width: 100%;\n            margin-bottom: 10px;\n        }\n    </style>\n</head>\n<body>\n    <div class=\"header\">\n        <div class=\"store-name\">{{ store_name }}</div>\n        <div>{{ store_address }}</div>\n        <div>Tel: {{ store_phone }}</div>\n    </div>\n\n    <div class=\"receipt-details\">\n        <div>\n            <div>Receipt #: {{ receipt_number }}</div>\n            <div>Date: {{ transaction_date }}</div>\n            <div>Time: {{ transaction_time }}</div>\n        </div>\n    </div>\n\n    <div class=\"customer-info\">\n        <div>Customer: {{ customer_name }}</div>\n        <div>Cust ID: {{ customer_id }}</div>\n    </div>\n\n    <!-- This iterates through items array generated by the custom generator -->\n    <table class=\"items-table\">\n        <thead>\n            <tr>\n                <th>Item</th>\n                <th>Qty</th>\n                <th>Price</th>\n                <th>Total</th>\n            </tr>\n        </thead>\n        <tbody>\n            {% for item in items %}\n            <tr>\n                <td>{{ item.product_name }}<br><small>SKU: {{ item.sku }}</small></td>\n                <td>{{ item.quantity }}</td>\n                <td>${{ \"%.2f\"|format(item.unit_price) }}</td>\n                <td>${{ \"%.2f\"|format(item.item_total) }}</td>\n            </tr>\n            {% endfor %}\n        </tbody>\n    </table>\n\n    <table class=\"totals\">\n        <tr>\n            <td>Subtotal:</td>\n            <td>${{ \"%.2f\"|format(subtotal) }}</td>\n        </tr>\n        <tr>\n            <td>Tax ({{ \"%.2f\"|format(tax_rate) }}%):</td>\n            <td>${{ \"%.2f\"|format(tax_amount) }}</td>\n        </tr>\n        <tr>\n            <td>TOTAL:</td>\n            <td>${{ \"%.2f\"|format(total) }}</td>\n        </tr>\n    </table>\n\n    <div class=\"payment-info\">\n        <div>Payment Method: {{ payment_method }}</div>\n    </div>\n\n    <div class=\"thank-you\">\n        Thank you for shopping with us!\n    </div>\n</body>\n</html>\n```\n\nNote the use of Jinja2's `{% for item in items %}...{% endfor %}` loop to iterate through the array of items that was generated with our custom generator.\n\n### Custom Generators for Document Data\n\nFor advanced use cases, you can define custom generators to map structured data into document fields:\n\n```python\ndef generate_receipt_items(row, col_name=None, parent_dfs=None):\n    \"\"\"Generate receipt line items based on transaction and product data.\"\"\"\n    items = []\n    if parent_dfs and 'Product' in parent_dfs and 'Transaction' in parent_dfs:\n        products_df = parent_dfs['Product']\n        transactions_df = parent_dfs['Transaction']\n        \n        # Find transactions for this customer\n        customer_transactions = transactions_df[transactions_df['customer_id'] == row['customer_id']]\n        \n        # Add products from transactions to receipt\n        for _, tx in customer_transactions.iterrows():\n            product = products_df[products_df['id'] == tx['product_id']].iloc[0]\n            items.append({\n                \"product_name\": product['name'],\n                \"quantity\": tx['quantity'],\n                \"unit_price\": product['price'],\n                \"item_total\": tx['quantity'] * product['price']\n            })\n    return items\n\n# Register the custom generator\ngenerator.register_generator('array', generate_receipt_items, column_name='items')\n```\n\nThe `parent_dfs` parameter gives access to all previously generated structured data, allowing you to create rich, interconnected documents.\n\n\n## SQLAlchemy Models with Templates\n\nYou can also use SQLAlchemy models to define both your structured data schema and template-based documents. This approach is great for applications that already use SQLAlchemy ORM:\n\n```python\nfrom sqlalchemy import Column, Integer, String, Float, ForeignKey, Text\nfrom sqlalchemy.ext.declarative import declarative_base\nfrom sqlalchemy.orm import relationship\nfrom syda.templates import SydaTemplate\n\nBase = declarative_base()\n\n# Regular structured SQLAlchemy model\nclass Customer(Base):\n    __tablename__ = 'customers'\n    \n    id = Column(Integer, primary_key=True)\n    name = Column(String(100), nullable=False)\n    industry = Column(String(50))\n    annual_revenue = Column(Float)\n    website = Column(String(100))\n    \n    # Relationships\n    opportunities = relationship(\"Opportunity\", back_populates=\"customer\")\n\n# Another structured model\nclass Opportunity(Base):\n    __tablename__ = 'opportunities'\n    \n    id = Column(Integer, primary_key=True)\n    customer_id = Column(Integer, ForeignKey('customers.id'), nullable=False)\n    name = Column(String(100), nullable=False)\n    value = Column(Float, nullable=False)\n    description = Column(Text)\n    \n    # Relationships\n    customer = relationship(\"Customer\", back_populates=\"opportunities\")\n\n# Template model\nclass ProposalDocument(Base):\n    __tablename__ = 'proposal_documents'\n    \n    # Special template attributes\n    __template__ = True\n    __depends_on__ = ['Opportunity']  # This template depends on the Opportunity model\n    \n    # Template source configuration\n    __template_source__ = 'templates/proposal.html'\n    __input_file_type__ = 'html'\n    __output_file_type__ = 'pdf'\n    \n    # Fields needed for the template (these become columns in the generated data)\n    id = Column(Integer, primary_key=True)\n    opportunity_id = Column(Integer, ForeignKey('opportunities.id'), nullable=False)\n    title = Column(String(200))\n    customer_name = Column(String(100), ForeignKey('customers.name'))\n    opportunity_value = Column(Float, ForeignKey('opportunities.value'))\n    proposed_solutions = Column(Text)\n```\n\nThen generate all data in one call:\n\n```python\nfrom syda.generate import SyntheticDataGenerator\nfrom syda.schemas import ModelConfig\n\n# Initialize generator\nconfig = ModelConfig(provider=\"anthropic\", model_name=\"claude-3-5-haiku-20241022\")\ngenerator = SyntheticDataGenerator(model_config=config)\n\n# Generate all data at once\nresults = generator.generate_for_sqlalchemy_models(\n    sqlalchemy_models=[Customer, Opportunity, ProposalDocument],\n    sample_sizes={'customers': 5, 'opportunities': 8, 'proposal_documents': 3},\n    output_dir=\"output\"\n)\n```\n\nThe example above demonstrates:\n1. Regular SQLAlchemy models for structured data (Customer, Opportunity)\n2. A template model (ProposalDocument)\n3. Foreign key relationships between the template and structured models\n4. Generating everything together with `generate_for_sqlalchemy_models`\n\n\n## Model Selection and Configuration\n\nSyda currently supports two AI providers: OpenAI and Anthropic (Claude).\n\n\n\n### Basic Configuration\n\nConfigure provider, model, temperature, tokens, and proxy settings using the `ModelConfig` class:\n\n```python\nfrom syda.schemas import ModelConfig, ProxyConfig\n\n# Create a model configuration\nconfig = ModelConfig(\n    provider='openai',  # Choose from: 'openai', 'anthropic', etc.\n    model_name='gpt-4-turbo',  # Model name for the selected provider\n    temperature=0.7,    # Controls randomness (0.0-1.0)\n    seed=42,            # For reproducible outputs (provider-specific)\n    max_tokens=4000,    # Maximum response length (default: 4000)\n    proxy=ProxyConfig(  # Optional proxy configuration\n        base_url='https://ai-proxy.company.com/v1',\n        headers={'X-Company-Auth':'internal-token'},\n        params={'team':'data-science'}\n    )\n)\n\n# Initialize generator with the configuration\ngenerator = SyntheticDataGenerator(model_config=config)\n```\n\n### Using Different Model Providers\n\nThe library currently supports OpenAI and Anthropic (Claude) models and allows you to easily switch between these providers while maintaining a consistent interface.\n\n#### OpenAI Models\n\n```python\n# Default configuration - uses OpenAI's GPT-4 if no model_config provided\ndefault_generator = SyntheticDataGenerator()\n\n# Explicitly configure for GPT-3.5 Turbo (faster and more cost-effective)\nopenai_config = ModelConfig(\n    provider='openai',\n    model_name='gpt-3.5-turbo',  # You can also use 'gpt-3.5-turbo-1106' for better JSON handling\n    temperature=0.7,\n    response_format={\"type\": \"json_object\"}  # Forces JSON response format (GPT models)\n)\ngpt35_generator = SyntheticDataGenerator(model_config=openai_config)\n\n# Generate data with specific model configuration\ndata = gpt35_generator.generate_data(\n    schema={'product_id': 'number', 'product_name': 'text', 'price': 'number'},\n    prompt=\"Generate electronic product data with prices between $500-$2000\",\n    sample_size=10\n)\n```\n\n#### Anthropic Claude Models\n\n```python\n# Configure for Claude (requires ANTHROPIC_API_KEY environment variable)\nclaude_config = ModelConfig(\n    provider='anthropic',\n    model_name='claude-3-sonnet-20240229',  # Available models: claude-3-opus, claude-3-sonnet, claude-3-haiku\n    temperature=0.7,\n    max_tokens=2000  # Claude can sometimes need more tokens for structured output\n)\nclaude_generator = SyntheticDataGenerator(model_config=claude_config)\n\n# Generate data with Claude\ndata = claude_generator.generate_data(\n    schema={'product_id': 'number', 'product_name': 'text', 'price': 'number', 'description': 'text'},\n    prompt=\"Generate luxury product data with realistic prices over $1000\",\n    sample_size=5\n)\n```\n\n#### Maximum Tokens Parameter\n\nThe library now uses a default of 4000 tokens for `max_tokens` to ensure complete responses with all expected columns. This helps prevent incomplete data generation issues.\n\n```python\n# Override the default max_tokens setting\nconfig = ModelConfig(\n    provider=\"openai\",\n    model_name=\"gpt-4\",\n    max_tokens=8000,  # Increase for very complex schemas or large sample sizes\n    temperature=0.7\n)\n```\n\nWhen generating complex data or data with many columns, consider increasing this value if you notice missing columns in your generated data.\n\n#### Provider-Specific Optimizations\n\nEach AI provider has different strengths and parameter requirements. The library automatically handles most of the differences, but you can optimize for specific providers:\n\n```python\n# OpenAI-specific optimization\nopenai_optimized = ModelConfig(\n    provider='openai',\n    model_name='gpt-4-turbo',\n    temperature=0.7,\n    response_format={\"type\": \"json_object\"},  # Only works with OpenAI\n    seed=42  # For reproducible outputs\n)\n\n# Anthropic-specific optimization\nanthropic_optimized = ModelConfig(\n    provider='anthropic',\n    model_name='claude-3-opus-20240229',\n    temperature=0.7,\n    system=\"You are a synthetic data generator that creates realistic, high-quality datasets based on the provided schema.\"  # System prompt works best with Anthropic\n)\n```\n\n### Advanced: Direct Access to LLM Client\n\nFor advanced use cases, you can access the underlying LLM client directly for additional control:\n\n```python\nfrom syda.llm import create_llm_client\n\n# Create a standalone LLM client\nllm_client = create_llm_client(\n    model_config=ModelConfig(\n        provider='anthropic', \n        model_name='claude-3-opus-20240229'\n    ),\n    # API key is optional if set in environment variables\n    anthropic_api_key=\"your_api_key\"  \n)\n\n# Define a Pydantic model for structured output\nfrom pydantic import BaseModel\nfrom typing import List\n\nclass Book(BaseModel):\n    title: str\n    author: str\n    year: int\n    genre: str\n    pages: int\n\nclass BookCollection(BaseModel):\n    books: List[Book]\n\n# Use the client for structured responses\nbooks = llm_client.client.chat.completions.create(\n    model=\"claude-3-opus-20240229\",\n    response_model=BookCollection,  # Automatically parses the response to this model\n    messages=[{\"role\": \"user\", \"content\": \"Generate 5 fictional sci-fi books.\"}]\n)\n\n# Access the structured data directly\nfor book in books.books:\n    print(f\"{book.title} by {book.author} ({book.year}) - {book.pages} pages\")\n```\n\nThis approach gives you direct control over the client while still providing structured data extraction capabilities.\n\n## Output Options\n\nSyda offers flexible output options to suit different use cases:\n\n### Multiple Schema Generation\n\nWhen generating data for multiple schemas using `generate_for_schemas` or `generate_for_sqlalchemy_models`, you can specify an output directory and format:\n\n```python\n# Generate and save data to CSV files (default)\nresults = generator.generate_for_schemas(\n    schemas=schemas,\n    output_dir=\"output_directory\",\n    output_format=\"csv\"  # Default format\n)\n\n# Generate and save data to JSON files\nresults = generator.generate_for_schemas(\n    schemas=schemas,\n    output_dir=\"output_directory\",\n    output_format=\"json\"\n)\n```\n\nEach schema will be saved to a separate file with the schema name as the filename. For example:\n\n* CSV format: `output_directory/customer.csv`, `output_directory/order.csv`, etc.\n* JSON format: `output_directory/customer.json`, `output_directory/order.json`, etc.\n\nThe `results` dictionary will still contain all generated DataFrames, so you can both save to files and work with the data directly in your code.\n\n\n## Configuration and Error Handling\n\n### API Keys Management\n\nYou can provide appropriate API keys based on the provider you're using. There are two recommended ways to manage API keys:\n\n#### 1. Environment Variables (Recommended)\n\nSet API keys via environment variables:\n\n```bash\n# For OpenAI models\nexport OPENAI_API_KEY=your_openai_key\n\n# For Anthropic models\nexport ANTHROPIC_API_KEY=your_anthropic_key\n\n# For other providers, set the appropriate environment variables\n```\n\nYou can also use a `.env` file in your project root and load it with:\n\n```python\nfrom dotenv import load_dotenv\nload_dotenv()  # This loads API keys from .env file\n```\n\n#### 2. Direct Initialization\n\nProvide API keys when initializing the generator:\n\n```python\n# With explicit model configuration\ngenerator = SyntheticDataGenerator(\n    model_config=ModelConfig(provider='openai', model_name='gpt-4'),\n    openai_api_key=\"your_openai_key\",      # Only needed for OpenAI models\n    anthropic_api_key=\"your_anthropic_key\"  # Only needed for Anthropic models\n)\n```\n\n\n### Error Handling\n\nSyda's error handling has been improved to provide more useful feedback when data generation fails. The library now:\n\n1. **Raises Explicit Exceptions**: When data generation fails rather than returning random data\n2. **Provides Detailed Error Messages**: Explaining what went wrong and potential fixes\n3. **Validates Output Structure**: Ensures generated data matches the expected schema\n\nExample error handling:\n\n```python\ntry:\n    data = generator.generate_data(\n        schema=YourModel,\n        prompt=\"Generate synthetic data...\",\n        sample_size=10\n    )\n    # Process the data...\nexcept ValueError as e:\n    print(f\"Data generation failed: {str(e)}\")\n    # Implement fallback strategy or retry with different parameters\n```\n\n## Contributing\n\n1. Fork the repository.\n2. Create a feature branch.\n3. Commit your changes.\n4. Push to your branch.\n5. Open a Pull Request.\n\n## License\n\nSee [LICENSE](LICENSE) for details.\n",
    "bugtrack_url": null,
    "license": "LGPL-3.0-or-later",
    "summary": "A Python library for AI-powered synthetic data generation with referential integrity",
    "version": "0.0.1",
    "project_urls": {
        "Changelog": "https://github.com/syda-ai/syda/blob/main/CHANGELOG.md",
        "Documentation": "https://python.syda.ai",
        "Homepage": "https://github.com/syda-ai/syda",
        "Issues": "https://github.com/syda-ai/syda/issues",
        "Repository": "https://github.com/syda-ai/syda.git"
    },
    "split_keywords": [
        "synthetic data",
        " ai",
        " machine learning",
        " data generation",
        " testing",
        " privacy",
        " sqlalchemy",
        " openai",
        " anthropic",
        " claude",
        " gpt"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "a5928cbaa2017cac233decd154541c40501e4e5e01026d5d91dfb3a14ce701d7",
                "md5": "9052607ff4a482147598dcf06dc88bb2",
                "sha256": "891af8ad05869b175bc4d1f08d317d089c88759ebcc3368f5f83d921119ee05b"
            },
            "downloads": -1,
            "filename": "syda-0.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "9052607ff4a482147598dcf06dc88bb2",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 52451,
            "upload_time": "2025-08-12T02:24:07",
            "upload_time_iso_8601": "2025-08-12T02:24:07.941308Z",
            "url": "https://files.pythonhosted.org/packages/a5/92/8cbaa2017cac233decd154541c40501e4e5e01026d5d91dfb3a14ce701d7/syda-0.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "aa10a19b5a3a3a14dfd08702eabb7bdefba3f8bdd438640da0da09bbe35c2895",
                "md5": "0d40ef3154013a29e3c1c4c28956ec98",
                "sha256": "1442422c4820e9fdd298976537cb040589512b53ae61324db30187ff6395806e"
            },
            "downloads": -1,
            "filename": "syda-0.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "0d40ef3154013a29e3c1c4c28956ec98",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 167825,
            "upload_time": "2025-08-12T02:24:09",
            "upload_time_iso_8601": "2025-08-12T02:24:09.433416Z",
            "url": "https://files.pythonhosted.org/packages/aa/10/a19b5a3a3a14dfd08702eabb7bdefba3f8bdd438640da0da09bbe35c2895/syda-0.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-12 02:24:09",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "syda-ai",
    "github_project": "syda",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "pydantic",
            "specs": [
                [
                    ">=",
                    "2.4.2"
                ]
            ]
        },
        {
            "name": "python-dotenv",
            "specs": [
                [
                    ">=",
                    "1.0.0"
                ]
            ]
        },
        {
            "name": "sqlalchemy",
            "specs": [
                [
                    ">=",
                    "2.0.23"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    ">=",
                    "2.0.3"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    ">=",
                    "1.24.3"
                ]
            ]
        },
        {
            "name": "networkx",
            "specs": [
                [
                    ">=",
                    "3.1"
                ]
            ]
        },
        {
            "name": "openai",
            "specs": [
                [
                    ">=",
                    "1.0.0"
                ]
            ]
        },
        {
            "name": "anthropic",
            "specs": [
                [
                    ">=",
                    "0.7.0"
                ]
            ]
        },
        {
            "name": "instructor",
            "specs": [
                [
                    ">=",
                    "1.0.0"
                ]
            ]
        },
        {
            "name": "python-magic",
            "specs": [
                [
                    ">=",
                    "0.4.27"
                ]
            ]
        },
        {
            "name": "python-docx",
            "specs": [
                [
                    ">=",
                    "1.0.0"
                ]
            ]
        },
        {
            "name": "openpyxl",
            "specs": [
                [
                    ">=",
                    "3.1.2"
                ]
            ]
        },
        {
            "name": "weasyprint",
            "specs": [
                [
                    ">=",
                    "65.1"
                ]
            ]
        },
        {
            "name": "pyyaml",
            "specs": [
                [
                    ">=",
                    "6.0.1"
                ]
            ]
        },
        {
            "name": "pytest",
            "specs": [
                [
                    ">=",
                    "7.4.0"
                ]
            ]
        },
        {
            "name": "boto3",
            "specs": [
                [
                    ">=",
                    "1.28.0"
                ]
            ]
        },
        {
            "name": "azure-storage-blob",
            "specs": [
                [
                    ">=",
                    "12.19.0"
                ]
            ]
        },
        {
            "name": "pdfplumber",
            "specs": [
                [
                    ">=",
                    "0.10.3"
                ]
            ]
        },
        {
            "name": "pillow",
            "specs": [
                [
                    ">=",
                    "10.0.1"
                ]
            ]
        },
        {
            "name": "pytesseract",
            "specs": [
                [
                    ">=",
                    "0.3.10"
                ]
            ]
        },
        {
            "name": "sqlalchemy-utils",
            "specs": [
                [
                    ">=",
                    "0.41.1"
                ]
            ]
        },
        {
            "name": "mkdocs-material",
            "specs": [
                [
                    ">=",
                    "9.6.15"
                ]
            ]
        },
        {
            "name": "mkdocs",
            "specs": [
                [
                    ">=",
                    "1.6.1"
                ]
            ]
        },
        {
            "name": "mkdocs-macros-plugin",
            "specs": [
                [
                    ">=",
                    "1.3.7"
                ]
            ]
        }
    ],
    "lcname": "syda"
}
        
Elapsed time: 2.68635s