# Synthetic Data Generation Library
A Python-based open-source library for generating synthetic data with AI while preserving referential integrity. Allowing seamless use of OpenAI, Anthropic (Claude), and other AI models.
## Table of Contents
* [Features](#features)
* [Installation](#installation)
* [Quick Start](#quick-start)
* [Core API](#core-api)
* [Structured Data Generation](#structured-data-generation)
* [SQLAlchemy Model Integration](#sqlalchemy-model-integration)
* [Handling Foreign Key Relationships](#handling-foreign-key-relationships)
* [Multiple Schema Definition Formats](#multiple-schema-definition-formats)
* [SQLAlchemy Models](#1-sqlalchemy-models)
* [YAML Schema Files](#2-yaml-schema-files)
* [JSON Schema Files](#3-json-schema-files)
* [Dictionary-Based Schemas](#4-dictionary-based-schemas)
* [Foreign Key Definition Methods](#foreign-key-definition-methods)
* [Automatic Management of Multiple Related Models](#automatic-management-of-multiple-related-models)
* [Using SQLAlchemy Models](#using-sqlalchemy-models)
* [Using YAML Schema Files](#using-yaml-schema-files)
* [Using JSON Schema Files](#using-json-schema-files)
* [Using Dictionary-Based Schemas](#using-dictionary-based-schemas)
* [Complete CRM Example](#complete-crm-example)
* [Metadata Enhancement Benefits with SQLAlchemy Models](#metadata-enhancement-benefits-with-sqlalchemy-models)
* [Custom Generators for Domain-Specific Data](#custom-generators-for-domain-specific-data)
* [Unstructured Document Generation](#unstructured-document-generation)
* [Template-Based Document Generation](#template-based-document-generation)
* [Template Schema Requirements](#template-schema-requirements)
* [Supported Template Types](#supported-template-types)
* [Combined Structured and Unstructured Data](#combined-structured-and-unstructured-data)
* [Connecting Documents to Structured Data](#connecting-documents-to-structured-data)
* [Schema Dependencies for Documents](#schema-dependencies-for-documents)
* [Custom Generators for Document Data](#custom-generators-for-document-data)
* [SQLAlchemy Models with Templates](#sqlalchemy-models-with-templates)
* [Model Selection and Configuration](#model-selection-and-configuration)
* [Basic Configuration](#basic-configuration)
* [Using Different Model Providers](#using-different-model-providers)
* [OpenAI Models](#openai-models)
* [Anthropic Claude Models](#anthropic-claude-models)
* [Maximum Tokens Parameter](#maximum-tokens-parameter)
* [Provider-Specific Optimizations](#provider-specific-optimizations)
* [Advanced: Direct Access to LLM Client](#advanced-direct-access-to-llm-client)
* [Output Options](#output-options)
* [Configuration and Error Handling](#configuration-and-error-handling)
* [API Keys Management](#api-keys-management)
* [Environment Variables (Recommended)](#1-environment-variables-recommended)
* [Direct Initialization](#2-direct-initialization)
* [Error Handling](#error-handling)
* [Contributing](#contributing)
* [License](#license)
## Features
* **Multi-Provider AI Integration**:
* Seamless integration with multiple AI providers
* Support for OpenAI (GPT) and Anthropic (Claude).
* Default model is Anthropic Claude model claude-3-5-haiku-20241022
* Consistent interface across different providers
* Provider-specific parameter optimization
* **LLM-based Data Generation**:
* AI-powered schema understanding and data creation
* Contextually-aware synthetic records
* Natural language prompt customization
* Intelligent schema inference
* **SQLAlchemy Integration**:
* Automatic extraction of model metadata, docstrings and constraints
* Intelligent column-specific data generation
* Parameter naming consistency with `sqlalchemy_models`
* **Multiple Schema Formats**:
* SQLAlchemy model integration with automatic metadata extraction
* YAML/JSON schema file support with full foreign key relationship handling
* Python dictionary-based schema definitions
* **Referential Integrity**
* Automatic foreign key detection and resolution
* Multi-model dependency analysis through topological sorting
* Robust handling of related data with referential constraints
* **Custom Generators**
* Register column- or type-specific functions for domain-specific data
* Contextual generators that adapt to other fields (like ICD-10 codes based on demographics)
* Weighted distributions for realistic data patterns
## Installation
Install the package using pip:
```bash
pip install syda
```
## Quick Start
```python
from syda.structured import SyntheticDataGenerator
from syda.schemas import ModelConfig
model_config = ModelConfig(
provider="anthropic",
model_name="claude-3-5-haiku-20241022",
temperature=0.7,
max_tokens=8192 # Larger value for more complete responses
)
generator = SyntheticDataGenerator(model_config=model_config)
# Define schema for a single table
schemas = {
'Patient': {
'patient_id': 'number',
'diagnosis_code': 'icd10_code',
'email': 'email',
'visit_date': 'date',
'notes': 'text'
}
}
prompt = "Generate realistic synthetic patient records with ICD-10 diagnosis codes, emails, visit dates, and clinical notes."
# Generate and save to CSV
results = generator.generate_for_schemas(
schemas=schemas,
prompts={'Patient': prompt},
sample_sizes={'Patient': 15},
output_dir='synthetic_output'
)
print(f"Data saved to synthetic_output/Patient.csv")
```
## Core API
### Structured Data Generation
Use simple schema maps or SQLAlchemy models to generate data:
```python
from syda.structured import SyntheticDataGenerator
from syda.schemas import ModelConfig
model_config = ModelConfig(provider='anthropic', model_name='claude-3-5-haiku-20241022')
generator = SyntheticDataGenerator(model_config=model_config)
# Simple dict schema
schemas = {
'User': {'id': 'number', 'name': 'text'}
}
results = generator.generate_for_schemas(
schemas=schemas,
prompts={'User': 'Generate user records'},
sample_sizes={'User': 10}
)
```
### SQLAlchemy Model Integration
Pass declarative models directly—docstrings and column metadata inform the prompt:
```python
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer, String
from syda.structured import SyntheticDataGenerator
from syda.schemas import ModelConfig
Base = declarative_base()
class User(Base):
__tablename__ = 'users'
id = Column(Integer, primary_key=True)
name = Column(String, comment="Full name of the user")
model_config = ModelConfig(provider='anthropic', model_name='claude-3-5-haiku-20241022')
generator = SyntheticDataGenerator(model_config=model_config)
results = generator.generate_for_sqlalchemy_models(
sqlalchemy_models=[User],
prompts={'User': 'Generate users'},
sample_sizes={'User': 5}
)
```
### SQLAlchemy Model Integration
Pass declarative models directly—docstrings and column metadata inform the prompt:
```python
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer, String
from syda.structured import SyntheticDataGenerator
from syda.schemas import ModelConfig
Base = declarative_base()
class User(Base):
__tablename__ = 'users'
id = Column(Integer, primary_key=True)
name = Column(String, comment="Full name of the user")
model_config = ModelConfig(provider='anthropic', model_name='claude-3-5-haiku-20241022')
generator = SyntheticDataGenerator(model_config=model_config)
results = generator.generate_for_sqlalchemy_models(
sqlalchemy_models=[User],
prompts={'users': 'Generate users'},
sample_sizes={'users': 5}
)
```
> **Important:** SQLAlchemy models **must** have either `__table__` or `__tablename__` specified. Without one of these attributes, the model cannot be properly processed by the system. The `__tablename__` attribute defines the name of the database table and is used as the schema name when generating data. For example, a model with `__tablename__ = 'users'` will be referenced as 'users' in prompts, sample_sizes, custom generators and the returned results dictionary.
### Handling Foreign Key Relationships
The library provides robust support for handling foreign key relationships with referential integrity:
1. **Automatic Foreign Key Detection**: Foreign keys are automatically detected from your yml, json, dict, SQLAlchemy models and assigned the type `'foreign_key'`.
2. **Manual Column-Specific Foreign Key Generators**: You can also manually define foreign key generators for specific columns as below snippet
```python
# After generating departments and loading them into departments_df:
def department_id_fk_generator(row, col_name):
return random.choice(departments_df['id'].tolist())
generator.register_generator('foreign_key', department_id_fk_generator, column_name='department_id')
```
3. **Multi-Step Generation Process**: For related tables, generate parent records first, then use their IDs when generating child records:
```python
# Generate departments first, then employees with valid department_id references
results = generator.generate_for_sqlalchemy_models(
sqlalchemy_models=[Department, Employee],
prompts={
'departments': 'Generate company departments',
'employees': 'Generate realistic employee data'
},
sample_sizes={
'departments': 5,
'employees': 10
}
)
# Access the generated dataframes
departments_df = results['departments']
employees_df = results['employees']
```
4. **Referential Integrity Preservation**: The foreign key generator samples from actual existing IDs in the parent table, ensuring all references are valid.
5. **Metadata-Enhanced Foreign Keys**: Column comments on foreign key fields are preserved and included in the prompt, helping the LLM understand the relationship context.
### Multiple Schema Definition Formats
> **Note:** For detailed information on supported field types and schema format, see the [Schema Reference](schema_reference.md) document.
Syda supports defining your data models in multiple formats, all leading to the same synthetic data generation capabilities. Choose the format that best suits your workflow:
#### 1. SQLAlchemy Models
```python
from sqlalchemy import Column, Integer, String, ForeignKey, Float, Date
from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()
class Customer(Base):
__tablename__ = 'customers'
__doc__ = """Customer organization that places orders"""
id = Column(Integer, primary_key=True)
name = Column(String(100), nullable=False, comment="Company name")
status = Column(String(20), comment="Customer status (Active/Inactive/Prospect)")
class Order(Base):
__tablename__ = 'orders'
__doc__ = """Customer order for products or services"""
id = Column(Integer, primary_key=True)
customer_id = Column(Integer, ForeignKey('customers.id'), nullable=False)
order_date = Column(Date, nullable=False, comment="Date when order was placed")
total_amount = Column(Float, comment="Total monetary value of the order in USD")
# Generate data from SQLAlchemy models
results = generator.generate_for_sqlalchemy_models(
sqlalchemy_models=[Customer, Order],
prompts={"customers": "Generate tech companies"},
sample_sizes={"customers": 10, "orders": 30}
)
```
#### 2. YAML Schema Files
```yaml
# customer.yaml
__table_description__: Customer organization that places orders
id:
type: number
primary_key: true
name:
type: text
max_length: 100
not_null: true
description: Company name
status:
type: text
max_length: 20
description: Customer status (Active/Inactive/Prospect)
```
```yaml
# order.yaml
__table_description__: Customer order for products or services
__foreign_keys__:
customer_id: [Customer, id]
id:
type: number
primary_key: true
customer_id:
type: foreign_key
not_null: true
description: Reference to the customer who placed the order
order_date:
type: date
not_null: true
description: Date when order was placed
total_amount:
type: number
description: Total monetary value of the order in USD
```
```python
# Generate data from YAML schema files
results = generator.generate_for_schemas(
schemas={
'Customer': 'schemas/customer.yaml',
'Order': 'schemas/order.yaml'
},
prompts={'Customer': 'Generate tech companies'},
sample_sizes={'Customer': 10, 'Order': 30}
)
```
#### 3. JSON Schema Files
```json
// customer.json
{
"__table_description__": "Customer organization that places orders",
"id": {
"type": "number",
"primary_key": true
},
"name": {
"type": "text",
"max_length": 100,
"not_null": true,
"description": "Company name"
},
"status": {
"type": "text",
"max_length": 20,
"description": "Customer status (Active/Inactive/Prospect)"
}
}
```
```json
// order.json
{
"__table_description__": "Customer order for products or services",
"__foreign_keys__": {
"customer_id": ["Customer", "id"]
},
"id": {
"type": "number",
"primary_key": true
},
"customer_id": {
"type": "foreign_key",
"not_null": true,
"description": "Reference to the customer who placed the order"
},
"order_date": {
"type": "date",
"not_null": true,
"description": "Date when order was placed"
},
"total_amount": {
"type": "number",
"description": "Total monetary value of the order in USD"
}
}
```
```python
# Generate data from JSON schema files
results = generator.generate_for_schemas(
schemas={
'Customer': 'schemas/customer.json',
'Order': 'schemas/order.json'
},
prompts={'Customer': 'Generate tech companies'},
sample_sizes={'Customer': 10, 'Order': 30}
)
```
#### 4. Dictionary-Based Schemas
```python
# Define schemas directly as dictionaries
schemas = {
'Customer': {
'__table_description__': 'Customer organization that places orders',
'id': {'type': 'number', 'primary_key': True},
'name': {
'type': 'text',
'max_length': 100,
'not_null': True,
'description': 'Company name'
},
'status': {
'type': 'text',
'max_length': 20,
'description': 'Customer status (Active/Inactive/Prospect)'
}
},
'Order': {
'__table_description__': 'Customer order for products or services',
'__foreign_keys__': {
'customer_id': ['Customer', 'id']
},
'id': {'type': 'number', 'primary_key': True},
'customer_id': {
'type': 'foreign_key',
'not_null': True,
'description': 'Reference to the customer who placed the order'
},
'order_date': {
'type': 'date',
'not_null': True,
'description': 'Date when order was placed'
},
'total_amount': {
'type': 'number',
'description': 'Total monetary value of the order in USD'
}
}
}
# Generate data from dictionary schemas
results = generator.generate_for_schemas(
schemas=schemas,
prompts={'Customer': 'Generate tech companies'},
sample_sizes={'Customer': 10, 'Order': 30}
)
```
#### Foreign Key Definition Methods
There are three ways to define foreign key relationships:
1. Using the `__foreign_keys__` special section in a schema:
```python
"__foreign_keys__": {
"customer_id": ["Customer", "id"]
}
```
2. Using field-level references with type and references properties:
```python
"order_id": {
"type": "foreign_key",
"references": {
"schema": "Order",
"field": "id"
}
}
```
3. Using type-based detection with naming conventions:
```python
"customer_id": "foreign_key"
```
(The system will attempt to infer the relationship based on naming conventions)
### Automatic Management of Multiple Related Models
#### Using SQLAlchemy Models
Simplify multi-table workflows with `generate_for_sqlalchemy_models`:
```python
from sqlalchemy import Column, Integer, String, Float, ForeignKey, Date, Text
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import relationship
from datetime import datetime, timedelta
import random
from syda.generate import SyntheticDataGenerator
Base = declarative_base()
# Customer model
class Customer(Base):
__tablename__ = 'customers'
id = Column(Integer, primary_key=True)
name = Column(String(100), nullable=False)
industry = Column(String(50))
status = Column(String(20))
contacts = relationship("Contact", back_populates="customer")
orders = relationship("Order", back_populates="customer")
# Contact model with foreign key to Customer
class Contact(Base):
__tablename__ = 'contacts'
id = Column(Integer, primary_key=True)
customer_id = Column(Integer, ForeignKey('customers.id'), nullable=False)
name = Column(String(100), nullable=False)
email = Column(String(120), nullable=False)
phone = Column(String(20))
customer = relationship("Customer", back_populates="contacts")
# Product model
class Product(Base):
__tablename__ = 'products'
id = Column(Integer, primary_key=True)
name = Column(String(100), nullable=False)
description = Column(Text)
price = Column(Float, nullable=False)
order_items = relationship("OrderItem", back_populates="product")
# Order model with foreign key to Customer
class Order(Base):
__tablename__ = 'orders'
id = Column(Integer, primary_key=True)
customer_id = Column(Integer, ForeignKey('customers.id'), nullable=False)
order_date = Column(Date, nullable=False)
total_amount = Column(Float)
customer = relationship("Customer", back_populates="orders")
order_items = relationship("OrderItem", back_populates="order")
# OrderItem model with foreign keys to Order and Product
class OrderItem(Base):
__tablename__ = 'order_items'
id = Column(Integer, primary_key=True)
order_id = Column(Integer, ForeignKey('orders.id'), nullable=False)
product_id = Column(Integer, ForeignKey('products.id'), nullable=False)
quantity = Column(Integer, nullable=False)
price = Column(Float, nullable=False)
order = relationship("Order", back_populates="order_items")
product = relationship("Product", back_populates="order_items")
# Initialize generator
generator = SyntheticDataGenerator()
# Generate data for all models in one call
results = generator.generate_for_sqlalchemy_models(
models=[Customer, Contact, Product, Order, OrderItem],
prompts={
"customers": "Generate diverse customer organizations for a B2B SaaS company.",
"contacts": "Generate cloud software products and services."
},
sample_sizes={
"customers": 10,
"contacts": 25,
"products": 15,
"orders": 30,
"order_items": 60
},
custom_generators={
"customers": {
# Ensure a specific distribution of customer statuses for business reporting
"status": lambda row, col: random.choice(["Active", "Inactive", "Prospect"]),
},
"products": {
# Ensure product categories match your specific business domains
"category": lambda row, col: random.choice([
"Cloud Infrastructure", "Business Intelligence", "Security Services",
"Data Analytics", "Custom Development", "Support Package", "API Services"
])
},
}
)
```
#### Using YAML Schema Files
The same relationship management is available with YAML schemas:
```yaml
# customer.yaml
__table_name__: customers
__description__: Customer organizations
id:
type: integer
constraints:
primary_key: true
not_null: true
name:
type: string
constraints:
not_null: true
max_length: 100
industry:
type: string
constraints:
max_length: 50
status:
type: string
constraints:
max_length: 20
```
```yaml
# contact.yaml
__table_name__: contacts
__description__: Customer contacts
__foreign_keys__:
customer_id: [customers, id]
id:
type: integer
constraints:
primary_key: true
not_null: true
customer_id:
type: integer
constraints:
not_null: true
name:
type: string
constraints:
not_null: true
max_length: 100
email:
type: string
constraints:
not_null: true
max_length: 120
phone:
type: string
constraints:
max_length: 20
```
```yaml
# order.yaml
__table_name__: orders
__description__: Customer orders
__foreign_keys__:
customer_id: [customers, id]
id:
type: integer
constraints:
primary_key: true
not_null: true
customer_id:
type: integer
constraints:
not_null: true
order_date:
type: string
format: date
constraints:
not_null: true
total_amount:
type: number
format: float
```
```python
# Generate data for multiple related tables with YAML schemas
results = generator.generate_for_schemas(
schemas={
'Customer': 'schemas/customer.yaml',
'Contact': 'schemas/contact.yaml',
'Product': 'schemas/product.yaml',
'Order': 'schemas/order.yaml',
'OrderItem': 'schemas/order_item.yaml'
},
prompts={
"Customer": "Generate diverse customer organizations for a B2B SaaS company.",
"Product": "Generate cloud software products and services."
},
sample_sizes={
"Customer": 10,
"Contact": 20,
"Product": 15,
"Order": 30,
"OrderItem": 60
}
)
```
#### Using JSON Schema Files
JSON schema files offer the same capabilities:
```json
// customer.json
{
"__table_name__": "customers",
"__description__": "Customer organizations",
"id": {
"type": "integer",
"constraints": {
"primary_key": true,
"not_null": true
}
},
"name": {
"type": "string",
"constraints": {
"not_null": true,
"max_length": 100
}
},
"industry": {
"type": "string",
"constraints": {
"max_length": 50
}
},
"status": {
"type": "string",
"constraints": {
"max_length": 20
}
}
}
```
```json
// contact.json
{
"__table_name__": "contacts",
"__description__": "Customer contacts",
"__foreign_keys__": {
"customer_id": ["customers", "id"]
},
"id": {
"type": "integer",
"constraints": {
"primary_key": true,
"not_null": true
}
},
"customer_id": {
"type": "integer",
"constraints": {
"not_null": true
}
},
"name": {
"type": "string",
"constraints": {
"not_null": true,
"max_length": 100
}
},
"email": {
"type": "string",
"constraints": {
"not_null": true,
"max_length": 120
}
},
"phone": {
"type": "string",
"constraints": {
"max_length": 20
}
}
}
```
```json
// order.json
{
"__table_name__": "orders",
"__description__": "Customer orders",
"__foreign_keys__": {
"customer_id": ["customers", "id"]
},
"id": {
"type": "integer",
"constraints": {
"primary_key": true,
"not_null": true
}
},
"customer_id": {
"type": "integer",
"constraints": {
"not_null": true
}
},
"order_date": {
"type": "string",
"format": "date",
"constraints": {
"not_null": true
}
},
"total_amount": {
"type": "number",
"format": "float"
}
}
```
```python
# Generate data for multiple related tables with JSON schemas
results = generator.generate_for_schemas(
schemas={
'Customer': 'schemas/customer.json',
'Contact': 'schemas/contact.json',
'Product': 'schemas/product.json',
'Order': 'schemas/order.json',
'OrderItem': 'schemas/order_item.json'
},
prompts={
"Customer": "Generate diverse customer organizations for a B2B SaaS company.",
"Product": "Generate cloud software products and services."
},
sample_sizes={
"Customer": 10,
"Contact": 20,
"Product": 15,
"Order": 30,
"OrderItem": 60
}
)
```
#### Using Dictionary-Based Schemas
Similar relationship management works with dictionary schemas:
```python
# Define schemas as Python dictionaries
schemas = {
'Customer': {
'__table_name__': 'customers',
'__description__': 'Customer organizations',
'id': {
'type': 'integer',
'constraints': {'primary_key': True, 'not_null': True}
},
'name': {
'type': 'string',
'constraints': {'not_null': True, 'max_length': 100}
},
'industry': {
'type': 'string',
'constraints': {'max_length': 50}
},
'status': {
'type': 'string',
'constraints': {'max_length': 20}
}
},
'Contact': {
'__table_name__': 'contacts',
'__description__': 'Customer contacts',
'__foreign_keys__': {
'customer_id': ['customers', 'id']
},
'id': {
'type': 'integer',
'constraints': {'primary_key': True, 'not_null': True}
},
'customer_id': {
'type': 'integer',
'constraints': {'not_null': True}
},
'name': {
'type': 'string',
'constraints': {'not_null': True, 'max_length': 100}
},
'email': {
'type': 'string',
'constraints': {'not_null': True, 'max_length': 120}
},
'phone': {
'type': 'string',
'constraints': {'max_length': 20}
}
},
'Order': {
'__table_name__': 'orders',
'__description__': 'Customer orders',
'__foreign_keys__': {
'customer_id': ['customers', 'id']
},
'id': {
'type': 'integer',
'constraints': {'primary_key': True, 'not_null': True}
},
'customer_id': {
'type': 'integer',
'constraints': {'not_null': True}
},
'order_date': {
'type': 'string',
'format': 'date',
'constraints': {'not_null': True}
},
'total_amount': {
'type': 'number',
'format': 'float'
}
}
}
# Generate data for dictionary schemas
results = generator.generate_for_schemas(
schemas=schemas,
prompts={
'Customer': 'Generate diverse customer organizations for a B2B SaaS company.'
},
sample_sizes={
'Customer': 10,
'Contact': 20,
'Order': 30
}
)
```
In all cases, the generator will:
1. Analyze relationships between models/schemas
2. Determine the correct generation order using topological sorting
3. Generate parent tables first
4. Use existing primary keys when populating foreign keys in child tables
5. Maintain referential integrity across the entire dataset
### Complete CRM Example
Here’s a comprehensive example demonstrating `generate_for_sqlalchemy_models` across five interrelated models, including entity definitions, prompt setup, and data verification:
```python
#!/usr/bin/env python
import random
import datetime
from sqlalchemy import Column, Integer, String, ForeignKey, Float, Date, Boolean, Text
from sqlalchemy.orm import declarative_base, relationship
from syda.structured import SyntheticDataGenerator
Base = declarative_base()
class Customer(Base):
__tablename__ = 'customers'
id = Column(Integer, primary_key=True)
name = Column(String(100), unique=True, comment="Customer organization name")
industry = Column(String(50), comment="Customer's primary industry")
website = Column(String(100), comment="Customer's website URL")
status = Column(String(20), comment="Active, Inactive, Prospect")
created_at = Column(Date, default=datetime.date.today, comment="Date when added to CRM")
contacts = relationship("Contact", back_populates="customer")
orders = relationship("Order", back_populates="customer")
class Contact(Base):
__tablename__ = 'contacts'
id = Column(Integer, primary_key=True)
customer_id = Column(Integer, ForeignKey('customers.id'), comment="Customer this contact belongs to")
first_name = Column(String(50), comment="Contact's first name")
last_name = Column(String(50), comment="Contact's last name")
email = Column(String(100), unique=True, comment="Contact's email address")
phone = Column(String(20), comment="Contact's phone number")
position = Column(String(100), comment="Job title or position")
is_primary = Column(Boolean, default=False, comment="Primary contact flag")
customer = relationship("Customer", back_populates="contacts")
class Product(Base):
__tablename__ = 'products'
id = Column(Integer, primary_key=True)
name = Column(String(100), unique=True, comment="Product name")
category = Column(String(50), comment="Product category")
price = Column(Float, comment="Product price in USD")
description = Column(Text, comment="Product description")
order_items = relationship("OrderItem", back_populates="product")
class Order(Base):
__tablename__ = 'orders'
id = Column(Integer, primary_key=True)
customer_id = Column(Integer, ForeignKey('customers.id'), comment="Customer who placed the order")
order_date = Column(Date, comment="Date when order was placed")
status = Column(String(20), comment="Order status: New, Processing, Shipped, Delivered, Cancelled")
total_amount = Column(Float, comment="Total amount in USD")
customer = relationship("Customer", back_populates="orders")
items = relationship("OrderItem", back_populates="order")
class OrderItem(Base):
__tablename__ = 'order_items'
id = Column(Integer, primary_key=True)
order_id = Column(Integer, ForeignKey('orders.id'), comment="Order this item belongs to")
product_id = Column(Integer, ForeignKey('products.id'), comment="Product in the order")
quantity = Column(Integer, comment="Quantity ordered")
unit_price = Column(Float, comment="Unit price at order time")
order = relationship("Order", back_populates="items")
product = relationship("Product", back_populates="order_items")
def main():
generator = SyntheticDataGenerator(model='gpt-4')
output_dir = 'crm_data'
prompts = {
"customers": "Generate diverse customer organizations for a B2B SaaS company.",
"products": "Generate products for a cloud software company.",
"orders": "Generate realistic orders with appropriate dates and statuses."
}
sample_sizes = {"customers": 10, "contacts": 25, "products": 15, "orders": 30, "order_items": 60}
results = generator.generate_for_sqlalchemy_models(
sqlalchemy_models=[Customer, Contact, Product, Order, OrderItem],
prompts=prompts,
sample_sizes=sample_sizes,
output_dir=output_dir
)
# Referential integrity checks
print("\n🔍 Verifying referential integrity:")
if set(results['Contact']['customer_id']).issubset(set(results['Customer']['id'])):
print(" ✅ All Contact.customer_id values are valid.")
if set(results['OrderItem']['product_id']).issubset(set(results['Product']['id'])):
print(" ✅ All OrderItem.product_id values are valid.")
```
## Metadata Enhancement Benefits with SQLAlchemy Models
* **Richer Context**: Leverages docstrings, comments, and column constraints to enrich prompts.
* **Simpler Prompts**: Less manual specification; model infers details.
* **Constraint Awareness**: Respects `nullable`, `unique`, and length constraints.
* **Custom Generators**: Column-level functions for fine-tuned data.
* **Automatic Docstring Utilization**: Embeds business context from model definitions.
## Unstructured Document Generation
SYDA can generate realistic unstructured documents such as PDF reports, letters, and forms based on templates. This is useful for applications that require document generation with synthetic data.
For complete examples, see the [examples/unstructured_only](examples/unstructured_only) directory, which includes healthcare document generation samples.
### Template-Based Document Generation
Create template-based document schemas by specifying template fields in your schema:
```python
from syda.generate import SyntheticDataGenerator
from syda.schemas import ModelConfig
# Initialize generator
config = ModelConfig(provider="anthropic", model_name="claude-3-5-haiku-20241022")
generator = SyntheticDataGenerator(model_config=config)
# Define template-based schemas
schemas = {
'MedicalReport': 'schemas/medical_report.yml',
'LabResult': 'schemas/lab_result.yml'
}
```
Here's an example of a medical report template schema:
```yaml
# Medical report template schema (medical_report.yml)
__template__: true
__description__: Medical report template for patient visits
__name__: MedicalReport
__foreign_keys__: {}
__template_source__: templates/medical_report_template.html
__input_file_type__: html
__output_file_type__: pdf
# Patient information
patient_id:
type: string
format: uuid
patient_name:
type: string
date_of_birth:
type: string
format: date
visit_date:
type: string
format: date-time
chief_complaint:
type: string
medical_history:
type: string
# Vital signs
blood_pressure:
type: string
heart_rate:
type: integer
respiratory_rate:
type: integer
temperature:
type: number
oxygen_saturation:
type: integer
# Clinical information
assessment:
type: string
# Generate data and PDF documents
results = generator.generate_for_schemas(
schemas=schemas,
sample_sizes={
'MedicalReport': 5,
'LabResult': 5
},
prompts={
'MedicalReport': 'Generate synthetic medical reports for patients',
'LabResult': 'Generate synthetic laboratory test results for patients'
},
output_dir="output"
)
```
### Template Schema Requirements
Template-based schemas must include these special fields:
```yaml
__template__: true
__template_source__: /path/to/template.html
__input_file_type__: html
__output_file_type__: pdf
```
The template file (like HTML) includes variable placeholders that get replaced with generated data. Here's an example of a Jinja2 HTML template for medical reports corresponding to the schema above:
```html
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Medical Report</title>
<style>
body {
font-family: Arial, sans-serif;
margin: 40px;
line-height: 1.6;
}
.header {
text-align: center;
border-bottom: 2px solid #333;
padding-bottom: 10px;
margin-bottom: 20px;
}
.section {
margin-bottom: 20px;
}
.section-title {
font-weight: bold;
margin-bottom: 5px;
}
</style>
</head>
<body>
<div class="header">
<h1>MEDICAL REPORT</h1>
</div>
<div class="section">
<div class="section-title">PATIENT INFORMATION</div>
<p>
<strong>Patient ID:</strong> {{ patient_id }}<br>
<strong>Name:</strong> {{ patient_name }}<br>
<strong>Date of Birth:</strong> {{ date_of_birth }}
</p>
</div>
<div class="section">
<div class="section-title">VISIT INFORMATION</div>
<p>
<strong>Visit Date:</strong> {{ visit_date }}<br>
<strong>Chief Complaint:</strong> {{ chief_complaint }}
</p>
</div>
<div class="section">
<div class="section-title">MEDICAL HISTORY</div>
<p>{{ medical_history }}</p>
</div>
<div class="section">
<div class="section-title">VITAL SIGNS</div>
<p>
<strong>Blood Pressure:</strong> {{ blood_pressure }}<br>
<strong>Heart Rate:</strong> {{ heart_rate }} bpm<br>
<strong>Respiratory Rate:</strong> {{ respiratory_rate }} breaths/min<br>
<strong>Temperature:</strong> {{ temperature }}°F<br>
<strong>Oxygen Saturation:</strong> {{ oxygen_saturation }}%
</p>
</div>
<div class="section">
<div class="section-title">ASSESSMENT</div>
<p>{{ assessment }}</p>
</div>
</body>
</html>
```
As you can see, the template uses Jinja2's `{{ variable_name }}` syntax to insert the data from the generated schema fields into the HTML document.
### Supported Template Types
- HTML → PDF: Best supported with complete styling control
- HTML → HTML: Simple text formatting
More template formats will be supported in next versions
## Combined Structured and Unstructured Data
SYDA excels at generating both structured data (tables/databases) and unstructured content (documents) in a coordinated way.
For working examples, see the [examples/structured_and_unstructured](examples/structured_and_unstructured) directory, which contains retail receipt generation and CRM document examples.
### Connecting Documents to Structured Data
You can create relationships between document schemas and structured data schemas:
```python
from syda.generate import SyntheticDataGenerator
generator = SyntheticDataGenerator()
# Define both structured and template-based schemas
schemas = {
'Customer': 'schemas/customer.yml', # Structured data
'Product': 'schemas/product.yml', # Structured data
'Transaction': 'schemas/transaction.yml', # Structured data
'Receipt': 'schemas/receipt.yml' # Template-based document
}
```
Here's what a structured data schema for a `Customer` might look like:
```yaml
# Customer schema (customer.yml)
__table_name__: Customer
__description__: Retail customers
id:
type: integer
description: Unique customer ID
constraints:
primary_key: true
not_null: true
min: 1
first_name:
type: string
description: Customer's first name
constraints:
not_null: true
length: 50
last_name:
type: string
description: Customer's last name
constraints:
not_null: true
length: 50
email:
type: email
description: Customer's email address
constraints:
not_null: true
unique: true
length: 100
```
And here's a template-based document schema for a `Receipt` that references the structured data:
```yaml
# Receipt template schema (receipt.yml)
__template__: true
__description__: Retail receipt template
__name__: Receipt
__depends_on__: [Product, Transaction, Customer]
__foreign_keys__:
customer_name: [Customer, first_name]
__template_source__: templates/receipt.html
__input_file_type__: html
__output_file_type__: pdf
# Receipt header
store_name:
type: string
length: 50
description: Name of the retail store
store_address:
type: address
length: 150
description: Full address of the store
# Receipt details
receipt_number:
type: string
pattern: '^RCP-\d{8}$'
length: 12
description: Unique receipt identifier
# Product purchase details
items:
type: array
description: "List of purchased items with product details"
# Generate everything - maintains relationships between structured and document data
results = generator.generate_for_schemas(
schemas=schemas,
output_dir="output"
)
# Results include both DataFrames and generated documents
customers_df = results['Customer']
receipts_df = results['Receipt'] # Contains metadata about generated documents
```
### Schema Dependencies for Documents
Template schemas can specify dependencies on structured schemas:
```yaml
# Receipt template schema (receipt.yml)
__template__: true
__name__: Receipt
__depends_on__: [Product, Transaction, Customer]
__foreign_keys__:
customer_id: [Customer, id]
__template_source__: templates/receipt.html
__input_file_type__: html
__output_file_type__: pdf
```
This ensures that dependent structured data is generated first, and related documents can reference that data.
Here's an example of a receipt HTML template that uses data from both the receipt schema and the related structured data:
```html
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Receipt</title>
<style>
body {
font-family: 'Courier New', Courier, monospace;
font-size: 12px;
line-height: 1.3;
max-width: 380px;
margin: 0 auto;
padding: 10px;
}
.header, .footer {
text-align: center;
margin-bottom: 10px;
}
.items-table {
width: 100%;
margin-bottom: 10px;
}
.totals {
width: 100%;
margin-bottom: 10px;
}
</style>
</head>
<body>
<div class="header">
<div class="store-name">{{ store_name }}</div>
<div>{{ store_address }}</div>
<div>Tel: {{ store_phone }}</div>
</div>
<div class="receipt-details">
<div>
<div>Receipt #: {{ receipt_number }}</div>
<div>Date: {{ transaction_date }}</div>
<div>Time: {{ transaction_time }}</div>
</div>
</div>
<div class="customer-info">
<div>Customer: {{ customer_name }}</div>
<div>Cust ID: {{ customer_id }}</div>
</div>
<!-- This iterates through items array generated by the custom generator -->
<table class="items-table">
<thead>
<tr>
<th>Item</th>
<th>Qty</th>
<th>Price</th>
<th>Total</th>
</tr>
</thead>
<tbody>
{% for item in items %}
<tr>
<td>{{ item.product_name }}<br><small>SKU: {{ item.sku }}</small></td>
<td>{{ item.quantity }}</td>
<td>${{ "%.2f"|format(item.unit_price) }}</td>
<td>${{ "%.2f"|format(item.item_total) }}</td>
</tr>
{% endfor %}
</tbody>
</table>
<table class="totals">
<tr>
<td>Subtotal:</td>
<td>${{ "%.2f"|format(subtotal) }}</td>
</tr>
<tr>
<td>Tax ({{ "%.2f"|format(tax_rate) }}%):</td>
<td>${{ "%.2f"|format(tax_amount) }}</td>
</tr>
<tr>
<td>TOTAL:</td>
<td>${{ "%.2f"|format(total) }}</td>
</tr>
</table>
<div class="payment-info">
<div>Payment Method: {{ payment_method }}</div>
</div>
<div class="thank-you">
Thank you for shopping with us!
</div>
</body>
</html>
```
Note the use of Jinja2's `{% for item in items %}...{% endfor %}` loop to iterate through the array of items that was generated with our custom generator.
### Custom Generators for Document Data
For advanced use cases, you can define custom generators to map structured data into document fields:
```python
def generate_receipt_items(row, col_name=None, parent_dfs=None):
"""Generate receipt line items based on transaction and product data."""
items = []
if parent_dfs and 'Product' in parent_dfs and 'Transaction' in parent_dfs:
products_df = parent_dfs['Product']
transactions_df = parent_dfs['Transaction']
# Find transactions for this customer
customer_transactions = transactions_df[transactions_df['customer_id'] == row['customer_id']]
# Add products from transactions to receipt
for _, tx in customer_transactions.iterrows():
product = products_df[products_df['id'] == tx['product_id']].iloc[0]
items.append({
"product_name": product['name'],
"quantity": tx['quantity'],
"unit_price": product['price'],
"item_total": tx['quantity'] * product['price']
})
return items
# Register the custom generator
generator.register_generator('array', generate_receipt_items, column_name='items')
```
The `parent_dfs` parameter gives access to all previously generated structured data, allowing you to create rich, interconnected documents.
## SQLAlchemy Models with Templates
You can also use SQLAlchemy models to define both your structured data schema and template-based documents. This approach is great for applications that already use SQLAlchemy ORM:
```python
from sqlalchemy import Column, Integer, String, Float, ForeignKey, Text
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import relationship
from syda.templates import SydaTemplate
Base = declarative_base()
# Regular structured SQLAlchemy model
class Customer(Base):
__tablename__ = 'customers'
id = Column(Integer, primary_key=True)
name = Column(String(100), nullable=False)
industry = Column(String(50))
annual_revenue = Column(Float)
website = Column(String(100))
# Relationships
opportunities = relationship("Opportunity", back_populates="customer")
# Another structured model
class Opportunity(Base):
__tablename__ = 'opportunities'
id = Column(Integer, primary_key=True)
customer_id = Column(Integer, ForeignKey('customers.id'), nullable=False)
name = Column(String(100), nullable=False)
value = Column(Float, nullable=False)
description = Column(Text)
# Relationships
customer = relationship("Customer", back_populates="opportunities")
# Template model
class ProposalDocument(Base):
__tablename__ = 'proposal_documents'
# Special template attributes
__template__ = True
__depends_on__ = ['Opportunity'] # This template depends on the Opportunity model
# Template source configuration
__template_source__ = 'templates/proposal.html'
__input_file_type__ = 'html'
__output_file_type__ = 'pdf'
# Fields needed for the template (these become columns in the generated data)
id = Column(Integer, primary_key=True)
opportunity_id = Column(Integer, ForeignKey('opportunities.id'), nullable=False)
title = Column(String(200))
customer_name = Column(String(100), ForeignKey('customers.name'))
opportunity_value = Column(Float, ForeignKey('opportunities.value'))
proposed_solutions = Column(Text)
```
Then generate all data in one call:
```python
from syda.generate import SyntheticDataGenerator
from syda.schemas import ModelConfig
# Initialize generator
config = ModelConfig(provider="anthropic", model_name="claude-3-5-haiku-20241022")
generator = SyntheticDataGenerator(model_config=config)
# Generate all data at once
results = generator.generate_for_sqlalchemy_models(
sqlalchemy_models=[Customer, Opportunity, ProposalDocument],
sample_sizes={'customers': 5, 'opportunities': 8, 'proposal_documents': 3},
output_dir="output"
)
```
The example above demonstrates:
1. Regular SQLAlchemy models for structured data (Customer, Opportunity)
2. A template model (ProposalDocument)
3. Foreign key relationships between the template and structured models
4. Generating everything together with `generate_for_sqlalchemy_models`
## Model Selection and Configuration
Syda currently supports two AI providers: OpenAI and Anthropic (Claude).
### Basic Configuration
Configure provider, model, temperature, tokens, and proxy settings using the `ModelConfig` class:
```python
from syda.schemas import ModelConfig, ProxyConfig
# Create a model configuration
config = ModelConfig(
provider='openai', # Choose from: 'openai', 'anthropic', etc.
model_name='gpt-4-turbo', # Model name for the selected provider
temperature=0.7, # Controls randomness (0.0-1.0)
seed=42, # For reproducible outputs (provider-specific)
max_tokens=4000, # Maximum response length (default: 4000)
proxy=ProxyConfig( # Optional proxy configuration
base_url='https://ai-proxy.company.com/v1',
headers={'X-Company-Auth':'internal-token'},
params={'team':'data-science'}
)
)
# Initialize generator with the configuration
generator = SyntheticDataGenerator(model_config=config)
```
### Using Different Model Providers
The library currently supports OpenAI and Anthropic (Claude) models and allows you to easily switch between these providers while maintaining a consistent interface.
#### OpenAI Models
```python
# Default configuration - uses OpenAI's GPT-4 if no model_config provided
default_generator = SyntheticDataGenerator()
# Explicitly configure for GPT-3.5 Turbo (faster and more cost-effective)
openai_config = ModelConfig(
provider='openai',
model_name='gpt-3.5-turbo', # You can also use 'gpt-3.5-turbo-1106' for better JSON handling
temperature=0.7,
response_format={"type": "json_object"} # Forces JSON response format (GPT models)
)
gpt35_generator = SyntheticDataGenerator(model_config=openai_config)
# Generate data with specific model configuration
data = gpt35_generator.generate_data(
schema={'product_id': 'number', 'product_name': 'text', 'price': 'number'},
prompt="Generate electronic product data with prices between $500-$2000",
sample_size=10
)
```
#### Anthropic Claude Models
```python
# Configure for Claude (requires ANTHROPIC_API_KEY environment variable)
claude_config = ModelConfig(
provider='anthropic',
model_name='claude-3-sonnet-20240229', # Available models: claude-3-opus, claude-3-sonnet, claude-3-haiku
temperature=0.7,
max_tokens=2000 # Claude can sometimes need more tokens for structured output
)
claude_generator = SyntheticDataGenerator(model_config=claude_config)
# Generate data with Claude
data = claude_generator.generate_data(
schema={'product_id': 'number', 'product_name': 'text', 'price': 'number', 'description': 'text'},
prompt="Generate luxury product data with realistic prices over $1000",
sample_size=5
)
```
#### Maximum Tokens Parameter
The library now uses a default of 4000 tokens for `max_tokens` to ensure complete responses with all expected columns. This helps prevent incomplete data generation issues.
```python
# Override the default max_tokens setting
config = ModelConfig(
provider="openai",
model_name="gpt-4",
max_tokens=8000, # Increase for very complex schemas or large sample sizes
temperature=0.7
)
```
When generating complex data or data with many columns, consider increasing this value if you notice missing columns in your generated data.
#### Provider-Specific Optimizations
Each AI provider has different strengths and parameter requirements. The library automatically handles most of the differences, but you can optimize for specific providers:
```python
# OpenAI-specific optimization
openai_optimized = ModelConfig(
provider='openai',
model_name='gpt-4-turbo',
temperature=0.7,
response_format={"type": "json_object"}, # Only works with OpenAI
seed=42 # For reproducible outputs
)
# Anthropic-specific optimization
anthropic_optimized = ModelConfig(
provider='anthropic',
model_name='claude-3-opus-20240229',
temperature=0.7,
system="You are a synthetic data generator that creates realistic, high-quality datasets based on the provided schema." # System prompt works best with Anthropic
)
```
### Advanced: Direct Access to LLM Client
For advanced use cases, you can access the underlying LLM client directly for additional control:
```python
from syda.llm import create_llm_client
# Create a standalone LLM client
llm_client = create_llm_client(
model_config=ModelConfig(
provider='anthropic',
model_name='claude-3-opus-20240229'
),
# API key is optional if set in environment variables
anthropic_api_key="your_api_key"
)
# Define a Pydantic model for structured output
from pydantic import BaseModel
from typing import List
class Book(BaseModel):
title: str
author: str
year: int
genre: str
pages: int
class BookCollection(BaseModel):
books: List[Book]
# Use the client for structured responses
books = llm_client.client.chat.completions.create(
model="claude-3-opus-20240229",
response_model=BookCollection, # Automatically parses the response to this model
messages=[{"role": "user", "content": "Generate 5 fictional sci-fi books."}]
)
# Access the structured data directly
for book in books.books:
print(f"{book.title} by {book.author} ({book.year}) - {book.pages} pages")
```
This approach gives you direct control over the client while still providing structured data extraction capabilities.
## Output Options
Syda offers flexible output options to suit different use cases:
### Multiple Schema Generation
When generating data for multiple schemas using `generate_for_schemas` or `generate_for_sqlalchemy_models`, you can specify an output directory and format:
```python
# Generate and save data to CSV files (default)
results = generator.generate_for_schemas(
schemas=schemas,
output_dir="output_directory",
output_format="csv" # Default format
)
# Generate and save data to JSON files
results = generator.generate_for_schemas(
schemas=schemas,
output_dir="output_directory",
output_format="json"
)
```
Each schema will be saved to a separate file with the schema name as the filename. For example:
* CSV format: `output_directory/customer.csv`, `output_directory/order.csv`, etc.
* JSON format: `output_directory/customer.json`, `output_directory/order.json`, etc.
The `results` dictionary will still contain all generated DataFrames, so you can both save to files and work with the data directly in your code.
## Configuration and Error Handling
### API Keys Management
You can provide appropriate API keys based on the provider you're using. There are two recommended ways to manage API keys:
#### 1. Environment Variables (Recommended)
Set API keys via environment variables:
```bash
# For OpenAI models
export OPENAI_API_KEY=your_openai_key
# For Anthropic models
export ANTHROPIC_API_KEY=your_anthropic_key
# For other providers, set the appropriate environment variables
```
You can also use a `.env` file in your project root and load it with:
```python
from dotenv import load_dotenv
load_dotenv() # This loads API keys from .env file
```
#### 2. Direct Initialization
Provide API keys when initializing the generator:
```python
# With explicit model configuration
generator = SyntheticDataGenerator(
model_config=ModelConfig(provider='openai', model_name='gpt-4'),
openai_api_key="your_openai_key", # Only needed for OpenAI models
anthropic_api_key="your_anthropic_key" # Only needed for Anthropic models
)
```
### Error Handling
Syda's error handling has been improved to provide more useful feedback when data generation fails. The library now:
1. **Raises Explicit Exceptions**: When data generation fails rather than returning random data
2. **Provides Detailed Error Messages**: Explaining what went wrong and potential fixes
3. **Validates Output Structure**: Ensures generated data matches the expected schema
Example error handling:
```python
try:
data = generator.generate_data(
schema=YourModel,
prompt="Generate synthetic data...",
sample_size=10
)
# Process the data...
except ValueError as e:
print(f"Data generation failed: {str(e)}")
# Implement fallback strategy or retry with different parameters
```
## Contributing
1. Fork the repository.
2. Create a feature branch.
3. Commit your changes.
4. Push to your branch.
5. Open a Pull Request.
## License
See [LICENSE](LICENSE) for details.
Raw data
{
"_id": null,
"home_page": null,
"name": "syda",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "synthetic data, AI, machine learning, data generation, testing, privacy, SQLAlchemy, OpenAI, Anthropic, Claude, GPT",
"author": null,
"author_email": "Rama Krishna Kumar Lingamgunta <lrkkumar2606@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/aa/10/a19b5a3a3a14dfd08702eabb7bdefba3f8bdd438640da0da09bbe35c2895/syda-0.0.1.tar.gz",
"platform": null,
"description": "# Synthetic Data Generation Library\n\nA Python-based open-source library for generating synthetic data with AI while preserving referential integrity. Allowing seamless use of OpenAI, Anthropic (Claude), and other AI models.\n\n## Table of Contents\n\n* [Features](#features)\n* [Installation](#installation)\n* [Quick Start](#quick-start)\n* [Core API](#core-api)\n * [Structured Data Generation](#structured-data-generation)\n * [SQLAlchemy Model Integration](#sqlalchemy-model-integration)\n * [Handling Foreign Key Relationships](#handling-foreign-key-relationships)\n * [Multiple Schema Definition Formats](#multiple-schema-definition-formats)\n * [SQLAlchemy Models](#1-sqlalchemy-models)\n * [YAML Schema Files](#2-yaml-schema-files)\n * [JSON Schema Files](#3-json-schema-files)\n * [Dictionary-Based Schemas](#4-dictionary-based-schemas)\n * [Foreign Key Definition Methods](#foreign-key-definition-methods)\n * [Automatic Management of Multiple Related Models](#automatic-management-of-multiple-related-models)\n * [Using SQLAlchemy Models](#using-sqlalchemy-models)\n * [Using YAML Schema Files](#using-yaml-schema-files)\n * [Using JSON Schema Files](#using-json-schema-files)\n * [Using Dictionary-Based Schemas](#using-dictionary-based-schemas)\n * [Complete CRM Example](#complete-crm-example)\n* [Metadata Enhancement Benefits with SQLAlchemy Models](#metadata-enhancement-benefits-with-sqlalchemy-models)\n* [Custom Generators for Domain-Specific Data](#custom-generators-for-domain-specific-data)\n* [Unstructured Document Generation](#unstructured-document-generation)\n * [Template-Based Document Generation](#template-based-document-generation)\n * [Template Schema Requirements](#template-schema-requirements)\n * [Supported Template Types](#supported-template-types)\n* [Combined Structured and Unstructured Data](#combined-structured-and-unstructured-data)\n * [Connecting Documents to Structured Data](#connecting-documents-to-structured-data)\n * [Schema Dependencies for Documents](#schema-dependencies-for-documents)\n * [Custom Generators for Document Data](#custom-generators-for-document-data)\n* [SQLAlchemy Models with Templates](#sqlalchemy-models-with-templates)\n* [Model Selection and Configuration](#model-selection-and-configuration)\n * [Basic Configuration](#basic-configuration)\n * [Using Different Model Providers](#using-different-model-providers)\n * [OpenAI Models](#openai-models)\n * [Anthropic Claude Models](#anthropic-claude-models)\n * [Maximum Tokens Parameter](#maximum-tokens-parameter)\n * [Provider-Specific Optimizations](#provider-specific-optimizations)\n * [Advanced: Direct Access to LLM Client](#advanced-direct-access-to-llm-client)\n* [Output Options](#output-options)\n* [Configuration and Error Handling](#configuration-and-error-handling)\n * [API Keys Management](#api-keys-management)\n * [Environment Variables (Recommended)](#1-environment-variables-recommended)\n * [Direct Initialization](#2-direct-initialization)\n * [Error Handling](#error-handling)\n* [Contributing](#contributing)\n* [License](#license)\n\n## Features\n\n* **Multi-Provider AI Integration**:\n\n * Seamless integration with multiple AI providers\n * Support for OpenAI (GPT) and Anthropic (Claude). \n * Default model is Anthropic Claude model claude-3-5-haiku-20241022\n * Consistent interface across different providers\n * Provider-specific parameter optimization\n\n* **LLM-based Data Generation**:\n\n * AI-powered schema understanding and data creation\n * Contextually-aware synthetic records\n * Natural language prompt customization\n * Intelligent schema inference\n\n* **SQLAlchemy Integration**:\n\n * Automatic extraction of model metadata, docstrings and constraints\n * Intelligent column-specific data generation\n * Parameter naming consistency with `sqlalchemy_models`\n \n* **Multiple Schema Formats**:\n\n * SQLAlchemy model integration with automatic metadata extraction\n * YAML/JSON schema file support with full foreign key relationship handling\n * Python dictionary-based schema definitions\n \n* **Referential Integrity**\n\n * Automatic foreign key detection and resolution\n * Multi-model dependency analysis through topological sorting\n * Robust handling of related data with referential constraints\n \n* **Custom Generators**\n\n * Register column- or type-specific functions for domain-specific data\n * Contextual generators that adapt to other fields (like ICD-10 codes based on demographics)\n * Weighted distributions for realistic data patterns\n\n\n## Installation\n\nInstall the package using pip:\n\n```bash\npip install syda\n```\n\n## Quick Start\n\n```python\nfrom syda.structured import SyntheticDataGenerator\nfrom syda.schemas import ModelConfig\n\nmodel_config = ModelConfig(\n provider=\"anthropic\",\n model_name=\"claude-3-5-haiku-20241022\",\n temperature=0.7,\n max_tokens=8192 # Larger value for more complete responses\n)\n\ngenerator = SyntheticDataGenerator(model_config=model_config)\n\n# Define schema for a single table\nschemas = {\n 'Patient': {\n 'patient_id': 'number',\n 'diagnosis_code': 'icd10_code',\n 'email': 'email',\n 'visit_date': 'date',\n 'notes': 'text'\n }\n}\n\nprompt = \"Generate realistic synthetic patient records with ICD-10 diagnosis codes, emails, visit dates, and clinical notes.\"\n\n# Generate and save to CSV\nresults = generator.generate_for_schemas(\n schemas=schemas,\n prompts={'Patient': prompt},\n sample_sizes={'Patient': 15},\n output_dir='synthetic_output'\n)\nprint(f\"Data saved to synthetic_output/Patient.csv\")\n```\n\n## Core API\n\n### Structured Data Generation\n\nUse simple schema maps or SQLAlchemy models to generate data:\n\n```python\nfrom syda.structured import SyntheticDataGenerator\nfrom syda.schemas import ModelConfig\n\nmodel_config = ModelConfig(provider='anthropic', model_name='claude-3-5-haiku-20241022')\ngenerator = SyntheticDataGenerator(model_config=model_config)\n\n# Simple dict schema\nschemas = {\n 'User': {'id': 'number', 'name': 'text'}\n}\nresults = generator.generate_for_schemas(\n schemas=schemas,\n prompts={'User': 'Generate user records'},\n sample_sizes={'User': 10}\n)\n```\n\n### SQLAlchemy Model Integration\n\nPass declarative models directly\u2014docstrings and column metadata inform the prompt:\n\n```python\nfrom sqlalchemy.ext.declarative import declarative_base\nfrom sqlalchemy import Column, Integer, String\nfrom syda.structured import SyntheticDataGenerator\nfrom syda.schemas import ModelConfig\n\nBase = declarative_base()\nclass User(Base):\n __tablename__ = 'users'\n id = Column(Integer, primary_key=True)\n name = Column(String, comment=\"Full name of the user\")\n\nmodel_config = ModelConfig(provider='anthropic', model_name='claude-3-5-haiku-20241022')\ngenerator = SyntheticDataGenerator(model_config=model_config)\nresults = generator.generate_for_sqlalchemy_models(\n sqlalchemy_models=[User], \n prompts={'User': 'Generate users'}, \n sample_sizes={'User': 5}\n)\n```\n\n### SQLAlchemy Model Integration\n\nPass declarative models directly\u2014docstrings and column metadata inform the prompt:\n\n```python\nfrom sqlalchemy.ext.declarative import declarative_base\nfrom sqlalchemy import Column, Integer, String\nfrom syda.structured import SyntheticDataGenerator\nfrom syda.schemas import ModelConfig\n\nBase = declarative_base()\nclass User(Base):\n __tablename__ = 'users'\n id = Column(Integer, primary_key=True)\n name = Column(String, comment=\"Full name of the user\")\n\nmodel_config = ModelConfig(provider='anthropic', model_name='claude-3-5-haiku-20241022')\ngenerator = SyntheticDataGenerator(model_config=model_config)\nresults = generator.generate_for_sqlalchemy_models(\n sqlalchemy_models=[User], \n prompts={'users': 'Generate users'}, \n sample_sizes={'users': 5}\n)\n```\n\n> **Important:** SQLAlchemy models **must** have either `__table__` or `__tablename__` specified. Without one of these attributes, the model cannot be properly processed by the system. The `__tablename__` attribute defines the name of the database table and is used as the schema name when generating data. For example, a model with `__tablename__ = 'users'` will be referenced as 'users' in prompts, sample_sizes, custom generators and the returned results dictionary.\n\n\n### Handling Foreign Key Relationships\n\nThe library provides robust support for handling foreign key relationships with referential integrity:\n\n1. **Automatic Foreign Key Detection**: Foreign keys are automatically detected from your yml, json, dict, SQLAlchemy models and assigned the type `'foreign_key'`.\n2. **Manual Column-Specific Foreign Key Generators**: You can also manually define foreign key generators for specific columns as below snippet\n\n```python\n# After generating departments and loading them into departments_df:\ndef department_id_fk_generator(row, col_name):\n return random.choice(departments_df['id'].tolist())\ngenerator.register_generator('foreign_key', department_id_fk_generator, column_name='department_id')\n```\n\n3. **Multi-Step Generation Process**: For related tables, generate parent records first, then use their IDs when generating child records:\n\n```python\n# Generate departments first, then employees with valid department_id references\nresults = generator.generate_for_sqlalchemy_models(\n sqlalchemy_models=[Department, Employee],\n prompts={\n 'departments': 'Generate company departments',\n 'employees': 'Generate realistic employee data'\n },\n sample_sizes={\n 'departments': 5,\n 'employees': 10\n }\n)\n\n# Access the generated dataframes\ndepartments_df = results['departments']\nemployees_df = results['employees']\n```\n\n4. **Referential Integrity Preservation**: The foreign key generator samples from actual existing IDs in the parent table, ensuring all references are valid.\n5. **Metadata-Enhanced Foreign Keys**: Column comments on foreign key fields are preserved and included in the prompt, helping the LLM understand the relationship context.\n\n\n### Multiple Schema Definition Formats\n\n\n> **Note:** For detailed information on supported field types and schema format, see the [Schema Reference](schema_reference.md) document.\n\n\nSyda supports defining your data models in multiple formats, all leading to the same synthetic data generation capabilities. Choose the format that best suits your workflow:\n\n#### 1. SQLAlchemy Models\n\n```python\nfrom sqlalchemy import Column, Integer, String, ForeignKey, Float, Date\nfrom sqlalchemy.ext.declarative import declarative_base\n\nBase = declarative_base()\n\nclass Customer(Base):\n __tablename__ = 'customers'\n __doc__ = \"\"\"Customer organization that places orders\"\"\"\n \n id = Column(Integer, primary_key=True)\n name = Column(String(100), nullable=False, comment=\"Company name\")\n status = Column(String(20), comment=\"Customer status (Active/Inactive/Prospect)\")\n\nclass Order(Base):\n __tablename__ = 'orders'\n __doc__ = \"\"\"Customer order for products or services\"\"\"\n \n id = Column(Integer, primary_key=True)\n customer_id = Column(Integer, ForeignKey('customers.id'), nullable=False)\n order_date = Column(Date, nullable=False, comment=\"Date when order was placed\")\n total_amount = Column(Float, comment=\"Total monetary value of the order in USD\")\n\n# Generate data from SQLAlchemy models\nresults = generator.generate_for_sqlalchemy_models(\n sqlalchemy_models=[Customer, Order],\n prompts={\"customers\": \"Generate tech companies\"},\n sample_sizes={\"customers\": 10, \"orders\": 30}\n)\n```\n\n#### 2. YAML Schema Files\n\n```yaml\n# customer.yaml\n__table_description__: Customer organization that places orders\nid:\n type: number\n primary_key: true\nname:\n type: text\n max_length: 100\n not_null: true\n description: Company name\nstatus:\n type: text\n max_length: 20\n description: Customer status (Active/Inactive/Prospect)\n```\n\n```yaml\n# order.yaml\n__table_description__: Customer order for products or services\n__foreign_keys__:\n customer_id: [Customer, id]\nid:\n type: number\n primary_key: true\ncustomer_id:\n type: foreign_key\n not_null: true\n description: Reference to the customer who placed the order\norder_date:\n type: date\n not_null: true\n description: Date when order was placed\ntotal_amount:\n type: number\n description: Total monetary value of the order in USD\n```\n\n```python\n# Generate data from YAML schema files\nresults = generator.generate_for_schemas(\n schemas={\n 'Customer': 'schemas/customer.yaml',\n 'Order': 'schemas/order.yaml'\n },\n prompts={'Customer': 'Generate tech companies'},\n sample_sizes={'Customer': 10, 'Order': 30}\n)\n```\n\n#### 3. JSON Schema Files\n\n```json\n// customer.json\n{\n \"__table_description__\": \"Customer organization that places orders\",\n \"id\": {\n \"type\": \"number\",\n \"primary_key\": true\n },\n \"name\": {\n \"type\": \"text\",\n \"max_length\": 100,\n \"not_null\": true,\n \"description\": \"Company name\"\n },\n \"status\": {\n \"type\": \"text\",\n \"max_length\": 20,\n \"description\": \"Customer status (Active/Inactive/Prospect)\"\n }\n}\n```\n\n```json\n// order.json\n{\n \"__table_description__\": \"Customer order for products or services\",\n \"__foreign_keys__\": {\n \"customer_id\": [\"Customer\", \"id\"]\n },\n \"id\": {\n \"type\": \"number\",\n \"primary_key\": true\n },\n \"customer_id\": {\n \"type\": \"foreign_key\",\n \"not_null\": true,\n \"description\": \"Reference to the customer who placed the order\"\n },\n \"order_date\": {\n \"type\": \"date\",\n \"not_null\": true,\n \"description\": \"Date when order was placed\"\n },\n \"total_amount\": {\n \"type\": \"number\",\n \"description\": \"Total monetary value of the order in USD\"\n }\n}\n```\n\n```python\n# Generate data from JSON schema files\nresults = generator.generate_for_schemas(\n schemas={\n 'Customer': 'schemas/customer.json',\n 'Order': 'schemas/order.json'\n },\n prompts={'Customer': 'Generate tech companies'},\n sample_sizes={'Customer': 10, 'Order': 30}\n)\n```\n\n#### 4. Dictionary-Based Schemas\n\n```python\n# Define schemas directly as dictionaries\nschemas = {\n 'Customer': {\n '__table_description__': 'Customer organization that places orders',\n 'id': {'type': 'number', 'primary_key': True},\n 'name': {\n 'type': 'text',\n 'max_length': 100,\n 'not_null': True,\n 'description': 'Company name'\n },\n 'status': {\n 'type': 'text',\n 'max_length': 20,\n 'description': 'Customer status (Active/Inactive/Prospect)'\n }\n },\n 'Order': {\n '__table_description__': 'Customer order for products or services',\n '__foreign_keys__': {\n 'customer_id': ['Customer', 'id']\n },\n 'id': {'type': 'number', 'primary_key': True},\n 'customer_id': {\n 'type': 'foreign_key',\n 'not_null': True,\n 'description': 'Reference to the customer who placed the order'\n },\n 'order_date': {\n 'type': 'date',\n 'not_null': True,\n 'description': 'Date when order was placed'\n },\n 'total_amount': {\n 'type': 'number',\n 'description': 'Total monetary value of the order in USD'\n }\n }\n}\n\n# Generate data from dictionary schemas\nresults = generator.generate_for_schemas(\n schemas=schemas,\n prompts={'Customer': 'Generate tech companies'},\n sample_sizes={'Customer': 10, 'Order': 30}\n)\n```\n\n#### Foreign Key Definition Methods\n\nThere are three ways to define foreign key relationships:\n\n1. Using the `__foreign_keys__` special section in a schema:\n ```python\n \"__foreign_keys__\": {\n \"customer_id\": [\"Customer\", \"id\"]\n }\n ```\n\n2. Using field-level references with type and references properties:\n ```python\n \"order_id\": {\n \"type\": \"foreign_key\",\n \"references\": {\n \"schema\": \"Order\",\n \"field\": \"id\"\n }\n }\n ```\n\n3. Using type-based detection with naming conventions:\n ```python\n \"customer_id\": \"foreign_key\"\n ```\n (The system will attempt to infer the relationship based on naming conventions)\n\n### Automatic Management of Multiple Related Models\n\n#### Using SQLAlchemy Models\n\nSimplify multi-table workflows with `generate_for_sqlalchemy_models`:\n\n```python\nfrom sqlalchemy import Column, Integer, String, Float, ForeignKey, Date, Text\nfrom sqlalchemy.ext.declarative import declarative_base\nfrom sqlalchemy.orm import relationship\nfrom datetime import datetime, timedelta\nimport random\nfrom syda.generate import SyntheticDataGenerator\n\nBase = declarative_base()\n\n# Customer model\nclass Customer(Base):\n __tablename__ = 'customers'\n \n id = Column(Integer, primary_key=True)\n name = Column(String(100), nullable=False)\n industry = Column(String(50))\n status = Column(String(20))\n contacts = relationship(\"Contact\", back_populates=\"customer\")\n orders = relationship(\"Order\", back_populates=\"customer\")\n\n# Contact model with foreign key to Customer\nclass Contact(Base):\n __tablename__ = 'contacts'\n \n id = Column(Integer, primary_key=True)\n customer_id = Column(Integer, ForeignKey('customers.id'), nullable=False)\n name = Column(String(100), nullable=False)\n email = Column(String(120), nullable=False)\n phone = Column(String(20))\n customer = relationship(\"Customer\", back_populates=\"contacts\")\n\n# Product model\nclass Product(Base):\n __tablename__ = 'products'\n \n id = Column(Integer, primary_key=True)\n name = Column(String(100), nullable=False)\n description = Column(Text)\n price = Column(Float, nullable=False)\n order_items = relationship(\"OrderItem\", back_populates=\"product\")\n\n# Order model with foreign key to Customer\nclass Order(Base):\n __tablename__ = 'orders'\n \n id = Column(Integer, primary_key=True)\n customer_id = Column(Integer, ForeignKey('customers.id'), nullable=False)\n order_date = Column(Date, nullable=False)\n total_amount = Column(Float)\n customer = relationship(\"Customer\", back_populates=\"orders\")\n order_items = relationship(\"OrderItem\", back_populates=\"order\")\n\n# OrderItem model with foreign keys to Order and Product\nclass OrderItem(Base):\n __tablename__ = 'order_items'\n \n id = Column(Integer, primary_key=True)\n order_id = Column(Integer, ForeignKey('orders.id'), nullable=False)\n product_id = Column(Integer, ForeignKey('products.id'), nullable=False)\n quantity = Column(Integer, nullable=False)\n price = Column(Float, nullable=False)\n order = relationship(\"Order\", back_populates=\"order_items\")\n product = relationship(\"Product\", back_populates=\"order_items\")\n\n# Initialize generator\ngenerator = SyntheticDataGenerator()\n\n# Generate data for all models in one call\nresults = generator.generate_for_sqlalchemy_models(\n models=[Customer, Contact, Product, Order, OrderItem],\n prompts={\n \"customers\": \"Generate diverse customer organizations for a B2B SaaS company.\",\n \"contacts\": \"Generate cloud software products and services.\"\n },\n sample_sizes={\n \"customers\": 10,\n \"contacts\": 25,\n \"products\": 15,\n \"orders\": 30,\n \"order_items\": 60\n },\n custom_generators={\n \"customers\": {\n # Ensure a specific distribution of customer statuses for business reporting\n \"status\": lambda row, col: random.choice([\"Active\", \"Inactive\", \"Prospect\"]),\n },\n \"products\": {\n # Ensure product categories match your specific business domains\n \"category\": lambda row, col: random.choice([\n \"Cloud Infrastructure\", \"Business Intelligence\", \"Security Services\",\n \"Data Analytics\", \"Custom Development\", \"Support Package\", \"API Services\"\n ])\n },\n }\n)\n```\n\n#### Using YAML Schema Files\n\nThe same relationship management is available with YAML schemas:\n\n```yaml\n# customer.yaml\n__table_name__: customers\n__description__: Customer organizations\n\nid:\n type: integer\n constraints:\n primary_key: true\n not_null: true\n\nname:\n type: string\n constraints:\n not_null: true\n max_length: 100\n\nindustry:\n type: string\n constraints:\n max_length: 50\n\nstatus:\n type: string\n constraints:\n max_length: 20\n```\n\n```yaml\n# contact.yaml\n__table_name__: contacts\n__description__: Customer contacts\n__foreign_keys__:\n customer_id: [customers, id]\n\nid:\n type: integer\n constraints:\n primary_key: true\n not_null: true\n\ncustomer_id:\n type: integer\n constraints:\n not_null: true\n\nname:\n type: string\n constraints:\n not_null: true\n max_length: 100\n\nemail:\n type: string\n constraints:\n not_null: true\n max_length: 120\n\nphone:\n type: string\n constraints:\n max_length: 20\n```\n\n```yaml\n# order.yaml\n__table_name__: orders\n__description__: Customer orders\n__foreign_keys__:\n customer_id: [customers, id]\n\nid:\n type: integer\n constraints:\n primary_key: true\n not_null: true\n\ncustomer_id:\n type: integer\n constraints:\n not_null: true\n\norder_date:\n type: string\n format: date\n constraints:\n not_null: true\n\ntotal_amount:\n type: number\n format: float\n```\n\n```python\n# Generate data for multiple related tables with YAML schemas\nresults = generator.generate_for_schemas(\n schemas={\n 'Customer': 'schemas/customer.yaml',\n 'Contact': 'schemas/contact.yaml',\n 'Product': 'schemas/product.yaml',\n 'Order': 'schemas/order.yaml',\n 'OrderItem': 'schemas/order_item.yaml'\n },\n prompts={\n \"Customer\": \"Generate diverse customer organizations for a B2B SaaS company.\",\n \"Product\": \"Generate cloud software products and services.\"\n },\n sample_sizes={\n \"Customer\": 10,\n \"Contact\": 20,\n \"Product\": 15,\n \"Order\": 30,\n \"OrderItem\": 60\n }\n)\n```\n\n#### Using JSON Schema Files\n\nJSON schema files offer the same capabilities:\n\n```json\n// customer.json\n{\n \"__table_name__\": \"customers\",\n \"__description__\": \"Customer organizations\",\n \"id\": {\n \"type\": \"integer\",\n \"constraints\": {\n \"primary_key\": true,\n \"not_null\": true\n }\n },\n \"name\": {\n \"type\": \"string\",\n \"constraints\": {\n \"not_null\": true,\n \"max_length\": 100\n }\n },\n \"industry\": {\n \"type\": \"string\",\n \"constraints\": {\n \"max_length\": 50\n }\n },\n \"status\": {\n \"type\": \"string\",\n \"constraints\": {\n \"max_length\": 20\n }\n }\n}\n```\n\n```json\n// contact.json\n{\n \"__table_name__\": \"contacts\",\n \"__description__\": \"Customer contacts\",\n \"__foreign_keys__\": {\n \"customer_id\": [\"customers\", \"id\"]\n },\n \"id\": {\n \"type\": \"integer\",\n \"constraints\": {\n \"primary_key\": true,\n \"not_null\": true\n }\n },\n \"customer_id\": {\n \"type\": \"integer\",\n \"constraints\": {\n \"not_null\": true\n }\n },\n \"name\": {\n \"type\": \"string\",\n \"constraints\": {\n \"not_null\": true,\n \"max_length\": 100\n }\n },\n \"email\": {\n \"type\": \"string\",\n \"constraints\": {\n \"not_null\": true,\n \"max_length\": 120\n }\n },\n \"phone\": {\n \"type\": \"string\",\n \"constraints\": {\n \"max_length\": 20\n }\n }\n}\n```\n\n```json\n// order.json\n{\n \"__table_name__\": \"orders\",\n \"__description__\": \"Customer orders\",\n \"__foreign_keys__\": {\n \"customer_id\": [\"customers\", \"id\"]\n },\n \"id\": {\n \"type\": \"integer\",\n \"constraints\": {\n \"primary_key\": true,\n \"not_null\": true\n }\n },\n \"customer_id\": {\n \"type\": \"integer\",\n \"constraints\": {\n \"not_null\": true\n }\n },\n \"order_date\": {\n \"type\": \"string\",\n \"format\": \"date\",\n \"constraints\": {\n \"not_null\": true\n }\n },\n \"total_amount\": {\n \"type\": \"number\",\n \"format\": \"float\"\n }\n}\n```\n\n```python\n# Generate data for multiple related tables with JSON schemas\nresults = generator.generate_for_schemas(\n schemas={\n 'Customer': 'schemas/customer.json',\n 'Contact': 'schemas/contact.json',\n 'Product': 'schemas/product.json',\n 'Order': 'schemas/order.json',\n 'OrderItem': 'schemas/order_item.json'\n },\n prompts={\n \"Customer\": \"Generate diverse customer organizations for a B2B SaaS company.\",\n \"Product\": \"Generate cloud software products and services.\"\n },\n sample_sizes={\n \"Customer\": 10,\n \"Contact\": 20,\n \"Product\": 15,\n \"Order\": 30,\n \"OrderItem\": 60\n }\n)\n```\n\n#### Using Dictionary-Based Schemas\n\nSimilar relationship management works with dictionary schemas:\n\n```python\n# Define schemas as Python dictionaries\nschemas = {\n 'Customer': {\n '__table_name__': 'customers',\n '__description__': 'Customer organizations',\n 'id': {\n 'type': 'integer',\n 'constraints': {'primary_key': True, 'not_null': True}\n },\n 'name': {\n 'type': 'string',\n 'constraints': {'not_null': True, 'max_length': 100}\n },\n 'industry': {\n 'type': 'string',\n 'constraints': {'max_length': 50}\n },\n 'status': {\n 'type': 'string',\n 'constraints': {'max_length': 20}\n }\n },\n 'Contact': {\n '__table_name__': 'contacts',\n '__description__': 'Customer contacts',\n '__foreign_keys__': {\n 'customer_id': ['customers', 'id']\n },\n 'id': {\n 'type': 'integer',\n 'constraints': {'primary_key': True, 'not_null': True}\n },\n 'customer_id': {\n 'type': 'integer',\n 'constraints': {'not_null': True}\n },\n 'name': {\n 'type': 'string',\n 'constraints': {'not_null': True, 'max_length': 100}\n },\n 'email': {\n 'type': 'string',\n 'constraints': {'not_null': True, 'max_length': 120}\n },\n 'phone': {\n 'type': 'string',\n 'constraints': {'max_length': 20}\n }\n },\n 'Order': {\n '__table_name__': 'orders',\n '__description__': 'Customer orders',\n '__foreign_keys__': {\n 'customer_id': ['customers', 'id']\n },\n 'id': {\n 'type': 'integer',\n 'constraints': {'primary_key': True, 'not_null': True}\n },\n 'customer_id': {\n 'type': 'integer',\n 'constraints': {'not_null': True}\n },\n 'order_date': {\n 'type': 'string',\n 'format': 'date',\n 'constraints': {'not_null': True}\n },\n 'total_amount': {\n 'type': 'number',\n 'format': 'float'\n }\n }\n}\n\n# Generate data for dictionary schemas\nresults = generator.generate_for_schemas(\n schemas=schemas,\n prompts={\n 'Customer': 'Generate diverse customer organizations for a B2B SaaS company.'\n },\n sample_sizes={\n 'Customer': 10,\n 'Contact': 20,\n 'Order': 30\n }\n)\n```\n\nIn all cases, the generator will:\n1. Analyze relationships between models/schemas\n2. Determine the correct generation order using topological sorting\n3. Generate parent tables first\n4. Use existing primary keys when populating foreign keys in child tables\n5. Maintain referential integrity across the entire dataset\n\n\n### Complete CRM Example\n\nHere\u2019s a comprehensive example demonstrating `generate_for_sqlalchemy_models` across five interrelated models, including entity definitions, prompt setup, and data verification:\n\n```python\n#!/usr/bin/env python\nimport random\nimport datetime\nfrom sqlalchemy import Column, Integer, String, ForeignKey, Float, Date, Boolean, Text\nfrom sqlalchemy.orm import declarative_base, relationship\nfrom syda.structured import SyntheticDataGenerator\n\nBase = declarative_base()\n\nclass Customer(Base):\n __tablename__ = 'customers'\n id = Column(Integer, primary_key=True)\n name = Column(String(100), unique=True, comment=\"Customer organization name\")\n industry = Column(String(50), comment=\"Customer's primary industry\")\n website = Column(String(100), comment=\"Customer's website URL\")\n status = Column(String(20), comment=\"Active, Inactive, Prospect\")\n created_at = Column(Date, default=datetime.date.today, comment=\"Date when added to CRM\")\n contacts = relationship(\"Contact\", back_populates=\"customer\")\n orders = relationship(\"Order\", back_populates=\"customer\")\n\nclass Contact(Base):\n __tablename__ = 'contacts'\n id = Column(Integer, primary_key=True)\n customer_id = Column(Integer, ForeignKey('customers.id'), comment=\"Customer this contact belongs to\")\n first_name = Column(String(50), comment=\"Contact's first name\")\n last_name = Column(String(50), comment=\"Contact's last name\")\n email = Column(String(100), unique=True, comment=\"Contact's email address\")\n phone = Column(String(20), comment=\"Contact's phone number\")\n position = Column(String(100), comment=\"Job title or position\")\n is_primary = Column(Boolean, default=False, comment=\"Primary contact flag\")\n customer = relationship(\"Customer\", back_populates=\"contacts\")\n\nclass Product(Base):\n __tablename__ = 'products'\n id = Column(Integer, primary_key=True)\n name = Column(String(100), unique=True, comment=\"Product name\")\n category = Column(String(50), comment=\"Product category\")\n price = Column(Float, comment=\"Product price in USD\")\n description = Column(Text, comment=\"Product description\")\n order_items = relationship(\"OrderItem\", back_populates=\"product\")\n\nclass Order(Base):\n __tablename__ = 'orders'\n id = Column(Integer, primary_key=True)\n customer_id = Column(Integer, ForeignKey('customers.id'), comment=\"Customer who placed the order\")\n order_date = Column(Date, comment=\"Date when order was placed\")\n status = Column(String(20), comment=\"Order status: New, Processing, Shipped, Delivered, Cancelled\")\n total_amount = Column(Float, comment=\"Total amount in USD\")\n customer = relationship(\"Customer\", back_populates=\"orders\")\n items = relationship(\"OrderItem\", back_populates=\"order\")\n\nclass OrderItem(Base):\n __tablename__ = 'order_items'\n id = Column(Integer, primary_key=True)\n order_id = Column(Integer, ForeignKey('orders.id'), comment=\"Order this item belongs to\")\n product_id = Column(Integer, ForeignKey('products.id'), comment=\"Product in the order\")\n quantity = Column(Integer, comment=\"Quantity ordered\")\n unit_price = Column(Float, comment=\"Unit price at order time\")\n order = relationship(\"Order\", back_populates=\"items\")\n product = relationship(\"Product\", back_populates=\"order_items\")\n\n\ndef main():\n generator = SyntheticDataGenerator(model='gpt-4')\n output_dir = 'crm_data'\n prompts = {\n \"customers\": \"Generate diverse customer organizations for a B2B SaaS company.\",\n \"products\": \"Generate products for a cloud software company.\",\n \"orders\": \"Generate realistic orders with appropriate dates and statuses.\"\n }\n sample_sizes = {\"customers\": 10, \"contacts\": 25, \"products\": 15, \"orders\": 30, \"order_items\": 60}\n\n results = generator.generate_for_sqlalchemy_models(\n sqlalchemy_models=[Customer, Contact, Product, Order, OrderItem],\n prompts=prompts,\n sample_sizes=sample_sizes,\n output_dir=output_dir\n )\n\n # Referential integrity checks\n print(\"\\n\ud83d\udd0d Verifying referential integrity:\")\n if set(results['Contact']['customer_id']).issubset(set(results['Customer']['id'])):\n print(\" \u2705 All Contact.customer_id values are valid.\")\n if set(results['OrderItem']['product_id']).issubset(set(results['Product']['id'])):\n print(\" \u2705 All OrderItem.product_id values are valid.\")\n```\n\n## Metadata Enhancement Benefits with SQLAlchemy Models\n\n* **Richer Context**: Leverages docstrings, comments, and column constraints to enrich prompts.\n* **Simpler Prompts**: Less manual specification; model infers details.\n* **Constraint Awareness**: Respects `nullable`, `unique`, and length constraints.\n* **Custom Generators**: Column-level functions for fine-tuned data.\n* **Automatic Docstring Utilization**: Embeds business context from model definitions.\n\n\n## Unstructured Document Generation\n\nSYDA can generate realistic unstructured documents such as PDF reports, letters, and forms based on templates. This is useful for applications that require document generation with synthetic data.\n\nFor complete examples, see the [examples/unstructured_only](examples/unstructured_only) directory, which includes healthcare document generation samples.\n\n### Template-Based Document Generation\n\nCreate template-based document schemas by specifying template fields in your schema:\n\n```python\nfrom syda.generate import SyntheticDataGenerator\nfrom syda.schemas import ModelConfig\n\n# Initialize generator \nconfig = ModelConfig(provider=\"anthropic\", model_name=\"claude-3-5-haiku-20241022\")\ngenerator = SyntheticDataGenerator(model_config=config)\n\n# Define template-based schemas\nschemas = {\n 'MedicalReport': 'schemas/medical_report.yml',\n 'LabResult': 'schemas/lab_result.yml'\n}\n```\n\nHere's an example of a medical report template schema:\n\n```yaml\n# Medical report template schema (medical_report.yml)\n__template__: true\n__description__: Medical report template for patient visits\n__name__: MedicalReport\n__foreign_keys__: {}\n__template_source__: templates/medical_report_template.html\n__input_file_type__: html\n__output_file_type__: pdf\n\n# Patient information\npatient_id:\n type: string\n format: uuid\n\npatient_name:\n type: string\n\ndate_of_birth:\n type: string\n format: date\n\nvisit_date:\n type: string\n format: date-time\n\nchief_complaint:\n type: string\n\nmedical_history:\n type: string\n\n# Vital signs\nblood_pressure:\n type: string\n\nheart_rate:\n type: integer\n\nrespiratory_rate:\n type: integer\n\ntemperature:\n type: number\n\noxygen_saturation:\n type: integer\n\n# Clinical information\nassessment:\n type: string\n\n# Generate data and PDF documents\nresults = generator.generate_for_schemas(\n schemas=schemas,\n sample_sizes={\n 'MedicalReport': 5,\n 'LabResult': 5\n },\n prompts={\n 'MedicalReport': 'Generate synthetic medical reports for patients',\n 'LabResult': 'Generate synthetic laboratory test results for patients'\n },\n output_dir=\"output\"\n)\n```\n\n### Template Schema Requirements\n\nTemplate-based schemas must include these special fields:\n\n```yaml\n__template__: true\n__template_source__: /path/to/template.html\n__input_file_type__: html\n__output_file_type__: pdf\n```\n\nThe template file (like HTML) includes variable placeholders that get replaced with generated data. Here's an example of a Jinja2 HTML template for medical reports corresponding to the schema above:\n\n```html\n<!DOCTYPE html>\n<html>\n<head>\n <meta charset=\"UTF-8\">\n <title>Medical Report</title>\n <style>\n body {\n font-family: Arial, sans-serif;\n margin: 40px;\n line-height: 1.6;\n }\n .header {\n text-align: center;\n border-bottom: 2px solid #333;\n padding-bottom: 10px;\n margin-bottom: 20px;\n }\n .section {\n margin-bottom: 20px;\n }\n .section-title {\n font-weight: bold;\n margin-bottom: 5px;\n }\n </style>\n</head>\n<body>\n <div class=\"header\">\n <h1>MEDICAL REPORT</h1>\n </div>\n \n <div class=\"section\">\n <div class=\"section-title\">PATIENT INFORMATION</div>\n <p>\n <strong>Patient ID:</strong> {{ patient_id }}<br>\n <strong>Name:</strong> {{ patient_name }}<br>\n <strong>Date of Birth:</strong> {{ date_of_birth }}\n </p>\n </div>\n \n <div class=\"section\">\n <div class=\"section-title\">VISIT INFORMATION</div>\n <p>\n <strong>Visit Date:</strong> {{ visit_date }}<br>\n <strong>Chief Complaint:</strong> {{ chief_complaint }}\n </p>\n </div>\n \n <div class=\"section\">\n <div class=\"section-title\">MEDICAL HISTORY</div>\n <p>{{ medical_history }}</p>\n </div>\n \n <div class=\"section\">\n <div class=\"section-title\">VITAL SIGNS</div>\n <p>\n <strong>Blood Pressure:</strong> {{ blood_pressure }}<br>\n <strong>Heart Rate:</strong> {{ heart_rate }} bpm<br>\n <strong>Respiratory Rate:</strong> {{ respiratory_rate }} breaths/min<br>\n <strong>Temperature:</strong> {{ temperature }}\u00b0F<br>\n <strong>Oxygen Saturation:</strong> {{ oxygen_saturation }}%\n </p>\n </div>\n \n <div class=\"section\">\n <div class=\"section-title\">ASSESSMENT</div>\n <p>{{ assessment }}</p>\n </div>\n</body>\n</html>\n```\n\nAs you can see, the template uses Jinja2's `{{ variable_name }}` syntax to insert the data from the generated schema fields into the HTML document.\n\n### Supported Template Types\n\n- HTML \u2192 PDF: Best supported with complete styling control\n- HTML \u2192 HTML: Simple text formatting\n\nMore template formats will be supported in next versions\n\n## Combined Structured and Unstructured Data\n\nSYDA excels at generating both structured data (tables/databases) and unstructured content (documents) in a coordinated way.\n\nFor working examples, see the [examples/structured_and_unstructured](examples/structured_and_unstructured) directory, which contains retail receipt generation and CRM document examples.\n\n\n### Connecting Documents to Structured Data\n\nYou can create relationships between document schemas and structured data schemas:\n\n```python\nfrom syda.generate import SyntheticDataGenerator\n\ngenerator = SyntheticDataGenerator()\n\n# Define both structured and template-based schemas\nschemas = {\n 'Customer': 'schemas/customer.yml', # Structured data\n 'Product': 'schemas/product.yml', # Structured data\n 'Transaction': 'schemas/transaction.yml', # Structured data\n 'Receipt': 'schemas/receipt.yml' # Template-based document\n}\n```\n\nHere's what a structured data schema for a `Customer` might look like:\n\n```yaml\n# Customer schema (customer.yml)\n__table_name__: Customer\n__description__: Retail customers\n\nid:\n type: integer\n description: Unique customer ID\n constraints:\n primary_key: true\n not_null: true\n min: 1\n\nfirst_name:\n type: string\n description: Customer's first name\n constraints:\n not_null: true\n length: 50\n\nlast_name:\n type: string\n description: Customer's last name\n constraints:\n not_null: true\n length: 50\n \nemail:\n type: email\n description: Customer's email address\n constraints:\n not_null: true\n unique: true\n length: 100\n```\n\nAnd here's a template-based document schema for a `Receipt` that references the structured data:\n\n```yaml\n# Receipt template schema (receipt.yml)\n__template__: true\n__description__: Retail receipt template\n__name__: Receipt\n__depends_on__: [Product, Transaction, Customer]\n__foreign_keys__:\n customer_name: [Customer, first_name]\n \n__template_source__: templates/receipt.html\n__input_file_type__: html\n__output_file_type__: pdf\n\n# Receipt header\nstore_name:\n type: string\n length: 50\n description: Name of the retail store\n\nstore_address:\n type: address\n length: 150\n description: Full address of the store\n\n# Receipt details\nreceipt_number:\n type: string\n pattern: '^RCP-\\d{8}$'\n length: 12\n description: Unique receipt identifier\n\n# Product purchase details\nitems:\n type: array\n description: \"List of purchased items with product details\"\n\n\n# Generate everything - maintains relationships between structured and document data\nresults = generator.generate_for_schemas(\n schemas=schemas,\n output_dir=\"output\"\n)\n\n# Results include both DataFrames and generated documents\ncustomers_df = results['Customer']\nreceipts_df = results['Receipt'] # Contains metadata about generated documents\n```\n\n### Schema Dependencies for Documents\n\nTemplate schemas can specify dependencies on structured schemas:\n\n```yaml\n# Receipt template schema (receipt.yml)\n__template__: true\n__name__: Receipt\n__depends_on__: [Product, Transaction, Customer]\n__foreign_keys__:\n customer_id: [Customer, id]\n__template_source__: templates/receipt.html\n__input_file_type__: html\n__output_file_type__: pdf\n```\n\nThis ensures that dependent structured data is generated first, and related documents can reference that data.\n\nHere's an example of a receipt HTML template that uses data from both the receipt schema and the related structured data:\n\n```html\n<!DOCTYPE html>\n<html lang=\"en\">\n<head>\n <meta charset=\"UTF-8\">\n <title>Receipt</title>\n <style>\n body {\n font-family: 'Courier New', Courier, monospace;\n font-size: 12px;\n line-height: 1.3;\n max-width: 380px;\n margin: 0 auto;\n padding: 10px;\n }\n .header, .footer {\n text-align: center;\n margin-bottom: 10px;\n }\n .items-table {\n width: 100%;\n margin-bottom: 10px;\n }\n .totals {\n width: 100%;\n margin-bottom: 10px;\n }\n </style>\n</head>\n<body>\n <div class=\"header\">\n <div class=\"store-name\">{{ store_name }}</div>\n <div>{{ store_address }}</div>\n <div>Tel: {{ store_phone }}</div>\n </div>\n\n <div class=\"receipt-details\">\n <div>\n <div>Receipt #: {{ receipt_number }}</div>\n <div>Date: {{ transaction_date }}</div>\n <div>Time: {{ transaction_time }}</div>\n </div>\n </div>\n\n <div class=\"customer-info\">\n <div>Customer: {{ customer_name }}</div>\n <div>Cust ID: {{ customer_id }}</div>\n </div>\n\n <!-- This iterates through items array generated by the custom generator -->\n <table class=\"items-table\">\n <thead>\n <tr>\n <th>Item</th>\n <th>Qty</th>\n <th>Price</th>\n <th>Total</th>\n </tr>\n </thead>\n <tbody>\n {% for item in items %}\n <tr>\n <td>{{ item.product_name }}<br><small>SKU: {{ item.sku }}</small></td>\n <td>{{ item.quantity }}</td>\n <td>${{ \"%.2f\"|format(item.unit_price) }}</td>\n <td>${{ \"%.2f\"|format(item.item_total) }}</td>\n </tr>\n {% endfor %}\n </tbody>\n </table>\n\n <table class=\"totals\">\n <tr>\n <td>Subtotal:</td>\n <td>${{ \"%.2f\"|format(subtotal) }}</td>\n </tr>\n <tr>\n <td>Tax ({{ \"%.2f\"|format(tax_rate) }}%):</td>\n <td>${{ \"%.2f\"|format(tax_amount) }}</td>\n </tr>\n <tr>\n <td>TOTAL:</td>\n <td>${{ \"%.2f\"|format(total) }}</td>\n </tr>\n </table>\n\n <div class=\"payment-info\">\n <div>Payment Method: {{ payment_method }}</div>\n </div>\n\n <div class=\"thank-you\">\n Thank you for shopping with us!\n </div>\n</body>\n</html>\n```\n\nNote the use of Jinja2's `{% for item in items %}...{% endfor %}` loop to iterate through the array of items that was generated with our custom generator.\n\n### Custom Generators for Document Data\n\nFor advanced use cases, you can define custom generators to map structured data into document fields:\n\n```python\ndef generate_receipt_items(row, col_name=None, parent_dfs=None):\n \"\"\"Generate receipt line items based on transaction and product data.\"\"\"\n items = []\n if parent_dfs and 'Product' in parent_dfs and 'Transaction' in parent_dfs:\n products_df = parent_dfs['Product']\n transactions_df = parent_dfs['Transaction']\n \n # Find transactions for this customer\n customer_transactions = transactions_df[transactions_df['customer_id'] == row['customer_id']]\n \n # Add products from transactions to receipt\n for _, tx in customer_transactions.iterrows():\n product = products_df[products_df['id'] == tx['product_id']].iloc[0]\n items.append({\n \"product_name\": product['name'],\n \"quantity\": tx['quantity'],\n \"unit_price\": product['price'],\n \"item_total\": tx['quantity'] * product['price']\n })\n return items\n\n# Register the custom generator\ngenerator.register_generator('array', generate_receipt_items, column_name='items')\n```\n\nThe `parent_dfs` parameter gives access to all previously generated structured data, allowing you to create rich, interconnected documents.\n\n\n## SQLAlchemy Models with Templates\n\nYou can also use SQLAlchemy models to define both your structured data schema and template-based documents. This approach is great for applications that already use SQLAlchemy ORM:\n\n```python\nfrom sqlalchemy import Column, Integer, String, Float, ForeignKey, Text\nfrom sqlalchemy.ext.declarative import declarative_base\nfrom sqlalchemy.orm import relationship\nfrom syda.templates import SydaTemplate\n\nBase = declarative_base()\n\n# Regular structured SQLAlchemy model\nclass Customer(Base):\n __tablename__ = 'customers'\n \n id = Column(Integer, primary_key=True)\n name = Column(String(100), nullable=False)\n industry = Column(String(50))\n annual_revenue = Column(Float)\n website = Column(String(100))\n \n # Relationships\n opportunities = relationship(\"Opportunity\", back_populates=\"customer\")\n\n# Another structured model\nclass Opportunity(Base):\n __tablename__ = 'opportunities'\n \n id = Column(Integer, primary_key=True)\n customer_id = Column(Integer, ForeignKey('customers.id'), nullable=False)\n name = Column(String(100), nullable=False)\n value = Column(Float, nullable=False)\n description = Column(Text)\n \n # Relationships\n customer = relationship(\"Customer\", back_populates=\"opportunities\")\n\n# Template model\nclass ProposalDocument(Base):\n __tablename__ = 'proposal_documents'\n \n # Special template attributes\n __template__ = True\n __depends_on__ = ['Opportunity'] # This template depends on the Opportunity model\n \n # Template source configuration\n __template_source__ = 'templates/proposal.html'\n __input_file_type__ = 'html'\n __output_file_type__ = 'pdf'\n \n # Fields needed for the template (these become columns in the generated data)\n id = Column(Integer, primary_key=True)\n opportunity_id = Column(Integer, ForeignKey('opportunities.id'), nullable=False)\n title = Column(String(200))\n customer_name = Column(String(100), ForeignKey('customers.name'))\n opportunity_value = Column(Float, ForeignKey('opportunities.value'))\n proposed_solutions = Column(Text)\n```\n\nThen generate all data in one call:\n\n```python\nfrom syda.generate import SyntheticDataGenerator\nfrom syda.schemas import ModelConfig\n\n# Initialize generator\nconfig = ModelConfig(provider=\"anthropic\", model_name=\"claude-3-5-haiku-20241022\")\ngenerator = SyntheticDataGenerator(model_config=config)\n\n# Generate all data at once\nresults = generator.generate_for_sqlalchemy_models(\n sqlalchemy_models=[Customer, Opportunity, ProposalDocument],\n sample_sizes={'customers': 5, 'opportunities': 8, 'proposal_documents': 3},\n output_dir=\"output\"\n)\n```\n\nThe example above demonstrates:\n1. Regular SQLAlchemy models for structured data (Customer, Opportunity)\n2. A template model (ProposalDocument)\n3. Foreign key relationships between the template and structured models\n4. Generating everything together with `generate_for_sqlalchemy_models`\n\n\n## Model Selection and Configuration\n\nSyda currently supports two AI providers: OpenAI and Anthropic (Claude).\n\n\n\n### Basic Configuration\n\nConfigure provider, model, temperature, tokens, and proxy settings using the `ModelConfig` class:\n\n```python\nfrom syda.schemas import ModelConfig, ProxyConfig\n\n# Create a model configuration\nconfig = ModelConfig(\n provider='openai', # Choose from: 'openai', 'anthropic', etc.\n model_name='gpt-4-turbo', # Model name for the selected provider\n temperature=0.7, # Controls randomness (0.0-1.0)\n seed=42, # For reproducible outputs (provider-specific)\n max_tokens=4000, # Maximum response length (default: 4000)\n proxy=ProxyConfig( # Optional proxy configuration\n base_url='https://ai-proxy.company.com/v1',\n headers={'X-Company-Auth':'internal-token'},\n params={'team':'data-science'}\n )\n)\n\n# Initialize generator with the configuration\ngenerator = SyntheticDataGenerator(model_config=config)\n```\n\n### Using Different Model Providers\n\nThe library currently supports OpenAI and Anthropic (Claude) models and allows you to easily switch between these providers while maintaining a consistent interface.\n\n#### OpenAI Models\n\n```python\n# Default configuration - uses OpenAI's GPT-4 if no model_config provided\ndefault_generator = SyntheticDataGenerator()\n\n# Explicitly configure for GPT-3.5 Turbo (faster and more cost-effective)\nopenai_config = ModelConfig(\n provider='openai',\n model_name='gpt-3.5-turbo', # You can also use 'gpt-3.5-turbo-1106' for better JSON handling\n temperature=0.7,\n response_format={\"type\": \"json_object\"} # Forces JSON response format (GPT models)\n)\ngpt35_generator = SyntheticDataGenerator(model_config=openai_config)\n\n# Generate data with specific model configuration\ndata = gpt35_generator.generate_data(\n schema={'product_id': 'number', 'product_name': 'text', 'price': 'number'},\n prompt=\"Generate electronic product data with prices between $500-$2000\",\n sample_size=10\n)\n```\n\n#### Anthropic Claude Models\n\n```python\n# Configure for Claude (requires ANTHROPIC_API_KEY environment variable)\nclaude_config = ModelConfig(\n provider='anthropic',\n model_name='claude-3-sonnet-20240229', # Available models: claude-3-opus, claude-3-sonnet, claude-3-haiku\n temperature=0.7,\n max_tokens=2000 # Claude can sometimes need more tokens for structured output\n)\nclaude_generator = SyntheticDataGenerator(model_config=claude_config)\n\n# Generate data with Claude\ndata = claude_generator.generate_data(\n schema={'product_id': 'number', 'product_name': 'text', 'price': 'number', 'description': 'text'},\n prompt=\"Generate luxury product data with realistic prices over $1000\",\n sample_size=5\n)\n```\n\n#### Maximum Tokens Parameter\n\nThe library now uses a default of 4000 tokens for `max_tokens` to ensure complete responses with all expected columns. This helps prevent incomplete data generation issues.\n\n```python\n# Override the default max_tokens setting\nconfig = ModelConfig(\n provider=\"openai\",\n model_name=\"gpt-4\",\n max_tokens=8000, # Increase for very complex schemas or large sample sizes\n temperature=0.7\n)\n```\n\nWhen generating complex data or data with many columns, consider increasing this value if you notice missing columns in your generated data.\n\n#### Provider-Specific Optimizations\n\nEach AI provider has different strengths and parameter requirements. The library automatically handles most of the differences, but you can optimize for specific providers:\n\n```python\n# OpenAI-specific optimization\nopenai_optimized = ModelConfig(\n provider='openai',\n model_name='gpt-4-turbo',\n temperature=0.7,\n response_format={\"type\": \"json_object\"}, # Only works with OpenAI\n seed=42 # For reproducible outputs\n)\n\n# Anthropic-specific optimization\nanthropic_optimized = ModelConfig(\n provider='anthropic',\n model_name='claude-3-opus-20240229',\n temperature=0.7,\n system=\"You are a synthetic data generator that creates realistic, high-quality datasets based on the provided schema.\" # System prompt works best with Anthropic\n)\n```\n\n### Advanced: Direct Access to LLM Client\n\nFor advanced use cases, you can access the underlying LLM client directly for additional control:\n\n```python\nfrom syda.llm import create_llm_client\n\n# Create a standalone LLM client\nllm_client = create_llm_client(\n model_config=ModelConfig(\n provider='anthropic', \n model_name='claude-3-opus-20240229'\n ),\n # API key is optional if set in environment variables\n anthropic_api_key=\"your_api_key\" \n)\n\n# Define a Pydantic model for structured output\nfrom pydantic import BaseModel\nfrom typing import List\n\nclass Book(BaseModel):\n title: str\n author: str\n year: int\n genre: str\n pages: int\n\nclass BookCollection(BaseModel):\n books: List[Book]\n\n# Use the client for structured responses\nbooks = llm_client.client.chat.completions.create(\n model=\"claude-3-opus-20240229\",\n response_model=BookCollection, # Automatically parses the response to this model\n messages=[{\"role\": \"user\", \"content\": \"Generate 5 fictional sci-fi books.\"}]\n)\n\n# Access the structured data directly\nfor book in books.books:\n print(f\"{book.title} by {book.author} ({book.year}) - {book.pages} pages\")\n```\n\nThis approach gives you direct control over the client while still providing structured data extraction capabilities.\n\n## Output Options\n\nSyda offers flexible output options to suit different use cases:\n\n### Multiple Schema Generation\n\nWhen generating data for multiple schemas using `generate_for_schemas` or `generate_for_sqlalchemy_models`, you can specify an output directory and format:\n\n```python\n# Generate and save data to CSV files (default)\nresults = generator.generate_for_schemas(\n schemas=schemas,\n output_dir=\"output_directory\",\n output_format=\"csv\" # Default format\n)\n\n# Generate and save data to JSON files\nresults = generator.generate_for_schemas(\n schemas=schemas,\n output_dir=\"output_directory\",\n output_format=\"json\"\n)\n```\n\nEach schema will be saved to a separate file with the schema name as the filename. For example:\n\n* CSV format: `output_directory/customer.csv`, `output_directory/order.csv`, etc.\n* JSON format: `output_directory/customer.json`, `output_directory/order.json`, etc.\n\nThe `results` dictionary will still contain all generated DataFrames, so you can both save to files and work with the data directly in your code.\n\n\n## Configuration and Error Handling\n\n### API Keys Management\n\nYou can provide appropriate API keys based on the provider you're using. There are two recommended ways to manage API keys:\n\n#### 1. Environment Variables (Recommended)\n\nSet API keys via environment variables:\n\n```bash\n# For OpenAI models\nexport OPENAI_API_KEY=your_openai_key\n\n# For Anthropic models\nexport ANTHROPIC_API_KEY=your_anthropic_key\n\n# For other providers, set the appropriate environment variables\n```\n\nYou can also use a `.env` file in your project root and load it with:\n\n```python\nfrom dotenv import load_dotenv\nload_dotenv() # This loads API keys from .env file\n```\n\n#### 2. Direct Initialization\n\nProvide API keys when initializing the generator:\n\n```python\n# With explicit model configuration\ngenerator = SyntheticDataGenerator(\n model_config=ModelConfig(provider='openai', model_name='gpt-4'),\n openai_api_key=\"your_openai_key\", # Only needed for OpenAI models\n anthropic_api_key=\"your_anthropic_key\" # Only needed for Anthropic models\n)\n```\n\n\n### Error Handling\n\nSyda's error handling has been improved to provide more useful feedback when data generation fails. The library now:\n\n1. **Raises Explicit Exceptions**: When data generation fails rather than returning random data\n2. **Provides Detailed Error Messages**: Explaining what went wrong and potential fixes\n3. **Validates Output Structure**: Ensures generated data matches the expected schema\n\nExample error handling:\n\n```python\ntry:\n data = generator.generate_data(\n schema=YourModel,\n prompt=\"Generate synthetic data...\",\n sample_size=10\n )\n # Process the data...\nexcept ValueError as e:\n print(f\"Data generation failed: {str(e)}\")\n # Implement fallback strategy or retry with different parameters\n```\n\n## Contributing\n\n1. Fork the repository.\n2. Create a feature branch.\n3. Commit your changes.\n4. Push to your branch.\n5. Open a Pull Request.\n\n## License\n\nSee [LICENSE](LICENSE) for details.\n",
"bugtrack_url": null,
"license": "LGPL-3.0-or-later",
"summary": "A Python library for AI-powered synthetic data generation with referential integrity",
"version": "0.0.1",
"project_urls": {
"Changelog": "https://github.com/syda-ai/syda/blob/main/CHANGELOG.md",
"Documentation": "https://python.syda.ai",
"Homepage": "https://github.com/syda-ai/syda",
"Issues": "https://github.com/syda-ai/syda/issues",
"Repository": "https://github.com/syda-ai/syda.git"
},
"split_keywords": [
"synthetic data",
" ai",
" machine learning",
" data generation",
" testing",
" privacy",
" sqlalchemy",
" openai",
" anthropic",
" claude",
" gpt"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "a5928cbaa2017cac233decd154541c40501e4e5e01026d5d91dfb3a14ce701d7",
"md5": "9052607ff4a482147598dcf06dc88bb2",
"sha256": "891af8ad05869b175bc4d1f08d317d089c88759ebcc3368f5f83d921119ee05b"
},
"downloads": -1,
"filename": "syda-0.0.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "9052607ff4a482147598dcf06dc88bb2",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 52451,
"upload_time": "2025-08-12T02:24:07",
"upload_time_iso_8601": "2025-08-12T02:24:07.941308Z",
"url": "https://files.pythonhosted.org/packages/a5/92/8cbaa2017cac233decd154541c40501e4e5e01026d5d91dfb3a14ce701d7/syda-0.0.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "aa10a19b5a3a3a14dfd08702eabb7bdefba3f8bdd438640da0da09bbe35c2895",
"md5": "0d40ef3154013a29e3c1c4c28956ec98",
"sha256": "1442422c4820e9fdd298976537cb040589512b53ae61324db30187ff6395806e"
},
"downloads": -1,
"filename": "syda-0.0.1.tar.gz",
"has_sig": false,
"md5_digest": "0d40ef3154013a29e3c1c4c28956ec98",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 167825,
"upload_time": "2025-08-12T02:24:09",
"upload_time_iso_8601": "2025-08-12T02:24:09.433416Z",
"url": "https://files.pythonhosted.org/packages/aa/10/a19b5a3a3a14dfd08702eabb7bdefba3f8bdd438640da0da09bbe35c2895/syda-0.0.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-12 02:24:09",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "syda-ai",
"github_project": "syda",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "pydantic",
"specs": [
[
">=",
"2.4.2"
]
]
},
{
"name": "python-dotenv",
"specs": [
[
">=",
"1.0.0"
]
]
},
{
"name": "sqlalchemy",
"specs": [
[
">=",
"2.0.23"
]
]
},
{
"name": "pandas",
"specs": [
[
">=",
"2.0.3"
]
]
},
{
"name": "numpy",
"specs": [
[
">=",
"1.24.3"
]
]
},
{
"name": "networkx",
"specs": [
[
">=",
"3.1"
]
]
},
{
"name": "openai",
"specs": [
[
">=",
"1.0.0"
]
]
},
{
"name": "anthropic",
"specs": [
[
">=",
"0.7.0"
]
]
},
{
"name": "instructor",
"specs": [
[
">=",
"1.0.0"
]
]
},
{
"name": "python-magic",
"specs": [
[
">=",
"0.4.27"
]
]
},
{
"name": "python-docx",
"specs": [
[
">=",
"1.0.0"
]
]
},
{
"name": "openpyxl",
"specs": [
[
">=",
"3.1.2"
]
]
},
{
"name": "weasyprint",
"specs": [
[
">=",
"65.1"
]
]
},
{
"name": "pyyaml",
"specs": [
[
">=",
"6.0.1"
]
]
},
{
"name": "pytest",
"specs": [
[
">=",
"7.4.0"
]
]
},
{
"name": "boto3",
"specs": [
[
">=",
"1.28.0"
]
]
},
{
"name": "azure-storage-blob",
"specs": [
[
">=",
"12.19.0"
]
]
},
{
"name": "pdfplumber",
"specs": [
[
">=",
"0.10.3"
]
]
},
{
"name": "pillow",
"specs": [
[
">=",
"10.0.1"
]
]
},
{
"name": "pytesseract",
"specs": [
[
">=",
"0.3.10"
]
]
},
{
"name": "sqlalchemy-utils",
"specs": [
[
">=",
"0.41.1"
]
]
},
{
"name": "mkdocs-material",
"specs": [
[
">=",
"9.6.15"
]
]
},
{
"name": "mkdocs",
"specs": [
[
">=",
"1.6.1"
]
]
},
{
"name": "mkdocs-macros-plugin",
"specs": [
[
">=",
"1.3.7"
]
]
}
],
"lcname": "syda"
}