data4ai


Namedata4ai JSON
Version 0.2.2 PyPI version JSON
download
home_pageNone
SummaryProduction-ready AI-powered dataset generation for instruction tuning and model fine-tuning
upload_time2025-08-18 14:34:27
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseMIT
keywords ai dataset instruction-tuning llm machine-learning training-data
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Data4AI 🚀

> **AI-powered dataset generation for instruction tuning and model fine-tuning**

[![PyPI version](https://badge.fury.io/py/data4ai.svg)](https://pypi.org/project/data4ai/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)

Generate high-quality synthetic datasets using state-of-the-art language models through OpenRouter API. Perfect for creating training data for LLM fine-tuning.

## ✨ Key Features

- 🤖 **100+ AI Models** - Access to GPT-4, Claude, Llama, and more via OpenRouter
- 📊 **Multiple Formats** - Support for ChatML (default) and Alpaca schemas
- 🔮 **DSPy Integration** - Dynamic prompt optimization for better quality
- 📄 **Document Support** - Generate datasets from PDFs, Word docs, Markdown, and text files
- 🎯 **Quality Features** - Optional Bloom's taxonomy, provenance tracking, and quality verification
- 🤖 **Smart Generation** - Both prompt-based and document-based dataset creation
- ☁️ **HuggingFace Hub** - Direct dataset publishing
- ⚡ **Production Ready** - Rate limiting, checkpointing, deduplication

## 🚀 Quick Start

### Installation

```bash
pip install data4ai              # All features included
pip install data4ai[all]         # All features
```

### Set Up Environment Variables

Data4AI requires environment variables to be set in your terminal:

#### Option 1: Quick Setup (Current Session)
```bash
# Get your API key from https://openrouter.ai/keys
export OPENROUTER_API_KEY="your_key_here"

# Optional: Set a specific model (default: openai/gpt-4o-mini)
export OPENROUTER_MODEL="anthropic/claude-3-5-sonnet"  # Or another model

# Optional: Set default dataset schema (default: chatml)
export DEFAULT_SCHEMA="chatml"  # Options: chatml, alpaca

# Optional: For publishing to HuggingFace
export HF_TOKEN="your_huggingface_token"
```

#### Option 2: Using .env File
```bash
# Create a .env file in your project directory
echo 'OPENROUTER_API_KEY=your_key_here' > .env
# The tool will automatically load from .env
```

#### Option 3: Permanent Setup
```bash
# Add to your shell config (~/.bashrc, ~/.zshrc, or ~/.profile)
echo 'export OPENROUTER_API_KEY="your_key_here"' >> ~/.bashrc
source ~/.bashrc
```

#### Check Your Setup
```bash
# Verify environment variables are set
echo "OPENROUTER_API_KEY: ${OPENROUTER_API_KEY:0:10}..." # Shows first 10 chars
```

### Generate Your First Dataset

```bash
# Generate from description
data4ai prompt \
  --repo my-dataset \
  --description "Create 10 Python programming questions with answers" \
  --count 10

# View results
cat my-dataset/data.jsonl
```

## 📚 Common Use Cases

### 1. Generate from Natural Language

```bash
data4ai prompt \
  --repo customer-support \
  --description "Create customer support Q&A for a SaaS product" \
  --count 100
```

### 2. Generate from Documents

```bash
# From single PDF document
data4ai doc research-paper.pdf \
  --repo paper-qa \
  --type qa \
  --count 100

# From entire folder of documents
data4ai doc /path/to/docs/folder \
  --repo multi-doc-dataset \
  --type qa \
  --count 500 \
  --recursive

# Process only specific file types in folder
data4ai doc /path/to/docs \
  --repo pdf-only-dataset \
  --file-types pdf \
  --count 200

# From Word document with summaries
data4ai doc manual.docx \
  --repo manual-summaries \
  --type summary \
  --count 50

# From Markdown with advanced extraction
data4ai doc README.md \
  --repo docs-dataset \
  --type instruction \
  --advanced

# Generate with optional quality features
data4ai doc document.pdf \
  --repo high-quality-dataset \
  --count 200 \
  --taxonomy balanced \    # Use Bloom's taxonomy for diverse questions
  --provenance \           # Include source references
  --verify \               # Verify quality (2x API calls)
  --long-context           # Merge chunks for better coherence
```

### 4. High-Quality Generation

```bash
# Basic generation (simple and fast)
data4ai doc document.pdf --repo basic-dataset --count 100

# With cognitive diversity using Bloom's Taxonomy
data4ai doc document.pdf \
  --repo taxonomy-dataset \
  --count 100 \
  --taxonomy balanced  # Creates questions at all cognitive levels

# With source tracking for verifiable datasets
data4ai doc research-papers/ \
  --repo cited-dataset \
  --count 500 \
  --provenance  # Includes character offsets for each answer

# Full quality mode for production datasets
data4ai doc documents/ \
  --repo production-dataset \
  --count 1000 \
  --chunk-tokens 250 \     # Token-based chunking
  --taxonomy balanced \    # Cognitive diversity
  --provenance \          # Source tracking
  --verify \              # Quality verification
  --long-context          # Optimized context usage
```

### 5. Publish to HuggingFace

```bash
# Generate and publish
data4ai prompt \
  --repo my-public-dataset \
  --description "Educational content about machine learning" \
  --count 200 \
  --huggingface
```

## 📚 Available Commands

### `data4ai prompt`
Generate dataset from natural language description using AI.

```bash
data4ai prompt --repo <name> --description <text> [options]
```

### `data4ai doc`
Generate dataset from document(s) - supports PDF, DOCX, MD, and TXT files.

```bash
data4ai doc <file_or_folder> --repo <name> [options]
```

### `data4ai push`
Upload existing dataset to HuggingFace Hub.

```bash
data4ai push --repo <name> [options]
```

## 🐍 Python API

```python
from data4ai import generate_from_description, generate_from_document

# Generate from description (uses ChatML by default)
result = generate_from_description(
    description="Create Python interview questions",
    repo="python-interviews",
    count=50,
    schema="chatml"  # Optional, ChatML is default
)

# Generate from document with quality features
result = generate_from_document(
    document_path="research-paper.pdf",
    repo="paper-qa",
    extraction_type="qa",
    count=100,
    taxonomy="balanced",      # Optional: Bloom's taxonomy
    include_provenance=True,   # Optional: Source tracking
    verify_quality=True        # Optional: Quality verification
)

print(f"Generated {result['row_count']} examples")
```

## 📋 Supported Schemas

**ChatML** (Default - OpenAI format)
```json
{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is machine learning?"},
    {"role": "assistant", "content": "Machine learning is..."}
  ]
}
```

**Alpaca** (Instruction tuning)
```json
{
  "instruction": "What is machine learning?",
  "input": "Explain in simple terms",
  "output": "Machine learning is..."
}
```


## 🎯 Quality Features (Optional)

All quality features are **optional** - use them when you need higher quality datasets:

| Feature | Flag | Description | Performance Impact |
|---------|------|-------------|-------------------|
| **Token Chunking** | `--chunk-tokens N` | Use token count instead of characters | Minimal |
| **Bloom's Taxonomy** | `--taxonomy balanced` | Create cognitively diverse questions | None |
| **Provenance** | `--provenance` | Include source references | Minimal |
| **Quality Verification** | `--verify` | Verify and improve examples | 2x API calls |
| **Long Context** | `--long-context` | Merge chunks for coherence | May reduce API calls |

### When to Use Quality Features

- **Quick Prototyping**: No features needed - fast and simple
- **Production Datasets**: Use `--taxonomy` and `--verify`
- **Academic/Research**: Use all features for maximum quality
- **Citation Required**: Always use `--provenance`

## ⚙️ Configuration

Create `.env` file:
```bash
OPENROUTER_API_KEY=your_key_here
OPENROUTER_MODEL=openai/gpt-4o-mini  # Optional (this is the default)
DEFAULT_SCHEMA=chatml                # Optional (this is the default)
HF_TOKEN=your_huggingface_token      # For publishing
```


## 📖 Documentation

- [Detailed Usage Guide](docs/DETAILED_USAGE.md) - Complete CLI reference
- [Examples](docs/EXAMPLES.md) - Code examples and recipes
- [API Documentation](docs/API.md) - Python API reference
- [Publishing Guide](docs/PUBLISHING.md) - PyPI publishing instructions
- [All Documentation](docs/README.md) - Complete documentation index

## 🛠️ Development

```bash
# Clone repository
git clone https://github.com/zysec/data4ai.git
cd data4ai

# Install for development
pip install -e ".[dev]"

# Run tests
pytest

# Check code quality
ruff check .
black --check .
```

## 🤝 Contributing

Contributions welcome! Please check our [Contributing Guide](CONTRIBUTING.md).

## 📄 License

MIT License - see [LICENSE](LICENSE) file.

## 🔗 Links

- [PyPI Package](https://pypi.org/project/data4ai/)
- [GitHub Repository](https://github.com/zysec/data4ai)
- [Documentation](https://github.com/zysec/data4ai/tree/main/docs)
- [Issue Tracker](https://github.com/zysec/data4ai/issues)

---

**Made with ❤️ by [ZySec AI](https://zysec.ai)**
            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "data4ai",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "ai, dataset, instruction-tuning, llm, machine-learning, training-data",
    "author": null,
    "author_email": "ZySec AI <support@zysec.ai>",
    "download_url": "https://files.pythonhosted.org/packages/7c/dc/a42f853643111b2a4e595b5f6acd1e13862ffe4665274f42abd73e1e4841/data4ai-0.2.2.tar.gz",
    "platform": null,
    "description": "# Data4AI \ud83d\ude80\n\n> **AI-powered dataset generation for instruction tuning and model fine-tuning**\n\n[![PyPI version](https://badge.fury.io/py/data4ai.svg)](https://pypi.org/project/data4ai/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)\n\nGenerate high-quality synthetic datasets using state-of-the-art language models through OpenRouter API. Perfect for creating training data for LLM fine-tuning.\n\n## \u2728 Key Features\n\n- \ud83e\udd16 **100+ AI Models** - Access to GPT-4, Claude, Llama, and more via OpenRouter\n- \ud83d\udcca **Multiple Formats** - Support for ChatML (default) and Alpaca schemas\n- \ud83d\udd2e **DSPy Integration** - Dynamic prompt optimization for better quality\n- \ud83d\udcc4 **Document Support** - Generate datasets from PDFs, Word docs, Markdown, and text files\n- \ud83c\udfaf **Quality Features** - Optional Bloom's taxonomy, provenance tracking, and quality verification\n- \ud83e\udd16 **Smart Generation** - Both prompt-based and document-based dataset creation\n- \u2601\ufe0f **HuggingFace Hub** - Direct dataset publishing\n- \u26a1 **Production Ready** - Rate limiting, checkpointing, deduplication\n\n## \ud83d\ude80 Quick Start\n\n### Installation\n\n```bash\npip install data4ai              # All features included\npip install data4ai[all]         # All features\n```\n\n### Set Up Environment Variables\n\nData4AI requires environment variables to be set in your terminal:\n\n#### Option 1: Quick Setup (Current Session)\n```bash\n# Get your API key from https://openrouter.ai/keys\nexport OPENROUTER_API_KEY=\"your_key_here\"\n\n# Optional: Set a specific model (default: openai/gpt-4o-mini)\nexport OPENROUTER_MODEL=\"anthropic/claude-3-5-sonnet\"  # Or another model\n\n# Optional: Set default dataset schema (default: chatml)\nexport DEFAULT_SCHEMA=\"chatml\"  # Options: chatml, alpaca\n\n# Optional: For publishing to HuggingFace\nexport HF_TOKEN=\"your_huggingface_token\"\n```\n\n#### Option 2: Using .env File\n```bash\n# Create a .env file in your project directory\necho 'OPENROUTER_API_KEY=your_key_here' > .env\n# The tool will automatically load from .env\n```\n\n#### Option 3: Permanent Setup\n```bash\n# Add to your shell config (~/.bashrc, ~/.zshrc, or ~/.profile)\necho 'export OPENROUTER_API_KEY=\"your_key_here\"' >> ~/.bashrc\nsource ~/.bashrc\n```\n\n#### Check Your Setup\n```bash\n# Verify environment variables are set\necho \"OPENROUTER_API_KEY: ${OPENROUTER_API_KEY:0:10}...\" # Shows first 10 chars\n```\n\n### Generate Your First Dataset\n\n```bash\n# Generate from description\ndata4ai prompt \\\n  --repo my-dataset \\\n  --description \"Create 10 Python programming questions with answers\" \\\n  --count 10\n\n# View results\ncat my-dataset/data.jsonl\n```\n\n## \ud83d\udcda Common Use Cases\n\n### 1. Generate from Natural Language\n\n```bash\ndata4ai prompt \\\n  --repo customer-support \\\n  --description \"Create customer support Q&A for a SaaS product\" \\\n  --count 100\n```\n\n### 2. Generate from Documents\n\n```bash\n# From single PDF document\ndata4ai doc research-paper.pdf \\\n  --repo paper-qa \\\n  --type qa \\\n  --count 100\n\n# From entire folder of documents\ndata4ai doc /path/to/docs/folder \\\n  --repo multi-doc-dataset \\\n  --type qa \\\n  --count 500 \\\n  --recursive\n\n# Process only specific file types in folder\ndata4ai doc /path/to/docs \\\n  --repo pdf-only-dataset \\\n  --file-types pdf \\\n  --count 200\n\n# From Word document with summaries\ndata4ai doc manual.docx \\\n  --repo manual-summaries \\\n  --type summary \\\n  --count 50\n\n# From Markdown with advanced extraction\ndata4ai doc README.md \\\n  --repo docs-dataset \\\n  --type instruction \\\n  --advanced\n\n# Generate with optional quality features\ndata4ai doc document.pdf \\\n  --repo high-quality-dataset \\\n  --count 200 \\\n  --taxonomy balanced \\    # Use Bloom's taxonomy for diverse questions\n  --provenance \\           # Include source references\n  --verify \\               # Verify quality (2x API calls)\n  --long-context           # Merge chunks for better coherence\n```\n\n### 4. High-Quality Generation\n\n```bash\n# Basic generation (simple and fast)\ndata4ai doc document.pdf --repo basic-dataset --count 100\n\n# With cognitive diversity using Bloom's Taxonomy\ndata4ai doc document.pdf \\\n  --repo taxonomy-dataset \\\n  --count 100 \\\n  --taxonomy balanced  # Creates questions at all cognitive levels\n\n# With source tracking for verifiable datasets\ndata4ai doc research-papers/ \\\n  --repo cited-dataset \\\n  --count 500 \\\n  --provenance  # Includes character offsets for each answer\n\n# Full quality mode for production datasets\ndata4ai doc documents/ \\\n  --repo production-dataset \\\n  --count 1000 \\\n  --chunk-tokens 250 \\     # Token-based chunking\n  --taxonomy balanced \\    # Cognitive diversity\n  --provenance \\          # Source tracking\n  --verify \\              # Quality verification\n  --long-context          # Optimized context usage\n```\n\n### 5. Publish to HuggingFace\n\n```bash\n# Generate and publish\ndata4ai prompt \\\n  --repo my-public-dataset \\\n  --description \"Educational content about machine learning\" \\\n  --count 200 \\\n  --huggingface\n```\n\n## \ud83d\udcda Available Commands\n\n### `data4ai prompt`\nGenerate dataset from natural language description using AI.\n\n```bash\ndata4ai prompt --repo <name> --description <text> [options]\n```\n\n### `data4ai doc`\nGenerate dataset from document(s) - supports PDF, DOCX, MD, and TXT files.\n\n```bash\ndata4ai doc <file_or_folder> --repo <name> [options]\n```\n\n### `data4ai push`\nUpload existing dataset to HuggingFace Hub.\n\n```bash\ndata4ai push --repo <name> [options]\n```\n\n## \ud83d\udc0d Python API\n\n```python\nfrom data4ai import generate_from_description, generate_from_document\n\n# Generate from description (uses ChatML by default)\nresult = generate_from_description(\n    description=\"Create Python interview questions\",\n    repo=\"python-interviews\",\n    count=50,\n    schema=\"chatml\"  # Optional, ChatML is default\n)\n\n# Generate from document with quality features\nresult = generate_from_document(\n    document_path=\"research-paper.pdf\",\n    repo=\"paper-qa\",\n    extraction_type=\"qa\",\n    count=100,\n    taxonomy=\"balanced\",      # Optional: Bloom's taxonomy\n    include_provenance=True,   # Optional: Source tracking\n    verify_quality=True        # Optional: Quality verification\n)\n\nprint(f\"Generated {result['row_count']} examples\")\n```\n\n## \ud83d\udccb Supported Schemas\n\n**ChatML** (Default - OpenAI format)\n```json\n{\n  \"messages\": [\n    {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n    {\"role\": \"user\", \"content\": \"What is machine learning?\"},\n    {\"role\": \"assistant\", \"content\": \"Machine learning is...\"}\n  ]\n}\n```\n\n**Alpaca** (Instruction tuning)\n```json\n{\n  \"instruction\": \"What is machine learning?\",\n  \"input\": \"Explain in simple terms\",\n  \"output\": \"Machine learning is...\"\n}\n```\n\n\n## \ud83c\udfaf Quality Features (Optional)\n\nAll quality features are **optional** - use them when you need higher quality datasets:\n\n| Feature | Flag | Description | Performance Impact |\n|---------|------|-------------|-------------------|\n| **Token Chunking** | `--chunk-tokens N` | Use token count instead of characters | Minimal |\n| **Bloom's Taxonomy** | `--taxonomy balanced` | Create cognitively diverse questions | None |\n| **Provenance** | `--provenance` | Include source references | Minimal |\n| **Quality Verification** | `--verify` | Verify and improve examples | 2x API calls |\n| **Long Context** | `--long-context` | Merge chunks for coherence | May reduce API calls |\n\n### When to Use Quality Features\n\n- **Quick Prototyping**: No features needed - fast and simple\n- **Production Datasets**: Use `--taxonomy` and `--verify`\n- **Academic/Research**: Use all features for maximum quality\n- **Citation Required**: Always use `--provenance`\n\n## \u2699\ufe0f Configuration\n\nCreate `.env` file:\n```bash\nOPENROUTER_API_KEY=your_key_here\nOPENROUTER_MODEL=openai/gpt-4o-mini  # Optional (this is the default)\nDEFAULT_SCHEMA=chatml                # Optional (this is the default)\nHF_TOKEN=your_huggingface_token      # For publishing\n```\n\n\n## \ud83d\udcd6 Documentation\n\n- [Detailed Usage Guide](docs/DETAILED_USAGE.md) - Complete CLI reference\n- [Examples](docs/EXAMPLES.md) - Code examples and recipes\n- [API Documentation](docs/API.md) - Python API reference\n- [Publishing Guide](docs/PUBLISHING.md) - PyPI publishing instructions\n- [All Documentation](docs/README.md) - Complete documentation index\n\n## \ud83d\udee0\ufe0f Development\n\n```bash\n# Clone repository\ngit clone https://github.com/zysec/data4ai.git\ncd data4ai\n\n# Install for development\npip install -e \".[dev]\"\n\n# Run tests\npytest\n\n# Check code quality\nruff check .\nblack --check .\n```\n\n## \ud83e\udd1d Contributing\n\nContributions welcome! Please check our [Contributing Guide](CONTRIBUTING.md).\n\n## \ud83d\udcc4 License\n\nMIT License - see [LICENSE](LICENSE) file.\n\n## \ud83d\udd17 Links\n\n- [PyPI Package](https://pypi.org/project/data4ai/)\n- [GitHub Repository](https://github.com/zysec/data4ai)\n- [Documentation](https://github.com/zysec/data4ai/tree/main/docs)\n- [Issue Tracker](https://github.com/zysec/data4ai/issues)\n\n---\n\n**Made with \u2764\ufe0f by [ZySec AI](https://zysec.ai)**",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Production-ready AI-powered dataset generation for instruction tuning and model fine-tuning",
    "version": "0.2.2",
    "project_urls": {
        "Documentation": "https://github.com/zysec-ai/data4ai/blob/main/README.md",
        "Homepage": "https://github.com/zysec-ai/data4ai",
        "Issues": "https://github.com/zysec-ai/data4ai/issues",
        "Repository": "https://github.com/zysec-ai/data4ai"
    },
    "split_keywords": [
        "ai",
        " dataset",
        " instruction-tuning",
        " llm",
        " machine-learning",
        " training-data"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "ae995be3f6cbb79a9063ba82c0c87c4a0ee71789754e0034798c74dd09677921",
                "md5": "e5f2069540b9f279afa618353644c99c",
                "sha256": "297c1310bd157caa99877c56ca7492b15594e89493c4fb26a1fb4e2d961f55ba"
            },
            "downloads": -1,
            "filename": "data4ai-0.2.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e5f2069540b9f279afa618353644c99c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 75568,
            "upload_time": "2025-08-18T14:34:25",
            "upload_time_iso_8601": "2025-08-18T14:34:25.503630Z",
            "url": "https://files.pythonhosted.org/packages/ae/99/5be3f6cbb79a9063ba82c0c87c4a0ee71789754e0034798c74dd09677921/data4ai-0.2.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "7cdca42f853643111b2a4e595b5f6acd1e13862ffe4665274f42abd73e1e4841",
                "md5": "62795af6378421b5eda6a295262b8391",
                "sha256": "49338bfd4c33a2ac0ca31b51da165df0ad8b56041b539672e0a7a442f66553e0"
            },
            "downloads": -1,
            "filename": "data4ai-0.2.2.tar.gz",
            "has_sig": false,
            "md5_digest": "62795af6378421b5eda6a295262b8391",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 98257,
            "upload_time": "2025-08-18T14:34:27",
            "upload_time_iso_8601": "2025-08-18T14:34:27.313429Z",
            "url": "https://files.pythonhosted.org/packages/7c/dc/a42f853643111b2a4e595b5f6acd1e13862ffe4665274f42abd73e1e4841/data4ai-0.2.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-18 14:34:27",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "zysec-ai",
    "github_project": "data4ai",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "data4ai"
}
        
Elapsed time: 1.66260s