# Data4AI 🚀
> **AI-powered dataset generation for instruction tuning and model fine-tuning**
[](https://pypi.org/project/data4ai/)
[](https://opensource.org/licenses/MIT)
[](https://www.python.org/downloads/)
Generate high-quality synthetic datasets using state-of-the-art language models through OpenRouter API. Perfect for creating training data for LLM fine-tuning.
## ✨ Key Features
- 🤖 **100+ AI Models** - Access to GPT-4, Claude, Llama, and more via OpenRouter
- 📊 **Multiple Formats** - Support for ChatML (default) and Alpaca schemas
- 🔮 **DSPy Integration** - Dynamic prompt optimization for better quality
- 📄 **Document Support** - Generate datasets from PDFs, Word docs, Markdown, and text files
- 🎯 **Quality Features** - Optional Bloom's taxonomy, provenance tracking, and quality verification
- 🤖 **Smart Generation** - Both prompt-based and document-based dataset creation
- ☁️ **HuggingFace Hub** - Direct dataset publishing
- ⚡ **Production Ready** - Rate limiting, checkpointing, deduplication
## 🚀 Quick Start
### Installation
```bash
pip install data4ai # All features included
pip install data4ai[all] # All features
```
### Set Up Environment Variables
Data4AI requires environment variables to be set in your terminal:
#### Option 1: Quick Setup (Current Session)
```bash
# Get your API key from https://openrouter.ai/keys
export OPENROUTER_API_KEY="your_key_here"
# Optional: Set a specific model (default: openai/gpt-4o-mini)
export OPENROUTER_MODEL="anthropic/claude-3-5-sonnet" # Or another model
# Optional: Set default dataset schema (default: chatml)
export DEFAULT_SCHEMA="chatml" # Options: chatml, alpaca
# Optional: For publishing to HuggingFace
export HF_TOKEN="your_huggingface_token"
```
#### Option 2: Using .env File
```bash
# Create a .env file in your project directory
echo 'OPENROUTER_API_KEY=your_key_here' > .env
# The tool will automatically load from .env
```
#### Option 3: Permanent Setup
```bash
# Add to your shell config (~/.bashrc, ~/.zshrc, or ~/.profile)
echo 'export OPENROUTER_API_KEY="your_key_here"' >> ~/.bashrc
source ~/.bashrc
```
#### Check Your Setup
```bash
# Verify environment variables are set
echo "OPENROUTER_API_KEY: ${OPENROUTER_API_KEY:0:10}..." # Shows first 10 chars
```
### Generate Your First Dataset
```bash
# Generate from description
data4ai prompt \
--repo my-dataset \
--description "Create 10 Python programming questions with answers" \
--count 10
# View results
cat my-dataset/data.jsonl
```
## 📚 Common Use Cases
### 1. Generate from Natural Language
```bash
data4ai prompt \
--repo customer-support \
--description "Create customer support Q&A for a SaaS product" \
--count 100
```
### 2. Generate from Documents
```bash
# From single PDF document
data4ai doc research-paper.pdf \
--repo paper-qa \
--type qa \
--count 100
# From entire folder of documents
data4ai doc /path/to/docs/folder \
--repo multi-doc-dataset \
--type qa \
--count 500 \
--recursive
# Process only specific file types in folder
data4ai doc /path/to/docs \
--repo pdf-only-dataset \
--file-types pdf \
--count 200
# From Word document with summaries
data4ai doc manual.docx \
--repo manual-summaries \
--type summary \
--count 50
# From Markdown with advanced extraction
data4ai doc README.md \
--repo docs-dataset \
--type instruction \
--advanced
# Generate with optional quality features
data4ai doc document.pdf \
--repo high-quality-dataset \
--count 200 \
--taxonomy balanced \ # Use Bloom's taxonomy for diverse questions
--provenance \ # Include source references
--verify \ # Verify quality (2x API calls)
--long-context # Merge chunks for better coherence
```
### 4. High-Quality Generation
```bash
# Basic generation (simple and fast)
data4ai doc document.pdf --repo basic-dataset --count 100
# With cognitive diversity using Bloom's Taxonomy
data4ai doc document.pdf \
--repo taxonomy-dataset \
--count 100 \
--taxonomy balanced # Creates questions at all cognitive levels
# With source tracking for verifiable datasets
data4ai doc research-papers/ \
--repo cited-dataset \
--count 500 \
--provenance # Includes character offsets for each answer
# Full quality mode for production datasets
data4ai doc documents/ \
--repo production-dataset \
--count 1000 \
--chunk-tokens 250 \ # Token-based chunking
--taxonomy balanced \ # Cognitive diversity
--provenance \ # Source tracking
--verify \ # Quality verification
--long-context # Optimized context usage
```
### 5. Publish to HuggingFace
```bash
# Generate and publish
data4ai prompt \
--repo my-public-dataset \
--description "Educational content about machine learning" \
--count 200 \
--huggingface
```
## 📚 Available Commands
### `data4ai prompt`
Generate dataset from natural language description using AI.
```bash
data4ai prompt --repo <name> --description <text> [options]
```
### `data4ai doc`
Generate dataset from document(s) - supports PDF, DOCX, MD, and TXT files.
```bash
data4ai doc <file_or_folder> --repo <name> [options]
```
### `data4ai push`
Upload existing dataset to HuggingFace Hub.
```bash
data4ai push --repo <name> [options]
```
## 🐍 Python API
```python
from data4ai import generate_from_description, generate_from_document
# Generate from description (uses ChatML by default)
result = generate_from_description(
description="Create Python interview questions",
repo="python-interviews",
count=50,
schema="chatml" # Optional, ChatML is default
)
# Generate from document with quality features
result = generate_from_document(
document_path="research-paper.pdf",
repo="paper-qa",
extraction_type="qa",
count=100,
taxonomy="balanced", # Optional: Bloom's taxonomy
include_provenance=True, # Optional: Source tracking
verify_quality=True # Optional: Quality verification
)
print(f"Generated {result['row_count']} examples")
```
## 📋 Supported Schemas
**ChatML** (Default - OpenAI format)
```json
{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is machine learning?"},
{"role": "assistant", "content": "Machine learning is..."}
]
}
```
**Alpaca** (Instruction tuning)
```json
{
"instruction": "What is machine learning?",
"input": "Explain in simple terms",
"output": "Machine learning is..."
}
```
## 🎯 Quality Features (Optional)
All quality features are **optional** - use them when you need higher quality datasets:
| Feature | Flag | Description | Performance Impact |
|---------|------|-------------|-------------------|
| **Token Chunking** | `--chunk-tokens N` | Use token count instead of characters | Minimal |
| **Bloom's Taxonomy** | `--taxonomy balanced` | Create cognitively diverse questions | None |
| **Provenance** | `--provenance` | Include source references | Minimal |
| **Quality Verification** | `--verify` | Verify and improve examples | 2x API calls |
| **Long Context** | `--long-context` | Merge chunks for coherence | May reduce API calls |
### When to Use Quality Features
- **Quick Prototyping**: No features needed - fast and simple
- **Production Datasets**: Use `--taxonomy` and `--verify`
- **Academic/Research**: Use all features for maximum quality
- **Citation Required**: Always use `--provenance`
## ⚙️ Configuration
Create `.env` file:
```bash
OPENROUTER_API_KEY=your_key_here
OPENROUTER_MODEL=openai/gpt-4o-mini # Optional (this is the default)
DEFAULT_SCHEMA=chatml # Optional (this is the default)
HF_TOKEN=your_huggingface_token # For publishing
```
## 📖 Documentation
- [Detailed Usage Guide](docs/DETAILED_USAGE.md) - Complete CLI reference
- [Examples](docs/EXAMPLES.md) - Code examples and recipes
- [API Documentation](docs/API.md) - Python API reference
- [Publishing Guide](docs/PUBLISHING.md) - PyPI publishing instructions
- [All Documentation](docs/README.md) - Complete documentation index
## 🛠️ Development
```bash
# Clone repository
git clone https://github.com/zysec/data4ai.git
cd data4ai
# Install for development
pip install -e ".[dev]"
# Run tests
pytest
# Check code quality
ruff check .
black --check .
```
## 🤝 Contributing
Contributions welcome! Please check our [Contributing Guide](CONTRIBUTING.md).
## 📄 License
MIT License - see [LICENSE](LICENSE) file.
## 🔗 Links
- [PyPI Package](https://pypi.org/project/data4ai/)
- [GitHub Repository](https://github.com/zysec/data4ai)
- [Documentation](https://github.com/zysec/data4ai/tree/main/docs)
- [Issue Tracker](https://github.com/zysec/data4ai/issues)
---
**Made with ❤️ by [ZySec AI](https://zysec.ai)**
Raw data
{
"_id": null,
"home_page": null,
"name": "data4ai",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "ai, dataset, instruction-tuning, llm, machine-learning, training-data",
"author": null,
"author_email": "ZySec AI <support@zysec.ai>",
"download_url": "https://files.pythonhosted.org/packages/7c/dc/a42f853643111b2a4e595b5f6acd1e13862ffe4665274f42abd73e1e4841/data4ai-0.2.2.tar.gz",
"platform": null,
"description": "# Data4AI \ud83d\ude80\n\n> **AI-powered dataset generation for instruction tuning and model fine-tuning**\n\n[](https://pypi.org/project/data4ai/)\n[](https://opensource.org/licenses/MIT)\n[](https://www.python.org/downloads/)\n\nGenerate high-quality synthetic datasets using state-of-the-art language models through OpenRouter API. Perfect for creating training data for LLM fine-tuning.\n\n## \u2728 Key Features\n\n- \ud83e\udd16 **100+ AI Models** - Access to GPT-4, Claude, Llama, and more via OpenRouter\n- \ud83d\udcca **Multiple Formats** - Support for ChatML (default) and Alpaca schemas\n- \ud83d\udd2e **DSPy Integration** - Dynamic prompt optimization for better quality\n- \ud83d\udcc4 **Document Support** - Generate datasets from PDFs, Word docs, Markdown, and text files\n- \ud83c\udfaf **Quality Features** - Optional Bloom's taxonomy, provenance tracking, and quality verification\n- \ud83e\udd16 **Smart Generation** - Both prompt-based and document-based dataset creation\n- \u2601\ufe0f **HuggingFace Hub** - Direct dataset publishing\n- \u26a1 **Production Ready** - Rate limiting, checkpointing, deduplication\n\n## \ud83d\ude80 Quick Start\n\n### Installation\n\n```bash\npip install data4ai # All features included\npip install data4ai[all] # All features\n```\n\n### Set Up Environment Variables\n\nData4AI requires environment variables to be set in your terminal:\n\n#### Option 1: Quick Setup (Current Session)\n```bash\n# Get your API key from https://openrouter.ai/keys\nexport OPENROUTER_API_KEY=\"your_key_here\"\n\n# Optional: Set a specific model (default: openai/gpt-4o-mini)\nexport OPENROUTER_MODEL=\"anthropic/claude-3-5-sonnet\" # Or another model\n\n# Optional: Set default dataset schema (default: chatml)\nexport DEFAULT_SCHEMA=\"chatml\" # Options: chatml, alpaca\n\n# Optional: For publishing to HuggingFace\nexport HF_TOKEN=\"your_huggingface_token\"\n```\n\n#### Option 2: Using .env File\n```bash\n# Create a .env file in your project directory\necho 'OPENROUTER_API_KEY=your_key_here' > .env\n# The tool will automatically load from .env\n```\n\n#### Option 3: Permanent Setup\n```bash\n# Add to your shell config (~/.bashrc, ~/.zshrc, or ~/.profile)\necho 'export OPENROUTER_API_KEY=\"your_key_here\"' >> ~/.bashrc\nsource ~/.bashrc\n```\n\n#### Check Your Setup\n```bash\n# Verify environment variables are set\necho \"OPENROUTER_API_KEY: ${OPENROUTER_API_KEY:0:10}...\" # Shows first 10 chars\n```\n\n### Generate Your First Dataset\n\n```bash\n# Generate from description\ndata4ai prompt \\\n --repo my-dataset \\\n --description \"Create 10 Python programming questions with answers\" \\\n --count 10\n\n# View results\ncat my-dataset/data.jsonl\n```\n\n## \ud83d\udcda Common Use Cases\n\n### 1. Generate from Natural Language\n\n```bash\ndata4ai prompt \\\n --repo customer-support \\\n --description \"Create customer support Q&A for a SaaS product\" \\\n --count 100\n```\n\n### 2. Generate from Documents\n\n```bash\n# From single PDF document\ndata4ai doc research-paper.pdf \\\n --repo paper-qa \\\n --type qa \\\n --count 100\n\n# From entire folder of documents\ndata4ai doc /path/to/docs/folder \\\n --repo multi-doc-dataset \\\n --type qa \\\n --count 500 \\\n --recursive\n\n# Process only specific file types in folder\ndata4ai doc /path/to/docs \\\n --repo pdf-only-dataset \\\n --file-types pdf \\\n --count 200\n\n# From Word document with summaries\ndata4ai doc manual.docx \\\n --repo manual-summaries \\\n --type summary \\\n --count 50\n\n# From Markdown with advanced extraction\ndata4ai doc README.md \\\n --repo docs-dataset \\\n --type instruction \\\n --advanced\n\n# Generate with optional quality features\ndata4ai doc document.pdf \\\n --repo high-quality-dataset \\\n --count 200 \\\n --taxonomy balanced \\ # Use Bloom's taxonomy for diverse questions\n --provenance \\ # Include source references\n --verify \\ # Verify quality (2x API calls)\n --long-context # Merge chunks for better coherence\n```\n\n### 4. High-Quality Generation\n\n```bash\n# Basic generation (simple and fast)\ndata4ai doc document.pdf --repo basic-dataset --count 100\n\n# With cognitive diversity using Bloom's Taxonomy\ndata4ai doc document.pdf \\\n --repo taxonomy-dataset \\\n --count 100 \\\n --taxonomy balanced # Creates questions at all cognitive levels\n\n# With source tracking for verifiable datasets\ndata4ai doc research-papers/ \\\n --repo cited-dataset \\\n --count 500 \\\n --provenance # Includes character offsets for each answer\n\n# Full quality mode for production datasets\ndata4ai doc documents/ \\\n --repo production-dataset \\\n --count 1000 \\\n --chunk-tokens 250 \\ # Token-based chunking\n --taxonomy balanced \\ # Cognitive diversity\n --provenance \\ # Source tracking\n --verify \\ # Quality verification\n --long-context # Optimized context usage\n```\n\n### 5. Publish to HuggingFace\n\n```bash\n# Generate and publish\ndata4ai prompt \\\n --repo my-public-dataset \\\n --description \"Educational content about machine learning\" \\\n --count 200 \\\n --huggingface\n```\n\n## \ud83d\udcda Available Commands\n\n### `data4ai prompt`\nGenerate dataset from natural language description using AI.\n\n```bash\ndata4ai prompt --repo <name> --description <text> [options]\n```\n\n### `data4ai doc`\nGenerate dataset from document(s) - supports PDF, DOCX, MD, and TXT files.\n\n```bash\ndata4ai doc <file_or_folder> --repo <name> [options]\n```\n\n### `data4ai push`\nUpload existing dataset to HuggingFace Hub.\n\n```bash\ndata4ai push --repo <name> [options]\n```\n\n## \ud83d\udc0d Python API\n\n```python\nfrom data4ai import generate_from_description, generate_from_document\n\n# Generate from description (uses ChatML by default)\nresult = generate_from_description(\n description=\"Create Python interview questions\",\n repo=\"python-interviews\",\n count=50,\n schema=\"chatml\" # Optional, ChatML is default\n)\n\n# Generate from document with quality features\nresult = generate_from_document(\n document_path=\"research-paper.pdf\",\n repo=\"paper-qa\",\n extraction_type=\"qa\",\n count=100,\n taxonomy=\"balanced\", # Optional: Bloom's taxonomy\n include_provenance=True, # Optional: Source tracking\n verify_quality=True # Optional: Quality verification\n)\n\nprint(f\"Generated {result['row_count']} examples\")\n```\n\n## \ud83d\udccb Supported Schemas\n\n**ChatML** (Default - OpenAI format)\n```json\n{\n \"messages\": [\n {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n {\"role\": \"user\", \"content\": \"What is machine learning?\"},\n {\"role\": \"assistant\", \"content\": \"Machine learning is...\"}\n ]\n}\n```\n\n**Alpaca** (Instruction tuning)\n```json\n{\n \"instruction\": \"What is machine learning?\",\n \"input\": \"Explain in simple terms\",\n \"output\": \"Machine learning is...\"\n}\n```\n\n\n## \ud83c\udfaf Quality Features (Optional)\n\nAll quality features are **optional** - use them when you need higher quality datasets:\n\n| Feature | Flag | Description | Performance Impact |\n|---------|------|-------------|-------------------|\n| **Token Chunking** | `--chunk-tokens N` | Use token count instead of characters | Minimal |\n| **Bloom's Taxonomy** | `--taxonomy balanced` | Create cognitively diverse questions | None |\n| **Provenance** | `--provenance` | Include source references | Minimal |\n| **Quality Verification** | `--verify` | Verify and improve examples | 2x API calls |\n| **Long Context** | `--long-context` | Merge chunks for coherence | May reduce API calls |\n\n### When to Use Quality Features\n\n- **Quick Prototyping**: No features needed - fast and simple\n- **Production Datasets**: Use `--taxonomy` and `--verify`\n- **Academic/Research**: Use all features for maximum quality\n- **Citation Required**: Always use `--provenance`\n\n## \u2699\ufe0f Configuration\n\nCreate `.env` file:\n```bash\nOPENROUTER_API_KEY=your_key_here\nOPENROUTER_MODEL=openai/gpt-4o-mini # Optional (this is the default)\nDEFAULT_SCHEMA=chatml # Optional (this is the default)\nHF_TOKEN=your_huggingface_token # For publishing\n```\n\n\n## \ud83d\udcd6 Documentation\n\n- [Detailed Usage Guide](docs/DETAILED_USAGE.md) - Complete CLI reference\n- [Examples](docs/EXAMPLES.md) - Code examples and recipes\n- [API Documentation](docs/API.md) - Python API reference\n- [Publishing Guide](docs/PUBLISHING.md) - PyPI publishing instructions\n- [All Documentation](docs/README.md) - Complete documentation index\n\n## \ud83d\udee0\ufe0f Development\n\n```bash\n# Clone repository\ngit clone https://github.com/zysec/data4ai.git\ncd data4ai\n\n# Install for development\npip install -e \".[dev]\"\n\n# Run tests\npytest\n\n# Check code quality\nruff check .\nblack --check .\n```\n\n## \ud83e\udd1d Contributing\n\nContributions welcome! Please check our [Contributing Guide](CONTRIBUTING.md).\n\n## \ud83d\udcc4 License\n\nMIT License - see [LICENSE](LICENSE) file.\n\n## \ud83d\udd17 Links\n\n- [PyPI Package](https://pypi.org/project/data4ai/)\n- [GitHub Repository](https://github.com/zysec/data4ai)\n- [Documentation](https://github.com/zysec/data4ai/tree/main/docs)\n- [Issue Tracker](https://github.com/zysec/data4ai/issues)\n\n---\n\n**Made with \u2764\ufe0f by [ZySec AI](https://zysec.ai)**",
"bugtrack_url": null,
"license": "MIT",
"summary": "Production-ready AI-powered dataset generation for instruction tuning and model fine-tuning",
"version": "0.2.2",
"project_urls": {
"Documentation": "https://github.com/zysec-ai/data4ai/blob/main/README.md",
"Homepage": "https://github.com/zysec-ai/data4ai",
"Issues": "https://github.com/zysec-ai/data4ai/issues",
"Repository": "https://github.com/zysec-ai/data4ai"
},
"split_keywords": [
"ai",
" dataset",
" instruction-tuning",
" llm",
" machine-learning",
" training-data"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "ae995be3f6cbb79a9063ba82c0c87c4a0ee71789754e0034798c74dd09677921",
"md5": "e5f2069540b9f279afa618353644c99c",
"sha256": "297c1310bd157caa99877c56ca7492b15594e89493c4fb26a1fb4e2d961f55ba"
},
"downloads": -1,
"filename": "data4ai-0.2.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "e5f2069540b9f279afa618353644c99c",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 75568,
"upload_time": "2025-08-18T14:34:25",
"upload_time_iso_8601": "2025-08-18T14:34:25.503630Z",
"url": "https://files.pythonhosted.org/packages/ae/99/5be3f6cbb79a9063ba82c0c87c4a0ee71789754e0034798c74dd09677921/data4ai-0.2.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "7cdca42f853643111b2a4e595b5f6acd1e13862ffe4665274f42abd73e1e4841",
"md5": "62795af6378421b5eda6a295262b8391",
"sha256": "49338bfd4c33a2ac0ca31b51da165df0ad8b56041b539672e0a7a442f66553e0"
},
"downloads": -1,
"filename": "data4ai-0.2.2.tar.gz",
"has_sig": false,
"md5_digest": "62795af6378421b5eda6a295262b8391",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 98257,
"upload_time": "2025-08-18T14:34:27",
"upload_time_iso_8601": "2025-08-18T14:34:27.313429Z",
"url": "https://files.pythonhosted.org/packages/7c/dc/a42f853643111b2a4e595b5f6acd1e13862ffe4665274f42abd73e1e4841/data4ai-0.2.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-18 14:34:27",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "zysec-ai",
"github_project": "data4ai",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "data4ai"
}