data4ai


Namedata4ai JSON
Version 0.3.0 PyPI version JSON
download
home_pageNone
SummaryProduction-ready AI-powered dataset generation for instruction tuning and model fine-tuning
upload_time2025-08-22 00:52:23
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseMIT
keywords ai dataset instruction-tuning llm machine-learning training-data
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Data4AI πŸ€–

[![PyPI version](https://badge.fury.io/py/data4ai.svg)](https://pypi.org/project/data4ai/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
[![GitHub Stars](https://img.shields.io/github/stars/zysec-ai/data4ai.svg)](https://github.com/zysec-ai/data4ai/stargazers)

> **Generate high-quality AI training datasets from simple descriptions or documents**

Data4AI makes it easy to create instruction-tuning datasets for training and fine-tuning language models. Whether you're building domain-specific models or need quality training data, Data4AI has you covered.

## ✨ Features

- 🎯 **Simple Commands** - Generate datasets from descriptions or documents  
- πŸ“š **Multiple Formats** - Support for ChatML, Alpaca, and custom schemas
- πŸ”„ **Smart Processing** - Automatic chunking, deduplication, and quality validation
- 🏷️ **Cognitive Taxonomy** - Built-in Bloom's taxonomy for balanced learning
- ☁️ **Direct Upload** - Push datasets directly to HuggingFace Hub
- 🌐 **100+ Models** - Access to GPT, Claude, Llama, and more via OpenRouter

## πŸš€ Quick Start

### Install
```bash
pip install data4ai
```

### Get API Key
Get your free API key from [OpenRouter](https://openrouter.ai/keys):
```bash
export OPENROUTER_API_KEY="your_key_here"
```

### Generate Your First Dataset

**From a description:**
```bash
data4ai prompt \
  --repo my-dataset \
  --description "Python programming questions for beginners" \
  --count 100
```

**From documents:**
```bash
data4ai doc document.pdf \
  --repo doc-dataset \
  --count 100
```

**From YouTube videos:**
```bash
data4ai youtube @3Blue1Brown \
  --repo math-videos \
  --count 100
```

**Upload to HuggingFace:**
```bash
data4ai push --repo my-dataset
```

That's it! Your dataset is ready at `outputs/datasets/my-dataset/data.jsonl` πŸŽ‰

## πŸ“– Documentation

- **[Examples](docs/EXAMPLES.md)** - Real-world usage examples
- **[Commands](docs/COMMANDS.md)** - Complete CLI reference  
- **[Features](docs/FEATURES.md)** - Advanced features and options
- **[YouTube Integration](docs/YOUTUBE.md)** - Extract datasets from YouTube videos
- **[Troubleshooting](docs/TROUBLESHOOTING.md)** - Common issues and solutions
- **[Runnable Examples](examples/)** - Ready-to-run example scripts

## 🀝 Community

### Contributing
We welcome contributions! See our [Contributing Guide](CONTRIBUTING.md) for:
- Development setup
- Code style guidelines  
- Testing requirements
- Pull request process

### Getting Help
- πŸ› **Bug reports**: [GitHub Issues](https://github.com/zysec-ai/data4ai/issues)
- πŸ’¬ **Questions**: [GitHub Discussions](https://github.com/zysec-ai/data4ai/discussions)
- πŸ“§ **Contact**: [research@zysec.ai](mailto:research@zysec.ai)

### Project Structure
```
data4ai/
β”œβ”€β”€ data4ai/           # Core library code
β”œβ”€β”€ docs/             # User documentation  
β”œβ”€β”€ tests/            # Test suite
β”œβ”€β”€ README.md         # You are here
β”œβ”€β”€ CONTRIBUTING.md   # How to contribute
└── CHANGELOG.md      # Release history
```

## 🎯 Use Cases

**πŸ₯ Medical Training Data**
```bash
data4ai prompt --repo medical-qa \
  --description "Medical diagnosis Q&A for common symptoms" \
  --count 500
```

**βš–οΈ Legal Assistant Data** 
```bash
data4ai doc legal-docs/ --repo legal-assistant --count 1000
```

**πŸ’» Code Training Data**
```bash
data4ai prompt --repo code-qa \
  --description "Python debugging and best practices" \
  --count 300
```

**πŸ“Ί Educational Video Content**
```bash
# Programming tutorials
data4ai youtube --search "python tutorial,programming" --repo python-course --count 200

# Educational channels  
data4ai youtube @3Blue1Brown --repo math-education --count 150

# Conference talks
data4ai youtube @pycon --repo conference-talks --count 100
```

## πŸ› οΈ Advanced Usage

### Quality Control
```bash
data4ai doc document.pdf \
  --repo high-quality \
  --verify \
  --taxonomy advanced \
  --dedup-strategy content
```

### Batch Processing
```bash
data4ai doc documents/ \
  --repo batch-dataset \
  --count 1000 \
  --batch-size 20 \
  --recursive
```

### Custom Models
```bash
export OPENROUTER_MODEL="anthropic/claude-3-5-sonnet"
data4ai prompt --repo custom-model --description "..." --count 100
```

## πŸ—οΈ Architecture

Data4AI is built with:
- **Async Processing** - Fast concurrent generation
- **DSPy Integration** - Advanced prompt optimization  
- **Quality Validation** - Automatic content verification
- **Atomic Writes** - Safe file operations
- **Schema Validation** - Ensures data consistency

## πŸ“Š Sample Output

```json
{
  "messages": [
    {
      "role": "user", 
      "content": "How do I handle exceptions in Python?"
    },
    {
      "role": "assistant",
      "content": "In Python, use try-except blocks to handle exceptions: ..."
    }
  ],
  "taxonomy_level": "understand"
}
```

## πŸ”§ Configuration

### Environment Variables
```bash
# Required
export OPENROUTER_API_KEY="your_key"

# Optional  
export OPENROUTER_MODEL="openai/gpt-4o-mini"  # Default model
export HF_TOKEN="your_hf_token"               # For HuggingFace uploads
export OUTPUT_DIR="./outputs/datasets"       # Default output directory
```

### Config File
Create `.data4ai.yaml` in your project:
```yaml
default_model: "anthropic/claude-3-5-sonnet"
default_schema: "chatml" 
default_count: 100
quality_check: true
```

## πŸš€ Roadmap

- [ ] **Custom Schema Support** - Define your own data formats
- [ ] **Local Model Support** - Use local LLMs (Ollama, vLLM)
- [ ] **Multi-language Datasets** - Generate data in multiple languages  
- [ ] **Dataset Analytics** - Advanced quality metrics and visualization
- [ ] **API Service** - RESTful API for dataset generation

## πŸ“ˆ Performance

- **Speed**: Generate 100 examples in ~2 minutes
- **Quality**: Built-in validation and deduplication
- **Scale**: Tested with datasets up to 100K examples
- **Memory**: Efficient streaming for large documents

## ⭐ Show Your Support

If Data4AI helps you, please:
- ⭐ Star this repository
- 🐦 Share on social media  
- 🀝 Contribute improvements
- πŸ’ Sponsor the project

## πŸ“„ License

MIT License - see [LICENSE](LICENSE) file for details.

## 🏒 About ZySec AI

ZySec AI empowers enterprises to confidently adopt AI where data sovereignty, privacy, and security are non-negotiableβ€”helping them move beyond fragmented, siloed systems into a new era of intelligence, from data to agentic AI, on a single platform. Data4AI is developed by [ZySec AI](https://zysec.ai).

---

<div align="center">
  <b>Made with ❀️ by ZySec AI to the open source community</b>
</div>
            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "data4ai",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "ai, dataset, instruction-tuning, llm, machine-learning, training-data",
    "author": null,
    "author_email": "ZySec AI <support@zysec.ai>",
    "download_url": "https://files.pythonhosted.org/packages/77/60/c89d30c6c2674412cf6e50d0ea697f7a2134d584189261c0d66a786379cc/data4ai-0.3.0.tar.gz",
    "platform": null,
    "description": "# Data4AI \ud83e\udd16\n\n[![PyPI version](https://badge.fury.io/py/data4ai.svg)](https://pypi.org/project/data4ai/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)\n[![GitHub Stars](https://img.shields.io/github/stars/zysec-ai/data4ai.svg)](https://github.com/zysec-ai/data4ai/stargazers)\n\n> **Generate high-quality AI training datasets from simple descriptions or documents**\n\nData4AI makes it easy to create instruction-tuning datasets for training and fine-tuning language models. Whether you're building domain-specific models or need quality training data, Data4AI has you covered.\n\n## \u2728 Features\n\n- \ud83c\udfaf **Simple Commands** - Generate datasets from descriptions or documents  \n- \ud83d\udcda **Multiple Formats** - Support for ChatML, Alpaca, and custom schemas\n- \ud83d\udd04 **Smart Processing** - Automatic chunking, deduplication, and quality validation\n- \ud83c\udff7\ufe0f **Cognitive Taxonomy** - Built-in Bloom's taxonomy for balanced learning\n- \u2601\ufe0f **Direct Upload** - Push datasets directly to HuggingFace Hub\n- \ud83c\udf10 **100+ Models** - Access to GPT, Claude, Llama, and more via OpenRouter\n\n## \ud83d\ude80 Quick Start\n\n### Install\n```bash\npip install data4ai\n```\n\n### Get API Key\nGet your free API key from [OpenRouter](https://openrouter.ai/keys):\n```bash\nexport OPENROUTER_API_KEY=\"your_key_here\"\n```\n\n### Generate Your First Dataset\n\n**From a description:**\n```bash\ndata4ai prompt \\\n  --repo my-dataset \\\n  --description \"Python programming questions for beginners\" \\\n  --count 100\n```\n\n**From documents:**\n```bash\ndata4ai doc document.pdf \\\n  --repo doc-dataset \\\n  --count 100\n```\n\n**From YouTube videos:**\n```bash\ndata4ai youtube @3Blue1Brown \\\n  --repo math-videos \\\n  --count 100\n```\n\n**Upload to HuggingFace:**\n```bash\ndata4ai push --repo my-dataset\n```\n\nThat's it! Your dataset is ready at `outputs/datasets/my-dataset/data.jsonl` \ud83c\udf89\n\n## \ud83d\udcd6 Documentation\n\n- **[Examples](docs/EXAMPLES.md)** - Real-world usage examples\n- **[Commands](docs/COMMANDS.md)** - Complete CLI reference  \n- **[Features](docs/FEATURES.md)** - Advanced features and options\n- **[YouTube Integration](docs/YOUTUBE.md)** - Extract datasets from YouTube videos\n- **[Troubleshooting](docs/TROUBLESHOOTING.md)** - Common issues and solutions\n- **[Runnable Examples](examples/)** - Ready-to-run example scripts\n\n## \ud83e\udd1d Community\n\n### Contributing\nWe welcome contributions! See our [Contributing Guide](CONTRIBUTING.md) for:\n- Development setup\n- Code style guidelines  \n- Testing requirements\n- Pull request process\n\n### Getting Help\n- \ud83d\udc1b **Bug reports**: [GitHub Issues](https://github.com/zysec-ai/data4ai/issues)\n- \ud83d\udcac **Questions**: [GitHub Discussions](https://github.com/zysec-ai/data4ai/discussions)\n- \ud83d\udce7 **Contact**: [research@zysec.ai](mailto:research@zysec.ai)\n\n### Project Structure\n```\ndata4ai/\n\u251c\u2500\u2500 data4ai/           # Core library code\n\u251c\u2500\u2500 docs/             # User documentation  \n\u251c\u2500\u2500 tests/            # Test suite\n\u251c\u2500\u2500 README.md         # You are here\n\u251c\u2500\u2500 CONTRIBUTING.md   # How to contribute\n\u2514\u2500\u2500 CHANGELOG.md      # Release history\n```\n\n## \ud83c\udfaf Use Cases\n\n**\ud83c\udfe5 Medical Training Data**\n```bash\ndata4ai prompt --repo medical-qa \\\n  --description \"Medical diagnosis Q&A for common symptoms\" \\\n  --count 500\n```\n\n**\u2696\ufe0f Legal Assistant Data** \n```bash\ndata4ai doc legal-docs/ --repo legal-assistant --count 1000\n```\n\n**\ud83d\udcbb Code Training Data**\n```bash\ndata4ai prompt --repo code-qa \\\n  --description \"Python debugging and best practices\" \\\n  --count 300\n```\n\n**\ud83d\udcfa Educational Video Content**\n```bash\n# Programming tutorials\ndata4ai youtube --search \"python tutorial,programming\" --repo python-course --count 200\n\n# Educational channels  \ndata4ai youtube @3Blue1Brown --repo math-education --count 150\n\n# Conference talks\ndata4ai youtube @pycon --repo conference-talks --count 100\n```\n\n## \ud83d\udee0\ufe0f Advanced Usage\n\n### Quality Control\n```bash\ndata4ai doc document.pdf \\\n  --repo high-quality \\\n  --verify \\\n  --taxonomy advanced \\\n  --dedup-strategy content\n```\n\n### Batch Processing\n```bash\ndata4ai doc documents/ \\\n  --repo batch-dataset \\\n  --count 1000 \\\n  --batch-size 20 \\\n  --recursive\n```\n\n### Custom Models\n```bash\nexport OPENROUTER_MODEL=\"anthropic/claude-3-5-sonnet\"\ndata4ai prompt --repo custom-model --description \"...\" --count 100\n```\n\n## \ud83c\udfd7\ufe0f Architecture\n\nData4AI is built with:\n- **Async Processing** - Fast concurrent generation\n- **DSPy Integration** - Advanced prompt optimization  \n- **Quality Validation** - Automatic content verification\n- **Atomic Writes** - Safe file operations\n- **Schema Validation** - Ensures data consistency\n\n## \ud83d\udcca Sample Output\n\n```json\n{\n  \"messages\": [\n    {\n      \"role\": \"user\", \n      \"content\": \"How do I handle exceptions in Python?\"\n    },\n    {\n      \"role\": \"assistant\",\n      \"content\": \"In Python, use try-except blocks to handle exceptions: ...\"\n    }\n  ],\n  \"taxonomy_level\": \"understand\"\n}\n```\n\n## \ud83d\udd27 Configuration\n\n### Environment Variables\n```bash\n# Required\nexport OPENROUTER_API_KEY=\"your_key\"\n\n# Optional  \nexport OPENROUTER_MODEL=\"openai/gpt-4o-mini\"  # Default model\nexport HF_TOKEN=\"your_hf_token\"               # For HuggingFace uploads\nexport OUTPUT_DIR=\"./outputs/datasets\"       # Default output directory\n```\n\n### Config File\nCreate `.data4ai.yaml` in your project:\n```yaml\ndefault_model: \"anthropic/claude-3-5-sonnet\"\ndefault_schema: \"chatml\" \ndefault_count: 100\nquality_check: true\n```\n\n## \ud83d\ude80 Roadmap\n\n- [ ] **Custom Schema Support** - Define your own data formats\n- [ ] **Local Model Support** - Use local LLMs (Ollama, vLLM)\n- [ ] **Multi-language Datasets** - Generate data in multiple languages  \n- [ ] **Dataset Analytics** - Advanced quality metrics and visualization\n- [ ] **API Service** - RESTful API for dataset generation\n\n## \ud83d\udcc8 Performance\n\n- **Speed**: Generate 100 examples in ~2 minutes\n- **Quality**: Built-in validation and deduplication\n- **Scale**: Tested with datasets up to 100K examples\n- **Memory**: Efficient streaming for large documents\n\n## \u2b50 Show Your Support\n\nIf Data4AI helps you, please:\n- \u2b50 Star this repository\n- \ud83d\udc26 Share on social media  \n- \ud83e\udd1d Contribute improvements\n- \ud83d\udc9d Sponsor the project\n\n## \ud83d\udcc4 License\n\nMIT License - see [LICENSE](LICENSE) file for details.\n\n## \ud83c\udfe2 About ZySec AI\n\nZySec AI empowers enterprises to confidently adopt AI where data sovereignty, privacy, and security are non-negotiable\u2014helping them move beyond fragmented, siloed systems into a new era of intelligence, from data to agentic AI, on a single platform. Data4AI is developed by [ZySec AI](https://zysec.ai).\n\n---\n\n<div align=\"center\">\n  <b>Made with \u2764\ufe0f by ZySec AI to the open source community</b>\n</div>",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Production-ready AI-powered dataset generation for instruction tuning and model fine-tuning",
    "version": "0.3.0",
    "project_urls": {
        "Documentation": "https://github.com/zysec-ai/data4ai/blob/main/README.md",
        "Homepage": "https://github.com/zysec-ai/data4ai",
        "Issues": "https://github.com/zysec-ai/data4ai/issues",
        "Repository": "https://github.com/zysec-ai/data4ai"
    },
    "split_keywords": [
        "ai",
        " dataset",
        " instruction-tuning",
        " llm",
        " machine-learning",
        " training-data"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "150dd9b63e5182c466398642727cbea230b5c8a1d7d9451a4426eab28230d736",
                "md5": "e9a96fa3b821ed13b44833a9ef355dfa",
                "sha256": "f92be143b0c7d22970484016274860573b25f157c1a5a4c58c1de2edee3e57db"
            },
            "downloads": -1,
            "filename": "data4ai-0.3.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e9a96fa3b821ed13b44833a9ef355dfa",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 103716,
            "upload_time": "2025-08-22T00:52:21",
            "upload_time_iso_8601": "2025-08-22T00:52:21.392507Z",
            "url": "https://files.pythonhosted.org/packages/15/0d/d9b63e5182c466398642727cbea230b5c8a1d7d9451a4426eab28230d736/data4ai-0.3.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "7760c89d30c6c2674412cf6e50d0ea697f7a2134d584189261c0d66a786379cc",
                "md5": "f24382f7bf89e02a92b460e0eebeaebc",
                "sha256": "391bab837e64c39571e71c1d18c07150916462482539393211f8739ca4eaf3a3"
            },
            "downloads": -1,
            "filename": "data4ai-0.3.0.tar.gz",
            "has_sig": false,
            "md5_digest": "f24382f7bf89e02a92b460e0eebeaebc",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 130104,
            "upload_time": "2025-08-22T00:52:23",
            "upload_time_iso_8601": "2025-08-22T00:52:23.050949Z",
            "url": "https://files.pythonhosted.org/packages/77/60/c89d30c6c2674412cf6e50d0ea697f7a2134d584189261c0d66a786379cc/data4ai-0.3.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-22 00:52:23",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "zysec-ai",
    "github_project": "data4ai",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "data4ai"
}
        
Elapsed time: 2.02056s