# Data4AI π€
[](https://pypi.org/project/data4ai/)
[](https://opensource.org/licenses/MIT)
[](https://www.python.org/downloads/)
[](https://github.com/zysec-ai/data4ai/stargazers)
> **Generate high-quality AI training datasets from simple descriptions or documents**
Data4AI makes it easy to create instruction-tuning datasets for training and fine-tuning language models. Whether you're building domain-specific models or need quality training data, Data4AI has you covered.
## β¨ Features
- π― **Simple Commands** - Generate datasets from descriptions or documents
- π **Multiple Formats** - Support for ChatML, Alpaca, and custom schemas
- π **Smart Processing** - Automatic chunking, deduplication, and quality validation
- π·οΈ **Cognitive Taxonomy** - Built-in Bloom's taxonomy for balanced learning
- βοΈ **Direct Upload** - Push datasets directly to HuggingFace Hub
- π **100+ Models** - Access to GPT, Claude, Llama, and more via OpenRouter
## π Quick Start
### Install
```bash
pip install data4ai
```
### Get API Key
Get your free API key from [OpenRouter](https://openrouter.ai/keys):
```bash
export OPENROUTER_API_KEY="your_key_here"
```
### Generate Your First Dataset
**From a description:**
```bash
data4ai prompt \
--repo my-dataset \
--description "Python programming questions for beginners" \
--count 100
```
**From documents:**
```bash
data4ai doc document.pdf \
--repo doc-dataset \
--count 100
```
**From YouTube videos:**
```bash
data4ai youtube @3Blue1Brown \
--repo math-videos \
--count 100
```
**Upload to HuggingFace:**
```bash
data4ai push --repo my-dataset
```
That's it! Your dataset is ready at `outputs/datasets/my-dataset/data.jsonl` π
## π Documentation
- **[Examples](docs/EXAMPLES.md)** - Real-world usage examples
- **[Commands](docs/COMMANDS.md)** - Complete CLI reference
- **[Features](docs/FEATURES.md)** - Advanced features and options
- **[YouTube Integration](docs/YOUTUBE.md)** - Extract datasets from YouTube videos
- **[Troubleshooting](docs/TROUBLESHOOTING.md)** - Common issues and solutions
- **[Runnable Examples](examples/)** - Ready-to-run example scripts
## π€ Community
### Contributing
We welcome contributions! See our [Contributing Guide](CONTRIBUTING.md) for:
- Development setup
- Code style guidelines
- Testing requirements
- Pull request process
### Getting Help
- π **Bug reports**: [GitHub Issues](https://github.com/zysec-ai/data4ai/issues)
- π¬ **Questions**: [GitHub Discussions](https://github.com/zysec-ai/data4ai/discussions)
- π§ **Contact**: [research@zysec.ai](mailto:research@zysec.ai)
### Project Structure
```
data4ai/
βββ data4ai/ # Core library code
βββ docs/ # User documentation
βββ tests/ # Test suite
βββ README.md # You are here
βββ CONTRIBUTING.md # How to contribute
βββ CHANGELOG.md # Release history
```
## π― Use Cases
**π₯ Medical Training Data**
```bash
data4ai prompt --repo medical-qa \
--description "Medical diagnosis Q&A for common symptoms" \
--count 500
```
**βοΈ Legal Assistant Data**
```bash
data4ai doc legal-docs/ --repo legal-assistant --count 1000
```
**π» Code Training Data**
```bash
data4ai prompt --repo code-qa \
--description "Python debugging and best practices" \
--count 300
```
**πΊ Educational Video Content**
```bash
# Programming tutorials
data4ai youtube --search "python tutorial,programming" --repo python-course --count 200
# Educational channels
data4ai youtube @3Blue1Brown --repo math-education --count 150
# Conference talks
data4ai youtube @pycon --repo conference-talks --count 100
```
## π οΈ Advanced Usage
### Quality Control
```bash
data4ai doc document.pdf \
--repo high-quality \
--verify \
--taxonomy advanced \
--dedup-strategy content
```
### Batch Processing
```bash
data4ai doc documents/ \
--repo batch-dataset \
--count 1000 \
--batch-size 20 \
--recursive
```
### Custom Models
```bash
export OPENROUTER_MODEL="anthropic/claude-3-5-sonnet"
data4ai prompt --repo custom-model --description "..." --count 100
```
## ποΈ Architecture
Data4AI is built with:
- **Async Processing** - Fast concurrent generation
- **DSPy Integration** - Advanced prompt optimization
- **Quality Validation** - Automatic content verification
- **Atomic Writes** - Safe file operations
- **Schema Validation** - Ensures data consistency
## π Sample Output
```json
{
"messages": [
{
"role": "user",
"content": "How do I handle exceptions in Python?"
},
{
"role": "assistant",
"content": "In Python, use try-except blocks to handle exceptions: ..."
}
],
"taxonomy_level": "understand"
}
```
## π§ Configuration
### Environment Variables
```bash
# Required
export OPENROUTER_API_KEY="your_key"
# Optional
export OPENROUTER_MODEL="openai/gpt-4o-mini" # Default model
export HF_TOKEN="your_hf_token" # For HuggingFace uploads
export OUTPUT_DIR="./outputs/datasets" # Default output directory
```
### Config File
Create `.data4ai.yaml` in your project:
```yaml
default_model: "anthropic/claude-3-5-sonnet"
default_schema: "chatml"
default_count: 100
quality_check: true
```
## π Roadmap
- [ ] **Custom Schema Support** - Define your own data formats
- [ ] **Local Model Support** - Use local LLMs (Ollama, vLLM)
- [ ] **Multi-language Datasets** - Generate data in multiple languages
- [ ] **Dataset Analytics** - Advanced quality metrics and visualization
- [ ] **API Service** - RESTful API for dataset generation
## π Performance
- **Speed**: Generate 100 examples in ~2 minutes
- **Quality**: Built-in validation and deduplication
- **Scale**: Tested with datasets up to 100K examples
- **Memory**: Efficient streaming for large documents
## β Show Your Support
If Data4AI helps you, please:
- β Star this repository
- π¦ Share on social media
- π€ Contribute improvements
- π Sponsor the project
## π License
MIT License - see [LICENSE](LICENSE) file for details.
## π’ About ZySec AI
ZySec AI empowers enterprises to confidently adopt AI where data sovereignty, privacy, and security are non-negotiableβhelping them move beyond fragmented, siloed systems into a new era of intelligence, from data to agentic AI, on a single platform. Data4AI is developed by [ZySec AI](https://zysec.ai).
---
<div align="center">
<b>Made with β€οΈ by ZySec AI to the open source community</b>
</div>
Raw data
{
"_id": null,
"home_page": null,
"name": "data4ai",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "ai, dataset, instruction-tuning, llm, machine-learning, training-data",
"author": null,
"author_email": "ZySec AI <support@zysec.ai>",
"download_url": "https://files.pythonhosted.org/packages/77/60/c89d30c6c2674412cf6e50d0ea697f7a2134d584189261c0d66a786379cc/data4ai-0.3.0.tar.gz",
"platform": null,
"description": "# Data4AI \ud83e\udd16\n\n[](https://pypi.org/project/data4ai/)\n[](https://opensource.org/licenses/MIT)\n[](https://www.python.org/downloads/)\n[](https://github.com/zysec-ai/data4ai/stargazers)\n\n> **Generate high-quality AI training datasets from simple descriptions or documents**\n\nData4AI makes it easy to create instruction-tuning datasets for training and fine-tuning language models. Whether you're building domain-specific models or need quality training data, Data4AI has you covered.\n\n## \u2728 Features\n\n- \ud83c\udfaf **Simple Commands** - Generate datasets from descriptions or documents \n- \ud83d\udcda **Multiple Formats** - Support for ChatML, Alpaca, and custom schemas\n- \ud83d\udd04 **Smart Processing** - Automatic chunking, deduplication, and quality validation\n- \ud83c\udff7\ufe0f **Cognitive Taxonomy** - Built-in Bloom's taxonomy for balanced learning\n- \u2601\ufe0f **Direct Upload** - Push datasets directly to HuggingFace Hub\n- \ud83c\udf10 **100+ Models** - Access to GPT, Claude, Llama, and more via OpenRouter\n\n## \ud83d\ude80 Quick Start\n\n### Install\n```bash\npip install data4ai\n```\n\n### Get API Key\nGet your free API key from [OpenRouter](https://openrouter.ai/keys):\n```bash\nexport OPENROUTER_API_KEY=\"your_key_here\"\n```\n\n### Generate Your First Dataset\n\n**From a description:**\n```bash\ndata4ai prompt \\\n --repo my-dataset \\\n --description \"Python programming questions for beginners\" \\\n --count 100\n```\n\n**From documents:**\n```bash\ndata4ai doc document.pdf \\\n --repo doc-dataset \\\n --count 100\n```\n\n**From YouTube videos:**\n```bash\ndata4ai youtube @3Blue1Brown \\\n --repo math-videos \\\n --count 100\n```\n\n**Upload to HuggingFace:**\n```bash\ndata4ai push --repo my-dataset\n```\n\nThat's it! Your dataset is ready at `outputs/datasets/my-dataset/data.jsonl` \ud83c\udf89\n\n## \ud83d\udcd6 Documentation\n\n- **[Examples](docs/EXAMPLES.md)** - Real-world usage examples\n- **[Commands](docs/COMMANDS.md)** - Complete CLI reference \n- **[Features](docs/FEATURES.md)** - Advanced features and options\n- **[YouTube Integration](docs/YOUTUBE.md)** - Extract datasets from YouTube videos\n- **[Troubleshooting](docs/TROUBLESHOOTING.md)** - Common issues and solutions\n- **[Runnable Examples](examples/)** - Ready-to-run example scripts\n\n## \ud83e\udd1d Community\n\n### Contributing\nWe welcome contributions! See our [Contributing Guide](CONTRIBUTING.md) for:\n- Development setup\n- Code style guidelines \n- Testing requirements\n- Pull request process\n\n### Getting Help\n- \ud83d\udc1b **Bug reports**: [GitHub Issues](https://github.com/zysec-ai/data4ai/issues)\n- \ud83d\udcac **Questions**: [GitHub Discussions](https://github.com/zysec-ai/data4ai/discussions)\n- \ud83d\udce7 **Contact**: [research@zysec.ai](mailto:research@zysec.ai)\n\n### Project Structure\n```\ndata4ai/\n\u251c\u2500\u2500 data4ai/ # Core library code\n\u251c\u2500\u2500 docs/ # User documentation \n\u251c\u2500\u2500 tests/ # Test suite\n\u251c\u2500\u2500 README.md # You are here\n\u251c\u2500\u2500 CONTRIBUTING.md # How to contribute\n\u2514\u2500\u2500 CHANGELOG.md # Release history\n```\n\n## \ud83c\udfaf Use Cases\n\n**\ud83c\udfe5 Medical Training Data**\n```bash\ndata4ai prompt --repo medical-qa \\\n --description \"Medical diagnosis Q&A for common symptoms\" \\\n --count 500\n```\n\n**\u2696\ufe0f Legal Assistant Data** \n```bash\ndata4ai doc legal-docs/ --repo legal-assistant --count 1000\n```\n\n**\ud83d\udcbb Code Training Data**\n```bash\ndata4ai prompt --repo code-qa \\\n --description \"Python debugging and best practices\" \\\n --count 300\n```\n\n**\ud83d\udcfa Educational Video Content**\n```bash\n# Programming tutorials\ndata4ai youtube --search \"python tutorial,programming\" --repo python-course --count 200\n\n# Educational channels \ndata4ai youtube @3Blue1Brown --repo math-education --count 150\n\n# Conference talks\ndata4ai youtube @pycon --repo conference-talks --count 100\n```\n\n## \ud83d\udee0\ufe0f Advanced Usage\n\n### Quality Control\n```bash\ndata4ai doc document.pdf \\\n --repo high-quality \\\n --verify \\\n --taxonomy advanced \\\n --dedup-strategy content\n```\n\n### Batch Processing\n```bash\ndata4ai doc documents/ \\\n --repo batch-dataset \\\n --count 1000 \\\n --batch-size 20 \\\n --recursive\n```\n\n### Custom Models\n```bash\nexport OPENROUTER_MODEL=\"anthropic/claude-3-5-sonnet\"\ndata4ai prompt --repo custom-model --description \"...\" --count 100\n```\n\n## \ud83c\udfd7\ufe0f Architecture\n\nData4AI is built with:\n- **Async Processing** - Fast concurrent generation\n- **DSPy Integration** - Advanced prompt optimization \n- **Quality Validation** - Automatic content verification\n- **Atomic Writes** - Safe file operations\n- **Schema Validation** - Ensures data consistency\n\n## \ud83d\udcca Sample Output\n\n```json\n{\n \"messages\": [\n {\n \"role\": \"user\", \n \"content\": \"How do I handle exceptions in Python?\"\n },\n {\n \"role\": \"assistant\",\n \"content\": \"In Python, use try-except blocks to handle exceptions: ...\"\n }\n ],\n \"taxonomy_level\": \"understand\"\n}\n```\n\n## \ud83d\udd27 Configuration\n\n### Environment Variables\n```bash\n# Required\nexport OPENROUTER_API_KEY=\"your_key\"\n\n# Optional \nexport OPENROUTER_MODEL=\"openai/gpt-4o-mini\" # Default model\nexport HF_TOKEN=\"your_hf_token\" # For HuggingFace uploads\nexport OUTPUT_DIR=\"./outputs/datasets\" # Default output directory\n```\n\n### Config File\nCreate `.data4ai.yaml` in your project:\n```yaml\ndefault_model: \"anthropic/claude-3-5-sonnet\"\ndefault_schema: \"chatml\" \ndefault_count: 100\nquality_check: true\n```\n\n## \ud83d\ude80 Roadmap\n\n- [ ] **Custom Schema Support** - Define your own data formats\n- [ ] **Local Model Support** - Use local LLMs (Ollama, vLLM)\n- [ ] **Multi-language Datasets** - Generate data in multiple languages \n- [ ] **Dataset Analytics** - Advanced quality metrics and visualization\n- [ ] **API Service** - RESTful API for dataset generation\n\n## \ud83d\udcc8 Performance\n\n- **Speed**: Generate 100 examples in ~2 minutes\n- **Quality**: Built-in validation and deduplication\n- **Scale**: Tested with datasets up to 100K examples\n- **Memory**: Efficient streaming for large documents\n\n## \u2b50 Show Your Support\n\nIf Data4AI helps you, please:\n- \u2b50 Star this repository\n- \ud83d\udc26 Share on social media \n- \ud83e\udd1d Contribute improvements\n- \ud83d\udc9d Sponsor the project\n\n## \ud83d\udcc4 License\n\nMIT License - see [LICENSE](LICENSE) file for details.\n\n## \ud83c\udfe2 About ZySec AI\n\nZySec AI empowers enterprises to confidently adopt AI where data sovereignty, privacy, and security are non-negotiable\u2014helping them move beyond fragmented, siloed systems into a new era of intelligence, from data to agentic AI, on a single platform. Data4AI is developed by [ZySec AI](https://zysec.ai).\n\n---\n\n<div align=\"center\">\n <b>Made with \u2764\ufe0f by ZySec AI to the open source community</b>\n</div>",
"bugtrack_url": null,
"license": "MIT",
"summary": "Production-ready AI-powered dataset generation for instruction tuning and model fine-tuning",
"version": "0.3.0",
"project_urls": {
"Documentation": "https://github.com/zysec-ai/data4ai/blob/main/README.md",
"Homepage": "https://github.com/zysec-ai/data4ai",
"Issues": "https://github.com/zysec-ai/data4ai/issues",
"Repository": "https://github.com/zysec-ai/data4ai"
},
"split_keywords": [
"ai",
" dataset",
" instruction-tuning",
" llm",
" machine-learning",
" training-data"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "150dd9b63e5182c466398642727cbea230b5c8a1d7d9451a4426eab28230d736",
"md5": "e9a96fa3b821ed13b44833a9ef355dfa",
"sha256": "f92be143b0c7d22970484016274860573b25f157c1a5a4c58c1de2edee3e57db"
},
"downloads": -1,
"filename": "data4ai-0.3.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "e9a96fa3b821ed13b44833a9ef355dfa",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 103716,
"upload_time": "2025-08-22T00:52:21",
"upload_time_iso_8601": "2025-08-22T00:52:21.392507Z",
"url": "https://files.pythonhosted.org/packages/15/0d/d9b63e5182c466398642727cbea230b5c8a1d7d9451a4426eab28230d736/data4ai-0.3.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "7760c89d30c6c2674412cf6e50d0ea697f7a2134d584189261c0d66a786379cc",
"md5": "f24382f7bf89e02a92b460e0eebeaebc",
"sha256": "391bab837e64c39571e71c1d18c07150916462482539393211f8739ca4eaf3a3"
},
"downloads": -1,
"filename": "data4ai-0.3.0.tar.gz",
"has_sig": false,
"md5_digest": "f24382f7bf89e02a92b460e0eebeaebc",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 130104,
"upload_time": "2025-08-22T00:52:23",
"upload_time_iso_8601": "2025-08-22T00:52:23.050949Z",
"url": "https://files.pythonhosted.org/packages/77/60/c89d30c6c2674412cf6e50d0ea697f7a2134d584189261c0d66a786379cc/data4ai-0.3.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-22 00:52:23",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "zysec-ai",
"github_project": "data4ai",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "data4ai"
}