# Chatan
Create diverse, synthetic datasets. Start from scratch or augment an existing dataset. Simply define your dataset schema as a set of generators, typically being LLMs with a prompt describing what kind of examples you want.
## Installation
Basic installation (includes OpenAI, Anthropic, and core functionality):
```bash
pip install chatan
```
With optional features:
```bash
# For local model support (transformers + PyTorch)
pip install chatan[local]
# For advanced evaluation features (semantic similarity, BLEU score)
pip install chatan[eval]
# For all optional features
pip install chatan[all]
```
## Getting Started
```python
import chatan
# Create a generator
gen = chatan.generator("openai", "YOUR_API_KEY")
# Define a dataset schema
ds = chatan.dataset({
"topic": chatan.sample.choice(["Python", "JavaScript", "Rust"]),
"prompt": gen("write a programming question about {topic}"),
"response": gen("answer this question: {prompt}")
})
# Generate the data with a progress bar
df = ds.generate(n=10)
```
## Generator Options
### API-based Generators (included in base install)
```python
# OpenAI
gen = chatan.generator("openai", "YOUR_OPENAI_API_KEY")
# Anthropic
gen = chatan.generator("anthropic", "YOUR_ANTHROPIC_API_KEY")
```
### Local Model Support (requires `pip install chatan[local]`)
```python
# HuggingFace Transformers
gen = chatan.generator("transformers", model="microsoft/DialoGPT-medium")
```
## Examples
Create Data Mixes
```python
from chatan import dataset, generator, sample
import uuid
gen = generator("openai", "YOUR_API_KEY")
mix = [
"san antonio, tx",
"marfa, tx",
"paris, fr"
]
ds = dataset({
"id": sample.uuid(),
"topic": sample.choice(mix),
"prompt": gen("write an example question about the history of {topic}"),
"response": gen("respond to: {prompt}"),
})
```
Augment datasets
```python
from chatan import generator, dataset, sample
from datasets import load_dataset
gen = generator("openai", "YOUR_API_KEY")
hf_data = load_dataset("some/dataset")
ds = dataset({
"original_prompt": sample.from_dataset(hf_data, "prompt"),
"variation": gen("rewrite this prompt: {original_prompt}"),
"response": gen("respond to: {variation}")
})
```
## Evaluation
Evaluate rows inline or compute aggregate metrics:
```python
from chatan import dataset, eval, sample
ds = dataset({
"col1": sample.choice(["a", "a", "b"]),
"col2": "b",
"score": eval.exact_match("col1", "col2")
})
df = ds.generate()
aggregate = ds.evaluate({
"exact_match": ds.eval.exact_match("col1", "col2")
})
```
### Advanced Evaluation (requires `pip install chatan[eval]`)
```python
# Semantic similarity using sentence transformers
aggregate = ds.evaluate({
"semantic_sim": ds.eval.semantic_similarity("col1", "col2")
})
# BLEU score evaluation
aggregate = ds.evaluate({
"bleu": ds.eval.bleu_score("col1", "col2")
})
```
## Installation Options Summary
| Feature | Install Command | What's Included |
|---------|----------------|-----------------|
| **Basic** | `pip install chatan` | OpenAI, Anthropic, core sampling, basic evaluation |
| **Local Models** | `pip install chatan[local]` | + HuggingFace Transformers, PyTorch |
| **Advanced Eval** | `pip install chatan[eval]` | + Semantic similarity, BLEU scores, NLTK |
| **Everything** | `pip install chatan[all]` | All features above |
## Citation
If you use this code in your research, please cite:
```
@software{reetz2025chatan,
author = {Reetz, Christian},
title = {chatan: Create synthetic datasets with LLM generators.},
url = {https://github.com/cdreetz/chatan},
year = {2025}
}
```
## Contributing
Community contributions are more than welcome, bug reports, bug fixes, feature requests, feature additions, please refer to the Issues tab.
Raw data
{
"_id": null,
"home_page": null,
"name": "chatan",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "dataset generation, llm, machine learning, synthetic data",
"author": null,
"author_email": "Christian Reetz <cdreetz@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/09/47/0b266e939144371173665fcbca29f78f0a56bfb39826a03038d839b90508/chatan-0.2.1.tar.gz",
"platform": null,
"description": "# Chatan\n\nCreate diverse, synthetic datasets. Start from scratch or augment an existing dataset. Simply define your dataset schema as a set of generators, typically being LLMs with a prompt describing what kind of examples you want.\n\n## Installation\n\nBasic installation (includes OpenAI, Anthropic, and core functionality):\n```bash\npip install chatan\n```\n\nWith optional features:\n```bash\n# For local model support (transformers + PyTorch)\npip install chatan[local]\n\n# For advanced evaluation features (semantic similarity, BLEU score)\npip install chatan[eval]\n\n# For all optional features\npip install chatan[all]\n```\n\n## Getting Started\n\n```python\nimport chatan\n\n# Create a generator\ngen = chatan.generator(\"openai\", \"YOUR_API_KEY\")\n\n# Define a dataset schema\nds = chatan.dataset({\n \"topic\": chatan.sample.choice([\"Python\", \"JavaScript\", \"Rust\"]),\n \"prompt\": gen(\"write a programming question about {topic}\"),\n \"response\": gen(\"answer this question: {prompt}\")\n})\n\n# Generate the data with a progress bar\ndf = ds.generate(n=10)\n```\n\n## Generator Options\n\n### API-based Generators (included in base install)\n```python\n# OpenAI\ngen = chatan.generator(\"openai\", \"YOUR_OPENAI_API_KEY\")\n\n# Anthropic\ngen = chatan.generator(\"anthropic\", \"YOUR_ANTHROPIC_API_KEY\")\n```\n\n### Local Model Support (requires `pip install chatan[local]`)\n```python\n# HuggingFace Transformers\ngen = chatan.generator(\"transformers\", model=\"microsoft/DialoGPT-medium\")\n```\n\n## Examples\n\nCreate Data Mixes\n\n```python\nfrom chatan import dataset, generator, sample\nimport uuid\n\ngen = generator(\"openai\", \"YOUR_API_KEY\")\n\nmix = [\n \"san antonio, tx\",\n \"marfa, tx\",\n \"paris, fr\"\n]\n\nds = dataset({\n \"id\": sample.uuid(),\n \"topic\": sample.choice(mix),\n \"prompt\": gen(\"write an example question about the history of {topic}\"),\n \"response\": gen(\"respond to: {prompt}\"),\n})\n```\n\nAugment datasets\n\n```python\nfrom chatan import generator, dataset, sample\nfrom datasets import load_dataset\n\ngen = generator(\"openai\", \"YOUR_API_KEY\")\nhf_data = load_dataset(\"some/dataset\")\n\nds = dataset({\n \"original_prompt\": sample.from_dataset(hf_data, \"prompt\"),\n \"variation\": gen(\"rewrite this prompt: {original_prompt}\"),\n \"response\": gen(\"respond to: {variation}\")\n})\n```\n\n## Evaluation\n\nEvaluate rows inline or compute aggregate metrics:\n\n```python\nfrom chatan import dataset, eval, sample\n\nds = dataset({\n \"col1\": sample.choice([\"a\", \"a\", \"b\"]),\n \"col2\": \"b\",\n \"score\": eval.exact_match(\"col1\", \"col2\")\n})\n\ndf = ds.generate()\naggregate = ds.evaluate({\n \"exact_match\": ds.eval.exact_match(\"col1\", \"col2\")\n})\n```\n\n### Advanced Evaluation (requires `pip install chatan[eval]`)\n```python\n# Semantic similarity using sentence transformers\naggregate = ds.evaluate({\n \"semantic_sim\": ds.eval.semantic_similarity(\"col1\", \"col2\")\n})\n\n# BLEU score evaluation\naggregate = ds.evaluate({\n \"bleu\": ds.eval.bleu_score(\"col1\", \"col2\")\n})\n```\n\n## Installation Options Summary\n\n| Feature | Install Command | What's Included |\n|---------|----------------|-----------------|\n| **Basic** | `pip install chatan` | OpenAI, Anthropic, core sampling, basic evaluation |\n| **Local Models** | `pip install chatan[local]` | + HuggingFace Transformers, PyTorch |\n| **Advanced Eval** | `pip install chatan[eval]` | + Semantic similarity, BLEU scores, NLTK |\n| **Everything** | `pip install chatan[all]` | All features above |\n\n## Citation\n\nIf you use this code in your research, please cite:\n\n```\n@software{reetz2025chatan,\n author = {Reetz, Christian},\n title = {chatan: Create synthetic datasets with LLM generators.},\n url = {https://github.com/cdreetz/chatan},\n year = {2025}\n}\n```\n\n## Contributing\n\nCommunity contributions are more than welcome, bug reports, bug fixes, feature requests, feature additions, please refer to the Issues tab.\n",
"bugtrack_url": null,
"license": null,
"summary": "Create synthetic datasets with LLM generators and samplers",
"version": "0.2.1",
"project_urls": {
"Documentation": "https://github.com/cdreetz/chatan#readme",
"Issues": "https://github.com/cdreetz/chatan/issues",
"Source": "https://github.com/cdreetz/chatan"
},
"split_keywords": [
"dataset generation",
" llm",
" machine learning",
" synthetic data"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "3715dc50c672d887168f310d8548f27d3fa57bcc2d3128cf9820a87c26c103ec",
"md5": "3104b946ee8167598cc801e8b916a8c7",
"sha256": "a783e43370a8317a3855b4ef5df7b09c059443fc091cbbf40c8cdc66a9446964"
},
"downloads": -1,
"filename": "chatan-0.2.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "3104b946ee8167598cc801e8b916a8c7",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 21278,
"upload_time": "2025-10-22T02:28:46",
"upload_time_iso_8601": "2025-10-22T02:28:46.447951Z",
"url": "https://files.pythonhosted.org/packages/37/15/dc50c672d887168f310d8548f27d3fa57bcc2d3128cf9820a87c26c103ec/chatan-0.2.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "09470b266e939144371173665fcbca29f78f0a56bfb39826a03038d839b90508",
"md5": "c196927a4a5fad2aa7146daa3a8ccff3",
"sha256": "be41ad947a2eacf7f1b11b9ce42c5739c9c40df32e9d00528e4c61f8c92fb48d"
},
"downloads": -1,
"filename": "chatan-0.2.1.tar.gz",
"has_sig": false,
"md5_digest": "c196927a4a5fad2aa7146daa3a8ccff3",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 37280,
"upload_time": "2025-10-22T02:28:47",
"upload_time_iso_8601": "2025-10-22T02:28:47.987468Z",
"url": "https://files.pythonhosted.org/packages/09/47/0b266e939144371173665fcbca29f78f0a56bfb39826a03038d839b90508/chatan-0.2.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-22 02:28:47",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "cdreetz",
"github_project": "chatan#readme",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "chatan"
}