chatan


Namechatan JSON
Version 0.2.1 PyPI version JSON
download
home_pageNone
SummaryCreate synthetic datasets with LLM generators and samplers
upload_time2025-10-22 02:28:47
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseNone
keywords dataset generation llm machine learning synthetic data
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Chatan

Create diverse, synthetic datasets. Start from scratch or augment an existing dataset. Simply define your dataset schema as a set of generators, typically being LLMs with a prompt describing what kind of examples you want.

## Installation

Basic installation (includes OpenAI, Anthropic, and core functionality):
```bash
pip install chatan
```

With optional features:
```bash
# For local model support (transformers + PyTorch)
pip install chatan[local]

# For advanced evaluation features (semantic similarity, BLEU score)
pip install chatan[eval]

# For all optional features
pip install chatan[all]
```

## Getting Started

```python
import chatan

# Create a generator
gen = chatan.generator("openai", "YOUR_API_KEY")

# Define a dataset schema
ds = chatan.dataset({
    "topic": chatan.sample.choice(["Python", "JavaScript", "Rust"]),
    "prompt": gen("write a programming question about {topic}"),
    "response": gen("answer this question: {prompt}")
})

# Generate the data with a progress bar
df = ds.generate(n=10)
```

## Generator Options

### API-based Generators (included in base install)
```python
# OpenAI
gen = chatan.generator("openai", "YOUR_OPENAI_API_KEY")

# Anthropic
gen = chatan.generator("anthropic", "YOUR_ANTHROPIC_API_KEY")
```

### Local Model Support (requires `pip install chatan[local]`)
```python
# HuggingFace Transformers
gen = chatan.generator("transformers", model="microsoft/DialoGPT-medium")
```

## Examples

Create Data Mixes

```python
from chatan import dataset, generator, sample
import uuid

gen = generator("openai", "YOUR_API_KEY")

mix = [
    "san antonio, tx",
    "marfa, tx",
    "paris, fr"
]

ds = dataset({
    "id": sample.uuid(),
    "topic": sample.choice(mix),
    "prompt": gen("write an example question about the history of {topic}"),
    "response": gen("respond to: {prompt}"),
})
```

Augment datasets

```python
from chatan import generator, dataset, sample
from datasets import load_dataset

gen = generator("openai", "YOUR_API_KEY")
hf_data = load_dataset("some/dataset")

ds = dataset({
    "original_prompt": sample.from_dataset(hf_data, "prompt"),
    "variation": gen("rewrite this prompt: {original_prompt}"),
    "response": gen("respond to: {variation}")
})
```

## Evaluation

Evaluate rows inline or compute aggregate metrics:

```python
from chatan import dataset, eval, sample

ds = dataset({
    "col1": sample.choice(["a", "a", "b"]),
    "col2": "b",
    "score": eval.exact_match("col1", "col2")
})

df = ds.generate()
aggregate = ds.evaluate({
    "exact_match": ds.eval.exact_match("col1", "col2")
})
```

### Advanced Evaluation (requires `pip install chatan[eval]`)
```python
# Semantic similarity using sentence transformers
aggregate = ds.evaluate({
    "semantic_sim": ds.eval.semantic_similarity("col1", "col2")
})

# BLEU score evaluation
aggregate = ds.evaluate({
    "bleu": ds.eval.bleu_score("col1", "col2")
})
```

## Installation Options Summary

| Feature | Install Command | What's Included |
|---------|----------------|-----------------|
| **Basic** | `pip install chatan` | OpenAI, Anthropic, core sampling, basic evaluation |
| **Local Models** | `pip install chatan[local]` | + HuggingFace Transformers, PyTorch |
| **Advanced Eval** | `pip install chatan[eval]` | + Semantic similarity, BLEU scores, NLTK |
| **Everything** | `pip install chatan[all]` | All features above |

## Citation

If you use this code in your research, please cite:

```
@software{reetz2025chatan,
  author = {Reetz, Christian},
  title = {chatan: Create synthetic datasets with LLM generators.},
  url = {https://github.com/cdreetz/chatan},
  year = {2025}
}
```

## Contributing

Community contributions are more than welcome, bug reports, bug fixes, feature requests, feature additions, please refer to the Issues tab.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "chatan",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "dataset generation, llm, machine learning, synthetic data",
    "author": null,
    "author_email": "Christian Reetz <cdreetz@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/09/47/0b266e939144371173665fcbca29f78f0a56bfb39826a03038d839b90508/chatan-0.2.1.tar.gz",
    "platform": null,
    "description": "# Chatan\n\nCreate diverse, synthetic datasets. Start from scratch or augment an existing dataset. Simply define your dataset schema as a set of generators, typically being LLMs with a prompt describing what kind of examples you want.\n\n## Installation\n\nBasic installation (includes OpenAI, Anthropic, and core functionality):\n```bash\npip install chatan\n```\n\nWith optional features:\n```bash\n# For local model support (transformers + PyTorch)\npip install chatan[local]\n\n# For advanced evaluation features (semantic similarity, BLEU score)\npip install chatan[eval]\n\n# For all optional features\npip install chatan[all]\n```\n\n## Getting Started\n\n```python\nimport chatan\n\n# Create a generator\ngen = chatan.generator(\"openai\", \"YOUR_API_KEY\")\n\n# Define a dataset schema\nds = chatan.dataset({\n    \"topic\": chatan.sample.choice([\"Python\", \"JavaScript\", \"Rust\"]),\n    \"prompt\": gen(\"write a programming question about {topic}\"),\n    \"response\": gen(\"answer this question: {prompt}\")\n})\n\n# Generate the data with a progress bar\ndf = ds.generate(n=10)\n```\n\n## Generator Options\n\n### API-based Generators (included in base install)\n```python\n# OpenAI\ngen = chatan.generator(\"openai\", \"YOUR_OPENAI_API_KEY\")\n\n# Anthropic\ngen = chatan.generator(\"anthropic\", \"YOUR_ANTHROPIC_API_KEY\")\n```\n\n### Local Model Support (requires `pip install chatan[local]`)\n```python\n# HuggingFace Transformers\ngen = chatan.generator(\"transformers\", model=\"microsoft/DialoGPT-medium\")\n```\n\n## Examples\n\nCreate Data Mixes\n\n```python\nfrom chatan import dataset, generator, sample\nimport uuid\n\ngen = generator(\"openai\", \"YOUR_API_KEY\")\n\nmix = [\n    \"san antonio, tx\",\n    \"marfa, tx\",\n    \"paris, fr\"\n]\n\nds = dataset({\n    \"id\": sample.uuid(),\n    \"topic\": sample.choice(mix),\n    \"prompt\": gen(\"write an example question about the history of {topic}\"),\n    \"response\": gen(\"respond to: {prompt}\"),\n})\n```\n\nAugment datasets\n\n```python\nfrom chatan import generator, dataset, sample\nfrom datasets import load_dataset\n\ngen = generator(\"openai\", \"YOUR_API_KEY\")\nhf_data = load_dataset(\"some/dataset\")\n\nds = dataset({\n    \"original_prompt\": sample.from_dataset(hf_data, \"prompt\"),\n    \"variation\": gen(\"rewrite this prompt: {original_prompt}\"),\n    \"response\": gen(\"respond to: {variation}\")\n})\n```\n\n## Evaluation\n\nEvaluate rows inline or compute aggregate metrics:\n\n```python\nfrom chatan import dataset, eval, sample\n\nds = dataset({\n    \"col1\": sample.choice([\"a\", \"a\", \"b\"]),\n    \"col2\": \"b\",\n    \"score\": eval.exact_match(\"col1\", \"col2\")\n})\n\ndf = ds.generate()\naggregate = ds.evaluate({\n    \"exact_match\": ds.eval.exact_match(\"col1\", \"col2\")\n})\n```\n\n### Advanced Evaluation (requires `pip install chatan[eval]`)\n```python\n# Semantic similarity using sentence transformers\naggregate = ds.evaluate({\n    \"semantic_sim\": ds.eval.semantic_similarity(\"col1\", \"col2\")\n})\n\n# BLEU score evaluation\naggregate = ds.evaluate({\n    \"bleu\": ds.eval.bleu_score(\"col1\", \"col2\")\n})\n```\n\n## Installation Options Summary\n\n| Feature | Install Command | What's Included |\n|---------|----------------|-----------------|\n| **Basic** | `pip install chatan` | OpenAI, Anthropic, core sampling, basic evaluation |\n| **Local Models** | `pip install chatan[local]` | + HuggingFace Transformers, PyTorch |\n| **Advanced Eval** | `pip install chatan[eval]` | + Semantic similarity, BLEU scores, NLTK |\n| **Everything** | `pip install chatan[all]` | All features above |\n\n## Citation\n\nIf you use this code in your research, please cite:\n\n```\n@software{reetz2025chatan,\n  author = {Reetz, Christian},\n  title = {chatan: Create synthetic datasets with LLM generators.},\n  url = {https://github.com/cdreetz/chatan},\n  year = {2025}\n}\n```\n\n## Contributing\n\nCommunity contributions are more than welcome, bug reports, bug fixes, feature requests, feature additions, please refer to the Issues tab.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Create synthetic datasets with LLM generators and samplers",
    "version": "0.2.1",
    "project_urls": {
        "Documentation": "https://github.com/cdreetz/chatan#readme",
        "Issues": "https://github.com/cdreetz/chatan/issues",
        "Source": "https://github.com/cdreetz/chatan"
    },
    "split_keywords": [
        "dataset generation",
        " llm",
        " machine learning",
        " synthetic data"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "3715dc50c672d887168f310d8548f27d3fa57bcc2d3128cf9820a87c26c103ec",
                "md5": "3104b946ee8167598cc801e8b916a8c7",
                "sha256": "a783e43370a8317a3855b4ef5df7b09c059443fc091cbbf40c8cdc66a9446964"
            },
            "downloads": -1,
            "filename": "chatan-0.2.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "3104b946ee8167598cc801e8b916a8c7",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 21278,
            "upload_time": "2025-10-22T02:28:46",
            "upload_time_iso_8601": "2025-10-22T02:28:46.447951Z",
            "url": "https://files.pythonhosted.org/packages/37/15/dc50c672d887168f310d8548f27d3fa57bcc2d3128cf9820a87c26c103ec/chatan-0.2.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "09470b266e939144371173665fcbca29f78f0a56bfb39826a03038d839b90508",
                "md5": "c196927a4a5fad2aa7146daa3a8ccff3",
                "sha256": "be41ad947a2eacf7f1b11b9ce42c5739c9c40df32e9d00528e4c61f8c92fb48d"
            },
            "downloads": -1,
            "filename": "chatan-0.2.1.tar.gz",
            "has_sig": false,
            "md5_digest": "c196927a4a5fad2aa7146daa3a8ccff3",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 37280,
            "upload_time": "2025-10-22T02:28:47",
            "upload_time_iso_8601": "2025-10-22T02:28:47.987468Z",
            "url": "https://files.pythonhosted.org/packages/09/47/0b266e939144371173665fcbca29f78f0a56bfb39826a03038d839b90508/chatan-0.2.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-22 02:28:47",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "cdreetz",
    "github_project": "chatan#readme",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "chatan"
}
        
Elapsed time: 2.84130s