<p align="center">
<a href="https://bespokelabs.ai/" target="_blank">
<picture>
<source media="(prefers-color-scheme: light)" width="100px" srcset="docs/Bespoke-Labs-Logomark-Red-crop.png">
<img alt="Bespoke Labs Logo" width="100px" src="docs/Bespoke-Labs-Logomark-Red-crop.png">
</picture>
</a>
</p>
<h1 align="center">Bespoke Curator</h1>
<h3 align="center" style="font-size: 20px; margin-bottom: 4px">Data Curation for Post-Training & Structured Data Extraction</h3>
<br/>
<p align="center">
<a href="https://docs.bespokelabs.ai/bespoke-curator/getting-started">
<img alt="Static Badge" src="https://img.shields.io/badge/Docs-docs.bespokelabs.ai-blue?style=flat&link=https%3A%2F%2Fdocs.bespokelabs.ai">
</a>
<a href="https://bespokelabs.ai/">
<img alt="Site" src="https://img.shields.io/badge/Site-bespokelabs.ai-blue?link=https%3A%2F%2Fbespokelabs.ai"/>
</a>
<img alt="PyPI - Version" src="https://img.shields.io/pypi/v/bespokelabs-curator">
<a href="https://twitter.com/bespokelabsai">
<img src="https://img.shields.io/twitter/follow/bespokelabsai" alt="Follow on X" />
</a>
<a href="https://discord.gg/KqpXvpzVBS">
<img alt="Discord" src="https://img.shields.io/discord/1230990265867698186">
</a>
</p>
<div align="center">
[ English | <a href="README_zh.md">中文</a> ]
</div>
## Overview
Bespoke Curator makes it easy to create synthetic data pipelines. Whether you are training a model or extracting structure, Curator will prepare high-quality data quickly and robustly.
* Rich Python based library for generating and curating synthetic data.
* Interactive viewer to monitor data while it is being generated.
* First class support for structured outputs.
* Built-in performance optimizations for asynchronous operations, caching, and fault recovery at every scale.
* Support for a wide range of inference options via LiteLLM, vLLM, and popular batch APIs.
![CLI in action](docs/curator-cli.gif)
Check out our full documentation for [getting started](https://docs.bespokelabs.ai/bespoke-curator/getting-started), [tutorials](https://docs.bespokelabs.ai/bespoke-curator/tutorials), [guides](https://docs.bespokelabs.ai/bespoke-curator/how-to-guides) and detailed [reference](https://docs.bespokelabs.ai/bespoke-curator/api-reference/llm-api-documentation).
## Installation
```bash
pip install bespokelabs-curator
```
## Quickstart
### Using `curator.LLM`
```python
from bespokelabs import curator
llm = curator.LLM(model_name="gpt-4o-mini")
poem = llm("Write a poem about the importance of data in AI.")
print(poems.to_pandas())
```
> [!NOTE]
> Retries and caching are enabled by default to help you rapidly iterate your data pipelines.
> So now if you run the same prompt again, you will get the same response, pretty much instantly.
> You can delete the cache at `~/.cache/curator` or disable it with `export CURATOR_DISABLE_CACHE=true`.
### Calling other models
You can also call other [LiteLLM](https://docs.litellm.ai/docs/) supported models by
changing the `model_name` argument.
```python
llm = curator.LLM(model_name="claude-3-5-sonnet-20240620")
```
In addition to a wide range of API providers, local web servers (hosted by [vLLM](https://docs.bespokelabs.ai/bespoke-curator/how-to-guides/using-vllm-with-curator#online-mode-server) or [Ollama](https://docs.bespokelabs.ai/bespoke-curator/how-to-guides/using-ollama-with-curator)) are supported via LiteLLM. For completely offline inference directly through vLLM, see the [documentation](https://docs.bespokelabs.ai/bespoke-curator/how-to-guides/using-vllm-with-curator#offline-mode-local).
> [!IMPORTANT]
> Make sure to set your API keys as environment variables for the model you are calling. For example running `export OPENAI_API_KEY=sk-...` and `export ANTHROPIC_API_KEY=ant-...` will allow you to run the previous two examples. A full list of supported models and their associated environment variable names can be found [in the litellm docs](https://docs.litellm.ai/docs/providers).
> [!TIP]
> If you are generating large datasets, you may want to use [batch mode](https://docs.bespokelabs.ai/bespoke-curator/tutorials/save-usdusdusd-with-batch-mode) to save costs. Currently batch APIs from [OpenAI](https://platform.openai.com/docs/guides/batch) and [Anthropic](https://docs.anthropic.com/en/docs/build-with-claude/message-batches) are supported. With curator this is as simple as setting `batch=True` in the `LLM` class.
### Using structured outputs
Let's use structured outputs to generate multiple poems in a single LLM call. We can define a class to encapsulate a list of poems,
and then pass it to the `LLM` class.
```python
from typing import List
from pydantic import BaseModel, Field
from bespokelabs import curator
class Poem(BaseModel):
poem: str = Field(description="A poem.")
class Poems(BaseModel):
poems_list: List[Poem] = Field(description="A list of poems.")
llm = curator.LLM(model_name="gpt-4o-mini", response_format=Poems)
poems = llm(["Write two poems about the importance of data in AI.",
"Write three haikus about the importance of data in AI."])
print(poems.to_pandas())
```
Note how each `Poems` object occupies a single row in the dataset.
For more advanced use cases, you might need to define more custom parsing and prompting logic. For example, you might want to preserve the mapping between each topic and the poem being generated from it. In this case, you can define a `Poet` object that inherits from `LLM`, and define your own prompting and parsing logic:
```python
from typing import Dict, List
from datasets import Dataset
from pydantic import BaseModel, Field
from bespokelabs import curator
class Poem(BaseModel):
poem: str = Field(description="A poem.")
class Poems(BaseModel):
poems: List[Poem] = Field(description="A list of poems.")
class Poet(curator.LLM):
response_format = Poems
def prompt(self, input: Dict) -> str:
return f"Write two poems about {input['topic']}."
def parse(self, input: Dict, response: Poems) -> Dict:
return [{"topic": input["topic"], "poem": p.poem} for p in response.poems]
poet = Poet(model_name="gpt-4o-mini")
topics = Dataset.from_dict({"topic": ["Urban loneliness in a bustling city", "Beauty of Bespoke Labs's Curator library"]})
poem = poet(topics)
print(poem.to_pandas())
```
```
topic poem
0 Urban loneliness in a bustling city In the city’s heart, where the lights never di...
1 Urban loneliness in a bustling city Steps echo loudly, pavement slick with rain,\n...
2 Beauty of Bespoke Labs's Curator library In the heart of Curation’s realm, \nWhere art...
3 Beauty of Bespoke Labs's Curator library Step within the library’s embrace, \nA sanctu...
```
In the `Poet` class:
* `response_format` is the structured output class we defined above.
* `prompt` takes the input (`input`) and returns the prompt for the LLM.
* `parse` takes the input (`input`) and the structured output (`response`) and converts it to a list of dictionaries. This is so that we can easily convert the output to a HuggingFace Dataset object.
Note that `topics` can be created with another `LLM` class as well,
and we can scale this up to create tens of thousands of diverse poems.
You can see a more detailed example in the [examples/poem-generation/poem.py](examples/poem-generation/poem.py) file,
and other examples in the [examples](examples) directory.
See the [docs](https://docs.bespokelabs.ai/) for more details as well as
for troubleshooting information.
## Bespoke Curator Viewer
![Viewer in action](docs/curator-viewer.gif)
To run the bespoke dataset viewer:
```bash
curator-viewer
```
This will pop up a browser window with the viewer running on `127.0.0.1:3000` by default if you haven't specified a different host and port.
The dataset viewer shows all the different runs you have made. Once a run is selected, you can see the dataset and the responses from the LLM.
Optional parameters to run the viewer on a different host and port:
```bash
>>> curator-viewer -h
usage: curator-viewer [-h] [--host HOST] [--port PORT] [--verbose]
Curator Viewer
options:
-h, --help show this help message and exit
--host HOST Host to run the server on (default: localhost)
--port PORT Port to run the server on (default: 3000)
--verbose, -v Enables debug logging for more verbose output
```
The only requirement for running `curator-viewer` is to install node. You can install them by following the instructions [here](https://nodejs.org/en/download/package-manager).
For example, to check if you have node installed, you can run:
```bash
node -v
```
If it's not installed, installing latest node on MacOS, you can run:
```bash
# installs nvm (Node Version Manager)
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.0/install.sh | bash
# download and install Node.js (you may need to restart the terminal)
nvm install 22
# verifies the right Node.js version is in the environment
node -v # should print `v22.11.0`
# verifies the right npm version is in the environment
npm -v # should print `10.9.0`
```
## Contributing
Thank you to all the contributors for making this project possible!
Please follow [these instructions](CONTRIBUTING.md) on how to contribute.
## Citation
If you find Curator useful, please consider citing us!
```
@software{Curator: A Tool for Synthetic Data Creation,
author = {Marten, Ryan and Vu, Trung and Cheng-Jie Ji, Charlie and Sharma, Kartik and Dimakis, Alex and Sathiamoorthy, Mahesh},
month = jan,
title = {{Curator}},
year = {2025}
}
```
Raw data
{
"_id": null,
"home_page": "https://github.com/bespokelabsai/curator",
"name": "bespokelabs-curator",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.10",
"maintainer_email": null,
"keywords": "ai, curator, bespoke",
"author": "Bespoke Labs",
"author_email": "company@bespokelabs.ai",
"download_url": "https://files.pythonhosted.org/packages/e9/98/281c2dd82212d4e704a5ef8fc2a30f4aaba6dc761a9dd29f54d88fbf7b14/bespokelabs_curator-0.1.15.post1.tar.gz",
"platform": null,
"description": "<p align=\"center\">\n <a href=\"https://bespokelabs.ai/\" target=\"_blank\">\n <picture>\n <source media=\"(prefers-color-scheme: light)\" width=\"100px\" srcset=\"docs/Bespoke-Labs-Logomark-Red-crop.png\">\n <img alt=\"Bespoke Labs Logo\" width=\"100px\" src=\"docs/Bespoke-Labs-Logomark-Red-crop.png\">\n </picture>\n </a>\n</p>\n\n<h1 align=\"center\">Bespoke Curator</h1>\n<h3 align=\"center\" style=\"font-size: 20px; margin-bottom: 4px\">Data Curation for Post-Training & Structured Data Extraction</h3>\n<br/>\n\n<p align=\"center\">\n <a href=\"https://docs.bespokelabs.ai/bespoke-curator/getting-started\">\n <img alt=\"Static Badge\" src=\"https://img.shields.io/badge/Docs-docs.bespokelabs.ai-blue?style=flat&link=https%3A%2F%2Fdocs.bespokelabs.ai\">\n </a>\n <a href=\"https://bespokelabs.ai/\">\n <img alt=\"Site\" src=\"https://img.shields.io/badge/Site-bespokelabs.ai-blue?link=https%3A%2F%2Fbespokelabs.ai\"/>\n </a>\n <img alt=\"PyPI - Version\" src=\"https://img.shields.io/pypi/v/bespokelabs-curator\">\n <a href=\"https://twitter.com/bespokelabsai\">\n <img src=\"https://img.shields.io/twitter/follow/bespokelabsai\" alt=\"Follow on X\" />\n </a>\n <a href=\"https://discord.gg/KqpXvpzVBS\">\n <img alt=\"Discord\" src=\"https://img.shields.io/discord/1230990265867698186\">\n </a>\n</p>\n<div align=\"center\">\n[ English | <a href=\"README_zh.md\">\u4e2d\u6587</a> ]\n</div>\n\n\n## Overview\n\n\nBespoke Curator makes it easy to create synthetic data pipelines. Whether you are training a model or extracting structure, Curator will prepare high-quality data quickly and robustly.\n\n* Rich Python based library for generating and curating synthetic data.\n* Interactive viewer to monitor data while it is being generated.\n* First class support for structured outputs.\n* Built-in performance optimizations for asynchronous operations, caching, and fault recovery at every scale.\n* Support for a wide range of inference options via LiteLLM, vLLM, and popular batch APIs.\n\n![CLI in action](docs/curator-cli.gif)\n\nCheck out our full documentation for [getting started](https://docs.bespokelabs.ai/bespoke-curator/getting-started), [tutorials](https://docs.bespokelabs.ai/bespoke-curator/tutorials), [guides](https://docs.bespokelabs.ai/bespoke-curator/how-to-guides) and detailed [reference](https://docs.bespokelabs.ai/bespoke-curator/api-reference/llm-api-documentation).\n\n## Installation\n\n```bash\npip install bespokelabs-curator\n```\n\n## Quickstart\n\n### Using `curator.LLM`\n\n```python\nfrom bespokelabs import curator\nllm = curator.LLM(model_name=\"gpt-4o-mini\")\npoem = llm(\"Write a poem about the importance of data in AI.\")\nprint(poems.to_pandas())\n```\n\n> [!NOTE]\n> Retries and caching are enabled by default to help you rapidly iterate your data pipelines.\n> So now if you run the same prompt again, you will get the same response, pretty much instantly.\n> You can delete the cache at `~/.cache/curator` or disable it with `export CURATOR_DISABLE_CACHE=true`.\n\n### Calling other models\nYou can also call other [LiteLLM](https://docs.litellm.ai/docs/) supported models by\nchanging the `model_name` argument.\n\n```python\nllm = curator.LLM(model_name=\"claude-3-5-sonnet-20240620\")\n```\n\nIn addition to a wide range of API providers, local web servers (hosted by [vLLM](https://docs.bespokelabs.ai/bespoke-curator/how-to-guides/using-vllm-with-curator#online-mode-server) or [Ollama](https://docs.bespokelabs.ai/bespoke-curator/how-to-guides/using-ollama-with-curator)) are supported via LiteLLM. For completely offline inference directly through vLLM, see the [documentation](https://docs.bespokelabs.ai/bespoke-curator/how-to-guides/using-vllm-with-curator#offline-mode-local).\n\n> [!IMPORTANT]\n> Make sure to set your API keys as environment variables for the model you are calling. For example running `export OPENAI_API_KEY=sk-...` and `export ANTHROPIC_API_KEY=ant-...` will allow you to run the previous two examples. A full list of supported models and their associated environment variable names can be found [in the litellm docs](https://docs.litellm.ai/docs/providers).\n\n> [!TIP]\n> If you are generating large datasets, you may want to use [batch mode](https://docs.bespokelabs.ai/bespoke-curator/tutorials/save-usdusdusd-with-batch-mode) to save costs. Currently batch APIs from [OpenAI](https://platform.openai.com/docs/guides/batch) and [Anthropic](https://docs.anthropic.com/en/docs/build-with-claude/message-batches) are supported. With curator this is as simple as setting `batch=True` in the `LLM` class.\n\n### Using structured outputs\n\nLet's use structured outputs to generate multiple poems in a single LLM call. We can define a class to encapsulate a list of poems,\nand then pass it to the `LLM` class.\n\n```python\nfrom typing import List\nfrom pydantic import BaseModel, Field\nfrom bespokelabs import curator\n\nclass Poem(BaseModel):\n poem: str = Field(description=\"A poem.\")\n\n\nclass Poems(BaseModel):\n poems_list: List[Poem] = Field(description=\"A list of poems.\")\n\n\nllm = curator.LLM(model_name=\"gpt-4o-mini\", response_format=Poems)\npoems = llm([\"Write two poems about the importance of data in AI.\", \n \"Write three haikus about the importance of data in AI.\"])\nprint(poems.to_pandas())\n```\n\nNote how each `Poems` object occupies a single row in the dataset. \n\n\nFor more advanced use cases, you might need to define more custom parsing and prompting logic. For example, you might want to preserve the mapping between each topic and the poem being generated from it. In this case, you can define a `Poet` object that inherits from `LLM`, and define your own prompting and parsing logic:\n\n```python\nfrom typing import Dict, List\nfrom datasets import Dataset\nfrom pydantic import BaseModel, Field\nfrom bespokelabs import curator\n\n\nclass Poem(BaseModel):\n poem: str = Field(description=\"A poem.\")\n\n\nclass Poems(BaseModel):\n poems: List[Poem] = Field(description=\"A list of poems.\")\n\n\nclass Poet(curator.LLM):\n response_format = Poems\n\n def prompt(self, input: Dict) -> str:\n return f\"Write two poems about {input['topic']}.\"\n\n def parse(self, input: Dict, response: Poems) -> Dict:\n return [{\"topic\": input[\"topic\"], \"poem\": p.poem} for p in response.poems]\n\n\npoet = Poet(model_name=\"gpt-4o-mini\")\n\ntopics = Dataset.from_dict({\"topic\": [\"Urban loneliness in a bustling city\", \"Beauty of Bespoke Labs's Curator library\"]})\npoem = poet(topics)\nprint(poem.to_pandas())\n```\n```\n topic poem\n0 Urban loneliness in a bustling city In the city\u2019s heart, where the lights never di...\n1 Urban loneliness in a bustling city Steps echo loudly, pavement slick with rain,\\n...\n2 Beauty of Bespoke Labs's Curator library In the heart of Curation\u2019s realm, \\nWhere art...\n3 Beauty of Bespoke Labs's Curator library Step within the library\u2019s embrace, \\nA sanctu...\n```\nIn the `Poet` class:\n* `response_format` is the structured output class we defined above.\n* `prompt` takes the input (`input`) and returns the prompt for the LLM.\n* `parse` takes the input (`input`) and the structured output (`response`) and converts it to a list of dictionaries. This is so that we can easily convert the output to a HuggingFace Dataset object.\n\nNote that `topics` can be created with another `LLM` class as well,\nand we can scale this up to create tens of thousands of diverse poems.\nYou can see a more detailed example in the [examples/poem-generation/poem.py](examples/poem-generation/poem.py) file,\nand other examples in the [examples](examples) directory.\n\nSee the [docs](https://docs.bespokelabs.ai/) for more details as well as\nfor troubleshooting information.\n\n## Bespoke Curator Viewer\n\n![Viewer in action](docs/curator-viewer.gif)\n\nTo run the bespoke dataset viewer:\n\n```bash\ncurator-viewer\n```\n\nThis will pop up a browser window with the viewer running on `127.0.0.1:3000` by default if you haven't specified a different host and port.\n\nThe dataset viewer shows all the different runs you have made. Once a run is selected, you can see the dataset and the responses from the LLM.\n\nOptional parameters to run the viewer on a different host and port:\n```bash\n>>> curator-viewer -h\nusage: curator-viewer [-h] [--host HOST] [--port PORT] [--verbose]\n\nCurator Viewer\n\noptions:\n -h, --help show this help message and exit\n --host HOST Host to run the server on (default: localhost)\n --port PORT Port to run the server on (default: 3000)\n --verbose, -v Enables debug logging for more verbose output\n```\n\nThe only requirement for running `curator-viewer` is to install node. You can install them by following the instructions [here](https://nodejs.org/en/download/package-manager).\n\nFor example, to check if you have node installed, you can run:\n\n```bash\nnode -v\n```\n\nIf it's not installed, installing latest node on MacOS, you can run:\n\n```bash\n# installs nvm (Node Version Manager)\ncurl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.0/install.sh | bash\n# download and install Node.js (you may need to restart the terminal)\nnvm install 22\n# verifies the right Node.js version is in the environment\nnode -v # should print `v22.11.0`\n# verifies the right npm version is in the environment\nnpm -v # should print `10.9.0`\n```\n\n## Contributing\nThank you to all the contributors for making this project possible!\nPlease follow [these instructions](CONTRIBUTING.md) on how to contribute.\n\n## Citation\nIf you find Curator useful, please consider citing us!\n\n```\n@software{Curator: A Tool for Synthetic Data Creation,\n author = {Marten, Ryan and Vu, Trung and Cheng-Jie Ji, Charlie and Sharma, Kartik and Dimakis, Alex and Sathiamoorthy, Mahesh},\n month = jan,\n title = {{Curator}},\n year = {2025}\n}\n```",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "Bespoke Labs Curator",
"version": "0.1.15.post1",
"project_urls": {
"Homepage": "https://github.com/bespokelabsai/curator",
"Repository": "https://github.com/bespokelabsai/curator"
},
"split_keywords": [
"ai",
" curator",
" bespoke"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "6cce9408eb85b3c440f779b68281712c4898871739c0036bbde2b109455fff5d",
"md5": "960740baaeacc467290d093c99d03101",
"sha256": "48bcbb19e640e355cd3a1ea987af60d02d378f4e36302510c773ad531691155e"
},
"downloads": -1,
"filename": "bespokelabs_curator-0.1.15.post1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "960740baaeacc467290d093c99d03101",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.10",
"size": 1211424,
"upload_time": "2025-01-15T15:13:00",
"upload_time_iso_8601": "2025-01-15T15:13:00.452239Z",
"url": "https://files.pythonhosted.org/packages/6c/ce/9408eb85b3c440f779b68281712c4898871739c0036bbde2b109455fff5d/bespokelabs_curator-0.1.15.post1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "e998281c2dd82212d4e704a5ef8fc2a30f4aaba6dc761a9dd29f54d88fbf7b14",
"md5": "9346b49679e5a9e7d49595dab059bb40",
"sha256": "64136808bb4ba89bb2a25a1a2bb7dec761990c7891e8a27ec96d6d918370603d"
},
"downloads": -1,
"filename": "bespokelabs_curator-0.1.15.post1.tar.gz",
"has_sig": false,
"md5_digest": "9346b49679e5a9e7d49595dab059bb40",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.10",
"size": 1132702,
"upload_time": "2025-01-15T15:13:03",
"upload_time_iso_8601": "2025-01-15T15:13:03.070458Z",
"url": "https://files.pythonhosted.org/packages/e9/98/281c2dd82212d4e704a5ef8fc2a30f4aaba6dc761a9dd29f54d88fbf7b14/bespokelabs_curator-0.1.15.post1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-01-15 15:13:03",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "bespokelabsai",
"github_project": "curator",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "bespokelabs-curator"
}