<p align="center">
<a href="https://bespokelabs.ai/" target="_blank">
<picture>
<source media="(prefers-color-scheme: light)" width="80px" srcset="https://raw.githubusercontent.com/bespokelabsai/curator/main/docs/Bespoke-Labs-Logomark-Red.png">
<img alt="Bespoke Labs Logo" width="80px" src="https://raw.githubusercontent.com/bespokelabsai/curator/main/docs/Bespoke-Labs-Logomark-Red-on-Black.png">
</picture>
</a>
</p>
<h1 align="center">Bespoke Curator</h1>
<h3 align="center" style="font-size: 20px; margin-bottom: 4px">Data Curation for Post-Training & Structured Data Extraction</h3>
<br/>
<p align="center">
<a href="https://docs.bespokelabs.ai/">
<img alt="Static Badge" src="https://img.shields.io/badge/Docs-docs.bespokelabs.ai-blue?style=flat&link=https%3A%2F%2Fdocs.bespokelabs.ai">
</a>
<a href="https://bespokelabs.ai/">
<img alt="Site" src="https://img.shields.io/badge/Site-bespokelabs.ai-blue?link=https%3A%2F%2Fbespokelabs.ai"/>
</a>
<img alt="PyPI - Version" src="https://img.shields.io/pypi/v/bespokelabs-curator">
<a href="https://twitter.com/bespokelabsai">
<img src="https://img.shields.io/twitter/follow/bespokelabsai" alt="Follow on X" />
</a>
<a href="https://discord.gg/KqpXvpzVBS">
<img alt="Discord" src="https://img.shields.io/discord/1230990265867698186">
</a>
<a href="https://github.com/psf/black">
<img alt="Code style: black" src="https://img.shields.io/badge/Code%20style-black-000000.svg">
</a>
</p>
## Overview
Bespoke Curator makes it very easy to create high-quality synthetic data at scale, which you can use to finetune models or use for structured data extraction at scale.
Bespoke Curator is an open-source project:
* That comes with a rich Python based library for generating and curating synthetic data.
* A Curator Viewer which makes it easy to view the datasets, thus aiding in the dataset creation.
* We will also be releasing high-quality datasets that should move the needle on post-training.
## Key Features
1. **Programmability and Structured Outputs**: Synthetic data generation is lot more than just using a single prompt -- it involves calling LLMs multiple times and orchestrating control-flow. Curator treats structured outputs as first class citizens and helps you design complex pipelines.
2. **Built-in Performance Optimization**: We often see calling LLMs in loops, or inefficient implementation of multi-threading. We have baked in performance optimizations so that you don't need to worry about those!
3. **Intelligent Caching and Fault Recovery**: Given LLM calls can add up in cost and time, failures are undesirable but sometimes unavoidable. We cache the LLM requests and responses so that it is easy to recover from a failure. Moreover, when working on a multi-stage pipeline, caching of stages makes it easy to iterate.
4. **Native HuggingFace Dataset Integration**: Work directly on HuggingFace Dataset objects throughout your pipeline. Your synthetic data is immediately ready for fine-tuning!
5. **Interactive Curator Viewer**: Improve and iterate on your prompts using our built-in viewer. Inspect LLM requests and responses in real-time, allowing you to iterate and refine your data generation strategy with immediate feedback.
## Installation
```bash
pip install bespokelabs-curator
```
## Usage
To run the examples below, make sure to set your OpenAI API key in
the environment variable `OPENAI_API_KEY` by running `export OPENAI_API_KEY=sk-...` in your terminal.
### Hello World with `SimpleLLM`: A simple interface for calling LLMs
```python
from bespokelabs import curator
llm = curator.SimpleLLM(model_name="gpt-4o-mini")
poem = llm("Write a poem about the importance of data in AI.")
print(poem)
# Or you can pass a list of prompts to generate multiple responses.
poems = llm(["Write a poem about the importance of data in AI.",
"Write a haiku about the importance of data in AI."])
print(poems)
```
Note that retries and caching are enabled by default.
So now if you run the same prompt again, you will get the same response, pretty much instantly.
You can delete the cache at `~/.cache/curator`.
#### Use LiteLLM backend for calling other models
You can use the [LiteLLM](https://docs.litellm.ai/docs/providers) backend for calling other models.
```python
from bespokelabs import curator
llm = curator.SimpleLLM(model_name="claude-3-5-sonnet-20240620", backend="litellm")
poem = llm("Write a poem about the importance of data in AI.")
print(poem)
```
### Visualize in Curator Viewer
Run `curator-viewer` on the command line to see the dataset in the viewer.
You can click on a run and then click on a specific row to see the LLM request and response.
![Curator Responses](docs/curator-responses.png)
More examples below.
### `LLM`: A more powerful interface for synthetic data generation
Let's use structured outputs to generate poems.
```python
from bespokelabs import curator
from datasets import Dataset
from pydantic import BaseModel, Field
from typing import List
topics = Dataset.from_dict({"topic": [
"Urban loneliness in a bustling city",
"Beauty of Bespoke Labs's Curator library"
]})
```
Define a class to encapsulate a list of poems.
```python
class Poem(BaseModel):
poem: str = Field(description="A poem.")
class Poems(BaseModel):
poems_list: List[Poem] = Field(description="A list of poems.")
```
We define an `LLM` object that generates poems which gets applied to the topics dataset.
```python
poet = curator.LLM(
prompt_func=lambda row: f"Write two poems about {row['topic']}.",
model_name="gpt-4o-mini",
response_format=Poems,
parse_func=lambda row, poems: [
{"topic": row["topic"], "poem": p.poem} for p in poems.poems_list
],
)
```
Here:
* `prompt_func` takes a row of the dataset as input and returns the prompt for the LLM.
* `response_format` is the structured output class we defined above.
* `parse_func` takes the input (`row`) and the structured output (`poems`) and converts it to a list of dictionaries. This is so that we can easily convert the output to a HuggingFace Dataset object.
Now we can apply the `LLM` object to the dataset, which reads very pythonic.
```python
poem = poet(topics)
print(poem.to_pandas())
# Example output:
# topic poem
# 0 Urban loneliness in a bustling city In the city's heart, where the sirens wail,\nA...
# 1 Urban loneliness in a bustling city City streets hum with a bittersweet song,\nHor...
# 2 Beauty of Bespoke Labs's Curator library In whispers of design and crafted grace,\nBesp...
# 3 Beauty of Bespoke Labs's Curator library In the hushed breath of parchment and ink,\nBe...
```
Note that `topics` can be created with `curator.LLM` as well,
and we can scale this up to create tens of thousands of diverse poems.
You can see a more detailed example in the [examples/poem.py](https://github.com/bespokelabsai/curator/blob/mahesh/update_doc/examples/poem.py) file,
and other examples in the [examples](https://github.com/bespokelabsai/curator/blob/mahesh/update_doc/examples) directory.
See the [docs](https://docs.bespokelabs.ai/) for more details as well as
for troubleshooting information.
## Bespoke Curator Viewer
To run the bespoke dataset viewer:
```bash
curator-viewer
```
This will pop up a browser window with the viewer running on `127.0.0.1:3000` by default if you haven't specified a different host and port.
The dataset viewer shows all the different runs you have made.
![Curator Runs](docs/curator-runs.png)
You can also see the dataset and the responses from the LLM.
![Curator Dataset](docs/curator-dataset.png)
Optional parameters to run the viewer on a different host and port:
```bash
>>> curator-viewer -h
usage: curator-viewer [-h] [--host HOST] [--port PORT] [--verbose]
Curator Viewer
options:
-h, --help show this help message and exit
--host HOST Host to run the server on (default: localhost)
--port PORT Port to run the server on (default: 3000)
--verbose, -v Enables debug logging for more verbose output
```
The only requirement for running `curator-viewer` is to install node. You can install them by following the instructions [here](https://nodejs.org/en/download/package-manager).
For example, to check if you have node installed, you can run:
```bash
node -v
```
If it's not installed, installing latest node on MacOS, you can run:
```bash
# installs nvm (Node Version Manager)
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.0/install.sh | bash
# download and install Node.js (you may need to restart the terminal)
nvm install 22
# verifies the right Node.js version is in the environment
node -v # should print `v22.11.0`
# verifies the right npm version is in the environment
npm -v # should print `10.9.0`
```
## Contributing
Contributions are welcome!
Raw data
{
"_id": null,
"home_page": "https://github.com/bespokelabsai/curator",
"name": "bespokelabs-curator",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.10",
"maintainer_email": null,
"keywords": "ai, curator, bespoke",
"author": "Bespoke Labs",
"author_email": "company@bespokelabs.ai",
"download_url": "https://files.pythonhosted.org/packages/31/ce/9c98209949212d19bbd7f6740fb4c26e5ec6251ea53d1c371a7efea57da4/bespokelabs_curator-0.1.12.tar.gz",
"platform": null,
"description": "<p align=\"center\">\n <a href=\"https://bespokelabs.ai/\" target=\"_blank\">\n <picture>\n <source media=\"(prefers-color-scheme: light)\" width=\"80px\" srcset=\"https://raw.githubusercontent.com/bespokelabsai/curator/main/docs/Bespoke-Labs-Logomark-Red.png\">\n <img alt=\"Bespoke Labs Logo\" width=\"80px\" src=\"https://raw.githubusercontent.com/bespokelabsai/curator/main/docs/Bespoke-Labs-Logomark-Red-on-Black.png\">\n </picture>\n </a>\n</p>\n\n<h1 align=\"center\">Bespoke Curator</h1>\n<h3 align=\"center\" style=\"font-size: 20px; margin-bottom: 4px\">Data Curation for Post-Training & Structured Data Extraction</h3>\n<br/>\n<p align=\"center\">\n <a href=\"https://docs.bespokelabs.ai/\">\n <img alt=\"Static Badge\" src=\"https://img.shields.io/badge/Docs-docs.bespokelabs.ai-blue?style=flat&link=https%3A%2F%2Fdocs.bespokelabs.ai\">\n </a>\n <a href=\"https://bespokelabs.ai/\">\n <img alt=\"Site\" src=\"https://img.shields.io/badge/Site-bespokelabs.ai-blue?link=https%3A%2F%2Fbespokelabs.ai\"/>\n </a>\n <img alt=\"PyPI - Version\" src=\"https://img.shields.io/pypi/v/bespokelabs-curator\">\n <a href=\"https://twitter.com/bespokelabsai\">\n <img src=\"https://img.shields.io/twitter/follow/bespokelabsai\" alt=\"Follow on X\" />\n </a>\n <a href=\"https://discord.gg/KqpXvpzVBS\">\n <img alt=\"Discord\" src=\"https://img.shields.io/discord/1230990265867698186\">\n </a>\n <a href=\"https://github.com/psf/black\">\n <img alt=\"Code style: black\" src=\"https://img.shields.io/badge/Code%20style-black-000000.svg\">\n </a>\n</p>\n\n## Overview\n\nBespoke Curator makes it very easy to create high-quality synthetic data at scale, which you can use to finetune models or use for structured data extraction at scale.\n\nBespoke Curator is an open-source project:\n* That comes with a rich Python based library for generating and curating synthetic data.\n* A Curator Viewer which makes it easy to view the datasets, thus aiding in the dataset creation.\n* We will also be releasing high-quality datasets that should move the needle on post-training.\n\n## Key Features\n\n1. **Programmability and Structured Outputs**: Synthetic data generation is lot more than just using a single prompt -- it involves calling LLMs multiple times and orchestrating control-flow. Curator treats structured outputs as first class citizens and helps you design complex pipelines.\n2. **Built-in Performance Optimization**: We often see calling LLMs in loops, or inefficient implementation of multi-threading. We have baked in performance optimizations so that you don't need to worry about those!\n3. **Intelligent Caching and Fault Recovery**: Given LLM calls can add up in cost and time, failures are undesirable but sometimes unavoidable. We cache the LLM requests and responses so that it is easy to recover from a failure. Moreover, when working on a multi-stage pipeline, caching of stages makes it easy to iterate.\n4. **Native HuggingFace Dataset Integration**: Work directly on HuggingFace Dataset objects throughout your pipeline. Your synthetic data is immediately ready for fine-tuning!\n5. **Interactive Curator Viewer**: Improve and iterate on your prompts using our built-in viewer. Inspect LLM requests and responses in real-time, allowing you to iterate and refine your data generation strategy with immediate feedback.\n\n## Installation\n\n```bash\npip install bespokelabs-curator\n```\n\n## Usage\nTo run the examples below, make sure to set your OpenAI API key in \nthe environment variable `OPENAI_API_KEY` by running `export OPENAI_API_KEY=sk-...` in your terminal.\n\n### Hello World with `SimpleLLM`: A simple interface for calling LLMs\n\n```python\nfrom bespokelabs import curator\nllm = curator.SimpleLLM(model_name=\"gpt-4o-mini\")\npoem = llm(\"Write a poem about the importance of data in AI.\")\nprint(poem)\n# Or you can pass a list of prompts to generate multiple responses.\npoems = llm([\"Write a poem about the importance of data in AI.\",\n \"Write a haiku about the importance of data in AI.\"])\nprint(poems)\n```\nNote that retries and caching are enabled by default.\nSo now if you run the same prompt again, you will get the same response, pretty much instantly.\nYou can delete the cache at `~/.cache/curator`.\n\n#### Use LiteLLM backend for calling other models\nYou can use the [LiteLLM](https://docs.litellm.ai/docs/providers) backend for calling other models.\n\n```python\nfrom bespokelabs import curator\nllm = curator.SimpleLLM(model_name=\"claude-3-5-sonnet-20240620\", backend=\"litellm\")\npoem = llm(\"Write a poem about the importance of data in AI.\")\nprint(poem)\n```\n\n### Visualize in Curator Viewer\nRun `curator-viewer` on the command line to see the dataset in the viewer.\n\nYou can click on a run and then click on a specific row to see the LLM request and response.\n![Curator Responses](docs/curator-responses.png)\nMore examples below.\n\n### `LLM`: A more powerful interface for synthetic data generation\n\nLet's use structured outputs to generate poems.\n```python\nfrom bespokelabs import curator\nfrom datasets import Dataset\nfrom pydantic import BaseModel, Field\nfrom typing import List\n\ntopics = Dataset.from_dict({\"topic\": [\n \"Urban loneliness in a bustling city\",\n \"Beauty of Bespoke Labs's Curator library\"\n]})\n```\n\nDefine a class to encapsulate a list of poems.\n```python\nclass Poem(BaseModel):\n poem: str = Field(description=\"A poem.\")\n\nclass Poems(BaseModel):\n poems_list: List[Poem] = Field(description=\"A list of poems.\")\n```\n\nWe define an `LLM` object that generates poems which gets applied to the topics dataset.\n```python\npoet = curator.LLM(\n prompt_func=lambda row: f\"Write two poems about {row['topic']}.\",\n model_name=\"gpt-4o-mini\",\n response_format=Poems,\n parse_func=lambda row, poems: [\n {\"topic\": row[\"topic\"], \"poem\": p.poem} for p in poems.poems_list\n ],\n)\n```\nHere:\n* `prompt_func` takes a row of the dataset as input and returns the prompt for the LLM.\n* `response_format` is the structured output class we defined above.\n* `parse_func` takes the input (`row`) and the structured output (`poems`) and converts it to a list of dictionaries. This is so that we can easily convert the output to a HuggingFace Dataset object.\n\nNow we can apply the `LLM` object to the dataset, which reads very pythonic.\n```python\npoem = poet(topics)\nprint(poem.to_pandas())\n# Example output:\n# topic poem\n# 0 Urban loneliness in a bustling city In the city's heart, where the sirens wail,\\nA...\n# 1 Urban loneliness in a bustling city City streets hum with a bittersweet song,\\nHor...\n# 2 Beauty of Bespoke Labs's Curator library In whispers of design and crafted grace,\\nBesp...\n# 3 Beauty of Bespoke Labs's Curator library In the hushed breath of parchment and ink,\\nBe...\n```\nNote that `topics` can be created with `curator.LLM` as well,\nand we can scale this up to create tens of thousands of diverse poems.\nYou can see a more detailed example in the [examples/poem.py](https://github.com/bespokelabsai/curator/blob/mahesh/update_doc/examples/poem.py) file,\nand other examples in the [examples](https://github.com/bespokelabsai/curator/blob/mahesh/update_doc/examples) directory.\n\nSee the [docs](https://docs.bespokelabs.ai/) for more details as well as \nfor troubleshooting information.\n\n## Bespoke Curator Viewer\n\nTo run the bespoke dataset viewer:\n\n```bash\ncurator-viewer\n```\n\nThis will pop up a browser window with the viewer running on `127.0.0.1:3000` by default if you haven't specified a different host and port.\n\nThe dataset viewer shows all the different runs you have made.\n![Curator Runs](docs/curator-runs.png)\n\nYou can also see the dataset and the responses from the LLM.\n![Curator Dataset](docs/curator-dataset.png)\n\n\nOptional parameters to run the viewer on a different host and port:\n```bash\n>>> curator-viewer -h\nusage: curator-viewer [-h] [--host HOST] [--port PORT] [--verbose]\n\nCurator Viewer\n\noptions:\n -h, --help show this help message and exit\n --host HOST Host to run the server on (default: localhost)\n --port PORT Port to run the server on (default: 3000)\n --verbose, -v Enables debug logging for more verbose output\n```\n\nThe only requirement for running `curator-viewer` is to install node. You can install them by following the instructions [here](https://nodejs.org/en/download/package-manager).\n\nFor example, to check if you have node installed, you can run:\n\n```bash\nnode -v\n```\n\nIf it's not installed, installing latest node on MacOS, you can run:\n\n```bash\n# installs nvm (Node Version Manager)\ncurl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.0/install.sh | bash\n# download and install Node.js (you may need to restart the terminal)\nnvm install 22\n# verifies the right Node.js version is in the environment\nnode -v # should print `v22.11.0`\n# verifies the right npm version is in the environment\nnpm -v # should print `10.9.0`\n```\n\n## Contributing\nContributions are welcome! \n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "Bespoke Labs Curator",
"version": "0.1.12",
"project_urls": {
"Homepage": "https://github.com/bespokelabsai/curator",
"Repository": "https://github.com/bespokelabsai/curator"
},
"split_keywords": [
"ai",
" curator",
" bespoke"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "b60773f971655e7500abcd048fa25837c447b7c5e74ea67f02e331ccd09bb2ff",
"md5": "0c9cff4ed38728cd13c191992a958727",
"sha256": "cb2eb8e955f19f45e1031ccb399737b655f24cc38fccb22529e1154f64aa80d3"
},
"downloads": -1,
"filename": "bespokelabs_curator-0.1.12-py3-none-any.whl",
"has_sig": false,
"md5_digest": "0c9cff4ed38728cd13c191992a958727",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.10",
"size": 1190364,
"upload_time": "2024-12-17T06:19:11",
"upload_time_iso_8601": "2024-12-17T06:19:11.481486Z",
"url": "https://files.pythonhosted.org/packages/b6/07/73f971655e7500abcd048fa25837c447b7c5e74ea67f02e331ccd09bb2ff/bespokelabs_curator-0.1.12-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "31ce9c98209949212d19bbd7f6740fb4c26e5ec6251ea53d1c371a7efea57da4",
"md5": "386f0ae2d64c15f3c02dbbc6767afaa4",
"sha256": "988d434e7fc847850d6bc145748cf77bc9f040ac0cd2322683dcb68812f1264c"
},
"downloads": -1,
"filename": "bespokelabs_curator-0.1.12.tar.gz",
"has_sig": false,
"md5_digest": "386f0ae2d64c15f3c02dbbc6767afaa4",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.10",
"size": 1125589,
"upload_time": "2024-12-17T06:19:15",
"upload_time_iso_8601": "2024-12-17T06:19:15.030529Z",
"url": "https://files.pythonhosted.org/packages/31/ce/9c98209949212d19bbd7f6740fb4c26e5ec6251ea53d1c371a7efea57da4/bespokelabs_curator-0.1.12.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-17 06:19:15",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "bespokelabsai",
"github_project": "curator",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "bespokelabs-curator"
}