<div align="center">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://github.com/argilla-io/distilabel/blob/main/docs/assets/distilabel-white.png?raw=true">
<img alt="Distilabel Logo" src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-black.png">
</picture>
</div>
<h3 align="center">Synthesize data for AI and add feedback on the fly!</h2>
<p align="center">
<a href="https://pypi.org/project/distilabel/">
<img alt="CI" src="https://img.shields.io/pypi/v/distilabel.svg?style=flat-round&logo=pypi&logoColor=white">
</a>
<a href="https://pepy.tech/project/distilabel">
<img alt="CI" src="https://static.pepy.tech/personalized-badge/distilabel?period=month&units=international_system&left_color=grey&right_color=blue&left_text=pypi%20downloads/month">
</a>
</p>
<p align="center">
<a href="https://twitter.com/argilla_io">
<img src="https://img.shields.io/badge/twitter-black?logo=x"/>
</a>
<a href="https://www.linkedin.com/company/argilla-io">
<img src="https://img.shields.io/badge/linkedin-blue?logo=linkedin"/>
</a>
<a href="https://join.slack.com/t/rubrixworkspace/shared_invite/zt-whigkyjn-a3IUJLD7gDbTZ0rKlvcJ5g">
<img src="https://img.shields.io/badge/slack-purple?logo=slack"/>
</a>
</p>
Distilabel is the **framework for synthetic data and AI feedback for AI engineers** that require **high-quality outputs, full data ownership, and overall efficiency**.
If you just want to get started, we recommend you check the [documentation](http://distilabel.argilla.io/). Curious, and want to know more? Keep reading!
<!-- ![overview](https://github.com/argilla-io/distilabel/assets/36760800/360110da-809d-4e24-a29b-1a1a8bc4f9b7) -->
## Why use Distilabel?
Whether you are working on **a predictive model** that computes semantic similarity or the next **generative model** that is going to beat the LLM benchmarks. Our framework ensures that the **hard data work pays off**. Distilabel is the missing piece that helps you **synthesize data** and provide **AI feedback**.
### Improve your AI output quality through data quality
Compute is expensive and output quality is important. We help you **focus on data quality**, which tackles the root cause of both of these problems at once. Distilabel helps you to synthesize and judge data to let you spend your valuable time on **achieveing and keeping high-quality standards for your data**.
### Take control of your data and models
**Ownership of data for fine-tuning your own LLMs** is not easy but Distilabel can help you to get started. We integrate **AI feedback from any LLM provider out there** using one unified API.
### Improve efficiency by quickly iterating on the right research and LLMs
Synthesize and judge data with **latest research papers** while ensuring **flexibility, scalability and fault tolerance**. So you can focus on improving your data and training your models.
## 🏘️ Community
We are an open-source community-driven project and we love to hear from you. Here are some ways to get involved:
- [Community Meetup](https://lu.ma/embed-checkout/evt-IQtRiSuXZCIW6FB): listen in or present during one of our bi-weekly events.
- [Slack](https://join.slack.com/t/rubrixworkspace/shared_invite/zt-whigkyjn-a3IUJLD7gDbTZ0rKlvcJ5g): get direct support from the community.
- [Roadmap](https://github.com/orgs/argilla-io/projects/10/views/1): plans change but we love to discuss those with our community so feel encouraged to participate.
## What do people build with Distilabel?
Distilabel is a tool that can be used to **synthesize data and provide AI feedback**. Our community uses Distilabel to create amazing [datasets](https://huggingface.co/datasets?other=distilabel) and [models](https://huggingface.co/models?other=distilabel), and **we love contributions to open-source** ourselves too.
- The [1M OpenHermesPreference](https://huggingface.co/datasets/argilla/OpenHermesPreferences) is a dataset of ~1 million AI preferences derived from teknium/OpenHermes-2.5. It shows how we can use Distilabel to **synthesize data on an immense scale**.
- Our [distilabeled Intel Orca DPO dataset](https://huggingface.co/datasets/argilla/distilabel-intel-orca-dpo-pairs) and the [improved OpenHermes model](https://huggingface.co/argilla/distilabeled-OpenHermes-2.5-Mistral-7B),, show how we **improve model performance by filtering out 50%** of the original dataset through **AI feedback**.
- The [haiku DPO data](https://github.com/davanstrien/haiku-dpo) outlines how anyone can create a **dataset for a specific task** and **the latest research papers** to improve the quality of the dataset.
## 👨🏽💻 Installation
```sh
pip install distilabel --upgrade
```
Requires Python 3.8+
In addition, the following extras are available:
- `anthropic`: for using models available in [Anthropic API](https://www.anthropic.com/api) via the `AnthropicLLM` integration.
- `cohere`: for using models available in [Cohere](https://cohere.ai/) via the `CohereLLM` integration.
- `argilla`: for exporting the generated datasets to [Argilla](https://argilla.io/).
- `hf-inference-endpoints`: for using the [Hugging Face Inference Endpoints](https://huggingface.co/inference-endpoints) via the `InferenceEndpointsLLM` integration.
- `hf-transformers`: for using models available in [transformers](https://github.com/huggingface/transformers) package via the `TransformersLLM` integration.
- `litellm`: for using [`LiteLLM`](https://github.com/BerriAI/litellm) to call any LLM using OpenAI format via the `LiteLLM` integration.
- `llama-cpp`: for using [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) Python bindings for `llama.cpp` via the `LlamaCppLLM` integration.
- `mistralai`: for using models available in [Mistral AI API](https://mistral.ai/news/la-plateforme/) via the `MistralAILLM` integration.
- `ollama`: for using [Ollama](https://ollama.com/) and their available models via `OllamaLLM` integration.
- `openai`: for using [OpenAI API](https://openai.com/blog/openai-api) models via the `OpenAILLM` integration, or the rest of the integrations based on OpenAI and relying on its client as `AnyscaleLLM`, `AzureOpenAILLM`, and `TogetherLLM`.
- `vertexai`: for using [Google Vertex AI](https://cloud.google.com/vertex-ai) proprietary models via the `VertexAILLM` integration.
- `vllm`: for using [vllm](https://github.com/vllm-project/vllm) serving engine via the `vLLM` integration.
### Example
To run the following example you must install `distilabel` with both `openai` extra:
```sh
pip install "distilabel[openai]" --upgrade
```
Then run:
```python
from distilabel.llms import OpenAILLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadHubDataset
from distilabel.steps.tasks import TextGeneration
with Pipeline(
name="simple-text-generation-pipeline",
description="A simple text generation pipeline",
) as pipeline:
load_dataset = LoadHubDataset(
name="load_dataset",
output_mappings={"prompt": "instruction"},
)
generate_with_openai = TextGeneration(
name="generate_with_gpt35", llm=OpenAILLM(model="gpt-3.5-turbo")
)
load_dataset.connect(generate_with_openai)
if __name__ == "__main__":
distiset = pipeline.run(
parameters={
"load_dataset": {
"repo_id": "distilabel-internal-testing/instruction-dataset-mini",
"split": "test",
},
"generate_with_gpt35": {
"llm": {
"generation_kwargs": {
"temperature": 0.7,
"max_new_tokens": 512,
}
}
},
},
)
```
## Badges
If you build something cool with `distilabel` consider adding one of these badges to your dataset or model card.
[<img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-light.png" alt="Built with Distilabel" width="200" height="32"/>](https://github.com/argilla-io/distilabel)
[<img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-light.png" alt="Built with Distilabel" width="200" height="32"/>](https://github.com/argilla-io/distilabel)
[<img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-dark.png" alt="Built with Distilabel" width="200" height="32"/>](https://github.com/argilla-io/distilabel)
[<img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-dark.png" alt="Built with Distilabel" width="200" height="32"/>](https://github.com/argilla-io/distilabel)
## Contribute
To directly contribute with `distilabel`, check our [good first issues](https://github.com/argilla-io/distilabel/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) or [open a new one](https://github.com/argilla-io/distilabel/issues/new/choose).
Raw data
{
"_id": null,
"home_page": null,
"name": "distilabel",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "alignment, annotation, data, llm, rlaif, synthetic",
"author": null,
"author_email": "Argilla <admin@argilla.io>",
"download_url": "https://files.pythonhosted.org/packages/5b/56/ad9718ecab84260d7e8747976b44c772f099e3553cc28c635ee08a97a5a0/distilabel-1.0.3.tar.gz",
"platform": null,
"description": "<div align=\"center\">\n <picture>\n <source media=\"(prefers-color-scheme: dark)\" srcset=\"https://github.com/argilla-io/distilabel/blob/main/docs/assets/distilabel-white.png?raw=true\">\n <img alt=\"Distilabel Logo\" src=\"https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-black.png\">\n </picture>\n</div>\n<h3 align=\"center\">Synthesize data for AI and add feedback on the fly!</h2>\n\n<p align=\"center\">\n<a href=\"https://pypi.org/project/distilabel/\">\n<img alt=\"CI\" src=\"https://img.shields.io/pypi/v/distilabel.svg?style=flat-round&logo=pypi&logoColor=white\">\n</a>\n<a href=\"https://pepy.tech/project/distilabel\">\n<img alt=\"CI\" src=\"https://static.pepy.tech/personalized-badge/distilabel?period=month&units=international_system&left_color=grey&right_color=blue&left_text=pypi%20downloads/month\">\n</a>\n</p>\n\n<p align=\"center\">\n<a href=\"https://twitter.com/argilla_io\">\n<img src=\"https://img.shields.io/badge/twitter-black?logo=x\"/>\n</a>\n<a href=\"https://www.linkedin.com/company/argilla-io\">\n<img src=\"https://img.shields.io/badge/linkedin-blue?logo=linkedin\"/>\n</a>\n<a href=\"https://join.slack.com/t/rubrixworkspace/shared_invite/zt-whigkyjn-a3IUJLD7gDbTZ0rKlvcJ5g\">\n<img src=\"https://img.shields.io/badge/slack-purple?logo=slack\"/>\n</a>\n</p>\n\nDistilabel is the **framework for synthetic data and AI feedback for AI engineers** that require **high-quality outputs, full data ownership, and overall efficiency**.\n\nIf you just want to get started, we recommend you check the [documentation](http://distilabel.argilla.io/). Curious, and want to know more? Keep reading!\n<!-- ![overview](https://github.com/argilla-io/distilabel/assets/36760800/360110da-809d-4e24-a29b-1a1a8bc4f9b7) -->\n\n## Why use Distilabel?\n\nWhether you are working on **a predictive model** that computes semantic similarity or the next **generative model** that is going to beat the LLM benchmarks. Our framework ensures that the **hard data work pays off**. Distilabel is the missing piece that helps you **synthesize data** and provide **AI feedback**.\n\n### Improve your AI output quality through data quality\n\nCompute is expensive and output quality is important. We help you **focus on data quality**, which tackles the root cause of both of these problems at once. Distilabel helps you to synthesize and judge data to let you spend your valuable time on **achieveing and keeping high-quality standards for your data**.\n\n### Take control of your data and models\n\n**Ownership of data for fine-tuning your own LLMs** is not easy but Distilabel can help you to get started. We integrate **AI feedback from any LLM provider out there** using one unified API.\n\n### Improve efficiency by quickly iterating on the right research and LLMs\n\nSynthesize and judge data with **latest research papers** while ensuring **flexibility, scalability and fault tolerance**. So you can focus on improving your data and training your models.\n\n## \ud83c\udfd8\ufe0f Community\n\nWe are an open-source community-driven project and we love to hear from you. Here are some ways to get involved:\n\n- [Community Meetup](https://lu.ma/embed-checkout/evt-IQtRiSuXZCIW6FB): listen in or present during one of our bi-weekly events.\n\n- [Slack](https://join.slack.com/t/rubrixworkspace/shared_invite/zt-whigkyjn-a3IUJLD7gDbTZ0rKlvcJ5g): get direct support from the community.\n\n- [Roadmap](https://github.com/orgs/argilla-io/projects/10/views/1): plans change but we love to discuss those with our community so feel encouraged to participate.\n\n## What do people build with Distilabel?\n\nDistilabel is a tool that can be used to **synthesize data and provide AI feedback**. Our community uses Distilabel to create amazing [datasets](https://huggingface.co/datasets?other=distilabel) and [models](https://huggingface.co/models?other=distilabel), and **we love contributions to open-source** ourselves too.\n\n- The [1M OpenHermesPreference](https://huggingface.co/datasets/argilla/OpenHermesPreferences) is a dataset of ~1 million AI preferences derived from teknium/OpenHermes-2.5. It shows how we can use Distilabel to **synthesize data on an immense scale**.\n- Our [distilabeled Intel Orca DPO dataset](https://huggingface.co/datasets/argilla/distilabel-intel-orca-dpo-pairs) and the [improved OpenHermes model](https://huggingface.co/argilla/distilabeled-OpenHermes-2.5-Mistral-7B),, show how we **improve model performance by filtering out 50%** of the original dataset through **AI feedback**.\n- The [haiku DPO data](https://github.com/davanstrien/haiku-dpo) outlines how anyone can create a **dataset for a specific task** and **the latest research papers** to improve the quality of the dataset.\n\n## \ud83d\udc68\ud83c\udffd\u200d\ud83d\udcbb Installation\n\n```sh\npip install distilabel --upgrade\n```\n\nRequires Python 3.8+\n\nIn addition, the following extras are available:\n\n- `anthropic`: for using models available in [Anthropic API](https://www.anthropic.com/api) via the `AnthropicLLM` integration.\n- `cohere`: for using models available in [Cohere](https://cohere.ai/) via the `CohereLLM` integration.\n- `argilla`: for exporting the generated datasets to [Argilla](https://argilla.io/).\n- `hf-inference-endpoints`: for using the [Hugging Face Inference Endpoints](https://huggingface.co/inference-endpoints) via the `InferenceEndpointsLLM` integration.\n- `hf-transformers`: for using models available in [transformers](https://github.com/huggingface/transformers) package via the `TransformersLLM` integration.\n- `litellm`: for using [`LiteLLM`](https://github.com/BerriAI/litellm) to call any LLM using OpenAI format via the `LiteLLM` integration.\n- `llama-cpp`: for using [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) Python bindings for `llama.cpp` via the `LlamaCppLLM` integration.\n- `mistralai`: for using models available in [Mistral AI API](https://mistral.ai/news/la-plateforme/) via the `MistralAILLM` integration.\n- `ollama`: for using [Ollama](https://ollama.com/) and their available models via `OllamaLLM` integration.\n- `openai`: for using [OpenAI API](https://openai.com/blog/openai-api) models via the `OpenAILLM` integration, or the rest of the integrations based on OpenAI and relying on its client as `AnyscaleLLM`, `AzureOpenAILLM`, and `TogetherLLM`.\n- `vertexai`: for using [Google Vertex AI](https://cloud.google.com/vertex-ai) proprietary models via the `VertexAILLM` integration.\n- `vllm`: for using [vllm](https://github.com/vllm-project/vllm) serving engine via the `vLLM` integration.\n\n### Example\n\nTo run the following example you must install `distilabel` with both `openai` extra:\n\n```sh\npip install \"distilabel[openai]\" --upgrade\n```\n\nThen run:\n\n```python\nfrom distilabel.llms import OpenAILLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import LoadHubDataset\nfrom distilabel.steps.tasks import TextGeneration\n\nwith Pipeline(\n name=\"simple-text-generation-pipeline\",\n description=\"A simple text generation pipeline\",\n) as pipeline:\n load_dataset = LoadHubDataset(\n name=\"load_dataset\",\n output_mappings={\"prompt\": \"instruction\"},\n )\n\n generate_with_openai = TextGeneration(\n name=\"generate_with_gpt35\", llm=OpenAILLM(model=\"gpt-3.5-turbo\")\n )\n\n load_dataset.connect(generate_with_openai)\n\nif __name__ == \"__main__\":\n distiset = pipeline.run(\n parameters={\n \"load_dataset\": {\n \"repo_id\": \"distilabel-internal-testing/instruction-dataset-mini\",\n \"split\": \"test\",\n },\n \"generate_with_gpt35\": {\n \"llm\": {\n \"generation_kwargs\": {\n \"temperature\": 0.7,\n \"max_new_tokens\": 512,\n }\n }\n },\n },\n )\n```\n\n## Badges\n\nIf you build something cool with `distilabel` consider adding one of these badges to your dataset or model card.\n\n [<img src=\"https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-light.png\" alt=\"Built with Distilabel\" width=\"200\" height=\"32\"/>](https://github.com/argilla-io/distilabel)\n\n[<img src=\"https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-light.png\" alt=\"Built with Distilabel\" width=\"200\" height=\"32\"/>](https://github.com/argilla-io/distilabel)\n\n [<img src=\"https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-dark.png\" alt=\"Built with Distilabel\" width=\"200\" height=\"32\"/>](https://github.com/argilla-io/distilabel)\n\n[<img src=\"https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-dark.png\" alt=\"Built with Distilabel\" width=\"200\" height=\"32\"/>](https://github.com/argilla-io/distilabel)\n\n## Contribute\n\nTo directly contribute with `distilabel`, check our [good first issues](https://github.com/argilla-io/distilabel/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) or [open a new one](https://github.com/argilla-io/distilabel/issues/new/choose).\n\n",
"bugtrack_url": null,
"license": null,
"summary": "AI Feedback (AIF) framework",
"version": "1.0.3",
"project_urls": {
"Documentation": "https://distilabel.argilla.io/",
"Issues": "https://github.com/argilla/distilabel/issues",
"Source": "https://github.com/argilla/distilabel"
},
"split_keywords": [
"alignment",
" annotation",
" data",
" llm",
" rlaif",
" synthetic"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "78bcd2095cfe49a3d552814c58ca79d16b38f8c340ddf65c559e9e167bf08d0a",
"md5": "f307d71bad7b7212923a5daa0c5dfec1",
"sha256": "6a43380671fb60b26e7a68600409315794bc3f907d7da1374d63a192e1b8e233"
},
"downloads": -1,
"filename": "distilabel-1.0.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "f307d71bad7b7212923a5daa0c5dfec1",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 185632,
"upload_time": "2024-04-25T12:48:58",
"upload_time_iso_8601": "2024-04-25T12:48:58.256554Z",
"url": "https://files.pythonhosted.org/packages/78/bc/d2095cfe49a3d552814c58ca79d16b38f8c340ddf65c559e9e167bf08d0a/distilabel-1.0.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "5b56ad9718ecab84260d7e8747976b44c772f099e3553cc28c635ee08a97a5a0",
"md5": "f2aea43ddbf6b82045247179a811a2e9",
"sha256": "01aecdc050f8e679f6a3635bd3d1e25bd906f5931c93efc250f9a483021c06ae"
},
"downloads": -1,
"filename": "distilabel-1.0.3.tar.gz",
"has_sig": false,
"md5_digest": "f2aea43ddbf6b82045247179a811a2e9",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 4217863,
"upload_time": "2024-04-25T12:49:00",
"upload_time_iso_8601": "2024-04-25T12:49:00.022207Z",
"url": "https://files.pythonhosted.org/packages/5b/56/ad9718ecab84260d7e8747976b44c772f099e3553cc28c635ee08a97a5a0/distilabel-1.0.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-04-25 12:49:00",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "argilla",
"github_project": "distilabel",
"github_not_found": true,
"lcname": "distilabel"
}