distilabel


Namedistilabel JSON
Version 1.0.3 PyPI version JSON
download
home_pageNone
SummaryAI Feedback (AIF) framework
upload_time2024-04-25 12:49:00
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseNone
keywords alignment annotation data llm rlaif synthetic
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <div align="center">
  <picture>
    <source media="(prefers-color-scheme: dark)" srcset="https://github.com/argilla-io/distilabel/blob/main/docs/assets/distilabel-white.png?raw=true">
    <img alt="Distilabel Logo" src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-black.png">
  </picture>
</div>
<h3 align="center">Synthesize data for AI and add feedback on the fly!</h2>

<p align="center">
<a  href="https://pypi.org/project/distilabel/">
<img alt="CI" src="https://img.shields.io/pypi/v/distilabel.svg?style=flat-round&logo=pypi&logoColor=white">
</a>
<a href="https://pepy.tech/project/distilabel">
<img alt="CI" src="https://static.pepy.tech/personalized-badge/distilabel?period=month&units=international_system&left_color=grey&right_color=blue&left_text=pypi%20downloads/month">
</a>
</p>

<p align="center">
<a href="https://twitter.com/argilla_io">
<img src="https://img.shields.io/badge/twitter-black?logo=x"/>
</a>
<a href="https://www.linkedin.com/company/argilla-io">
<img src="https://img.shields.io/badge/linkedin-blue?logo=linkedin"/>
</a>
<a href="https://join.slack.com/t/rubrixworkspace/shared_invite/zt-whigkyjn-a3IUJLD7gDbTZ0rKlvcJ5g">
<img src="https://img.shields.io/badge/slack-purple?logo=slack"/>
</a>
</p>

Distilabel is the **framework for synthetic data and AI feedback for AI engineers** that require **high-quality outputs, full data ownership, and overall efficiency**.

If you just want to get started, we recommend you check the [documentation](http://distilabel.argilla.io/). Curious, and want to know more? Keep reading!
<!-- ![overview](https://github.com/argilla-io/distilabel/assets/36760800/360110da-809d-4e24-a29b-1a1a8bc4f9b7)  -->

## Why use Distilabel?

Whether you are working on **a predictive model** that computes semantic similarity or the next **generative model** that is going to beat the LLM benchmarks. Our framework ensures that the **hard data work pays off**. Distilabel is the missing piece that helps you **synthesize data** and provide **AI feedback**.

### Improve your AI output quality through data quality

Compute is expensive and output quality is important. We help you **focus on data quality**, which tackles the root cause of both of these problems at once. Distilabel helps you to synthesize and judge data to let you spend your valuable time on **achieveing and keeping high-quality standards for your data**.

### Take control of your data and models

**Ownership of data for fine-tuning your own LLMs** is not easy but Distilabel can help you to get started. We integrate **AI feedback from any LLM provider out there** using one unified API.

### Improve efficiency by quickly iterating on the right research and LLMs

Synthesize and judge data with **latest research papers** while ensuring **flexibility, scalability and fault tolerance**. So you can focus on improving your data and training your models.

## 🏘️ Community

We are an open-source community-driven project and we love to hear from you. Here are some ways to get involved:

- [Community Meetup](https://lu.ma/embed-checkout/evt-IQtRiSuXZCIW6FB): listen in or present during one of our bi-weekly events.

- [Slack](https://join.slack.com/t/rubrixworkspace/shared_invite/zt-whigkyjn-a3IUJLD7gDbTZ0rKlvcJ5g): get direct support from the community.

- [Roadmap](https://github.com/orgs/argilla-io/projects/10/views/1): plans change but we love to discuss those with our community so feel encouraged to participate.

## What do people build with Distilabel?

Distilabel is a tool that can be used to **synthesize data and provide AI feedback**. Our community uses Distilabel to create amazing [datasets](https://huggingface.co/datasets?other=distilabel) and [models](https://huggingface.co/models?other=distilabel), and **we love contributions to open-source** ourselves too.

- The [1M OpenHermesPreference](https://huggingface.co/datasets/argilla/OpenHermesPreferences) is a dataset of ~1 million AI preferences derived from teknium/OpenHermes-2.5. It shows how we can use Distilabel to **synthesize data on an immense scale**.
- Our [distilabeled Intel Orca DPO dataset](https://huggingface.co/datasets/argilla/distilabel-intel-orca-dpo-pairs) and the [improved OpenHermes model](https://huggingface.co/argilla/distilabeled-OpenHermes-2.5-Mistral-7B),, show how we **improve model performance by filtering out 50%** of the original dataset through **AI feedback**.
- The [haiku DPO data](https://github.com/davanstrien/haiku-dpo) outlines how anyone can create a **dataset for a specific task** and **the latest research papers** to improve the quality of the dataset.

## 👨🏽‍💻 Installation

```sh
pip install distilabel --upgrade
```

Requires Python 3.8+

In addition, the following extras are available:

- `anthropic`: for using models available in [Anthropic API](https://www.anthropic.com/api) via the `AnthropicLLM` integration.
- `cohere`: for using models available in [Cohere](https://cohere.ai/) via the `CohereLLM` integration.
- `argilla`: for exporting the generated datasets to [Argilla](https://argilla.io/).
- `hf-inference-endpoints`: for using the [Hugging Face Inference Endpoints](https://huggingface.co/inference-endpoints) via the `InferenceEndpointsLLM` integration.
- `hf-transformers`: for using models available in [transformers](https://github.com/huggingface/transformers) package via the `TransformersLLM` integration.
- `litellm`: for using [`LiteLLM`](https://github.com/BerriAI/litellm) to call any LLM using OpenAI format via the `LiteLLM` integration.
- `llama-cpp`: for using [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) Python bindings for `llama.cpp` via the `LlamaCppLLM` integration.
- `mistralai`: for using models available in [Mistral AI API](https://mistral.ai/news/la-plateforme/) via the `MistralAILLM` integration.
- `ollama`: for using [Ollama](https://ollama.com/) and their available models via `OllamaLLM` integration.
- `openai`: for using [OpenAI API](https://openai.com/blog/openai-api) models via the `OpenAILLM` integration, or the rest of the integrations based on OpenAI and relying on its client as `AnyscaleLLM`, `AzureOpenAILLM`, and `TogetherLLM`.
- `vertexai`: for using [Google Vertex AI](https://cloud.google.com/vertex-ai) proprietary models via the `VertexAILLM` integration.
- `vllm`: for using [vllm](https://github.com/vllm-project/vllm) serving engine via the `vLLM` integration.

### Example

To run the following example you must install `distilabel` with both `openai` extra:

```sh
pip install "distilabel[openai]" --upgrade
```

Then run:

```python
from distilabel.llms import OpenAILLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadHubDataset
from distilabel.steps.tasks import TextGeneration

with Pipeline(
    name="simple-text-generation-pipeline",
    description="A simple text generation pipeline",
) as pipeline:
    load_dataset = LoadHubDataset(
        name="load_dataset",
        output_mappings={"prompt": "instruction"},
    )

    generate_with_openai = TextGeneration(
        name="generate_with_gpt35", llm=OpenAILLM(model="gpt-3.5-turbo")
    )

    load_dataset.connect(generate_with_openai)

if __name__ == "__main__":
    distiset = pipeline.run(
        parameters={
            "load_dataset": {
                "repo_id": "distilabel-internal-testing/instruction-dataset-mini",
                "split": "test",
            },
            "generate_with_gpt35": {
                "llm": {
                    "generation_kwargs": {
                        "temperature": 0.7,
                        "max_new_tokens": 512,
                    }
                }
            },
        },
    )
```

## Badges

If you build something cool with `distilabel` consider adding one of these badges to your dataset or model card.

    [<img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-light.png" alt="Built with Distilabel" width="200" height="32"/>](https://github.com/argilla-io/distilabel)

[<img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-light.png" alt="Built with Distilabel" width="200" height="32"/>](https://github.com/argilla-io/distilabel)

    [<img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-dark.png" alt="Built with Distilabel" width="200" height="32"/>](https://github.com/argilla-io/distilabel)

[<img src="https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-dark.png" alt="Built with Distilabel" width="200" height="32"/>](https://github.com/argilla-io/distilabel)

## Contribute

To directly contribute with `distilabel`, check our [good first issues](https://github.com/argilla-io/distilabel/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) or [open a new one](https://github.com/argilla-io/distilabel/issues/new/choose).


            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "distilabel",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "alignment, annotation, data, llm, rlaif, synthetic",
    "author": null,
    "author_email": "Argilla <admin@argilla.io>",
    "download_url": "https://files.pythonhosted.org/packages/5b/56/ad9718ecab84260d7e8747976b44c772f099e3553cc28c635ee08a97a5a0/distilabel-1.0.3.tar.gz",
    "platform": null,
    "description": "<div align=\"center\">\n  <picture>\n    <source media=\"(prefers-color-scheme: dark)\" srcset=\"https://github.com/argilla-io/distilabel/blob/main/docs/assets/distilabel-white.png?raw=true\">\n    <img alt=\"Distilabel Logo\" src=\"https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-black.png\">\n  </picture>\n</div>\n<h3 align=\"center\">Synthesize data for AI and add feedback on the fly!</h2>\n\n<p align=\"center\">\n<a  href=\"https://pypi.org/project/distilabel/\">\n<img alt=\"CI\" src=\"https://img.shields.io/pypi/v/distilabel.svg?style=flat-round&logo=pypi&logoColor=white\">\n</a>\n<a href=\"https://pepy.tech/project/distilabel\">\n<img alt=\"CI\" src=\"https://static.pepy.tech/personalized-badge/distilabel?period=month&units=international_system&left_color=grey&right_color=blue&left_text=pypi%20downloads/month\">\n</a>\n</p>\n\n<p align=\"center\">\n<a href=\"https://twitter.com/argilla_io\">\n<img src=\"https://img.shields.io/badge/twitter-black?logo=x\"/>\n</a>\n<a href=\"https://www.linkedin.com/company/argilla-io\">\n<img src=\"https://img.shields.io/badge/linkedin-blue?logo=linkedin\"/>\n</a>\n<a href=\"https://join.slack.com/t/rubrixworkspace/shared_invite/zt-whigkyjn-a3IUJLD7gDbTZ0rKlvcJ5g\">\n<img src=\"https://img.shields.io/badge/slack-purple?logo=slack\"/>\n</a>\n</p>\n\nDistilabel is the **framework for synthetic data and AI feedback for AI engineers** that require **high-quality outputs, full data ownership, and overall efficiency**.\n\nIf you just want to get started, we recommend you check the [documentation](http://distilabel.argilla.io/). Curious, and want to know more? Keep reading!\n<!-- ![overview](https://github.com/argilla-io/distilabel/assets/36760800/360110da-809d-4e24-a29b-1a1a8bc4f9b7)  -->\n\n## Why use Distilabel?\n\nWhether you are working on **a predictive model** that computes semantic similarity or the next **generative model** that is going to beat the LLM benchmarks. Our framework ensures that the **hard data work pays off**. Distilabel is the missing piece that helps you **synthesize data** and provide **AI feedback**.\n\n### Improve your AI output quality through data quality\n\nCompute is expensive and output quality is important. We help you **focus on data quality**, which tackles the root cause of both of these problems at once. Distilabel helps you to synthesize and judge data to let you spend your valuable time on **achieveing and keeping high-quality standards for your data**.\n\n### Take control of your data and models\n\n**Ownership of data for fine-tuning your own LLMs** is not easy but Distilabel can help you to get started. We integrate **AI feedback from any LLM provider out there** using one unified API.\n\n### Improve efficiency by quickly iterating on the right research and LLMs\n\nSynthesize and judge data with **latest research papers** while ensuring **flexibility, scalability and fault tolerance**. So you can focus on improving your data and training your models.\n\n## \ud83c\udfd8\ufe0f Community\n\nWe are an open-source community-driven project and we love to hear from you. Here are some ways to get involved:\n\n- [Community Meetup](https://lu.ma/embed-checkout/evt-IQtRiSuXZCIW6FB): listen in or present during one of our bi-weekly events.\n\n- [Slack](https://join.slack.com/t/rubrixworkspace/shared_invite/zt-whigkyjn-a3IUJLD7gDbTZ0rKlvcJ5g): get direct support from the community.\n\n- [Roadmap](https://github.com/orgs/argilla-io/projects/10/views/1): plans change but we love to discuss those with our community so feel encouraged to participate.\n\n## What do people build with Distilabel?\n\nDistilabel is a tool that can be used to **synthesize data and provide AI feedback**. Our community uses Distilabel to create amazing [datasets](https://huggingface.co/datasets?other=distilabel) and [models](https://huggingface.co/models?other=distilabel), and **we love contributions to open-source** ourselves too.\n\n- The [1M OpenHermesPreference](https://huggingface.co/datasets/argilla/OpenHermesPreferences) is a dataset of ~1 million AI preferences derived from teknium/OpenHermes-2.5. It shows how we can use Distilabel to **synthesize data on an immense scale**.\n- Our [distilabeled Intel Orca DPO dataset](https://huggingface.co/datasets/argilla/distilabel-intel-orca-dpo-pairs) and the [improved OpenHermes model](https://huggingface.co/argilla/distilabeled-OpenHermes-2.5-Mistral-7B),, show how we **improve model performance by filtering out 50%** of the original dataset through **AI feedback**.\n- The [haiku DPO data](https://github.com/davanstrien/haiku-dpo) outlines how anyone can create a **dataset for a specific task** and **the latest research papers** to improve the quality of the dataset.\n\n## \ud83d\udc68\ud83c\udffd\u200d\ud83d\udcbb Installation\n\n```sh\npip install distilabel --upgrade\n```\n\nRequires Python 3.8+\n\nIn addition, the following extras are available:\n\n- `anthropic`: for using models available in [Anthropic API](https://www.anthropic.com/api) via the `AnthropicLLM` integration.\n- `cohere`: for using models available in [Cohere](https://cohere.ai/) via the `CohereLLM` integration.\n- `argilla`: for exporting the generated datasets to [Argilla](https://argilla.io/).\n- `hf-inference-endpoints`: for using the [Hugging Face Inference Endpoints](https://huggingface.co/inference-endpoints) via the `InferenceEndpointsLLM` integration.\n- `hf-transformers`: for using models available in [transformers](https://github.com/huggingface/transformers) package via the `TransformersLLM` integration.\n- `litellm`: for using [`LiteLLM`](https://github.com/BerriAI/litellm) to call any LLM using OpenAI format via the `LiteLLM` integration.\n- `llama-cpp`: for using [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) Python bindings for `llama.cpp` via the `LlamaCppLLM` integration.\n- `mistralai`: for using models available in [Mistral AI API](https://mistral.ai/news/la-plateforme/) via the `MistralAILLM` integration.\n- `ollama`: for using [Ollama](https://ollama.com/) and their available models via `OllamaLLM` integration.\n- `openai`: for using [OpenAI API](https://openai.com/blog/openai-api) models via the `OpenAILLM` integration, or the rest of the integrations based on OpenAI and relying on its client as `AnyscaleLLM`, `AzureOpenAILLM`, and `TogetherLLM`.\n- `vertexai`: for using [Google Vertex AI](https://cloud.google.com/vertex-ai) proprietary models via the `VertexAILLM` integration.\n- `vllm`: for using [vllm](https://github.com/vllm-project/vllm) serving engine via the `vLLM` integration.\n\n### Example\n\nTo run the following example you must install `distilabel` with both `openai` extra:\n\n```sh\npip install \"distilabel[openai]\" --upgrade\n```\n\nThen run:\n\n```python\nfrom distilabel.llms import OpenAILLM\nfrom distilabel.pipeline import Pipeline\nfrom distilabel.steps import LoadHubDataset\nfrom distilabel.steps.tasks import TextGeneration\n\nwith Pipeline(\n    name=\"simple-text-generation-pipeline\",\n    description=\"A simple text generation pipeline\",\n) as pipeline:\n    load_dataset = LoadHubDataset(\n        name=\"load_dataset\",\n        output_mappings={\"prompt\": \"instruction\"},\n    )\n\n    generate_with_openai = TextGeneration(\n        name=\"generate_with_gpt35\", llm=OpenAILLM(model=\"gpt-3.5-turbo\")\n    )\n\n    load_dataset.connect(generate_with_openai)\n\nif __name__ == \"__main__\":\n    distiset = pipeline.run(\n        parameters={\n            \"load_dataset\": {\n                \"repo_id\": \"distilabel-internal-testing/instruction-dataset-mini\",\n                \"split\": \"test\",\n            },\n            \"generate_with_gpt35\": {\n                \"llm\": {\n                    \"generation_kwargs\": {\n                        \"temperature\": 0.7,\n                        \"max_new_tokens\": 512,\n                    }\n                }\n            },\n        },\n    )\n```\n\n## Badges\n\nIf you build something cool with `distilabel` consider adding one of these badges to your dataset or model card.\n\n    [<img src=\"https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-light.png\" alt=\"Built with Distilabel\" width=\"200\" height=\"32\"/>](https://github.com/argilla-io/distilabel)\n\n[<img src=\"https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-light.png\" alt=\"Built with Distilabel\" width=\"200\" height=\"32\"/>](https://github.com/argilla-io/distilabel)\n\n    [<img src=\"https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-dark.png\" alt=\"Built with Distilabel\" width=\"200\" height=\"32\"/>](https://github.com/argilla-io/distilabel)\n\n[<img src=\"https://raw.githubusercontent.com/argilla-io/distilabel/main/docs/assets/distilabel-badge-dark.png\" alt=\"Built with Distilabel\" width=\"200\" height=\"32\"/>](https://github.com/argilla-io/distilabel)\n\n## Contribute\n\nTo directly contribute with `distilabel`, check our [good first issues](https://github.com/argilla-io/distilabel/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) or [open a new one](https://github.com/argilla-io/distilabel/issues/new/choose).\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "AI Feedback (AIF) framework",
    "version": "1.0.3",
    "project_urls": {
        "Documentation": "https://distilabel.argilla.io/",
        "Issues": "https://github.com/argilla/distilabel/issues",
        "Source": "https://github.com/argilla/distilabel"
    },
    "split_keywords": [
        "alignment",
        " annotation",
        " data",
        " llm",
        " rlaif",
        " synthetic"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "78bcd2095cfe49a3d552814c58ca79d16b38f8c340ddf65c559e9e167bf08d0a",
                "md5": "f307d71bad7b7212923a5daa0c5dfec1",
                "sha256": "6a43380671fb60b26e7a68600409315794bc3f907d7da1374d63a192e1b8e233"
            },
            "downloads": -1,
            "filename": "distilabel-1.0.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "f307d71bad7b7212923a5daa0c5dfec1",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 185632,
            "upload_time": "2024-04-25T12:48:58",
            "upload_time_iso_8601": "2024-04-25T12:48:58.256554Z",
            "url": "https://files.pythonhosted.org/packages/78/bc/d2095cfe49a3d552814c58ca79d16b38f8c340ddf65c559e9e167bf08d0a/distilabel-1.0.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5b56ad9718ecab84260d7e8747976b44c772f099e3553cc28c635ee08a97a5a0",
                "md5": "f2aea43ddbf6b82045247179a811a2e9",
                "sha256": "01aecdc050f8e679f6a3635bd3d1e25bd906f5931c93efc250f9a483021c06ae"
            },
            "downloads": -1,
            "filename": "distilabel-1.0.3.tar.gz",
            "has_sig": false,
            "md5_digest": "f2aea43ddbf6b82045247179a811a2e9",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 4217863,
            "upload_time": "2024-04-25T12:49:00",
            "upload_time_iso_8601": "2024-04-25T12:49:00.022207Z",
            "url": "https://files.pythonhosted.org/packages/5b/56/ad9718ecab84260d7e8747976b44c772f099e3553cc28c635ee08a97a5a0/distilabel-1.0.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-25 12:49:00",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "argilla",
    "github_project": "distilabel",
    "github_not_found": true,
    "lcname": "distilabel"
}
        
Elapsed time: 0.29928s