fondant


Namefondant JSON
Version 0.12.0 PyPI version JSON
download
home_pagehttps://github.com/ml6team/fondant
SummaryFondant - Large-scale data processing made easy and reusable
upload_time2024-04-17 13:16:31
maintainerML6
docs_urlNone
authorML6
requires_python<3.12,>=3.9
licenseApache-2.0
keywords data machine learning fine-tuning foundation models
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <a id="top"></a>
<p align="center">
    <img src="https://raw.githubusercontent.com/ml6team/fondant/main/docs/art/fondant_banner.svg" style="height:225px;"/>
</p>
<p align="center">
    <i>
        <b>Production-ready</b> 
        data processing made 
        <b>easy</b> 
        and 
        <b>shareable</b>
    </i>
    <br>
    <a href="http://fondant.ai"><strong>Explore the docs ยป</strong></a>
    <br>
    <br>
    <a href="https://discord.gg/HnTdWhydGp"><img alt="Discord" src="https://dcbadge.vercel.app/api/server/HnTdWhydGp?style=flat-square"></a>
    <a href="https://pypi.org/project/fondant/"><img alt="PyPI version" src="https://img.shields.io/pypi/v/fondant?color=brightgreen&style=flat-square"></a>
    <a href="https://fondant.readthedocs.io/en/latest/license/"><img alt="License" src="https://img.shields.io/github/license/ml6team/fondant?style=flat-square&color=brightgreen"></a>
    <a href="https://github.com/ml6team/fondant/actions/workflows/pipeline.yaml"><img alt="GitHub Workflow Status" src="https://img.shields.io/github/actions/workflow/status/ml6team/fondant/pipeline.yaml?style=flat-square"></a>
    <a href="https://coveralls.io/github/ml6team/fondant?branch=main"><img alt="Coveralls" src="https://img.shields.io/coverallsCoverage/github/ml6team/fondant?style=flat-square"></a>
</p>
<br>

## ๐Ÿชค Why Fondant?

Fondant is a data framework that enables collaborative dataset building. It is designed for developing and crafting datasets together, sharing reusable operations and complete data processing trees. 

Fondant enables you to initialize datasets, apply various operations on them, and load datasets from other users. It assists in executing operations on managed services, sharing operations with others, and keeping track of your dataset versions. Fondant makes this all possible without moving the source data.


## ๐Ÿ’จ Getting Started

Fondant allows you to easily define workflows comprised of both reusable and custom components. The following example uses the reusable load_from_hf_hub component to load a dataset from the Hugging Face Hub and process it using a custom component that will resize the images resulting in a new dataset.


```pipeline.py
import pyarrow as pa

from fondant.dataset import Dataset

raw_data = Dataset.create(
    "load_from_hf_hub",
    arguments={
        "dataset_name": "fondant-ai/fondant-cc-25m",
        "n_rows_to_load": 100,
    },
    produces={
        "alt_text": pa.string(),
        "image_url": pa.string(),
        "license_location": pa.string(),
        "license_type": pa.string(),
        "webpage_url": pa.string(),
        "surt_url": pa.string(),
        "top_level_domain": pa.string(),
    },
)

images = raw_data.apply(
    "download_images",
    arguments={
        "input_partition_rows": 100,
        "resize_mode": "no",
    },
)

dataset = images.apply(
    "resize_images",
    arguments={
        "resize_width": 128,
        "resize_height": 128,
    },
)

```
Custom use cases require the creation of custom components. Check out our [**step by step guide**](https://fondant.ai/en/latest/guides/first_dataset/) to learn more about how to build custom pipelines and components.

<p align="right">(<a href="#top">back to top</a>)</p>

### Running your pipeline

Once you have a pipeline you can easily run (and compile) it by using the built-in CLI:

```bash
fondant run local pipeline.py
```

To see all available runner and arguments you can check the fondant CLI help pages

```bash
fondant --help
```

Or for a subcommand:

```bash
fondant <subcommand> --help
```

<p align="right">(<a href="#top">back to top</a>)</p>


## ๐Ÿช„ How Fondant works

- **Dataset**: The building blocks, a dataset is a collection of columns. Fondant operates uniquely via datasets. We start with a dataset, we augment it into a new dataset and we end with a dataset. Fondant optimizes the data transfer by storing and loading columns as needed. While also processing based on the available partitions. The aim is to make these datasets sharable and allow users to create their own datasets based on others.
- **Operation**: A transformation to be applied on a dataset resulting in a new dataset. The operation will load needed columns and produce new/altered columns. A transformation can be anything from loading, filtering, adding a column, writing etc. Fondant also makes operations sharable so you can easily use an operation in your workflow.
- **Shareable trees**: Datasets are a result of applying operations on other datasets. The full lineage is baked in. This allows for sharing not just the end product but the full history, users can also easily continue based on a dataset or branch off of an existing graph.

![overview](docs/art/fondant_overview.png)

<p align="right">(<a href="#top">back to top</a>)</p>

## ๐Ÿงฉ Key Features

Here's what Fondant brings to the table: 
- ๐Ÿ”ง Plug โ€˜nโ€™ play composable data processing workflows
- ๐Ÿงฉ Library containing off-the-shelf reusable components
- ๐Ÿผ A simple Pandas based interface for creating custom components
- ๐Ÿ“Š Built-in lineage, caching, and data explorer
- ๐Ÿš€ Production-ready, scalable deployment
- โ˜๏ธ Integration with runners across different clouds (Vertex, Sagemaker, Kubeflow)

๐Ÿ‘‰ **Check our [Component Hub](https://fondant.ai/en/latest/components/hub/) for an overview of all 
available components**

<p align="right">(<a href="#top">back to top</a>)</p>

## ๐Ÿช„ Example pipelines

We have created several ready-made example pipelines for you to use as a starting point for exploring Fondant. 
They are hosted as separate repositories containing a notebook tutorial so you can easily clone them and get started:

๐Ÿ“– [**RAG tuning pipeline**](https://github.com/ml6team/fondant-usecase-RAG)  
End-to-end Fondant pipelines to index and evaluate RAG (Retrieval-Augmented Generation) systems.

๐Ÿ›‹๏ธ [**ControlNet Interior Design Pipeline**](https://github.com/ml6team/fondant-usecase-controlnet)  
An end-to-end Fondant pipeline to collect and process data for the fine-tuning of a ControlNet model, focusing on images related to interior design.

๐Ÿ–ผ๏ธ [**Filter creative common license images**](https://github.com/ml6team/fondant-usecase-filter-creative-commons)  
An end-to-end Fondant pipeline that starts from our Fondant-CC-25M creative commons image dataset and filters and downloads the desired images.

## โš’๏ธ Installation

First, run the minimal Fondant installation:

```
pip install fondant
```

Fondant also includes extra dependencies for specific runners, storage integrations and publishing 
components to registries. The dependencies for the local runner (docker) is included by default.

For more detailed installation options, check the [**installation page**](https://fondant.ai/en/latest/guides/installation/)on our documentation.


## ๐Ÿ‘ญ Contributing

We welcome contributions of different kinds:

|                                  |                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|----------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Issues**                       | If you encounter any issue or bug, please submit them as a [Github issue](https://github.com/ml6team/fondant/issues). You can also submit a pull request directly to fix any clear bugs.                                                                                                                                                                                                                                                                                      |
| **Suggestions and feedback**     | Our roadmap and priorities are defined based on community feedback. To provide input, you can join [our discord](https://discord.gg/HnTdWhydGp) or submit an idea in our [Github Discussions](https://github.com/ml6team/fondant/discussions/categories/ideas).                                                                                                                                                                                                               |
| **Framework code contributions** | If you want to help with the development of the Fondant framework, have a look at the issues marked with the [good first issue](https://github.com/ml6team/fondant/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22) label. If you want to add additional functionality, please submit an issue for it first.                                                                                                                                                     |
| **Reusable components**          | Extending our library of reusable components is a great way to contribute. If you built a component which would be useful for other users, please submit a PR adding them to the [components/](https://github.com/ml6team/fondant/tree/main/src/fondant/components) directory. You can find a list of possible contributable components [here](https://github.com/ml6team/fondant/issues?q=is%3Aissue+is%3Aopen+label%3A%22Components%22) or your own ideas are also welcome! |

For a detailed view on the roadmap and day to day development, you can check our [github project
board](https://github.com/orgs/ml6team/projects/1).

You can also check out our [architecture](https://fondant.ai/en/latest/architecture/) page to familiarize yourself with the Fondant architecture and repository structure.

### Environment setup

We use [poetry](https://python-poetry.org/docs/) and [pre-commit](https://pre-commit.com/) to enable a smooth developer flow. Run the following commands to
set up your development environment:

```shell
pip install poetry
poetry install --all-extras
pre-commit install
```

<p align="right">(<a href="#top">back to top</a>)</p>



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/ml6team/fondant",
    "name": "fondant",
    "maintainer": "ML6",
    "docs_url": null,
    "requires_python": "<3.12,>=3.9",
    "maintainer_email": "fondant@ml6.eu",
    "keywords": "data, machine learning, fine-tuning, foundation models",
    "author": "ML6",
    "author_email": "fondant@ml6.eu",
    "download_url": "https://files.pythonhosted.org/packages/8a/2d/42fd8f29b4b8c3cd8aed7b4c4e5c0879136ebf4bfe70df891774a1642ee4/fondant-0.12.0.tar.gz",
    "platform": null,
    "description": "<a id=\"top\"></a>\n<p align=\"center\">\n    <img src=\"https://raw.githubusercontent.com/ml6team/fondant/main/docs/art/fondant_banner.svg\" style=\"height:225px;\"/>\n</p>\n<p align=\"center\">\n    <i>\n        <b>Production-ready</b> \n        data processing made \n        <b>easy</b> \n        and \n        <b>shareable</b>\n    </i>\n    <br>\n    <a href=\"http://fondant.ai\"><strong>Explore the docs \u00bb</strong></a>\n    <br>\n    <br>\n    <a href=\"https://discord.gg/HnTdWhydGp\"><img alt=\"Discord\" src=\"https://dcbadge.vercel.app/api/server/HnTdWhydGp?style=flat-square\"></a>\n    <a href=\"https://pypi.org/project/fondant/\"><img alt=\"PyPI version\" src=\"https://img.shields.io/pypi/v/fondant?color=brightgreen&style=flat-square\"></a>\n    <a href=\"https://fondant.readthedocs.io/en/latest/license/\"><img alt=\"License\" src=\"https://img.shields.io/github/license/ml6team/fondant?style=flat-square&color=brightgreen\"></a>\n    <a href=\"https://github.com/ml6team/fondant/actions/workflows/pipeline.yaml\"><img alt=\"GitHub Workflow Status\" src=\"https://img.shields.io/github/actions/workflow/status/ml6team/fondant/pipeline.yaml?style=flat-square\"></a>\n    <a href=\"https://coveralls.io/github/ml6team/fondant?branch=main\"><img alt=\"Coveralls\" src=\"https://img.shields.io/coverallsCoverage/github/ml6team/fondant?style=flat-square\"></a>\n</p>\n<br>\n\n## \ud83e\udea4 Why Fondant?\n\nFondant is a data framework that enables collaborative dataset building. It is designed for developing and crafting datasets together, sharing reusable operations and complete data processing trees. \n\nFondant enables you to initialize datasets, apply various operations on them, and load datasets from other users. It assists in executing operations on managed services, sharing operations with others, and keeping track of your dataset versions. Fondant makes this all possible without moving the source data.\n\n\n## \ud83d\udca8 Getting Started\n\nFondant allows you to easily define workflows comprised of both reusable and custom components. The following example uses the reusable load_from_hf_hub component to load a dataset from the Hugging Face Hub and process it using a custom component that will resize the images resulting in a new dataset.\n\n\n```pipeline.py\nimport pyarrow as pa\n\nfrom fondant.dataset import Dataset\n\nraw_data = Dataset.create(\n    \"load_from_hf_hub\",\n    arguments={\n        \"dataset_name\": \"fondant-ai/fondant-cc-25m\",\n        \"n_rows_to_load\": 100,\n    },\n    produces={\n        \"alt_text\": pa.string(),\n        \"image_url\": pa.string(),\n        \"license_location\": pa.string(),\n        \"license_type\": pa.string(),\n        \"webpage_url\": pa.string(),\n        \"surt_url\": pa.string(),\n        \"top_level_domain\": pa.string(),\n    },\n)\n\nimages = raw_data.apply(\n    \"download_images\",\n    arguments={\n        \"input_partition_rows\": 100,\n        \"resize_mode\": \"no\",\n    },\n)\n\ndataset = images.apply(\n    \"resize_images\",\n    arguments={\n        \"resize_width\": 128,\n        \"resize_height\": 128,\n    },\n)\n\n```\nCustom use cases require the creation of custom components. Check out our [**step by step guide**](https://fondant.ai/en/latest/guides/first_dataset/) to learn more about how to build custom pipelines and components.\n\n<p align=\"right\">(<a href=\"#top\">back to top</a>)</p>\n\n### Running your pipeline\n\nOnce you have a pipeline you can easily run (and compile) it by using the built-in CLI:\n\n```bash\nfondant run local pipeline.py\n```\n\nTo see all available runner and arguments you can check the fondant CLI help pages\n\n```bash\nfondant --help\n```\n\nOr for a subcommand:\n\n```bash\nfondant <subcommand> --help\n```\n\n<p align=\"right\">(<a href=\"#top\">back to top</a>)</p>\n\n\n## \ud83e\ude84 How Fondant works\n\n- **Dataset**: The building blocks, a dataset is a collection of columns. Fondant operates uniquely via datasets. We start with a dataset, we augment it into a new dataset and we end with a dataset. Fondant optimizes the data transfer by storing and loading columns as needed. While also processing based on the available partitions. The aim is to make these datasets sharable and allow users to create their own datasets based on others.\n- **Operation**: A transformation to be applied on a dataset resulting in a new dataset. The operation will load needed columns and produce new/altered columns. A transformation can be anything from loading, filtering, adding a column, writing etc. Fondant also makes operations sharable so you can easily use an operation in your workflow.\n- **Shareable trees**: Datasets are a result of applying operations on other datasets. The full lineage is baked in. This allows for sharing not just the end product but the full history, users can also easily continue based on a dataset or branch off of an existing graph.\n\n![overview](docs/art/fondant_overview.png)\n\n<p align=\"right\">(<a href=\"#top\">back to top</a>)</p>\n\n## \ud83e\udde9 Key Features\n\nHere's what Fondant brings to the table: \n- \ud83d\udd27 Plug \u2018n\u2019 play composable data processing workflows\n- \ud83e\udde9 Library containing off-the-shelf reusable components\n- \ud83d\udc3c A simple Pandas based interface for creating custom components\n- \ud83d\udcca Built-in lineage, caching, and data explorer\n- \ud83d\ude80 Production-ready, scalable deployment\n- \u2601\ufe0f Integration with runners across different clouds (Vertex, Sagemaker, Kubeflow)\n\n\ud83d\udc49 **Check our [Component Hub](https://fondant.ai/en/latest/components/hub/) for an overview of all \navailable components**\n\n<p align=\"right\">(<a href=\"#top\">back to top</a>)</p>\n\n## \ud83e\ude84 Example pipelines\n\nWe have created several ready-made example pipelines for you to use as a starting point for exploring Fondant. \nThey are hosted as separate repositories containing a notebook tutorial so you can easily clone them and get started:\n\n\ud83d\udcd6 [**RAG tuning pipeline**](https://github.com/ml6team/fondant-usecase-RAG)  \nEnd-to-end Fondant pipelines to index and evaluate RAG (Retrieval-Augmented Generation) systems.\n\n\ud83d\udecb\ufe0f [**ControlNet Interior Design Pipeline**](https://github.com/ml6team/fondant-usecase-controlnet)  \nAn end-to-end Fondant pipeline to collect and process data for the fine-tuning of a ControlNet model, focusing on images related to interior design.\n\n\ud83d\uddbc\ufe0f [**Filter creative common license images**](https://github.com/ml6team/fondant-usecase-filter-creative-commons)  \nAn end-to-end Fondant pipeline that starts from our Fondant-CC-25M creative commons image dataset and filters and downloads the desired images.\n\n## \u2692\ufe0f Installation\n\nFirst, run the minimal Fondant installation:\n\n```\npip install fondant\n```\n\nFondant also includes extra dependencies for specific runners, storage integrations and publishing \ncomponents to registries. The dependencies for the local runner (docker) is included by default.\n\nFor more detailed installation options, check the [**installation page**](https://fondant.ai/en/latest/guides/installation/)on our documentation.\n\n\n## \ud83d\udc6d Contributing\n\nWe welcome contributions of different kinds:\n\n|                                  |                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |\n|----------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| **Issues**                       | If you encounter any issue or bug, please submit them as a [Github issue](https://github.com/ml6team/fondant/issues). You can also submit a pull request directly to fix any clear bugs.                                                                                                                                                                                                                                                                                      |\n| **Suggestions and feedback**     | Our roadmap and priorities are defined based on community feedback. To provide input, you can join [our discord](https://discord.gg/HnTdWhydGp) or submit an idea in our [Github Discussions](https://github.com/ml6team/fondant/discussions/categories/ideas).                                                                                                                                                                                                               |\n| **Framework code contributions** | If you want to help with the development of the Fondant framework, have a look at the issues marked with the [good first issue](https://github.com/ml6team/fondant/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22) label. If you want to add additional functionality, please submit an issue for it first.                                                                                                                                                     |\n| **Reusable components**          | Extending our library of reusable components is a great way to contribute. If you built a component which would be useful for other users, please submit a PR adding them to the [components/](https://github.com/ml6team/fondant/tree/main/src/fondant/components) directory. You can find a list of possible contributable components [here](https://github.com/ml6team/fondant/issues?q=is%3Aissue+is%3Aopen+label%3A%22Components%22) or your own ideas are also welcome! |\n\nFor a detailed view on the roadmap and day to day development, you can check our [github project\nboard](https://github.com/orgs/ml6team/projects/1).\n\nYou can also check out our [architecture](https://fondant.ai/en/latest/architecture/) page to familiarize yourself with the Fondant architecture and repository structure.\n\n### Environment setup\n\nWe use [poetry](https://python-poetry.org/docs/) and [pre-commit](https://pre-commit.com/) to enable a smooth developer flow. Run the following commands to\nset up your development environment:\n\n```shell\npip install poetry\npoetry install --all-extras\npre-commit install\n```\n\n<p align=\"right\">(<a href=\"#top\">back to top</a>)</p>\n\n\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "Fondant - Large-scale data processing made easy and reusable",
    "version": "0.12.0",
    "project_urls": {
        "Homepage": "https://github.com/ml6team/fondant",
        "Repository": "https://github.com/ml6team/fondant"
    },
    "split_keywords": [
        "data",
        " machine learning",
        " fine-tuning",
        " foundation models"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0033ea36701c2495aaa130ff48548d42f874fbffffd6505e4b30d477fac685aa",
                "md5": "f512064dcad0f412bc91d42b236e536b",
                "sha256": "1c8d7ff8857b4ec1b0b46fc0649837f5d32e2c2960fd28ab413f8cacb32fff30"
            },
            "downloads": -1,
            "filename": "fondant-0.12.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "f512064dcad0f412bc91d42b236e536b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.12,>=3.9",
            "size": 91686,
            "upload_time": "2024-04-17T13:16:18",
            "upload_time_iso_8601": "2024-04-17T13:16:18.128398Z",
            "url": "https://files.pythonhosted.org/packages/00/33/ea36701c2495aaa130ff48548d42f874fbffffd6505e4b30d477fac685aa/fondant-0.12.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8a2d42fd8f29b4b8c3cd8aed7b4c4e5c0879136ebf4bfe70df891774a1642ee4",
                "md5": "a93c975845f28491948727448385d533",
                "sha256": "29dd76abd5694083d5fd91a040f9ff09898a705eacf1b9e1b8fbc3cc3c95b4c2"
            },
            "downloads": -1,
            "filename": "fondant-0.12.0.tar.gz",
            "has_sig": false,
            "md5_digest": "a93c975845f28491948727448385d533",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.12,>=3.9",
            "size": 71209,
            "upload_time": "2024-04-17T13:16:31",
            "upload_time_iso_8601": "2024-04-17T13:16:31.494843Z",
            "url": "https://files.pythonhosted.org/packages/8a/2d/42fd8f29b4b8c3cd8aed7b4c4e5c0879136ebf4bfe70df891774a1642ee4/fondant-0.12.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-17 13:16:31",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ml6team",
    "github_project": "fondant",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "tox": true,
    "lcname": "fondant"
}
        
ML6
Elapsed time: 0.24011s