datachain


Namedatachain JSON
Version 0.16.0 PyPI version JSON
download
home_pageNone
SummaryWrangle unstructured AI data at scale
upload_time2025-04-18 23:57:44
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ================
|logo| DataChain
================

|PyPI| |Python Version| |Codecov| |Tests|

.. |logo| image:: docs/assets/datachain.svg
   :height: 24
.. |PyPI| image:: https://img.shields.io/pypi/v/datachain.svg
   :target: https://pypi.org/project/datachain/
   :alt: PyPI
.. |Python Version| image:: https://img.shields.io/pypi/pyversions/datachain
   :target: https://pypi.org/project/datachain
   :alt: Python Version
.. |Codecov| image:: https://codecov.io/gh/iterative/datachain/graph/badge.svg?token=byliXGGyGB
   :target: https://codecov.io/gh/iterative/datachain
   :alt: Codecov
.. |Tests| image:: https://github.com/iterative/datachain/actions/workflows/tests.yml/badge.svg
   :target: https://github.com/iterative/datachain/actions/workflows/tests.yml
   :alt: Tests

DataChain is a Python-based AI-data warehouse for transforming and analyzing unstructured
data like images, audio, videos, text and PDFs. It integrates with external storage
(e.g. S3) to process data efficiently without data duplication and manages metadata
in an internal database for easy and efficient querying.


Use Cases
=========

1. **ETL.** Pythonic framework for describing and running unstructured data transformations
   and enrichments, applying models to data, including LLMs.
2. **Analytics.** DataChain dataset is a table that combines all the information about data
   objects in one place + it provides dataframe-like API and vectorized engine to do analytics
   on these tables at scale.
3. **Versioning.** DataChain doesn't store, require moving or copying data (unlike DVC).
   Perfect use case is a bucket with thousands or millions of images, videos, audio, PDFs.

Getting Started
===============

Visit `Quick Start <https://docs.datachain.ai/quick-start>`_ and `Docs <https://docs.datachain.ai/>`_
to get started with `DataChain` and learn more.

.. code:: bash

        pip install datachain


Example: download subset of files based on metadata
---------------------------------------------------

Sometimes users only need to download a specific subset of files from cloud storage,
rather than the entire dataset.
For example, you could use a JSON file's metadata to download just cat images with
high confidence scores.


.. code:: py

    import datachain as dc

    meta = dc.read_json("gs://datachain-demo/dogs-and-cats/*json", column="meta", anon=True)
    images = dc.read_storage("gs://datachain-demo/dogs-and-cats/*jpg", anon=True)

    images_id = images.map(id=lambda file: file.path.split('.')[-2])
    annotated = images_id.merge(meta, on="id", right_on="meta.id")

    likely_cats = annotated.filter((dc.Column("meta.inference.confidence") > 0.93) \
                                   & (dc.Column("meta.inference.class_") == "cat"))
    likely_cats.to_storage("high-confidence-cats/", signal="file")


Example: LLM based text-file evaluation
---------------------------------------

In this example, we evaluate chatbot conversations stored in text files
using LLM based evaluation.

.. code:: shell

    $ pip install mistralai # Requires version >=1.0.0
    $ export MISTRAL_API_KEY=_your_key_

Python code:

.. code:: py

    import os
    from mistralai import Mistral
    import datachain as dc

    PROMPT = "Was this dialog successful? Answer in a single word: Success or Failure."

    def eval_dialogue(file: dc.File) -> bool:
         client = Mistral(api_key = os.environ["MISTRAL_API_KEY"])
         response = client.chat.complete(
             model="open-mixtral-8x22b",
             messages=[{"role": "system", "content": PROMPT},
                       {"role": "user", "content": file.read()}])
         result = response.choices[0].message.content
         return result.lower().startswith("success")

    chain = (
       dc.read_storage("gs://datachain-demo/chatbot-KiT/", column="file", anon=True)
       .settings(parallel=4, cache=True)
       .map(is_success=eval_dialogue)
       .save("mistral_files")
    )

    successful_chain = chain.filter(dc.Column("is_success") == True)
    successful_chain.to_storage("./output_mistral")

    print(f"{successful_chain.count()} files were exported")



With the instruction above, the Mistral model considers 31/50 files to hold the successful dialogues:

.. code:: shell

    $ ls output_mistral/datachain-demo/chatbot-KiT/
    1.txt  15.txt 18.txt 2.txt  22.txt 25.txt 28.txt 33.txt 37.txt 4.txt  41.txt ...
    $ ls output_mistral/datachain-demo/chatbot-KiT/ | wc -l
    31


Key Features
============

📂 **Multimodal Dataset Versioning.**
   - Version unstructured data without moving or creating data copies, by supporting
     references to S3, GCP, Azure, and local file systems.
   - Multimodal data support: images, video, text, PDFs, JSONs, CSVs, parquet, etc.
   - Unite files and metadata together into persistent, versioned, columnar datasets.

🐍 **Python-friendly.**
   - Operate on Python objects and object fields: float scores, strings, matrixes,
     LLM response objects.
   - Run Python code in a high-scale, terabytes size datasets, with built-in
     parallelization and memory-efficient computing — no SQL or Spark required.

🧠 **Data Enrichment and Processing.**
   - Generate metadata using local AI models and LLM APIs.
   - Filter, join, and group datasets by metadata. Search by vector embeddings.
   - High-performance vectorized operations on Python objects: sum, count, avg, etc.
   - Pass datasets to Pytorch and Tensorflow, or export them back into storage.


Contributing
============

Contributions are very welcome. To learn more, see the `Contributor Guide`_.


Community and Support
=====================

* `Docs <https://docs.datachain.ai/>`_
* `File an issue`_ if you encounter any problems
* `Discord Chat <https://dvc.org/chat>`_
* `Email <mailto:support@dvc.org>`_
* `Twitter <https://twitter.com/DVCorg>`_


DataChain Studio Platform
=========================

`DataChain Studio`_ is a proprietary solution for teams that offers:

- **Centralized dataset registry** to manage data, code and
  dependencies in one place.
- **Data Lineage** for data sources as well as derivative dataset.
- **UI for Multimodal Data** like images, videos, and PDFs.
- **Scalable Compute** to handle large datasets (100M+ files) and in-house
  AI model inference.
- **Access control** including SSO and team based collaboration.

.. _PyPI: https://pypi.org/
.. _file an issue: https://github.com/iterative/datachain/issues
.. github-only
.. _Contributor Guide: https://docs.datachain.ai/contributing
.. _Pydantic: https://github.com/pydantic/pydantic
.. _publicly available: https://radar.kit.edu/radar/en/dataset/FdJmclKpjHzLfExE.ExpBot%2B-%2BA%2Bdataset%2Bof%2B79%2Bdialogs%2Bwith%2Ban%2Bexperimental%2Bcustomer%2Bservice%2Bchatbot
.. _SQLite: https://www.sqlite.org/
.. _Getting Started: https://docs.datachain.ai/
.. _DataChain Studio: https://studio.datachain.ai/

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "datachain",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": null,
    "author": null,
    "author_email": "Dmitry Petrov <support@dvc.org>",
    "download_url": "https://files.pythonhosted.org/packages/e7/d9/0daefe9c4bcc9d2ca59c3fdba0eb8c18dceea25cb2f4a060f7bf6139ecb7/datachain-0.16.0.tar.gz",
    "platform": null,
    "description": "================\n|logo| DataChain\n================\n\n|PyPI| |Python Version| |Codecov| |Tests|\n\n.. |logo| image:: docs/assets/datachain.svg\n   :height: 24\n.. |PyPI| image:: https://img.shields.io/pypi/v/datachain.svg\n   :target: https://pypi.org/project/datachain/\n   :alt: PyPI\n.. |Python Version| image:: https://img.shields.io/pypi/pyversions/datachain\n   :target: https://pypi.org/project/datachain\n   :alt: Python Version\n.. |Codecov| image:: https://codecov.io/gh/iterative/datachain/graph/badge.svg?token=byliXGGyGB\n   :target: https://codecov.io/gh/iterative/datachain\n   :alt: Codecov\n.. |Tests| image:: https://github.com/iterative/datachain/actions/workflows/tests.yml/badge.svg\n   :target: https://github.com/iterative/datachain/actions/workflows/tests.yml\n   :alt: Tests\n\nDataChain is a Python-based AI-data warehouse for transforming and analyzing unstructured\ndata like images, audio, videos, text and PDFs. It integrates with external storage\n(e.g. S3) to process data efficiently without data duplication and manages metadata\nin an internal database for easy and efficient querying.\n\n\nUse Cases\n=========\n\n1. **ETL.** Pythonic framework for describing and running unstructured data transformations\n   and enrichments, applying models to data, including LLMs.\n2. **Analytics.** DataChain dataset is a table that combines all the information about data\n   objects in one place + it provides dataframe-like API and vectorized engine to do analytics\n   on these tables at scale.\n3. **Versioning.** DataChain doesn't store, require moving or copying data (unlike DVC).\n   Perfect use case is a bucket with thousands or millions of images, videos, audio, PDFs.\n\nGetting Started\n===============\n\nVisit `Quick Start <https://docs.datachain.ai/quick-start>`_ and `Docs <https://docs.datachain.ai/>`_\nto get started with `DataChain` and learn more.\n\n.. code:: bash\n\n        pip install datachain\n\n\nExample: download subset of files based on metadata\n---------------------------------------------------\n\nSometimes users only need to download a specific subset of files from cloud storage,\nrather than the entire dataset.\nFor example, you could use a JSON file's metadata to download just cat images with\nhigh confidence scores.\n\n\n.. code:: py\n\n    import datachain as dc\n\n    meta = dc.read_json(\"gs://datachain-demo/dogs-and-cats/*json\", column=\"meta\", anon=True)\n    images = dc.read_storage(\"gs://datachain-demo/dogs-and-cats/*jpg\", anon=True)\n\n    images_id = images.map(id=lambda file: file.path.split('.')[-2])\n    annotated = images_id.merge(meta, on=\"id\", right_on=\"meta.id\")\n\n    likely_cats = annotated.filter((dc.Column(\"meta.inference.confidence\") > 0.93) \\\n                                   & (dc.Column(\"meta.inference.class_\") == \"cat\"))\n    likely_cats.to_storage(\"high-confidence-cats/\", signal=\"file\")\n\n\nExample: LLM based text-file evaluation\n---------------------------------------\n\nIn this example, we evaluate chatbot conversations stored in text files\nusing LLM based evaluation.\n\n.. code:: shell\n\n    $ pip install mistralai # Requires version >=1.0.0\n    $ export MISTRAL_API_KEY=_your_key_\n\nPython code:\n\n.. code:: py\n\n    import os\n    from mistralai import Mistral\n    import datachain as dc\n\n    PROMPT = \"Was this dialog successful? Answer in a single word: Success or Failure.\"\n\n    def eval_dialogue(file: dc.File) -> bool:\n         client = Mistral(api_key = os.environ[\"MISTRAL_API_KEY\"])\n         response = client.chat.complete(\n             model=\"open-mixtral-8x22b\",\n             messages=[{\"role\": \"system\", \"content\": PROMPT},\n                       {\"role\": \"user\", \"content\": file.read()}])\n         result = response.choices[0].message.content\n         return result.lower().startswith(\"success\")\n\n    chain = (\n       dc.read_storage(\"gs://datachain-demo/chatbot-KiT/\", column=\"file\", anon=True)\n       .settings(parallel=4, cache=True)\n       .map(is_success=eval_dialogue)\n       .save(\"mistral_files\")\n    )\n\n    successful_chain = chain.filter(dc.Column(\"is_success\") == True)\n    successful_chain.to_storage(\"./output_mistral\")\n\n    print(f\"{successful_chain.count()} files were exported\")\n\n\n\nWith the instruction above, the Mistral model considers 31/50 files to hold the successful dialogues:\n\n.. code:: shell\n\n    $ ls output_mistral/datachain-demo/chatbot-KiT/\n    1.txt  15.txt 18.txt 2.txt  22.txt 25.txt 28.txt 33.txt 37.txt 4.txt  41.txt ...\n    $ ls output_mistral/datachain-demo/chatbot-KiT/ | wc -l\n    31\n\n\nKey Features\n============\n\n\ud83d\udcc2 **Multimodal Dataset Versioning.**\n   - Version unstructured data without moving or creating data copies, by supporting\n     references to S3, GCP, Azure, and local file systems.\n   - Multimodal data support: images, video, text, PDFs, JSONs, CSVs, parquet, etc.\n   - Unite files and metadata together into persistent, versioned, columnar datasets.\n\n\ud83d\udc0d **Python-friendly.**\n   - Operate on Python objects and object fields: float scores, strings, matrixes,\n     LLM response objects.\n   - Run Python code in a high-scale, terabytes size datasets, with built-in\n     parallelization and memory-efficient computing \u2014 no SQL or Spark required.\n\n\ud83e\udde0 **Data Enrichment and Processing.**\n   - Generate metadata using local AI models and LLM APIs.\n   - Filter, join, and group datasets by metadata. Search by vector embeddings.\n   - High-performance vectorized operations on Python objects: sum, count, avg, etc.\n   - Pass datasets to Pytorch and Tensorflow, or export them back into storage.\n\n\nContributing\n============\n\nContributions are very welcome. To learn more, see the `Contributor Guide`_.\n\n\nCommunity and Support\n=====================\n\n* `Docs <https://docs.datachain.ai/>`_\n* `File an issue`_ if you encounter any problems\n* `Discord Chat <https://dvc.org/chat>`_\n* `Email <mailto:support@dvc.org>`_\n* `Twitter <https://twitter.com/DVCorg>`_\n\n\nDataChain Studio Platform\n=========================\n\n`DataChain Studio`_ is a proprietary solution for teams that offers:\n\n- **Centralized dataset registry** to manage data, code and\n  dependencies in one place.\n- **Data Lineage** for data sources as well as derivative dataset.\n- **UI for Multimodal Data** like images, videos, and PDFs.\n- **Scalable Compute** to handle large datasets (100M+ files) and in-house\n  AI model inference.\n- **Access control** including SSO and team based collaboration.\n\n.. _PyPI: https://pypi.org/\n.. _file an issue: https://github.com/iterative/datachain/issues\n.. github-only\n.. _Contributor Guide: https://docs.datachain.ai/contributing\n.. _Pydantic: https://github.com/pydantic/pydantic\n.. _publicly available: https://radar.kit.edu/radar/en/dataset/FdJmclKpjHzLfExE.ExpBot%2B-%2BA%2Bdataset%2Bof%2B79%2Bdialogs%2Bwith%2Ban%2Bexperimental%2Bcustomer%2Bservice%2Bchatbot\n.. _SQLite: https://www.sqlite.org/\n.. _Getting Started: https://docs.datachain.ai/\n.. _DataChain Studio: https://studio.datachain.ai/\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Wrangle unstructured AI data at scale",
    "version": "0.16.0",
    "project_urls": {
        "Documentation": "https://datachain.dvc.ai",
        "Issues": "https://github.com/iterative/datachain/issues",
        "Source": "https://github.com/iterative/datachain"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "9224baca9970e2a340fd2a9964c8b21dab824eef7dffb2fb53444f3ccfceef2f",
                "md5": "c180f101fe22ef9b5ead33523ad04f28",
                "sha256": "7f0a04b05b6727ce3bafcbf3e9701f729411a381b52897286c8d719b9e8cde30"
            },
            "downloads": -1,
            "filename": "datachain-0.16.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "c180f101fe22ef9b5ead33523ad04f28",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 277720,
            "upload_time": "2025-04-18T23:57:42",
            "upload_time_iso_8601": "2025-04-18T23:57:42.449665Z",
            "url": "https://files.pythonhosted.org/packages/92/24/baca9970e2a340fd2a9964c8b21dab824eef7dffb2fb53444f3ccfceef2f/datachain-0.16.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "e7d90daefe9c4bcc9d2ca59c3fdba0eb8c18dceea25cb2f4a060f7bf6139ecb7",
                "md5": "e4295573eecaa4633ad36677ec435b89",
                "sha256": "46ef9d60302527542e5710e285ff3ce15aed1a2029026de579f603c59f0cfb8e"
            },
            "downloads": -1,
            "filename": "datachain-0.16.0.tar.gz",
            "has_sig": false,
            "md5_digest": "e4295573eecaa4633ad36677ec435b89",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 2441679,
            "upload_time": "2025-04-18T23:57:44",
            "upload_time_iso_8601": "2025-04-18T23:57:44.973249Z",
            "url": "https://files.pythonhosted.org/packages/e7/d9/0daefe9c4bcc9d2ca59c3fdba0eb8c18dceea25cb2f4a060f7bf6139ecb7/datachain-0.16.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-04-18 23:57:44",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "iterative",
    "github_project": "datachain",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "datachain"
}
        
Elapsed time: 0.41910s