deltatorch


Namedeltatorch JSON
Version 0.0.3 PyPI version JSON
download
home_pagehttps://github.com/mshtelma/deltatorch/
SummaryDeltaTorch allows loading training data from DeltaLake tables for training Deep Learning models using PyTorch
upload_time2023-10-31 11:13:38
maintainer
docs_urlNone
authorMichael Shtelma
requires_python>=3.8,<4.0
licenseApache-2.0
keywords delta torch pytorch deltalake
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # deltatorch

![![image](https://github.com/mshtelma/deltatorch/workflows/build/badge.svg)](https://github.com/mshtelma/deltatorch/actions/workflows/ci.yml/badge.svg)
![![image](https://github.com/mshtelma/deltatorch/workflows/build/badge.svg)](https://github.com/mshtelma/deltatorch/actions/workflows/black.yml/badge.svg)
![![image](https://github.com/mshtelma/deltatorch/workflows/build/badge.svg)](https://github.com/mshtelma/deltatorch/actions/workflows/lint.yml/badge.svg)

## Concept

`deltatorch` allows users to directly use  `DeltaLake` tables as a data source for training using PyTorch. 
Using  `deltatorch`, users can create a PyTorch  `DataLoader` to load the training data. 
We support distributed training using PyTorch DDP as well. 

## Why yet another data-loading framework?

- Many Deep Learning projects are struggling with efficient data loading, especially with tabular datasets or datasets containing many small images
- Classical Big Data formats like Parquet can help with this issue, but are hard to operate:
  * writers might block readers
  * Failed write can make the whole dataset unreadable
  * More complicated projects might ingest data all the time, even during training

Delta Lake storage format solves all these issues, but PyTorch has no direct support for `DeltaLake` datasets.
`deltatorch` introduces such support and allows users to use `DeltaLake` for training Deep Learning models using PyTorch.

## Usage

### Requirements

- Python Version \> 3.8
- `pip` or `conda`

### Installation

- with `pip`:

```
pip install  git+https://github.com/delta-incubator/deltatorch
```
### Create PyTorch DataLoader to read our DeltaLake table

To utilize `deltatorch` at first, we will need a DeltaLake table containing training data we would like to use for training your PyTorch deep learning model. 
There is a requirement: this table must have an autoincrement ID field. This field is used by `deltatorch` for sharding and parallelization of loading. 
After that, we can use the `create_pytorch_dataloader` function to create PyTorch DataLoader, which can be used directly during training. 
Below you can find an example of creating a DataLoader for the following table schema :


```sql
CREATE TABLE TRAINING_DATA 
(   
    image BINARY,   
    label BIGINT,   
    id INT
) 
USING delta LOCATION 'path' 
```

After the table is ready we can use the `create_pytorch_dataloader` function to create a PyTorch DataLoader :
```python
from deltatorch import create_pytorch_dataloader
from deltatorch import FieldSpec

def create_data_loader(path:str, batch_size:int):

    return create_pytorch_dataloader(
        # Path to the DeltaLake table
        path,
        # Autoincrement ID field
        id_field="id",
        # Fields which will be used during training
        fields=[
            FieldSpec("image",
                      # Load image using Pillow
                      load_image_using_pil=True, 
                      # PyTorch Transform
                      transform=transform),
            FieldSpec("label"),
        ],
        # Number of readers 
        num_workers=2,
        # Shuffle data inside the record batches
        shuffle=True,
        # Batch size        
        batch_size=batch_size,
    )
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/mshtelma/deltatorch/",
    "name": "deltatorch",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8,<4.0",
    "maintainer_email": "",
    "keywords": "delta,torch,pytorch,deltalake",
    "author": "Michael Shtelma",
    "author_email": "mshtelma@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/81/42/09a597ed1a2df460150a7921b96c42b7318da89bc0c78b1d97ff96051443/deltatorch-0.0.3.tar.gz",
    "platform": null,
    "description": "# deltatorch\n\n![![image](https://github.com/mshtelma/deltatorch/workflows/build/badge.svg)](https://github.com/mshtelma/deltatorch/actions/workflows/ci.yml/badge.svg)\n![![image](https://github.com/mshtelma/deltatorch/workflows/build/badge.svg)](https://github.com/mshtelma/deltatorch/actions/workflows/black.yml/badge.svg)\n![![image](https://github.com/mshtelma/deltatorch/workflows/build/badge.svg)](https://github.com/mshtelma/deltatorch/actions/workflows/lint.yml/badge.svg)\n\n## Concept\n\n`deltatorch` allows users to directly use  `DeltaLake` tables as a data source for training using PyTorch. \nUsing  `deltatorch`, users can create a PyTorch  `DataLoader` to load the training data. \nWe support distributed training using PyTorch DDP as well. \n\n## Why yet another data-loading framework?\n\n- Many Deep Learning projects are struggling with efficient data loading, especially with tabular datasets or datasets containing many small images\n- Classical Big Data formats like Parquet can help with this issue, but are hard to operate:\n  * writers might block readers\n  * Failed write can make the whole dataset unreadable\n  * More complicated projects might ingest data all the time, even during training\n\nDelta Lake storage format solves all these issues, but PyTorch has no direct support for `DeltaLake` datasets.\n`deltatorch` introduces such support and allows users to use `DeltaLake` for training Deep Learning models using PyTorch.\n\n## Usage\n\n### Requirements\n\n- Python Version \\> 3.8\n- `pip` or `conda`\n\n### Installation\n\n- with `pip`:\n\n```\npip install  git+https://github.com/delta-incubator/deltatorch\n```\n### Create PyTorch DataLoader to read our DeltaLake table\n\nTo utilize `deltatorch` at first, we will need a DeltaLake table containing training data we would like to use for training your PyTorch deep learning model. \nThere is a requirement: this table must have an autoincrement ID field. This field is used by `deltatorch` for sharding and parallelization of loading. \nAfter that, we can use the `create_pytorch_dataloader` function to create PyTorch DataLoader, which can be used directly during training. \nBelow you can find an example of creating a DataLoader for the following table schema :\n\n\n```sql\nCREATE TABLE TRAINING_DATA \n(   \n    image BINARY,   \n    label BIGINT,   \n    id INT\n) \nUSING delta LOCATION 'path' \n```\n\nAfter the table is ready we can use the `create_pytorch_dataloader` function to create a PyTorch DataLoader :\n```python\nfrom deltatorch import create_pytorch_dataloader\nfrom deltatorch import FieldSpec\n\ndef create_data_loader(path:str, batch_size:int):\n\n    return create_pytorch_dataloader(\n        # Path to the DeltaLake table\n        path,\n        # Autoincrement ID field\n        id_field=\"id\",\n        # Fields which will be used during training\n        fields=[\n            FieldSpec(\"image\",\n                      # Load image using Pillow\n                      load_image_using_pil=True, \n                      # PyTorch Transform\n                      transform=transform),\n            FieldSpec(\"label\"),\n        ],\n        # Number of readers \n        num_workers=2,\n        # Shuffle data inside the record batches\n        shuffle=True,\n        # Batch size        \n        batch_size=batch_size,\n    )\n```\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "DeltaTorch allows loading training data from  DeltaLake tables for training Deep Learning models using PyTorch",
    "version": "0.0.3",
    "project_urls": {
        "Homepage": "https://github.com/mshtelma/deltatorch/"
    },
    "split_keywords": [
        "delta",
        "torch",
        "pytorch",
        "deltalake"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4a0556d04a27b8299a05b76e2355c6fcda3007f6ad34710134eca2d937bc8f01",
                "md5": "bdcd7b39d7ddcb4ed85da8c3472894a2",
                "sha256": "a8f3726f16ea8f417f0dcdb19b1e6630aeb6c5c701649266300fa18903ec5bb2"
            },
            "downloads": -1,
            "filename": "deltatorch-0.0.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "bdcd7b39d7ddcb4ed85da8c3472894a2",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8,<4.0",
            "size": 8236,
            "upload_time": "2023-10-31T11:13:36",
            "upload_time_iso_8601": "2023-10-31T11:13:36.355538Z",
            "url": "https://files.pythonhosted.org/packages/4a/05/56d04a27b8299a05b76e2355c6fcda3007f6ad34710134eca2d937bc8f01/deltatorch-0.0.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "814209a597ed1a2df460150a7921b96c42b7318da89bc0c78b1d97ff96051443",
                "md5": "42e239ba4f42e61024817dca55a65b19",
                "sha256": "07206b3e98348bd3c58f70ade670ce0149ac506c681ed2e24ced0f1f76f0c155"
            },
            "downloads": -1,
            "filename": "deltatorch-0.0.3.tar.gz",
            "has_sig": false,
            "md5_digest": "42e239ba4f42e61024817dca55a65b19",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8,<4.0",
            "size": 7449,
            "upload_time": "2023-10-31T11:13:38",
            "upload_time_iso_8601": "2023-10-31T11:13:38.073209Z",
            "url": "https://files.pythonhosted.org/packages/81/42/09a597ed1a2df460150a7921b96c42b7318da89bc0c78b1d97ff96051443/deltatorch-0.0.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-10-31 11:13:38",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "mshtelma",
    "github_project": "deltatorch",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "deltatorch"
}
        
Elapsed time: 4.82378s