# deltatorch
![![image](https://github.com/mshtelma/deltatorch/workflows/build/badge.svg)](https://github.com/mshtelma/deltatorch/actions/workflows/ci.yml/badge.svg)
![![image](https://github.com/mshtelma/deltatorch/workflows/build/badge.svg)](https://github.com/mshtelma/deltatorch/actions/workflows/black.yml/badge.svg)
![![image](https://github.com/mshtelma/deltatorch/workflows/build/badge.svg)](https://github.com/mshtelma/deltatorch/actions/workflows/lint.yml/badge.svg)
## Concept
`deltatorch` allows users to directly use `DeltaLake` tables as a data source for training using PyTorch.
Using `deltatorch`, users can create a PyTorch `DataLoader` to load the training data.
We support distributed training using PyTorch DDP as well.
## Why yet another data-loading framework?
- Many Deep Learning projects are struggling with efficient data loading, especially with tabular datasets or datasets containing many small images
- Classical Big Data formats like Parquet can help with this issue, but are hard to operate:
* writers might block readers
* Failed write can make the whole dataset unreadable
* More complicated projects might ingest data all the time, even during training
Delta Lake storage format solves all these issues, but PyTorch has no direct support for `DeltaLake` datasets.
`deltatorch` introduces such support and allows users to use `DeltaLake` for training Deep Learning models using PyTorch.
## Usage
### Requirements
- Python Version \> 3.8
- `pip` or `conda`
### Installation
- with `pip`:
```
pip install git+https://github.com/delta-incubator/deltatorch
```
### Create PyTorch DataLoader to read our DeltaLake table
To utilize `deltatorch` at first, we will need a DeltaLake table containing training data we would like to use for training your PyTorch deep learning model.
There is a requirement: this table must have an autoincrement ID field. This field is used by `deltatorch` for sharding and parallelization of loading.
After that, we can use the `create_pytorch_dataloader` function to create PyTorch DataLoader, which can be used directly during training.
Below you can find an example of creating a DataLoader for the following table schema :
```sql
CREATE TABLE TRAINING_DATA
(
image BINARY,
label BIGINT,
id INT
)
USING delta LOCATION 'path'
```
After the table is ready we can use the `create_pytorch_dataloader` function to create a PyTorch DataLoader :
```python
from deltatorch import create_pytorch_dataloader
from deltatorch import FieldSpec
def create_data_loader(path:str, batch_size:int):
return create_pytorch_dataloader(
# Path to the DeltaLake table
path,
# Autoincrement ID field
id_field="id",
# Fields which will be used during training
fields=[
FieldSpec("image",
# Load image using Pillow
load_image_using_pil=True,
# PyTorch Transform
transform=transform),
FieldSpec("label"),
],
# Number of readers
num_workers=2,
# Shuffle data inside the record batches
shuffle=True,
# Batch size
batch_size=batch_size,
)
```
Raw data
{
"_id": null,
"home_page": "https://github.com/mshtelma/deltatorch/",
"name": "deltatorch",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8,<4.0",
"maintainer_email": "",
"keywords": "delta,torch,pytorch,deltalake",
"author": "Michael Shtelma",
"author_email": "mshtelma@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/81/42/09a597ed1a2df460150a7921b96c42b7318da89bc0c78b1d97ff96051443/deltatorch-0.0.3.tar.gz",
"platform": null,
"description": "# deltatorch\n\n![![image](https://github.com/mshtelma/deltatorch/workflows/build/badge.svg)](https://github.com/mshtelma/deltatorch/actions/workflows/ci.yml/badge.svg)\n![![image](https://github.com/mshtelma/deltatorch/workflows/build/badge.svg)](https://github.com/mshtelma/deltatorch/actions/workflows/black.yml/badge.svg)\n![![image](https://github.com/mshtelma/deltatorch/workflows/build/badge.svg)](https://github.com/mshtelma/deltatorch/actions/workflows/lint.yml/badge.svg)\n\n## Concept\n\n`deltatorch` allows users to directly use `DeltaLake` tables as a data source for training using PyTorch. \nUsing `deltatorch`, users can create a PyTorch `DataLoader` to load the training data. \nWe support distributed training using PyTorch DDP as well. \n\n## Why yet another data-loading framework?\n\n- Many Deep Learning projects are struggling with efficient data loading, especially with tabular datasets or datasets containing many small images\n- Classical Big Data formats like Parquet can help with this issue, but are hard to operate:\n * writers might block readers\n * Failed write can make the whole dataset unreadable\n * More complicated projects might ingest data all the time, even during training\n\nDelta Lake storage format solves all these issues, but PyTorch has no direct support for `DeltaLake` datasets.\n`deltatorch` introduces such support and allows users to use `DeltaLake` for training Deep Learning models using PyTorch.\n\n## Usage\n\n### Requirements\n\n- Python Version \\> 3.8\n- `pip` or `conda`\n\n### Installation\n\n- with `pip`:\n\n```\npip install git+https://github.com/delta-incubator/deltatorch\n```\n### Create PyTorch DataLoader to read our DeltaLake table\n\nTo utilize `deltatorch` at first, we will need a DeltaLake table containing training data we would like to use for training your PyTorch deep learning model. \nThere is a requirement: this table must have an autoincrement ID field. This field is used by `deltatorch` for sharding and parallelization of loading. \nAfter that, we can use the `create_pytorch_dataloader` function to create PyTorch DataLoader, which can be used directly during training. \nBelow you can find an example of creating a DataLoader for the following table schema :\n\n\n```sql\nCREATE TABLE TRAINING_DATA \n( \n image BINARY, \n label BIGINT, \n id INT\n) \nUSING delta LOCATION 'path' \n```\n\nAfter the table is ready we can use the `create_pytorch_dataloader` function to create a PyTorch DataLoader :\n```python\nfrom deltatorch import create_pytorch_dataloader\nfrom deltatorch import FieldSpec\n\ndef create_data_loader(path:str, batch_size:int):\n\n return create_pytorch_dataloader(\n # Path to the DeltaLake table\n path,\n # Autoincrement ID field\n id_field=\"id\",\n # Fields which will be used during training\n fields=[\n FieldSpec(\"image\",\n # Load image using Pillow\n load_image_using_pil=True, \n # PyTorch Transform\n transform=transform),\n FieldSpec(\"label\"),\n ],\n # Number of readers \n num_workers=2,\n # Shuffle data inside the record batches\n shuffle=True,\n # Batch size \n batch_size=batch_size,\n )\n```\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "DeltaTorch allows loading training data from DeltaLake tables for training Deep Learning models using PyTorch",
"version": "0.0.3",
"project_urls": {
"Homepage": "https://github.com/mshtelma/deltatorch/"
},
"split_keywords": [
"delta",
"torch",
"pytorch",
"deltalake"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "4a0556d04a27b8299a05b76e2355c6fcda3007f6ad34710134eca2d937bc8f01",
"md5": "bdcd7b39d7ddcb4ed85da8c3472894a2",
"sha256": "a8f3726f16ea8f417f0dcdb19b1e6630aeb6c5c701649266300fa18903ec5bb2"
},
"downloads": -1,
"filename": "deltatorch-0.0.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "bdcd7b39d7ddcb4ed85da8c3472894a2",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8,<4.0",
"size": 8236,
"upload_time": "2023-10-31T11:13:36",
"upload_time_iso_8601": "2023-10-31T11:13:36.355538Z",
"url": "https://files.pythonhosted.org/packages/4a/05/56d04a27b8299a05b76e2355c6fcda3007f6ad34710134eca2d937bc8f01/deltatorch-0.0.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "814209a597ed1a2df460150a7921b96c42b7318da89bc0c78b1d97ff96051443",
"md5": "42e239ba4f42e61024817dca55a65b19",
"sha256": "07206b3e98348bd3c58f70ade670ce0149ac506c681ed2e24ced0f1f76f0c155"
},
"downloads": -1,
"filename": "deltatorch-0.0.3.tar.gz",
"has_sig": false,
"md5_digest": "42e239ba4f42e61024817dca55a65b19",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8,<4.0",
"size": 7449,
"upload_time": "2023-10-31T11:13:38",
"upload_time_iso_8601": "2023-10-31T11:13:38.073209Z",
"url": "https://files.pythonhosted.org/packages/81/42/09a597ed1a2df460150a7921b96c42b7318da89bc0c78b1d97ff96051443/deltatorch-0.0.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-10-31 11:13:38",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "mshtelma",
"github_project": "deltatorch",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "deltatorch"
}