xdlake


Namexdlake JSON
Version 0.0.10 PyPI version JSON
download
home_pageNone
SummaryA loose implimentation of the deltalake spec focused on extensibility and distributed data.
upload_time2024-10-12 19:53:28
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseMIT
keywords deltalake parquet
VCS
bugtrack_url
requirements pyarrow fsspec
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # xdlake

A loose implementation of [deltalake](https://delta.io), and the deltalake, written in Python on top of
[pyarrow](https://arrow.apache.org/docs/python/index.html), focused on extensibility, customizability, and distributed
data.

This is mostly inspired by the [deltalake package](https://github.com/delta-io/delta-rs), and is (much) less battle tested.
However, it is more flexible given it's Pythonic design. If you're interested give it a shot and maybe even help make it
better.

## Install
```
pip install xdlake
```

## Usage

#### Instantiation

Instantiate a table! This can be a local or remote. For remote, you may need to install the relevant
fsspec implementation, for instance s3fs, gcsfs, adlfs for AWS S3, Google Storage, and Azure Storage,
respectively.

```
dt = xdlake.DeltaTable("path/to/my/cool/local/table")
dt = xdlake.DeltaTable("s3://path/to/my/cool/table")
dt = xdlake.DeltaTable("az://path/to/my/cool/table", storage_options=dict_of_azure_creds)
```

#### Reads

Read the data. For fancy filtering and predicate push down and whatever, use `to_pyarrow_dataset` and 
learn how to [filter pyarrow datasets](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset.filter).

```
ds = dt.to_pyarrow_dataset()
t = dt.to_pyarrow_table()
df = dt.to_pandas()
```

#### Writes

Instances of DeltaTable are immutable: any method that performs a table operation will return a new DeltaTable.

##### Write in-memory data

Write data from memory. Data can be pyarrow tables, datasets, record batches, pandas DataFrames, or iterables of those things.

```
dt = dt.write(my_cool_pandas_dataframe)
dt = dt.write(my_cool_arrow_table)
dt = dt.write(my_cool_arrow_dataset)
dt = dt.write(my_cool_arrow_record_batches)
dt = dt.write(pyarrow.Table.from_pandas(df))
```

##### Import foreign data

Import references to foreign data without copying. Data may be heterogeneously located in s3, gs, azure, and local,
and cn be partitioned differently than the DeltaTable itself. Go hog wild.

See [Credentials](#Credentials) if you need different creds for different storage locations.

Import data from various locations in one go. This only works for non-partitioned data.
```
dt = dt.import_refs(["s3://some/aws/data", "gs://some/gcp/data", "az://some/azure/data" ])
dt = dt.import_refs(my_pyarrow_filesystem_dataset)
```

Partitioned data needs to be handled a differently. First, you'll need to read up on
[pyarrow partitioning](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.partitioning.html) to do it.
Second, you can only import one dataset at a time.
```
foreign_partitioning = pyarrow.dataset.partitioning(...)
ds = pyarrow.dataset.dataset(
    list_of_files,
    partitioning=foreign_partitioning,
    partition_base_dir,
    filesystem=xdlake.storage.get_filesystem(foreign_refs_loc),
)
dt = dt.import_refs(ds, partition_by=my_partition_cols)
```

#### Deletes

Delete rows from a DeltaTable using [pyarrow expressions](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Expression.html#pyarrow.dataset.Expression):
```
import pyarrow.compute as pc
expr = (
    (pc.field("cats") == pc.scalar("A"))
    |
    (pc.field("float64") > pc.scalar(0.9))
)
dt = dt.delete(expr)
```

##### Deletion Vectors

I really want to support
[deletion vectors](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#deletion-vectors), but pyarrow can't
filter parquet files by row indices (pretty basic if you ask me). If you also would like xdlake to
support deletion vectors, let the arrow folks know by chiming in
[here](https://github.com/apache/arrow/issues/35301).

#### Clone

You can clone a deltatable. This is a soft clone (no data is copied, and the new table just references the data). The entire version history is preserved. Writes are written to the new location.

```
cloned_dt = dt.clone("the/location/of/the/clone")
```

#### Credentials

DeltaTables that reference distributed data may need credentials for various cloud locations.

To register default credentials for s3, gs, etc.
```
xdlake.storage.register_default_filesystem_for_protocol("s3", s3_creds)
xdlake.storage.register_default_filesystem_for_protocol("gs", gs_creds)
xdlake.storage.register_default_filesystem_for_protocol("az", az_creds)
```

To register specific credentials for various prefixes:
```
xdlake.storage.register_filesystem("s3://bucket-doom/foo/bar", s3_creds)
xdlake.storage.register_filesystem("s3://bucket-zoom/biz/baz", other_s3_creds)
xdlake.storage.register_filesystem("az://container-blah/whiz/whaz", az_creds)
```

## Links
Project home page [GitHub](https://github.com/xbrianh/xdlake)  
The deltalake transaction log [protocol](https://github.com/delta-io/delta/blob/master/PROTOCOL.md)

### Bugs
Please report bugs, issues, feature requests, etc. on [GitHub](https://github.com/xbrianh/xdlake).

## Gitpod Workspace
[launch gitpod workspace](https://gitpod.io/#https://github.com/xbrianh/xdlake)

## Build Status
![main](https://github.com/xbrianh/xdlake/actions/workflows/cicd.yml/badge.svg)

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "xdlake",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "deltalake, parquet",
    "author": null,
    "author_email": "Brian Hannafious <xbrianh@amorphous-industries.com>",
    "download_url": "https://files.pythonhosted.org/packages/d5/89/8e9f956de6a76867f1b088c64bb5b77650de430d7d1ccdc0c6051b9c59a9/xdlake-0.0.10.tar.gz",
    "platform": null,
    "description": "# xdlake\n\nA loose implementation of [deltalake](https://delta.io), and the deltalake, written in Python on top of\n[pyarrow](https://arrow.apache.org/docs/python/index.html), focused on extensibility, customizability, and distributed\ndata.\n\nThis is mostly inspired by the [deltalake package](https://github.com/delta-io/delta-rs), and is (much) less battle tested.\nHowever, it is more flexible given it's Pythonic design. If you're interested give it a shot and maybe even help make it\nbetter.\n\n## Install\n```\npip install xdlake\n```\n\n## Usage\n\n#### Instantiation\n\nInstantiate a table! This can be a local or remote. For remote, you may need to install the relevant\nfsspec implementation, for instance s3fs, gcsfs, adlfs for AWS S3, Google Storage, and Azure Storage,\nrespectively.\n\n```\ndt = xdlake.DeltaTable(\"path/to/my/cool/local/table\")\ndt = xdlake.DeltaTable(\"s3://path/to/my/cool/table\")\ndt = xdlake.DeltaTable(\"az://path/to/my/cool/table\", storage_options=dict_of_azure_creds)\n```\n\n#### Reads\n\nRead the data. For fancy filtering and predicate push down and whatever, use `to_pyarrow_dataset` and \nlearn how to [filter pyarrow datasets](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset.filter).\n\n```\nds = dt.to_pyarrow_dataset()\nt = dt.to_pyarrow_table()\ndf = dt.to_pandas()\n```\n\n#### Writes\n\nInstances of DeltaTable are immutable: any method that performs a table operation will return a new DeltaTable.\n\n##### Write in-memory data\n\nWrite data from memory. Data can be pyarrow tables, datasets, record batches, pandas DataFrames, or iterables of those things.\n\n```\ndt = dt.write(my_cool_pandas_dataframe)\ndt = dt.write(my_cool_arrow_table)\ndt = dt.write(my_cool_arrow_dataset)\ndt = dt.write(my_cool_arrow_record_batches)\ndt = dt.write(pyarrow.Table.from_pandas(df))\n```\n\n##### Import foreign data\n\nImport references to foreign data without copying. Data may be heterogeneously located in s3, gs, azure, and local,\nand cn be partitioned differently than the DeltaTable itself. Go hog wild.\n\nSee [Credentials](#Credentials) if you need different creds for different storage locations.\n\nImport data from various locations in one go. This only works for non-partitioned data.\n```\ndt = dt.import_refs([\"s3://some/aws/data\", \"gs://some/gcp/data\", \"az://some/azure/data\" ])\ndt = dt.import_refs(my_pyarrow_filesystem_dataset)\n```\n\nPartitioned data needs to be handled a differently. First, you'll need to read up on\n[pyarrow partitioning](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.partitioning.html) to do it.\nSecond, you can only import one dataset at a time.\n```\nforeign_partitioning = pyarrow.dataset.partitioning(...)\nds = pyarrow.dataset.dataset(\n    list_of_files,\n    partitioning=foreign_partitioning,\n    partition_base_dir,\n    filesystem=xdlake.storage.get_filesystem(foreign_refs_loc),\n)\ndt = dt.import_refs(ds, partition_by=my_partition_cols)\n```\n\n#### Deletes\n\nDelete rows from a DeltaTable using [pyarrow expressions](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Expression.html#pyarrow.dataset.Expression):\n```\nimport pyarrow.compute as pc\nexpr = (\n    (pc.field(\"cats\") == pc.scalar(\"A\"))\n    |\n    (pc.field(\"float64\") > pc.scalar(0.9))\n)\ndt = dt.delete(expr)\n```\n\n##### Deletion Vectors\n\nI really want to support\n[deletion vectors](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#deletion-vectors), but pyarrow can't\nfilter parquet files by row indices (pretty basic if you ask me). If you also would like xdlake to\nsupport deletion vectors, let the arrow folks know by chiming in\n[here](https://github.com/apache/arrow/issues/35301).\n\n#### Clone\n\nYou can clone a deltatable. This is a soft clone (no data is copied, and the new table just references the data). The entire version history is preserved. Writes are written to the new location.\n\n```\ncloned_dt = dt.clone(\"the/location/of/the/clone\")\n```\n\n#### Credentials\n\nDeltaTables that reference distributed data may need credentials for various cloud locations.\n\nTo register default credentials for s3, gs, etc.\n```\nxdlake.storage.register_default_filesystem_for_protocol(\"s3\", s3_creds)\nxdlake.storage.register_default_filesystem_for_protocol(\"gs\", gs_creds)\nxdlake.storage.register_default_filesystem_for_protocol(\"az\", az_creds)\n```\n\nTo register specific credentials for various prefixes:\n```\nxdlake.storage.register_filesystem(\"s3://bucket-doom/foo/bar\", s3_creds)\nxdlake.storage.register_filesystem(\"s3://bucket-zoom/biz/baz\", other_s3_creds)\nxdlake.storage.register_filesystem(\"az://container-blah/whiz/whaz\", az_creds)\n```\n\n## Links\nProject home page [GitHub](https://github.com/xbrianh/xdlake)  \nThe deltalake transaction log [protocol](https://github.com/delta-io/delta/blob/master/PROTOCOL.md)\n\n### Bugs\nPlease report bugs, issues, feature requests, etc. on [GitHub](https://github.com/xbrianh/xdlake).\n\n## Gitpod Workspace\n[launch gitpod workspace](https://gitpod.io/#https://github.com/xbrianh/xdlake)\n\n## Build Status\n![main](https://github.com/xbrianh/xdlake/actions/workflows/cicd.yml/badge.svg)\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A loose implimentation of the deltalake spec focused on extensibility and distributed data.",
    "version": "0.0.10",
    "project_urls": {
        "Homepage": "https://github.com/xbrianh/xdlake",
        "Issues": "https://github.com/xbrianh/xdlake/issues",
        "Repository": "https://github.com/xbrianh/xdlake.git"
    },
    "split_keywords": [
        "deltalake",
        " parquet"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "83f516e51e998fbf00ecb8c04ec580622fd607bee10daeea1ad40434832519e8",
                "md5": "0c00ca95c499cd1fe0b893c8a394220a",
                "sha256": "7ece23b02354424edf74b7da5479230687a04bee3d7d57316e8a3c1adece850f"
            },
            "downloads": -1,
            "filename": "xdlake-0.0.10-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "0c00ca95c499cd1fe0b893c8a394220a",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 21017,
            "upload_time": "2024-10-12T19:53:27",
            "upload_time_iso_8601": "2024-10-12T19:53:27.451740Z",
            "url": "https://files.pythonhosted.org/packages/83/f5/16e51e998fbf00ecb8c04ec580622fd607bee10daeea1ad40434832519e8/xdlake-0.0.10-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d5898e9f956de6a76867f1b088c64bb5b77650de430d7d1ccdc0c6051b9c59a9",
                "md5": "06cc6f9b482e14c26bd20fe9edb89549",
                "sha256": "cd163273aa8792340ea58c37232be4f2dca8eb336a9c30792f4b85ea78589101"
            },
            "downloads": -1,
            "filename": "xdlake-0.0.10.tar.gz",
            "has_sig": false,
            "md5_digest": "06cc6f9b482e14c26bd20fe9edb89549",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 33714,
            "upload_time": "2024-10-12T19:53:28",
            "upload_time_iso_8601": "2024-10-12T19:53:28.361528Z",
            "url": "https://files.pythonhosted.org/packages/d5/89/8e9f956de6a76867f1b088c64bb5b77650de430d7d1ccdc0c6051b9c59a9/xdlake-0.0.10.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-12 19:53:28",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "xbrianh",
    "github_project": "xdlake",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "pyarrow",
            "specs": []
        },
        {
            "name": "fsspec",
            "specs": []
        }
    ],
    "tox": true,
    "lcname": "xdlake"
}
        
Elapsed time: 0.37158s