pylance


Namepylance JSON
Version 0.10.14 PyPI version JSON
download
home_pageNone
Summarypython wrapper for Lance columnar format
upload_time2024-04-18 01:34:28
maintainerNone
docs_urlNone
authorLance Devs <dev@lancedb.com>
requires_python>=3.8
licenseNone
keywords data-format data-science machine-learning arrow data-analytics
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Python bindings for Lance Data Format

> :warning: **Under heavy development**

<div align="center">
<p align="center">

<img width="257" alt="Lance Logo" src="https://user-images.githubusercontent.com/917119/199353423-d3e202f7-0269-411d-8ff2-e747e419e492.png">

Lance is a new columnar data format for data science and machine learning
</p></div>

Why you should use Lance
1. Is order of magnitude faster than parquet for point queries and nested data structures common to DS/ML
2. Comes with a fast vector index that delivers sub-millisecond nearest neighbors search performance
3. Is automatically versioned and supports lineage and time-travel for full reproducibility
4. Integrated with duckdb/pandas/polars already. Easily convert from/to parquet in 2 lines of code


## Quick start

**Installation**

```shell
pip install pylance
```

Make sure you have a recent version of pandas (1.5+), pyarrow (10.0+), and DuckDB (0.7.0+)

**Converting to Lance**
```python
import lance

import pandas as pd
import pyarrow as pa
import pyarrow.dataset

df = pd.DataFrame({"a": [5], "b": [10]})
uri = "/tmp/test.parquet"
tbl = pa.Table.from_pandas(df)
pa.dataset.write_dataset(tbl, uri, format='parquet')

parquet = pa.dataset.dataset(uri, format='parquet')
lance.write_dataset(parquet, "/tmp/test.lance")
```

**Reading Lance data**
```python
dataset = lance.dataset("/tmp/test.lance")
assert isinstance(dataset, pa.dataset.Dataset)
```

**Pandas**
```python
df = dataset.to_table().to_pandas()
```

**DuckDB**
```python
import duckdb

# If this segfaults, make sure you have duckdb v0.7+ installed
duckdb.query("SELECT * FROM dataset LIMIT 10").to_df()
```

**Vector search**

Download the sift1m subset

```shell
wget ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz
tar -xzf sift.tar.gz
```

Convert it to Lance

```python
import lance
from lance.vector import vec_to_table
import numpy as np
import struct

nvecs = 1000000
ndims = 128
with open("sift/sift_base.fvecs", mode="rb") as fobj:
    buf = fobj.read()
    data = np.array(struct.unpack("<128000000f", buf[4 : 4 + 4 * nvecs * ndims])).reshape((nvecs, ndims))
    dd = dict(zip(range(nvecs), data))

table = vec_to_table(dd)
uri = "vec_data.lance"
sift1m = lance.write_dataset(table, uri, max_rows_per_group=8192, max_rows_per_file=1024*1024)
```

Build the index

```python
sift1m.create_index("vector",
                    index_type="IVF_PQ", 
                    num_partitions=256,  # IVF
                    num_sub_vectors=16)  # PQ
```

Search the dataset

```python
# Get top 10 similar vectors
import duckdb

dataset = lance.dataset(uri)

# Sample 100 query vectors. If this segfaults, make sure you have duckdb v0.7+ installed
sample = duckdb.query("SELECT vector FROM dataset USING SAMPLE 100").to_df()
query_vectors = np.array([np.array(x) for x in sample.vector])

# Get nearest neighbors for all of them
rs = [dataset.to_table(nearest={"column": "vector", "k": 10, "q": q})      
      for q in query_vectors]
```

*More distance metrics, HNSW, and distributed support is on the roadmap


## Python package details

Install from PyPI: `pip install pylance`  # >=0.3.0 is the new rust-based implementation
Install from source: `maturin develop` (under the `/python` directory)
Run unit tests: `make test`
Run integration tests: `make integtest`

Import via: `import lance`

The python integration is done via pyo3 + custom python code:

1. We make wrapper classes in Rust for Dataset/Scanner/RecordBatchReader that's exposed to python.
2. These are then used by LanceDataset / LanceScanner implementations that extend pyarrow Dataset/Scanner for duckdb compat.
3. Data is delivered via the Arrow C Data Interface

## Motivation

Why do we *need* a new format for data science and machine learning?

### 1. Reproducibility is a must-have

Versioning and experimentation support should be built into the dataset instead of requiring multiple tools.<br/>
It should also be efficient and not require expensive copying everytime you want to create a new version.<br/>
We call this "Zero copy versioning" in Lance. It makes versioning data easy without increasing storage costs.

### 2. Cloud storage is now the default

Remote object storage is the default now for data science and machine learning and the performance characteristics of cloud are fundamentally different.<br/>
Lance format is optimized to be cloud native. Common operations like filter-then-take can be order of magnitude faster
using Lance than Parquet, especially for ML data.

### 3. Vectors must be a first class citizen, not a separate thing

The majority of reasonable scale workflows should not require the added complexity and cost of a
specialized database just to compute vector similarity. Lance integrates optimized vector indices
into a columnar format so no additional infrastructure is required to get low latency top-K similarity search.

### 4. Open standards is a requirement

The DS/ML ecosystem is incredibly rich and data *must be* easily accessible across different languages, tools, and environments.
Lance makes Apache Arrow integration its primary interface, which means conversions to/from is 2 lines of code, your
code does not need to change after conversion, and nothing is locked-up to force you to pay for vendor compute.
We need open-source not fauxpen-source.



            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "pylance",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "data-format, data-science, machine-learning, arrow, data-analytics",
    "author": "Lance Devs <dev@lancedb.com>",
    "author_email": "Lance Devs <dev@lancedb.com>",
    "download_url": null,
    "platform": null,
    "description": "# Python bindings for Lance Data Format\n\n> :warning: **Under heavy development**\n\n<div align=\"center\">\n<p align=\"center\">\n\n<img width=\"257\" alt=\"Lance Logo\" src=\"https://user-images.githubusercontent.com/917119/199353423-d3e202f7-0269-411d-8ff2-e747e419e492.png\">\n\nLance is a new columnar data format for data science and machine learning\n</p></div>\n\nWhy you should use Lance\n1. Is order of magnitude faster than parquet for point queries and nested data structures common to DS/ML\n2. Comes with a fast vector index that delivers sub-millisecond nearest neighbors search performance\n3. Is automatically versioned and supports lineage and time-travel for full reproducibility\n4. Integrated with duckdb/pandas/polars already. Easily convert from/to parquet in 2 lines of code\n\n\n## Quick start\n\n**Installation**\n\n```shell\npip install pylance\n```\n\nMake sure you have a recent version of pandas (1.5+), pyarrow (10.0+), and DuckDB (0.7.0+)\n\n**Converting to Lance**\n```python\nimport lance\n\nimport pandas as pd\nimport pyarrow as pa\nimport pyarrow.dataset\n\ndf = pd.DataFrame({\"a\": [5], \"b\": [10]})\nuri = \"/tmp/test.parquet\"\ntbl = pa.Table.from_pandas(df)\npa.dataset.write_dataset(tbl, uri, format='parquet')\n\nparquet = pa.dataset.dataset(uri, format='parquet')\nlance.write_dataset(parquet, \"/tmp/test.lance\")\n```\n\n**Reading Lance data**\n```python\ndataset = lance.dataset(\"/tmp/test.lance\")\nassert isinstance(dataset, pa.dataset.Dataset)\n```\n\n**Pandas**\n```python\ndf = dataset.to_table().to_pandas()\n```\n\n**DuckDB**\n```python\nimport duckdb\n\n# If this segfaults, make sure you have duckdb v0.7+ installed\nduckdb.query(\"SELECT * FROM dataset LIMIT 10\").to_df()\n```\n\n**Vector search**\n\nDownload the sift1m subset\n\n```shell\nwget ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz\ntar -xzf sift.tar.gz\n```\n\nConvert it to Lance\n\n```python\nimport lance\nfrom lance.vector import vec_to_table\nimport numpy as np\nimport struct\n\nnvecs = 1000000\nndims = 128\nwith open(\"sift/sift_base.fvecs\", mode=\"rb\") as fobj:\n    buf = fobj.read()\n    data = np.array(struct.unpack(\"<128000000f\", buf[4 : 4 + 4 * nvecs * ndims])).reshape((nvecs, ndims))\n    dd = dict(zip(range(nvecs), data))\n\ntable = vec_to_table(dd)\nuri = \"vec_data.lance\"\nsift1m = lance.write_dataset(table, uri, max_rows_per_group=8192, max_rows_per_file=1024*1024)\n```\n\nBuild the index\n\n```python\nsift1m.create_index(\"vector\",\n                    index_type=\"IVF_PQ\", \n                    num_partitions=256,  # IVF\n                    num_sub_vectors=16)  # PQ\n```\n\nSearch the dataset\n\n```python\n# Get top 10 similar vectors\nimport duckdb\n\ndataset = lance.dataset(uri)\n\n# Sample 100 query vectors. If this segfaults, make sure you have duckdb v0.7+ installed\nsample = duckdb.query(\"SELECT vector FROM dataset USING SAMPLE 100\").to_df()\nquery_vectors = np.array([np.array(x) for x in sample.vector])\n\n# Get nearest neighbors for all of them\nrs = [dataset.to_table(nearest={\"column\": \"vector\", \"k\": 10, \"q\": q})      \n      for q in query_vectors]\n```\n\n*More distance metrics, HNSW, and distributed support is on the roadmap\n\n\n## Python package details\n\nInstall from PyPI: `pip install pylance`  # >=0.3.0 is the new rust-based implementation\nInstall from source: `maturin develop` (under the `/python` directory)\nRun unit tests: `make test`\nRun integration tests: `make integtest`\n\nImport via: `import lance`\n\nThe python integration is done via pyo3 + custom python code:\n\n1. We make wrapper classes in Rust for Dataset/Scanner/RecordBatchReader that's exposed to python.\n2. These are then used by LanceDataset / LanceScanner implementations that extend pyarrow Dataset/Scanner for duckdb compat.\n3. Data is delivered via the Arrow C Data Interface\n\n## Motivation\n\nWhy do we *need* a new format for data science and machine learning?\n\n### 1. Reproducibility is a must-have\n\nVersioning and experimentation support should be built into the dataset instead of requiring multiple tools.<br/>\nIt should also be efficient and not require expensive copying everytime you want to create a new version.<br/>\nWe call this \"Zero copy versioning\" in Lance. It makes versioning data easy without increasing storage costs.\n\n### 2. Cloud storage is now the default\n\nRemote object storage is the default now for data science and machine learning and the performance characteristics of cloud are fundamentally different.<br/>\nLance format is optimized to be cloud native. Common operations like filter-then-take can be order of magnitude faster\nusing Lance than Parquet, especially for ML data.\n\n### 3. Vectors must be a first class citizen, not a separate thing\n\nThe majority of reasonable scale workflows should not require the added complexity and cost of a\nspecialized database just to compute vector similarity. Lance integrates optimized vector indices\ninto a columnar format so no additional infrastructure is required to get low latency top-K similarity search.\n\n### 4. Open standards is a requirement\n\nThe DS/ML ecosystem is incredibly rich and data *must be* easily accessible across different languages, tools, and environments.\nLance makes Apache Arrow integration its primary interface, which means conversions to/from is 2 lines of code, your\ncode does not need to change after conversion, and nothing is locked-up to force you to pay for vendor compute.\nWe need open-source not fauxpen-source.\n\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "python wrapper for Lance columnar format",
    "version": "0.10.14",
    "project_urls": null,
    "split_keywords": [
        "data-format",
        " data-science",
        " machine-learning",
        " arrow",
        " data-analytics"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c5bb243c65ee7e42e55fa601cb15f2f01a5855c37e2904cc59f6892709f6b3da",
                "md5": "090fcf9532d52a4ef0e603fcb94bac48",
                "sha256": "4714f20b925d8ae98713fa777b01f10d959f46eeef2e550d4de8b892df77bb9b"
            },
            "downloads": -1,
            "filename": "pylance-0.10.14-cp38-abi3-macosx_10_15_x86_64.whl",
            "has_sig": false,
            "md5_digest": "090fcf9532d52a4ef0e603fcb94bac48",
            "packagetype": "bdist_wheel",
            "python_version": "cp38",
            "requires_python": ">=3.8",
            "size": 21407603,
            "upload_time": "2024-04-18T01:34:28",
            "upload_time_iso_8601": "2024-04-18T01:34:28.485229Z",
            "url": "https://files.pythonhosted.org/packages/c5/bb/243c65ee7e42e55fa601cb15f2f01a5855c37e2904cc59f6892709f6b3da/pylance-0.10.14-cp38-abi3-macosx_10_15_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e2bda0fdaa698e1ef4f3fe773d5c54d00554a193ef87dfd86024dda5d8f22389",
                "md5": "cec62e42019430ba202ec1f139c3f5e7",
                "sha256": "896c042428c893db1a2baf4288457f1d7bfcc1574b45dc5d6fe679316254898a"
            },
            "downloads": -1,
            "filename": "pylance-0.10.14-cp38-abi3-macosx_11_0_arm64.whl",
            "has_sig": false,
            "md5_digest": "cec62e42019430ba202ec1f139c3f5e7",
            "packagetype": "bdist_wheel",
            "python_version": "cp38",
            "requires_python": ">=3.8",
            "size": 19639067,
            "upload_time": "2024-04-18T01:28:49",
            "upload_time_iso_8601": "2024-04-18T01:28:49.201988Z",
            "url": "https://files.pythonhosted.org/packages/e2/bd/a0fdaa698e1ef4f3fe773d5c54d00554a193ef87dfd86024dda5d8f22389/pylance-0.10.14-cp38-abi3-macosx_11_0_arm64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "112b077eb5f378e3f870972abe5855a6583fd5c5a25d3ed18361e8f99854e1b1",
                "md5": "2b9e997b1820518de7e5b915d6fad02f",
                "sha256": "2f3e639bf0b9ed61ccc7d91fa3fee4a5010b75e99c722893100227f2712c6628"
            },
            "downloads": -1,
            "filename": "pylance-0.10.14-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "has_sig": false,
            "md5_digest": "2b9e997b1820518de7e5b915d6fad02f",
            "packagetype": "bdist_wheel",
            "python_version": "cp38",
            "requires_python": ">=3.8",
            "size": 22538923,
            "upload_time": "2024-04-18T01:31:47",
            "upload_time_iso_8601": "2024-04-18T01:31:47.503405Z",
            "url": "https://files.pythonhosted.org/packages/11/2b/077eb5f378e3f870972abe5855a6583fd5c5a25d3ed18361e8f99854e1b1/pylance-0.10.14-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5894258394712cdd2e577009ae22f7b69404d9d8b32f743a5a126fd3e53aceef",
                "md5": "28b737d957c6a441b3a2c3eba7349c33",
                "sha256": "cdd8c3218aad621616b0d54517a3283ba731a5723946d82ab60cb15e1498a3b1"
            },
            "downloads": -1,
            "filename": "pylance-0.10.14-cp38-abi3-manylinux_2_24_aarch64.whl",
            "has_sig": false,
            "md5_digest": "28b737d957c6a441b3a2c3eba7349c33",
            "packagetype": "bdist_wheel",
            "python_version": "cp38",
            "requires_python": ">=3.8",
            "size": 20773660,
            "upload_time": "2024-04-18T01:31:37",
            "upload_time_iso_8601": "2024-04-18T01:31:37.386389Z",
            "url": "https://files.pythonhosted.org/packages/58/94/258394712cdd2e577009ae22f7b69404d9d8b32f743a5a126fd3e53aceef/pylance-0.10.14-cp38-abi3-manylinux_2_24_aarch64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "fc1c8f96d3345ac464d4b5ebcd25c4aefb5dacfc4ad8a7d3c41cd9723cf9d24e",
                "md5": "7f1778333a66b0e7a4ad9366b9ea0e67",
                "sha256": "213b485a74c1a204d5238dba43c641c3ba8c07d785305874f3e726e562beb37d"
            },
            "downloads": -1,
            "filename": "pylance-0.10.14-cp38-abi3-manylinux_2_28_x86_64.whl",
            "has_sig": false,
            "md5_digest": "7f1778333a66b0e7a4ad9366b9ea0e67",
            "packagetype": "bdist_wheel",
            "python_version": "cp38",
            "requires_python": ">=3.8",
            "size": 22546661,
            "upload_time": "2024-04-18T01:31:23",
            "upload_time_iso_8601": "2024-04-18T01:31:23.788839Z",
            "url": "https://files.pythonhosted.org/packages/fc/1c/8f96d3345ac464d4b5ebcd25c4aefb5dacfc4ad8a7d3c41cd9723cf9d24e/pylance-0.10.14-cp38-abi3-manylinux_2_28_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "22d93676e16de65e24af82bac5a2865c7b9f9e78585c97de314caac5ac1f60ba",
                "md5": "659ccc897bd6a2b217352cd2654985e7",
                "sha256": "a95181effd819f00c71688b813b211bd07d19c8fe008f028c85d91b319d96801"
            },
            "downloads": -1,
            "filename": "pylance-0.10.14-cp38-abi3-win_amd64.whl",
            "has_sig": false,
            "md5_digest": "659ccc897bd6a2b217352cd2654985e7",
            "packagetype": "bdist_wheel",
            "python_version": "cp38",
            "requires_python": ">=3.8",
            "size": 22872077,
            "upload_time": "2024-04-18T01:41:48",
            "upload_time_iso_8601": "2024-04-18T01:41:48.678789Z",
            "url": "https://files.pythonhosted.org/packages/22/d9/3676e16de65e24af82bac5a2865c7b9f9e78585c97de314caac5ac1f60ba/pylance-0.10.14-cp38-abi3-win_amd64.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-18 01:34:28",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "pylance"
}
        
Elapsed time: 0.25441s