# Python bindings for Lance Data Format
> :warning: **Under heavy development**
<div align="center">
<p align="center">
<img width="257" alt="Lance Logo" src="https://user-images.githubusercontent.com/917119/199353423-d3e202f7-0269-411d-8ff2-e747e419e492.png">
Lance is a new columnar data format for data science and machine learning
</p></div>
Why you should use Lance
1. Is order of magnitude faster than parquet for point queries and nested data structures common to DS/ML
2. Comes with a fast vector index that delivers sub-millisecond nearest neighbors search performance
3. Is automatically versioned and supports lineage and time-travel for full reproducibility
4. Integrated with duckdb/pandas/polars already. Easily convert from/to parquet in 2 lines of code
## Quick start
**Installation**
```shell
pip install pylance
```
Make sure you have a recent version of pandas (1.5+), pyarrow (10.0+), and DuckDB (0.7.0+)
**Converting to Lance**
```python
import lance
import pandas as pd
import pyarrow as pa
import pyarrow.dataset
df = pd.DataFrame({"a": [5], "b": [10]})
uri = "/tmp/test.parquet"
tbl = pa.Table.from_pandas(df)
pa.dataset.write_dataset(tbl, uri, format='parquet')
parquet = pa.dataset.dataset(uri, format='parquet')
lance.write_dataset(parquet, "/tmp/test.lance")
```
**Reading Lance data**
```python
dataset = lance.dataset("/tmp/test.lance")
assert isinstance(dataset, pa.dataset.Dataset)
```
**Pandas**
```python
df = dataset.to_table().to_pandas()
```
**DuckDB**
```python
import duckdb
# If this segfaults, make sure you have duckdb v0.7+ installed
duckdb.query("SELECT * FROM dataset LIMIT 10").to_df()
```
**Vector search**
Download the sift1m subset
```shell
wget ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz
tar -xzf sift.tar.gz
```
Convert it to Lance
```python
import lance
from lance.vector import vec_to_table
import numpy as np
import struct
nvecs = 1000000
ndims = 128
with open("sift/sift_base.fvecs", mode="rb") as fobj:
buf = fobj.read()
data = np.array(struct.unpack("<128000000f", buf[4 : 4 + 4 * nvecs * ndims])).reshape((nvecs, ndims))
dd = dict(zip(range(nvecs), data))
table = vec_to_table(dd)
uri = "vec_data.lance"
sift1m = lance.write_dataset(table, uri, max_rows_per_group=8192, max_rows_per_file=1024*1024)
```
Build the index
```python
sift1m.create_index("vector",
index_type="IVF_PQ",
num_partitions=256, # IVF
num_sub_vectors=16) # PQ
```
Search the dataset
```python
# Get top 10 similar vectors
import duckdb
dataset = lance.dataset(uri)
# Sample 100 query vectors. If this segfaults, make sure you have duckdb v0.7+ installed
sample = duckdb.query("SELECT vector FROM dataset USING SAMPLE 100").to_df()
query_vectors = np.array([np.array(x) for x in sample.vector])
# Get nearest neighbors for all of them
rs = [dataset.to_table(nearest={"column": "vector", "k": 10, "q": q})
for q in query_vectors]
```
*More distance metrics, HNSW, and distributed support is on the roadmap
## Python package details
Install from PyPI: `pip install pylance` # >=0.3.0 is the new rust-based implementation
Install from source: `maturin develop` (under the `/python` directory)
Run unit tests: `make test`
Run integration tests: `make integtest`
Import via: `import lance`
The python integration is done via pyo3 + custom python code:
1. We make wrapper classes in Rust for Dataset/Scanner/RecordBatchReader that's exposed to python.
2. These are then used by LanceDataset / LanceScanner implementations that extend pyarrow Dataset/Scanner for duckdb compat.
3. Data is delivered via the Arrow C Data Interface
## Motivation
Why do we *need* a new format for data science and machine learning?
### 1. Reproducibility is a must-have
Versioning and experimentation support should be built into the dataset instead of requiring multiple tools.<br/>
It should also be efficient and not require expensive copying everytime you want to create a new version.<br/>
We call this "Zero copy versioning" in Lance. It makes versioning data easy without increasing storage costs.
### 2. Cloud storage is now the default
Remote object storage is the default now for data science and machine learning and the performance characteristics of cloud are fundamentally different.<br/>
Lance format is optimized to be cloud native. Common operations like filter-then-take can be order of magnitude faster
using Lance than Parquet, especially for ML data.
### 3. Vectors must be a first class citizen, not a separate thing
The majority of reasonable scale workflows should not require the added complexity and cost of a
specialized database just to compute vector similarity. Lance integrates optimized vector indices
into a columnar format so no additional infrastructure is required to get low latency top-K similarity search.
### 4. Open standards is a requirement
The DS/ML ecosystem is incredibly rich and data *must be* easily accessible across different languages, tools, and environments.
Lance makes Apache Arrow integration its primary interface, which means conversions to/from is 2 lines of code, your
code does not need to change after conversion, and nothing is locked-up to force you to pay for vendor compute.
We need open-source not fauxpen-source.
Raw data
{
"_id": null,
"home_page": null,
"name": "pylance",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "data-format, data-science, machine-learning, arrow, data-analytics",
"author": "Lance Devs <dev@lancedb.com>",
"author_email": "Lance Devs <dev@lancedb.com>",
"download_url": null,
"platform": null,
"description": "# Python bindings for Lance Data Format\n\n> :warning: **Under heavy development**\n\n<div align=\"center\">\n<p align=\"center\">\n\n<img width=\"257\" alt=\"Lance Logo\" src=\"https://user-images.githubusercontent.com/917119/199353423-d3e202f7-0269-411d-8ff2-e747e419e492.png\">\n\nLance is a new columnar data format for data science and machine learning\n</p></div>\n\nWhy you should use Lance\n1. Is order of magnitude faster than parquet for point queries and nested data structures common to DS/ML\n2. Comes with a fast vector index that delivers sub-millisecond nearest neighbors search performance\n3. Is automatically versioned and supports lineage and time-travel for full reproducibility\n4. Integrated with duckdb/pandas/polars already. Easily convert from/to parquet in 2 lines of code\n\n\n## Quick start\n\n**Installation**\n\n```shell\npip install pylance\n```\n\nMake sure you have a recent version of pandas (1.5+), pyarrow (10.0+), and DuckDB (0.7.0+)\n\n**Converting to Lance**\n```python\nimport lance\n\nimport pandas as pd\nimport pyarrow as pa\nimport pyarrow.dataset\n\ndf = pd.DataFrame({\"a\": [5], \"b\": [10]})\nuri = \"/tmp/test.parquet\"\ntbl = pa.Table.from_pandas(df)\npa.dataset.write_dataset(tbl, uri, format='parquet')\n\nparquet = pa.dataset.dataset(uri, format='parquet')\nlance.write_dataset(parquet, \"/tmp/test.lance\")\n```\n\n**Reading Lance data**\n```python\ndataset = lance.dataset(\"/tmp/test.lance\")\nassert isinstance(dataset, pa.dataset.Dataset)\n```\n\n**Pandas**\n```python\ndf = dataset.to_table().to_pandas()\n```\n\n**DuckDB**\n```python\nimport duckdb\n\n# If this segfaults, make sure you have duckdb v0.7+ installed\nduckdb.query(\"SELECT * FROM dataset LIMIT 10\").to_df()\n```\n\n**Vector search**\n\nDownload the sift1m subset\n\n```shell\nwget ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz\ntar -xzf sift.tar.gz\n```\n\nConvert it to Lance\n\n```python\nimport lance\nfrom lance.vector import vec_to_table\nimport numpy as np\nimport struct\n\nnvecs = 1000000\nndims = 128\nwith open(\"sift/sift_base.fvecs\", mode=\"rb\") as fobj:\n buf = fobj.read()\n data = np.array(struct.unpack(\"<128000000f\", buf[4 : 4 + 4 * nvecs * ndims])).reshape((nvecs, ndims))\n dd = dict(zip(range(nvecs), data))\n\ntable = vec_to_table(dd)\nuri = \"vec_data.lance\"\nsift1m = lance.write_dataset(table, uri, max_rows_per_group=8192, max_rows_per_file=1024*1024)\n```\n\nBuild the index\n\n```python\nsift1m.create_index(\"vector\",\n index_type=\"IVF_PQ\", \n num_partitions=256, # IVF\n num_sub_vectors=16) # PQ\n```\n\nSearch the dataset\n\n```python\n# Get top 10 similar vectors\nimport duckdb\n\ndataset = lance.dataset(uri)\n\n# Sample 100 query vectors. If this segfaults, make sure you have duckdb v0.7+ installed\nsample = duckdb.query(\"SELECT vector FROM dataset USING SAMPLE 100\").to_df()\nquery_vectors = np.array([np.array(x) for x in sample.vector])\n\n# Get nearest neighbors for all of them\nrs = [dataset.to_table(nearest={\"column\": \"vector\", \"k\": 10, \"q\": q}) \n for q in query_vectors]\n```\n\n*More distance metrics, HNSW, and distributed support is on the roadmap\n\n\n## Python package details\n\nInstall from PyPI: `pip install pylance` # >=0.3.0 is the new rust-based implementation\nInstall from source: `maturin develop` (under the `/python` directory)\nRun unit tests: `make test`\nRun integration tests: `make integtest`\n\nImport via: `import lance`\n\nThe python integration is done via pyo3 + custom python code:\n\n1. We make wrapper classes in Rust for Dataset/Scanner/RecordBatchReader that's exposed to python.\n2. These are then used by LanceDataset / LanceScanner implementations that extend pyarrow Dataset/Scanner for duckdb compat.\n3. Data is delivered via the Arrow C Data Interface\n\n## Motivation\n\nWhy do we *need* a new format for data science and machine learning?\n\n### 1. Reproducibility is a must-have\n\nVersioning and experimentation support should be built into the dataset instead of requiring multiple tools.<br/>\nIt should also be efficient and not require expensive copying everytime you want to create a new version.<br/>\nWe call this \"Zero copy versioning\" in Lance. It makes versioning data easy without increasing storage costs.\n\n### 2. Cloud storage is now the default\n\nRemote object storage is the default now for data science and machine learning and the performance characteristics of cloud are fundamentally different.<br/>\nLance format is optimized to be cloud native. Common operations like filter-then-take can be order of magnitude faster\nusing Lance than Parquet, especially for ML data.\n\n### 3. Vectors must be a first class citizen, not a separate thing\n\nThe majority of reasonable scale workflows should not require the added complexity and cost of a\nspecialized database just to compute vector similarity. Lance integrates optimized vector indices\ninto a columnar format so no additional infrastructure is required to get low latency top-K similarity search.\n\n### 4. Open standards is a requirement\n\nThe DS/ML ecosystem is incredibly rich and data *must be* easily accessible across different languages, tools, and environments.\nLance makes Apache Arrow integration its primary interface, which means conversions to/from is 2 lines of code, your\ncode does not need to change after conversion, and nothing is locked-up to force you to pay for vendor compute.\nWe need open-source not fauxpen-source.\n\n\n",
"bugtrack_url": null,
"license": null,
"summary": "python wrapper for Lance columnar format",
"version": "0.22.0",
"project_urls": null,
"split_keywords": [
"data-format",
" data-science",
" machine-learning",
" arrow",
" data-analytics"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "9e22ad54cfda2bbf7e217de0cc131e0ed2c879af7728d6331903e44dee8f8dfb",
"md5": "913d8914078ab1db1a5aa3f3c1925a73",
"sha256": "2c0bb6bf7320e500f0f5948e5b23e4d70d9c84bba15a2db2e877be9637c4dc0c"
},
"downloads": -1,
"filename": "pylance-0.22.0-cp39-abi3-macosx_10_15_x86_64.whl",
"has_sig": false,
"md5_digest": "913d8914078ab1db1a5aa3f3c1925a73",
"packagetype": "bdist_wheel",
"python_version": "cp39",
"requires_python": ">=3.9",
"size": 34412591,
"upload_time": "2025-01-13T21:25:22",
"upload_time_iso_8601": "2025-01-13T21:25:22.067207Z",
"url": "https://files.pythonhosted.org/packages/9e/22/ad54cfda2bbf7e217de0cc131e0ed2c879af7728d6331903e44dee8f8dfb/pylance-0.22.0-cp39-abi3-macosx_10_15_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "28e454603e4ad6341240e507cd3b490e34cd0663610b59d5e6ba5a9d317cd421",
"md5": "0031a84e55c70412fcb0df3c730b221e",
"sha256": "341a8cbac762c1f446a05a1513dab1b7930f433a8331b08b0b89a975f3864f6a"
},
"downloads": -1,
"filename": "pylance-0.22.0-cp39-abi3-macosx_11_0_arm64.whl",
"has_sig": false,
"md5_digest": "0031a84e55c70412fcb0df3c730b221e",
"packagetype": "bdist_wheel",
"python_version": "cp39",
"requires_python": ">=3.9",
"size": 31889815,
"upload_time": "2025-01-13T21:10:24",
"upload_time_iso_8601": "2025-01-13T21:10:24.244427Z",
"url": "https://files.pythonhosted.org/packages/28/e4/54603e4ad6341240e507cd3b490e34cd0663610b59d5e6ba5a9d317cd421/pylance-0.22.0-cp39-abi3-macosx_11_0_arm64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "ceedbf2b5e480d9ec620f261d9b5293ebb494934b42f30af62973df476ef8b7d",
"md5": "fdc9ed59902a5873562c490d2f6b4987",
"sha256": "29848127701f2188b331ad8399036f1fb79bacf5102fd030bfe9fd30cb02cf5b"
},
"downloads": -1,
"filename": "pylance-0.22.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"has_sig": false,
"md5_digest": "fdc9ed59902a5873562c490d2f6b4987",
"packagetype": "bdist_wheel",
"python_version": "cp39",
"requires_python": ">=3.9",
"size": 38929145,
"upload_time": "2025-01-13T21:11:43",
"upload_time_iso_8601": "2025-01-13T21:11:43.272771Z",
"url": "https://files.pythonhosted.org/packages/ce/ed/bf2b5e480d9ec620f261d9b5293ebb494934b42f30af62973df476ef8b7d/pylance-0.22.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "bf6c069ef2823c7366c529297493719e8a3f6b16a19bbaf42e6f5010307157ec",
"md5": "b5a6c52cd1f0821d9ca64fbb7941897a",
"sha256": "cd4cc3dd3772600092685282db8cd4c21eaa68f458445b3107bd01b43afb8f11"
},
"downloads": -1,
"filename": "pylance-0.22.0-cp39-abi3-manylinux_2_24_aarch64.whl",
"has_sig": false,
"md5_digest": "b5a6c52cd1f0821d9ca64fbb7941897a",
"packagetype": "bdist_wheel",
"python_version": "cp39",
"requires_python": ">=3.9",
"size": 36272984,
"upload_time": "2025-01-13T21:11:57",
"upload_time_iso_8601": "2025-01-13T21:11:57.601761Z",
"url": "https://files.pythonhosted.org/packages/bf/6c/069ef2823c7366c529297493719e8a3f6b16a19bbaf42e6f5010307157ec/pylance-0.22.0-cp39-abi3-manylinux_2_24_aarch64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "50ff61e10792edab999d0cc0c89a409446d28bee0f47e157ebc5587c0f8fb332",
"md5": "c14ede092b9397606838be946592a46e",
"sha256": "8999e73ce180c977f91bb4629578d742b1e86fcf53e7d27b14d6d219395c17cd"
},
"downloads": -1,
"filename": "pylance-0.22.0-cp39-abi3-manylinux_2_28_x86_64.whl",
"has_sig": false,
"md5_digest": "c14ede092b9397606838be946592a46e",
"packagetype": "bdist_wheel",
"python_version": "cp39",
"requires_python": ">=3.9",
"size": 38322607,
"upload_time": "2025-01-13T21:11:40",
"upload_time_iso_8601": "2025-01-13T21:11:40.174769Z",
"url": "https://files.pythonhosted.org/packages/50/ff/61e10792edab999d0cc0c89a409446d28bee0f47e157ebc5587c0f8fb332/pylance-0.22.0-cp39-abi3-manylinux_2_28_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "61f0b62b14630af78d468ff7b15cc21576910edbd73114795b49907b39df2841",
"md5": "d3ff117e536c48943cdbe582d2cd3ddc",
"sha256": "848f1a74dab14dc14bf05569404977cfcba9a95a44e513e5a3b32f1221bfa00f"
},
"downloads": -1,
"filename": "pylance-0.22.0-cp39-abi3-win_amd64.whl",
"has_sig": false,
"md5_digest": "d3ff117e536c48943cdbe582d2cd3ddc",
"packagetype": "bdist_wheel",
"python_version": "cp39",
"requires_python": ">=3.9",
"size": 34216608,
"upload_time": "2025-01-13T21:24:47",
"upload_time_iso_8601": "2025-01-13T21:24:47.835866Z",
"url": "https://files.pythonhosted.org/packages/61/f0/b62b14630af78d468ff7b15cc21576910edbd73114795b49907b39df2841/pylance-0.22.0-cp39-abi3-win_amd64.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-01-13 21:25:22",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "pylance"
}