# MUVERA FDE
Python bindings for Google's Fixed Dimensional Encoding (FDE) algorithm from the [graph-mining](https://github.com/google/graph-mining/tree/main/sketching/point_cloud) project, as described in the [MUVERA paper](https://research.google/blog/muvera-making-multi-vector-retrieval-as-fast-as-single-vector-search/).
## Overview
This library provides Python bindings for Google's Fixed Dimensional Encoding (FDE) algorithm, a key component of the MUVERA (Multi-Vector Retrieval Aggregation) system. FDE is a technique for encoding variable-sized sets of vectors into fixed-dimensional representations while preserving similarity structure.
### Key Applications
- **Multi-vector retrieval**: Efficiently search collections of multi-vector data (e.g., video segments, document chunks)
- **Chamfer similarity approximation**: Approximate Chamfer distance between point clouds
- **Fixed-size embeddings**: Convert variable-length sequences into fixed-size vectors for ML models
- **Efficient similarity search**: Enable fast nearest-neighbor search on multi-vector data
### Algorithm Details
FDE uses SimHash projections to partition the input space and aggregates points within each partition. This approach preserves the similarity structure while providing a compact fixed-dimensional representation. The algorithm supports both sum aggregation (for queries) and average aggregation (for documents), making it suitable for asymmetric similarity search scenarios.
### Relationship to MUVERA
MUVERA (Multi-Vector Retrieval Aggregation) is Google's system for efficient multi-vector search that achieves single-vector search speeds. FDE is a core component of MUVERA that enables:
1. **Compression**: Reducing multiple vectors to a single fixed-size representation
2. **Similarity Preservation**: Maintaining approximate Chamfer distances between point sets
3. **Asymmetric Search**: Different encodings for queries (sum) and documents (average)
4. **Scalability**: Enabling billion-scale multi-vector retrieval
This implementation focuses on the FDE component, which can be used standalone or as part of a larger retrieval system.
## Features
- **Zero-copy numpy integration**: Efficient memory usage with numpy arrays
- **Flexible encoding types**: Support for both query (sum) and document (average) encodings
- **Configurable parameters**: Control dimensionality, projections, and aggregation methods
- **Modern Python packaging**: Built with `uv` and `scikit-build-core`
## Installation
### From source with uv
```bash
cd muvera-fde
uv pip install -e .
```
### Development installation
```bash
uv pip install -e ".[dev]"
```
## Quick Start
```python
import numpy as np
from muvera_fde import (
FixedDimensionalEncodingConfig,
FixedDimensionalEncoder,
EncodingType
)
# Create configuration
config = FixedDimensionalEncodingConfig(
dimension=3, # Input point dimension
num_simhash_projections=8, # Number of SimHash projections
num_repetitions=2, # Number of repetitions for robustness
encoding_type=EncodingType.DEFAULT_SUM # Sum aggregation for queries
)
# Initialize encoder
encoder = FixedDimensionalEncoder(config)
# Generate random point cloud
points = np.random.randn(100, 3).astype(np.float32)
# Encode the point cloud
encoding = encoder.encode(points)
print(f"Encoding shape: {encoding.shape}")
print(f"Output dimension: {encoder.output_dimension}")
```
## Configuration Options
The `FixedDimensionalEncodingConfig` dataclass supports the following parameters:
- `dimension`: Dimension of input points (default: 3)
- `num_repetitions`: Number of encoding repetitions (default: 1)
- `num_simhash_projections`: Number of SimHash projections (default: 8)
- `seed`: Random seed for reproducibility (default: 1)
- `encoding_type`: `EncodingType.DEFAULT_SUM` for queries, `EncodingType.AVERAGE` for documents
- `projection_dimension`: Optional dimension to project points before encoding
- `projection_type`: `ProjectionType.DEFAULT_IDENTITY` or `ProjectionType.AMS_SKETCH`
- `fill_empty_partitions`: Whether to fill empty partitions with zeros (default: False)
- `final_projection_dimension`: Optional final output dimension
## Advanced Usage
### Query vs Document Encoding
```python
# Query encoding (sum aggregation)
query_encoding = encoder.encode_query(query_points)
# Document encoding (average aggregation)
doc_encoding = encoder.encode_document(document_points)
```
### Dimensionality Reduction
```python
# Use AMS sketch projection
config = FixedDimensionalEncodingConfig(
dimension=128,
projection_dimension=32,
projection_type=ProjectionType.AMS_SKETCH,
final_projection_dimension=256
)
```
## Building from Source
The project uses CMake and scikit-build-core for building the C++ extension. Eigen is automatically downloaded during the build process.
```bash
# Clean build
uv pip install --force-reinstall -e .
# Run tests
uv run pytest
```
## Citation
If you use this library in your research, please cite the MUVERA paper:
```bibtex
@article{muvera2024,
title={MUVERA: Multi-Vector Retrieval via Fixed Dimensional Encodings},
author={Google Research Team},
journal={Google Research Blog},
year={2024},
url={https://research.google/blog/muvera-making-multi-vector-retrieval-as-fast-as-single-vector-search/}
}
```
## License
This project is licensed under the Apache License 2.0, the same license as the original Google graph-mining project. See the LICENSE file for details.
## Acknowledgments
This library provides Python bindings for the Fixed Dimensional Encoding implementation from:
- [Google's graph-mining project](https://github.com/google/graph-mining/tree/main/sketching/point_cloud)
- [MUVERA: Multi-Vector Retrieval via Fixed Dimensional Encodings](https://research.google/blog/muvera-making-multi-vector-retrieval-as-fast-as-single-vector-search/)
The core C++ implementation is adapted from Google's original code with modifications for Python integration.
Raw data
{
"_id": null,
"home_page": null,
"name": "muvera-fde",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.11",
"maintainer_email": null,
"keywords": "point-cloud, encoding, muvera, graph-mining, similarity-search, vector-search",
"author": null,
"author_email": "Yasyf Mohamedali <yasyfm@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/46/75/0af76f98378b96697df226d5777d81817339cfd2a637e8cfb90b1715c21b/muvera_fde-0.1.0.tar.gz",
"platform": null,
"description": "# MUVERA FDE\n\nPython bindings for Google's Fixed Dimensional Encoding (FDE) algorithm from the [graph-mining](https://github.com/google/graph-mining/tree/main/sketching/point_cloud) project, as described in the [MUVERA paper](https://research.google/blog/muvera-making-multi-vector-retrieval-as-fast-as-single-vector-search/).\n\n## Overview\n\nThis library provides Python bindings for Google's Fixed Dimensional Encoding (FDE) algorithm, a key component of the MUVERA (Multi-Vector Retrieval Aggregation) system. FDE is a technique for encoding variable-sized sets of vectors into fixed-dimensional representations while preserving similarity structure.\n\n### Key Applications\n\n- **Multi-vector retrieval**: Efficiently search collections of multi-vector data (e.g., video segments, document chunks)\n- **Chamfer similarity approximation**: Approximate Chamfer distance between point clouds\n- **Fixed-size embeddings**: Convert variable-length sequences into fixed-size vectors for ML models\n- **Efficient similarity search**: Enable fast nearest-neighbor search on multi-vector data\n\n### Algorithm Details\n\nFDE uses SimHash projections to partition the input space and aggregates points within each partition. This approach preserves the similarity structure while providing a compact fixed-dimensional representation. The algorithm supports both sum aggregation (for queries) and average aggregation (for documents), making it suitable for asymmetric similarity search scenarios.\n\n### Relationship to MUVERA\n\nMUVERA (Multi-Vector Retrieval Aggregation) is Google's system for efficient multi-vector search that achieves single-vector search speeds. FDE is a core component of MUVERA that enables:\n\n1. **Compression**: Reducing multiple vectors to a single fixed-size representation\n2. **Similarity Preservation**: Maintaining approximate Chamfer distances between point sets\n3. **Asymmetric Search**: Different encodings for queries (sum) and documents (average)\n4. **Scalability**: Enabling billion-scale multi-vector retrieval\n\nThis implementation focuses on the FDE component, which can be used standalone or as part of a larger retrieval system.\n\n## Features\n\n- **Zero-copy numpy integration**: Efficient memory usage with numpy arrays\n- **Flexible encoding types**: Support for both query (sum) and document (average) encodings\n- **Configurable parameters**: Control dimensionality, projections, and aggregation methods\n- **Modern Python packaging**: Built with `uv` and `scikit-build-core`\n\n## Installation\n\n### From source with uv\n\n```bash\ncd muvera-fde\nuv pip install -e .\n```\n\n### Development installation\n\n```bash\nuv pip install -e \".[dev]\"\n```\n\n## Quick Start\n\n```python\nimport numpy as np\nfrom muvera_fde import (\n FixedDimensionalEncodingConfig,\n FixedDimensionalEncoder,\n EncodingType\n)\n\n# Create configuration\nconfig = FixedDimensionalEncodingConfig(\n dimension=3, # Input point dimension\n num_simhash_projections=8, # Number of SimHash projections\n num_repetitions=2, # Number of repetitions for robustness\n encoding_type=EncodingType.DEFAULT_SUM # Sum aggregation for queries\n)\n\n# Initialize encoder\nencoder = FixedDimensionalEncoder(config)\n\n# Generate random point cloud\npoints = np.random.randn(100, 3).astype(np.float32)\n\n# Encode the point cloud\nencoding = encoder.encode(points)\nprint(f\"Encoding shape: {encoding.shape}\")\nprint(f\"Output dimension: {encoder.output_dimension}\")\n```\n\n## Configuration Options\n\nThe `FixedDimensionalEncodingConfig` dataclass supports the following parameters:\n\n- `dimension`: Dimension of input points (default: 3)\n- `num_repetitions`: Number of encoding repetitions (default: 1)\n- `num_simhash_projections`: Number of SimHash projections (default: 8)\n- `seed`: Random seed for reproducibility (default: 1)\n- `encoding_type`: `EncodingType.DEFAULT_SUM` for queries, `EncodingType.AVERAGE` for documents\n- `projection_dimension`: Optional dimension to project points before encoding\n- `projection_type`: `ProjectionType.DEFAULT_IDENTITY` or `ProjectionType.AMS_SKETCH`\n- `fill_empty_partitions`: Whether to fill empty partitions with zeros (default: False)\n- `final_projection_dimension`: Optional final output dimension\n\n## Advanced Usage\n\n### Query vs Document Encoding\n\n```python\n# Query encoding (sum aggregation)\nquery_encoding = encoder.encode_query(query_points)\n\n# Document encoding (average aggregation)\ndoc_encoding = encoder.encode_document(document_points)\n```\n\n### Dimensionality Reduction\n\n```python\n# Use AMS sketch projection\nconfig = FixedDimensionalEncodingConfig(\n dimension=128,\n projection_dimension=32,\n projection_type=ProjectionType.AMS_SKETCH,\n final_projection_dimension=256\n)\n```\n\n## Building from Source\n\nThe project uses CMake and scikit-build-core for building the C++ extension. Eigen is automatically downloaded during the build process.\n\n```bash\n# Clean build\nuv pip install --force-reinstall -e .\n\n# Run tests\nuv run pytest\n```\n\n## Citation\n\nIf you use this library in your research, please cite the MUVERA paper:\n\n```bibtex\n@article{muvera2024,\n title={MUVERA: Multi-Vector Retrieval via Fixed Dimensional Encodings},\n author={Google Research Team},\n journal={Google Research Blog},\n year={2024},\n url={https://research.google/blog/muvera-making-multi-vector-retrieval-as-fast-as-single-vector-search/}\n}\n```\n\n## License\n\nThis project is licensed under the Apache License 2.0, the same license as the original Google graph-mining project. See the LICENSE file for details.\n\n## Acknowledgments\n\nThis library provides Python bindings for the Fixed Dimensional Encoding implementation from:\n- [Google's graph-mining project](https://github.com/google/graph-mining/tree/main/sketching/point_cloud)\n- [MUVERA: Multi-Vector Retrieval via Fixed Dimensional Encodings](https://research.google/blog/muvera-making-multi-vector-retrieval-as-fast-as-single-vector-search/)\n\nThe core C++ implementation is adapted from Google's original code with modifications for Python integration.",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "Python bindings for Google's graph-mining Fixed Dimensional Encoding (FDE) from MUVERA",
"version": "0.1.0",
"project_urls": {
"Original Implementation": "https://github.com/google/graph-mining/tree/main/sketching/point_cloud",
"Research Paper": "https://research.google/blog/muvera-making-multi-vector-retrieval-as-fast-as-single-vector-search/"
},
"split_keywords": [
"point-cloud",
" encoding",
" muvera",
" graph-mining",
" similarity-search",
" vector-search"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "46750af76f98378b96697df226d5777d81817339cfd2a637e8cfb90b1715c21b",
"md5": "6c876814f146fd5b4753f0c05aea53e5",
"sha256": "8bedbbb70d5332de53cba91d835315d480008355fa189b4b885dc2d625d0b50d"
},
"downloads": -1,
"filename": "muvera_fde-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "6c876814f146fd5b4753f0c05aea53e5",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.11",
"size": 120485,
"upload_time": "2025-07-27T09:30:59",
"upload_time_iso_8601": "2025-07-27T09:30:59.016757Z",
"url": "https://files.pythonhosted.org/packages/46/75/0af76f98378b96697df226d5777d81817339cfd2a637e8cfb90b1715c21b/muvera_fde-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-27 09:30:59",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "google",
"github_project": "graph-mining",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "muvera-fde"
}