muvera-fde

Name	muvera-fde JSON
Version	0.1.0 JSON
	download
home_page	None
Summary	Python bindings for Google's graph-mining Fixed Dimensional Encoding (FDE) from MUVERA
upload_time	2025-07-27 09:30:59
maintainer	None
docs_url	None
author	None
requires_python	>=3.11
license	Apache-2.0
keywords	point-cloud encoding muvera graph-mining similarity-search vector-search
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # MUVERA FDE

Python bindings for Google's Fixed Dimensional Encoding (FDE) algorithm from the [graph-mining](https://github.com/google/graph-mining/tree/main/sketching/point_cloud) project, as described in the [MUVERA paper](https://research.google/blog/muvera-making-multi-vector-retrieval-as-fast-as-single-vector-search/).

## Overview

This library provides Python bindings for Google's Fixed Dimensional Encoding (FDE) algorithm, a key component of the MUVERA (Multi-Vector Retrieval Aggregation) system. FDE is a technique for encoding variable-sized sets of vectors into fixed-dimensional representations while preserving similarity structure.

### Key Applications

- **Multi-vector retrieval**: Efficiently search collections of multi-vector data (e.g., video segments, document chunks)
- **Chamfer similarity approximation**: Approximate Chamfer distance between point clouds
- **Fixed-size embeddings**: Convert variable-length sequences into fixed-size vectors for ML models
- **Efficient similarity search**: Enable fast nearest-neighbor search on multi-vector data

### Algorithm Details

FDE uses SimHash projections to partition the input space and aggregates points within each partition. This approach preserves the similarity structure while providing a compact fixed-dimensional representation. The algorithm supports both sum aggregation (for queries) and average aggregation (for documents), making it suitable for asymmetric similarity search scenarios.

### Relationship to MUVERA

MUVERA (Multi-Vector Retrieval Aggregation) is Google's system for efficient multi-vector search that achieves single-vector search speeds. FDE is a core component of MUVERA that enables:

1. **Compression**: Reducing multiple vectors to a single fixed-size representation
2. **Similarity Preservation**: Maintaining approximate Chamfer distances between point sets
3. **Asymmetric Search**: Different encodings for queries (sum) and documents (average)
4. **Scalability**: Enabling billion-scale multi-vector retrieval

This implementation focuses on the FDE component, which can be used standalone or as part of a larger retrieval system.

## Features

- **Zero-copy numpy integration**: Efficient memory usage with numpy arrays
- **Flexible encoding types**: Support for both query (sum) and document (average) encodings
- **Configurable parameters**: Control dimensionality, projections, and aggregation methods
- **Modern Python packaging**: Built with `uv` and `scikit-build-core`

## Installation

### From source with uv

```bash
cd muvera-fde
uv pip install -e .
```

### Development installation

```bash
uv pip install -e ".[dev]"
```

## Quick Start

```python
import numpy as np
from muvera_fde import (
    FixedDimensionalEncodingConfig,
    FixedDimensionalEncoder,
    EncodingType
)

# Create configuration
config = FixedDimensionalEncodingConfig(
    dimension=3,                    # Input point dimension
    num_simhash_projections=8,      # Number of SimHash projections
    num_repetitions=2,              # Number of repetitions for robustness
    encoding_type=EncodingType.DEFAULT_SUM  # Sum aggregation for queries
)

# Initialize encoder
encoder = FixedDimensionalEncoder(config)

# Generate random point cloud
points = np.random.randn(100, 3).astype(np.float32)

# Encode the point cloud
encoding = encoder.encode(points)
print(f"Encoding shape: {encoding.shape}")
print(f"Output dimension: {encoder.output_dimension}")
```

## Configuration Options

The `FixedDimensionalEncodingConfig` dataclass supports the following parameters:

- `dimension`: Dimension of input points (default: 3)
- `num_repetitions`: Number of encoding repetitions (default: 1)
- `num_simhash_projections`: Number of SimHash projections (default: 8)
- `seed`: Random seed for reproducibility (default: 1)
- `encoding_type`: `EncodingType.DEFAULT_SUM` for queries, `EncodingType.AVERAGE` for documents
- `projection_dimension`: Optional dimension to project points before encoding
- `projection_type`: `ProjectionType.DEFAULT_IDENTITY` or `ProjectionType.AMS_SKETCH`
- `fill_empty_partitions`: Whether to fill empty partitions with zeros (default: False)
- `final_projection_dimension`: Optional final output dimension

## Advanced Usage

### Query vs Document Encoding

```python
# Query encoding (sum aggregation)
query_encoding = encoder.encode_query(query_points)

# Document encoding (average aggregation)
doc_encoding = encoder.encode_document(document_points)
```

### Dimensionality Reduction

```python
# Use AMS sketch projection
config = FixedDimensionalEncodingConfig(
    dimension=128,
    projection_dimension=32,
    projection_type=ProjectionType.AMS_SKETCH,
    final_projection_dimension=256
)
```

## Building from Source

The project uses CMake and scikit-build-core for building the C++ extension. Eigen is automatically downloaded during the build process.

```bash
# Clean build
uv pip install --force-reinstall -e .

# Run tests
uv run pytest
```

## Citation

If you use this library in your research, please cite the MUVERA paper:

```bibtex
@article{muvera2024,
  title={MUVERA: Multi-Vector Retrieval via Fixed Dimensional Encodings},
  author={Google Research Team},
  journal={Google Research Blog},
  year={2024},
  url={https://research.google/blog/muvera-making-multi-vector-retrieval-as-fast-as-single-vector-search/}
}
```

## License

This project is licensed under the Apache License 2.0, the same license as the original Google graph-mining project. See the LICENSE file for details.

## Acknowledgments

This library provides Python bindings for the Fixed Dimensional Encoding implementation from:
- [Google's graph-mining project](https://github.com/google/graph-mining/tree/main/sketching/point_cloud)
- [MUVERA: Multi-Vector Retrieval via Fixed Dimensional Encodings](https://research.google/blog/muvera-making-multi-vector-retrieval-as-fast-as-single-vector-search/)

The core C++ implementation is adapted from Google's original code with modifications for Python integration.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "muvera-fde",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.11",
    "maintainer_email": null,
    "keywords": "point-cloud, encoding, muvera, graph-mining, similarity-search, vector-search",
    "author": null,
    "author_email": "Yasyf Mohamedali <yasyfm@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/46/75/0af76f98378b96697df226d5777d81817339cfd2a637e8cfb90b1715c21b/muvera_fde-0.1.0.tar.gz",
    "platform": null,
    "description": "# MUVERA FDE\n\nPython bindings for Google's Fixed Dimensional Encoding (FDE) algorithm from the [graph-mining](https://github.com/google/graph-mining/tree/main/sketching/point_cloud) project, as described in the [MUVERA paper](https://research.google/blog/muvera-making-multi-vector-retrieval-as-fast-as-single-vector-search/).\n\n## Overview\n\nThis library provides Python bindings for Google's Fixed Dimensional Encoding (FDE) algorithm, a key component of the MUVERA (Multi-Vector Retrieval Aggregation) system. FDE is a technique for encoding variable-sized sets of vectors into fixed-dimensional representations while preserving similarity structure.\n\n### Key Applications\n\n- **Multi-vector retrieval**: Efficiently search collections of multi-vector data (e.g., video segments, document chunks)\n- **Chamfer similarity approximation**: Approximate Chamfer distance between point clouds\n- **Fixed-size embeddings**: Convert variable-length sequences into fixed-size vectors for ML models\n- **Efficient similarity search**: Enable fast nearest-neighbor search on multi-vector data\n\n### Algorithm Details\n\nFDE uses SimHash projections to partition the input space and aggregates points within each partition. This approach preserves the similarity structure while providing a compact fixed-dimensional representation. The algorithm supports both sum aggregation (for queries) and average aggregation (for documents), making it suitable for asymmetric similarity search scenarios.\n\n### Relationship to MUVERA\n\nMUVERA (Multi-Vector Retrieval Aggregation) is Google's system for efficient multi-vector search that achieves single-vector search speeds. FDE is a core component of MUVERA that enables:\n\n1. **Compression**: Reducing multiple vectors to a single fixed-size representation\n2. **Similarity Preservation**: Maintaining approximate Chamfer distances between point sets\n3. **Asymmetric Search**: Different encodings for queries (sum) and documents (average)\n4. **Scalability**: Enabling billion-scale multi-vector retrieval\n\nThis implementation focuses on the FDE component, which can be used standalone or as part of a larger retrieval system.\n\n## Features\n\n- **Zero-copy numpy integration**: Efficient memory usage with numpy arrays\n- **Flexible encoding types**: Support for both query (sum) and document (average) encodings\n- **Configurable parameters**: Control dimensionality, projections, and aggregation methods\n- **Modern Python packaging**: Built with `uv` and `scikit-build-core`\n\n## Installation\n\n### From source with uv\n\n```bash\ncd muvera-fde\nuv pip install -e .\n```\n\n### Development installation\n\n```bash\nuv pip install -e \".[dev]\"\n```\n\n## Quick Start\n\n```python\nimport numpy as np\nfrom muvera_fde import (\n    FixedDimensionalEncodingConfig,\n    FixedDimensionalEncoder,\n    EncodingType\n)\n\n# Create configuration\nconfig = FixedDimensionalEncodingConfig(\n    dimension=3,                    # Input point dimension\n    num_simhash_projections=8,      # Number of SimHash projections\n    num_repetitions=2,              # Number of repetitions for robustness\n    encoding_type=EncodingType.DEFAULT_SUM  # Sum aggregation for queries\n)\n\n# Initialize encoder\nencoder = FixedDimensionalEncoder(config)\n\n# Generate random point cloud\npoints = np.random.randn(100, 3).astype(np.float32)\n\n# Encode the point cloud\nencoding = encoder.encode(points)\nprint(f\"Encoding shape: {encoding.shape}\")\nprint(f\"Output dimension: {encoder.output_dimension}\")\n```\n\n## Configuration Options\n\nThe `FixedDimensionalEncodingConfig` dataclass supports the following parameters:\n\n- `dimension`: Dimension of input points (default: 3)\n- `num_repetitions`: Number of encoding repetitions (default: 1)\n- `num_simhash_projections`: Number of SimHash projections (default: 8)\n- `seed`: Random seed for reproducibility (default: 1)\n- `encoding_type`: `EncodingType.DEFAULT_SUM` for queries, `EncodingType.AVERAGE` for documents\n- `projection_dimension`: Optional dimension to project points before encoding\n- `projection_type`: `ProjectionType.DEFAULT_IDENTITY` or `ProjectionType.AMS_SKETCH`\n- `fill_empty_partitions`: Whether to fill empty partitions with zeros (default: False)\n- `final_projection_dimension`: Optional final output dimension\n\n## Advanced Usage\n\n### Query vs Document Encoding\n\n```python\n# Query encoding (sum aggregation)\nquery_encoding = encoder.encode_query(query_points)\n\n# Document encoding (average aggregation)\ndoc_encoding = encoder.encode_document(document_points)\n```\n\n### Dimensionality Reduction\n\n```python\n# Use AMS sketch projection\nconfig = FixedDimensionalEncodingConfig(\n    dimension=128,\n    projection_dimension=32,\n    projection_type=ProjectionType.AMS_SKETCH,\n    final_projection_dimension=256\n)\n```\n\n## Building from Source\n\nThe project uses CMake and scikit-build-core for building the C++ extension. Eigen is automatically downloaded during the build process.\n\n```bash\n# Clean build\nuv pip install --force-reinstall -e .\n\n# Run tests\nuv run pytest\n```\n\n## Citation\n\nIf you use this library in your research, please cite the MUVERA paper:\n\n```bibtex\n@article{muvera2024,\n  title={MUVERA: Multi-Vector Retrieval via Fixed Dimensional Encodings},\n  author={Google Research Team},\n  journal={Google Research Blog},\n  year={2024},\n  url={https://research.google/blog/muvera-making-multi-vector-retrieval-as-fast-as-single-vector-search/}\n}\n```\n\n## License\n\nThis project is licensed under the Apache License 2.0, the same license as the original Google graph-mining project. See the LICENSE file for details.\n\n## Acknowledgments\n\nThis library provides Python bindings for the Fixed Dimensional Encoding implementation from:\n- [Google's graph-mining project](https://github.com/google/graph-mining/tree/main/sketching/point_cloud)\n- [MUVERA: Multi-Vector Retrieval via Fixed Dimensional Encodings](https://research.google/blog/muvera-making-multi-vector-retrieval-as-fast-as-single-vector-search/)\n\nThe core C++ implementation is adapted from Google's original code with modifications for Python integration.",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "Python bindings for Google's graph-mining Fixed Dimensional Encoding (FDE) from MUVERA",
    "version": "0.1.0",
    "project_urls": {
        "Original Implementation": "https://github.com/google/graph-mining/tree/main/sketching/point_cloud",
        "Research Paper": "https://research.google/blog/muvera-making-multi-vector-retrieval-as-fast-as-single-vector-search/"
    },
    "split_keywords": [
        "point-cloud",
        " encoding",
        " muvera",
        " graph-mining",
        " similarity-search",
        " vector-search"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "46750af76f98378b96697df226d5777d81817339cfd2a637e8cfb90b1715c21b",
                "md5": "6c876814f146fd5b4753f0c05aea53e5",
                "sha256": "8bedbbb70d5332de53cba91d835315d480008355fa189b4b885dc2d625d0b50d"
            },
            "downloads": -1,
            "filename": "muvera_fde-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "6c876814f146fd5b4753f0c05aea53e5",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11",
            "size": 120485,
            "upload_time": "2025-07-27T09:30:59",
            "upload_time_iso_8601": "2025-07-27T09:30:59.016757Z",
            "url": "https://files.pythonhosted.org/packages/46/75/0af76f98378b96697df226d5777d81817339cfd2a637e8cfb90b1715c21b/muvera_fde-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-27 09:30:59",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "google",
    "github_project": "graph-mining",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "muvera-fde"
}

None