faiss-vector-aggregator


Namefaiss-vector-aggregator JSON
Version 0.3.0 PyPI version JSON
download
home_pagehttps://github.com/vinerya/faiss_vector_aggregator
SummaryA package to aggregate embeddings in a Faiss vector store based on metadata columns.
upload_time2024-09-15 22:32:31
maintainerNone
docs_urlNone
authorMoudather Chelbi
requires_python>=3.8
licenseMIT
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            
# Faiss Embeddings Aggregation Library

This Python library provides a suite of advanced methods for aggregating multiple embeddings associated with a single document or entity into a single representative embedding. It supports a wide range of aggregation techniques, from simple averaging to sophisticated methods like PCA and Attentive Pooling.

## Table of Contents

- [Features](#features)
- [Installation](#installation)
- [Usage](#usage)
  - [Example 1: Simple Average Aggregation](#example-1-simple-average-aggregation)
  - [Example 2: Weighted Average Aggregation](#example-2-weighted-average-aggregation)
  - [Example 3: Principal Component Analysis (PCA) Aggregation](#example-3-principal-component-analysis-pca-aggregation)
  - [Example 4: Centroid Aggregation (K-Means)](#example-4-centroid-aggregation-k-means)
  - [Example 5: Attentive Pooling Aggregation](#example-5-attentive-pooling-aggregation)
- [Aggregation Methods](#aggregation-methods)
- [Parameters](#parameters)
- [Dependencies](#dependencies)
- [Contributing](#contributing)
- [License](#license)

## Features

- **Simple Average**: Compute the arithmetic mean of embeddings.
- **Weighted Average**: Compute a weighted average of embeddings.
- **Geometric Mean**: Compute the geometric mean across embeddings (for positive values).
- **Harmonic Mean**: Compute the harmonic mean across embeddings (for positive values).
- **Centroid (K-Means)**: Use K-Means clustering to find the centroid of the embeddings.
- **Principal Component Analysis (PCA)**: Use PCA to reduce embeddings to a single representative vector.
- **Median**: Compute the element-wise median of embeddings.
- **Trimmed Mean**: Compute the mean after trimming outliers.
- **Max-Pooling**: Take the maximum value for each dimension across embeddings.
- **Min-Pooling**: Take the minimum value for each dimension across embeddings.
- **Entropy-Weighted Average**: Weight embeddings by their entropy (information content).
- **Attentive Pooling**: Use an attention mechanism to learn the weights for combining embeddings.
- **Tukey's Biweight**: A robust method to down-weight outliers.
- **Exemplar**: Select the embedding that best represents the group by minimizing average distance.

## Installation

To install the package, you can use pip:

```bash
pip install faiss_vector_aggregator
```

## Usage

Below are examples demonstrating how to use the library to aggregate embeddings using different methods.

### Example 1: Simple Average Aggregation

Suppose you have a collection of embeddings stored in a FAISS index, and you want to aggregate them by their associated document IDs using simple averaging.

```python
from faiss_vector_aggregator import aggregate_embeddings

# Aggregate embeddings using simple averaging
aggregate_embeddings(
    input_folder="data/input",
    column_name="id",
    output_folder="data/output",
    method="average"
)
```

- **Parameters:**
  - `input_folder`: Path to the folder containing the input FAISS index and metadata.
  - `column_name`: The metadata field by which to aggregate embeddings (e.g., `'id'`).
  - `output_folder`: Path where the output FAISS index and metadata will be saved.
  - `method="average"`: Specifies the aggregation method.

### Example 2: Weighted Average Aggregation

If you have different weights for the embeddings, you can apply a weighted average to give more importance to certain embeddings.

```python
from faiss_vector_aggregator import aggregate_embeddings

# Example weights for the embeddings
weights = [0.1, 0.3, 0.6]

# Aggregate embeddings using weighted averaging
aggregate_embeddings(
    input_folder="data/input",
    column_name="id",
    output_folder="data/output",
    method="weighted_average",
    weights=weights
)
```

- **Parameters:**
  - `weights`: A list or array of weights corresponding to each embedding.
  - `method="weighted_average"`: Specifies the weighted average method.

### Example 3: Principal Component Analysis (PCA) Aggregation

To reduce high-dimensional embeddings to a single representative vector using PCA:

```python
from faiss_vector_aggregator import aggregate_embeddings

# Aggregate embeddings using PCA
aggregate_embeddings(
    input_folder="data/input",
    column_name="id",
    output_folder="data/output",
    method="pca"
)
```

- **Parameters:**
  - `method="pca"`: Specifies that PCA should be used for aggregation.

### Example 4: Centroid Aggregation (K-Means)

Use K-Means clustering to find the centroid of embeddings for each document ID.

```python
from faiss_vector_aggregator import aggregate_embeddings

# Aggregate embeddings using K-Means clustering to find the centroid
aggregate_embeddings(
    input_folder="data/input",
    column_name="id",
    output_folder="data/output",
    method="centroid"
)
```

- **Parameters:**
  - `method="centroid"`: Specifies that K-Means clustering should be used.

### Example 5: Attentive Pooling Aggregation

To use an attention mechanism for aggregating embeddings:

```python
from faiss_vector_aggregator import aggregate_embeddings

# Aggregate embeddings using Attentive Pooling
aggregate_embeddings(
    input_folder="data/input",
    column_name="id",
    output_folder="data/output",
    method="attentive_pooling"
)
```

- **Parameters:**
  - `method="attentive_pooling"`: Specifies the attentive pooling method.

## Aggregation Methods

Below is a detailed description of each aggregation method supported by the library:

- **average**: Compute the arithmetic mean of embeddings.
- **weighted_average**: Compute a weighted average of embeddings. Requires `weights`.
- **geometric_mean**: Compute the geometric mean across embeddings. Only for positive values.
- **harmonic_mean**: Compute the harmonic mean across embeddings. Only for positive values.
- **median**: Compute the element-wise median of embeddings.
- **trimmed_mean**: Compute the mean after trimming a percentage of outliers. Use `trim_percentage` parameter.
- **centroid**: Use K-Means clustering to find the centroid of the embeddings.
- **pca**: Use Principal Component Analysis to project embeddings onto the first principal component.
- **exemplar**: Select the embedding that minimizes the average cosine distance to others.
- **max_pooling**: Take the maximum value for each dimension across embeddings.
- **min_pooling**: Take the minimum value for each dimension across embeddings.
- **entropy_weighted_average**: Weight embeddings by their entropy (information content).
- **attentive_pooling**: Use an attention mechanism based on similarity to aggregate embeddings.
- **tukeys_biweight**: A robust method to down-weight outliers in the embeddings.

## Parameters

- `input_folder` (str): Path to the folder containing the input FAISS index (`index.faiss`) and metadata (`index.pkl`).
- `column_name` (str): The metadata field by which to aggregate embeddings (e.g., `'id'`).
- `output_folder` (str): Path where the output FAISS index and metadata will be saved.
- `method` (str): The aggregation method to use. Options include:
  - `'average'`, `'weighted_average'`, `'geometric_mean'`, `'harmonic_mean'`, `'centroid'`, `'pca'`, `'median'`, `'trimmed_mean'`, `'max_pooling'`, `'min_pooling'`, `'entropy_weighted_average'`, `'attentive_pooling'`, `'tukeys_biweight'`, `'exemplar'`.
- `weights` (list or np.ndarray, optional): Weights for the `weighted_average` method.
- `trim_percentage` (float, optional): Fraction to trim from each end for `trimmed_mean`. Should be between 0 and less than 0.5.
- `weights` (list or np.ndarray, optional): Weights for the `weighted_average` method.

## Dependencies

Ensure you have the following packages installed:

- **faiss**: For handling FAISS indexes.
- **numpy**: For numerical computations.
- **scipy**: For statistical functions.
- **scikit-learn**: For PCA and K-Means clustering.
- **langchain**: For handling document stores and vector stores.

You can install the dependencies using:

```bash
pip install faiss-cpu numpy scipy scikit-learn langchain
```

*Note:* Replace `faiss-cpu` with `faiss-gpu` if you prefer to use the GPU version of FAISS.

## Contributing

Contributions are welcome! Please feel free to submit a pull request or open an issue on the [GitHub repository](https://github.com/vinerya/faiss_vector_aggregator).

When contributing, please ensure that your code adheres to the following guidelines:

- Follow PEP 8 coding standards.
- Include docstrings and comments where necessary.
- Write unit tests for new features or bug fixes.
- Update the documentation to reflect changes.

## License

This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.

### Additional Notes

- **Usage with LangChain:**
  - This library is compatible with LangChain's `FAISS` vector store. Ensure that your embeddings and indexes are handled consistently when integrating with LangChain.


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/vinerya/faiss_vector_aggregator",
    "name": "faiss-vector-aggregator",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": null,
    "author": "Moudather Chelbi",
    "author_email": "moudather.chelbi@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/35/5e/badaf299c8fbb3ebfc3aeb7f68ab8630fa5e2af90ea44f752772c3aeeb19/faiss_vector_aggregator-0.3.0.tar.gz",
    "platform": null,
    "description": "\n# Faiss Embeddings Aggregation Library\n\nThis Python library provides a suite of advanced methods for aggregating multiple embeddings associated with a single document or entity into a single representative embedding. It supports a wide range of aggregation techniques, from simple averaging to sophisticated methods like PCA and Attentive Pooling.\n\n## Table of Contents\n\n- [Features](#features)\n- [Installation](#installation)\n- [Usage](#usage)\n  - [Example 1: Simple Average Aggregation](#example-1-simple-average-aggregation)\n  - [Example 2: Weighted Average Aggregation](#example-2-weighted-average-aggregation)\n  - [Example 3: Principal Component Analysis (PCA) Aggregation](#example-3-principal-component-analysis-pca-aggregation)\n  - [Example 4: Centroid Aggregation (K-Means)](#example-4-centroid-aggregation-k-means)\n  - [Example 5: Attentive Pooling Aggregation](#example-5-attentive-pooling-aggregation)\n- [Aggregation Methods](#aggregation-methods)\n- [Parameters](#parameters)\n- [Dependencies](#dependencies)\n- [Contributing](#contributing)\n- [License](#license)\n\n## Features\n\n- **Simple Average**: Compute the arithmetic mean of embeddings.\n- **Weighted Average**: Compute a weighted average of embeddings.\n- **Geometric Mean**: Compute the geometric mean across embeddings (for positive values).\n- **Harmonic Mean**: Compute the harmonic mean across embeddings (for positive values).\n- **Centroid (K-Means)**: Use K-Means clustering to find the centroid of the embeddings.\n- **Principal Component Analysis (PCA)**: Use PCA to reduce embeddings to a single representative vector.\n- **Median**: Compute the element-wise median of embeddings.\n- **Trimmed Mean**: Compute the mean after trimming outliers.\n- **Max-Pooling**: Take the maximum value for each dimension across embeddings.\n- **Min-Pooling**: Take the minimum value for each dimension across embeddings.\n- **Entropy-Weighted Average**: Weight embeddings by their entropy (information content).\n- **Attentive Pooling**: Use an attention mechanism to learn the weights for combining embeddings.\n- **Tukey's Biweight**: A robust method to down-weight outliers.\n- **Exemplar**: Select the embedding that best represents the group by minimizing average distance.\n\n## Installation\n\nTo install the package, you can use pip:\n\n```bash\npip install faiss_vector_aggregator\n```\n\n## Usage\n\nBelow are examples demonstrating how to use the library to aggregate embeddings using different methods.\n\n### Example 1: Simple Average Aggregation\n\nSuppose you have a collection of embeddings stored in a FAISS index, and you want to aggregate them by their associated document IDs using simple averaging.\n\n```python\nfrom faiss_vector_aggregator import aggregate_embeddings\n\n# Aggregate embeddings using simple averaging\naggregate_embeddings(\n    input_folder=\"data/input\",\n    column_name=\"id\",\n    output_folder=\"data/output\",\n    method=\"average\"\n)\n```\n\n- **Parameters:**\n  - `input_folder`: Path to the folder containing the input FAISS index and metadata.\n  - `column_name`: The metadata field by which to aggregate embeddings (e.g., `'id'`).\n  - `output_folder`: Path where the output FAISS index and metadata will be saved.\n  - `method=\"average\"`: Specifies the aggregation method.\n\n### Example 2: Weighted Average Aggregation\n\nIf you have different weights for the embeddings, you can apply a weighted average to give more importance to certain embeddings.\n\n```python\nfrom faiss_vector_aggregator import aggregate_embeddings\n\n# Example weights for the embeddings\nweights = [0.1, 0.3, 0.6]\n\n# Aggregate embeddings using weighted averaging\naggregate_embeddings(\n    input_folder=\"data/input\",\n    column_name=\"id\",\n    output_folder=\"data/output\",\n    method=\"weighted_average\",\n    weights=weights\n)\n```\n\n- **Parameters:**\n  - `weights`: A list or array of weights corresponding to each embedding.\n  - `method=\"weighted_average\"`: Specifies the weighted average method.\n\n### Example 3: Principal Component Analysis (PCA) Aggregation\n\nTo reduce high-dimensional embeddings to a single representative vector using PCA:\n\n```python\nfrom faiss_vector_aggregator import aggregate_embeddings\n\n# Aggregate embeddings using PCA\naggregate_embeddings(\n    input_folder=\"data/input\",\n    column_name=\"id\",\n    output_folder=\"data/output\",\n    method=\"pca\"\n)\n```\n\n- **Parameters:**\n  - `method=\"pca\"`: Specifies that PCA should be used for aggregation.\n\n### Example 4: Centroid Aggregation (K-Means)\n\nUse K-Means clustering to find the centroid of embeddings for each document ID.\n\n```python\nfrom faiss_vector_aggregator import aggregate_embeddings\n\n# Aggregate embeddings using K-Means clustering to find the centroid\naggregate_embeddings(\n    input_folder=\"data/input\",\n    column_name=\"id\",\n    output_folder=\"data/output\",\n    method=\"centroid\"\n)\n```\n\n- **Parameters:**\n  - `method=\"centroid\"`: Specifies that K-Means clustering should be used.\n\n### Example 5: Attentive Pooling Aggregation\n\nTo use an attention mechanism for aggregating embeddings:\n\n```python\nfrom faiss_vector_aggregator import aggregate_embeddings\n\n# Aggregate embeddings using Attentive Pooling\naggregate_embeddings(\n    input_folder=\"data/input\",\n    column_name=\"id\",\n    output_folder=\"data/output\",\n    method=\"attentive_pooling\"\n)\n```\n\n- **Parameters:**\n  - `method=\"attentive_pooling\"`: Specifies the attentive pooling method.\n\n## Aggregation Methods\n\nBelow is a detailed description of each aggregation method supported by the library:\n\n- **average**: Compute the arithmetic mean of embeddings.\n- **weighted_average**: Compute a weighted average of embeddings. Requires `weights`.\n- **geometric_mean**: Compute the geometric mean across embeddings. Only for positive values.\n- **harmonic_mean**: Compute the harmonic mean across embeddings. Only for positive values.\n- **median**: Compute the element-wise median of embeddings.\n- **trimmed_mean**: Compute the mean after trimming a percentage of outliers. Use `trim_percentage` parameter.\n- **centroid**: Use K-Means clustering to find the centroid of the embeddings.\n- **pca**: Use Principal Component Analysis to project embeddings onto the first principal component.\n- **exemplar**: Select the embedding that minimizes the average cosine distance to others.\n- **max_pooling**: Take the maximum value for each dimension across embeddings.\n- **min_pooling**: Take the minimum value for each dimension across embeddings.\n- **entropy_weighted_average**: Weight embeddings by their entropy (information content).\n- **attentive_pooling**: Use an attention mechanism based on similarity to aggregate embeddings.\n- **tukeys_biweight**: A robust method to down-weight outliers in the embeddings.\n\n## Parameters\n\n- `input_folder` (str): Path to the folder containing the input FAISS index (`index.faiss`) and metadata (`index.pkl`).\n- `column_name` (str): The metadata field by which to aggregate embeddings (e.g., `'id'`).\n- `output_folder` (str): Path where the output FAISS index and metadata will be saved.\n- `method` (str): The aggregation method to use. Options include:\n  - `'average'`, `'weighted_average'`, `'geometric_mean'`, `'harmonic_mean'`, `'centroid'`, `'pca'`, `'median'`, `'trimmed_mean'`, `'max_pooling'`, `'min_pooling'`, `'entropy_weighted_average'`, `'attentive_pooling'`, `'tukeys_biweight'`, `'exemplar'`.\n- `weights` (list or np.ndarray, optional): Weights for the `weighted_average` method.\n- `trim_percentage` (float, optional): Fraction to trim from each end for `trimmed_mean`. Should be between 0 and less than 0.5.\n- `weights` (list or np.ndarray, optional): Weights for the `weighted_average` method.\n\n## Dependencies\n\nEnsure you have the following packages installed:\n\n- **faiss**: For handling FAISS indexes.\n- **numpy**: For numerical computations.\n- **scipy**: For statistical functions.\n- **scikit-learn**: For PCA and K-Means clustering.\n- **langchain**: For handling document stores and vector stores.\n\nYou can install the dependencies using:\n\n```bash\npip install faiss-cpu numpy scipy scikit-learn langchain\n```\n\n*Note:* Replace `faiss-cpu` with `faiss-gpu` if you prefer to use the GPU version of FAISS.\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a pull request or open an issue on the [GitHub repository](https://github.com/vinerya/faiss_vector_aggregator).\n\nWhen contributing, please ensure that your code adheres to the following guidelines:\n\n- Follow PEP 8 coding standards.\n- Include docstrings and comments where necessary.\n- Write unit tests for new features or bug fixes.\n- Update the documentation to reflect changes.\n\n## License\n\nThis project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.\n\n### Additional Notes\n\n- **Usage with LangChain:**\n  - This library is compatible with LangChain's `FAISS` vector store. Ensure that your embeddings and indexes are handled consistently when integrating with LangChain.\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A package to aggregate embeddings in a Faiss vector store based on metadata columns.",
    "version": "0.3.0",
    "project_urls": {
        "Homepage": "https://github.com/vinerya/faiss_vector_aggregator"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1a0c051c3d406f9a0905cf36f21c07a3ef90b5356475709e54dd9488e90f8a11",
                "md5": "25d9094de50cac39ed019f007549a979",
                "sha256": "50b6a96462ad640be9a69569eaca1f41e603a202cdfd1d400d4c3617d5e11417"
            },
            "downloads": -1,
            "filename": "faiss_vector_aggregator-0.3.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "25d9094de50cac39ed019f007549a979",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 10345,
            "upload_time": "2024-09-15T22:32:29",
            "upload_time_iso_8601": "2024-09-15T22:32:29.580212Z",
            "url": "https://files.pythonhosted.org/packages/1a/0c/051c3d406f9a0905cf36f21c07a3ef90b5356475709e54dd9488e90f8a11/faiss_vector_aggregator-0.3.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "355ebadaf299c8fbb3ebfc3aeb7f68ab8630fa5e2af90ea44f752772c3aeeb19",
                "md5": "fbdd17183ace84bcd1735a3bd97fc03d",
                "sha256": "d673b7584d02a056ee14d565013d7c6f643391e123cd7812663cbfac0abdbeb2"
            },
            "downloads": -1,
            "filename": "faiss_vector_aggregator-0.3.0.tar.gz",
            "has_sig": false,
            "md5_digest": "fbdd17183ace84bcd1735a3bd97fc03d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 11032,
            "upload_time": "2024-09-15T22:32:31",
            "upload_time_iso_8601": "2024-09-15T22:32:31.815522Z",
            "url": "https://files.pythonhosted.org/packages/35/5e/badaf299c8fbb3ebfc3aeb7f68ab8630fa5e2af90ea44f752772c3aeeb19/faiss_vector_aggregator-0.3.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-15 22:32:31",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "vinerya",
    "github_project": "faiss_vector_aggregator",
    "github_not_found": true,
    "lcname": "faiss-vector-aggregator"
}
        
Elapsed time: 0.34648s