# Faiss Embeddings Aggregation Library
This Python library provides a suite of advanced methods for aggregating multiple embeddings associated with a single document or entity into a single representative embedding. It supports a wide range of aggregation techniques, from simple averaging to sophisticated methods like PCA and Attentive Pooling.
## Table of Contents
- [Features](#features)
- [Installation](#installation)
- [Usage](#usage)
- [Example 1: Simple Average Aggregation](#example-1-simple-average-aggregation)
- [Example 2: Weighted Average Aggregation](#example-2-weighted-average-aggregation)
- [Example 3: Principal Component Analysis (PCA) Aggregation](#example-3-principal-component-analysis-pca-aggregation)
- [Example 4: Centroid Aggregation (K-Means)](#example-4-centroid-aggregation-k-means)
- [Example 5: Attentive Pooling Aggregation](#example-5-attentive-pooling-aggregation)
- [Aggregation Methods](#aggregation-methods)
- [Parameters](#parameters)
- [Dependencies](#dependencies)
- [Contributing](#contributing)
- [License](#license)
## Features
- **Simple Average**: Compute the arithmetic mean of embeddings.
- **Weighted Average**: Compute a weighted average of embeddings.
- **Geometric Mean**: Compute the geometric mean across embeddings (for positive values).
- **Harmonic Mean**: Compute the harmonic mean across embeddings (for positive values).
- **Centroid (K-Means)**: Use K-Means clustering to find the centroid of the embeddings.
- **Principal Component Analysis (PCA)**: Use PCA to reduce embeddings to a single representative vector.
- **Median**: Compute the element-wise median of embeddings.
- **Trimmed Mean**: Compute the mean after trimming outliers.
- **Max-Pooling**: Take the maximum value for each dimension across embeddings.
- **Min-Pooling**: Take the minimum value for each dimension across embeddings.
- **Entropy-Weighted Average**: Weight embeddings by their entropy (information content).
- **Attentive Pooling**: Use an attention mechanism to learn the weights for combining embeddings.
- **Tukey's Biweight**: A robust method to down-weight outliers.
- **Exemplar**: Select the embedding that best represents the group by minimizing average distance.
## Installation
To install the package, you can use pip:
```bash
pip install faiss_vector_aggregator
```
## Usage
Below are examples demonstrating how to use the library to aggregate embeddings using different methods.
### Example 1: Simple Average Aggregation
Suppose you have a collection of embeddings stored in a FAISS index, and you want to aggregate them by their associated document IDs using simple averaging.
```python
from faiss_vector_aggregator import aggregate_embeddings
# Aggregate embeddings using simple averaging
aggregate_embeddings(
input_folder="data/input",
column_name="id",
output_folder="data/output",
method="average"
)
```
- **Parameters:**
- `input_folder`: Path to the folder containing the input FAISS index and metadata.
- `column_name`: The metadata field by which to aggregate embeddings (e.g., `'id'`).
- `output_folder`: Path where the output FAISS index and metadata will be saved.
- `method="average"`: Specifies the aggregation method.
### Example 2: Weighted Average Aggregation
If you have different weights for the embeddings, you can apply a weighted average to give more importance to certain embeddings.
```python
from faiss_vector_aggregator import aggregate_embeddings
# Example weights for the embeddings
weights = [0.1, 0.3, 0.6]
# Aggregate embeddings using weighted averaging
aggregate_embeddings(
input_folder="data/input",
column_name="id",
output_folder="data/output",
method="weighted_average",
weights=weights
)
```
- **Parameters:**
- `weights`: A list or array of weights corresponding to each embedding.
- `method="weighted_average"`: Specifies the weighted average method.
### Example 3: Principal Component Analysis (PCA) Aggregation
To reduce high-dimensional embeddings to a single representative vector using PCA:
```python
from faiss_vector_aggregator import aggregate_embeddings
# Aggregate embeddings using PCA
aggregate_embeddings(
input_folder="data/input",
column_name="id",
output_folder="data/output",
method="pca"
)
```
- **Parameters:**
- `method="pca"`: Specifies that PCA should be used for aggregation.
### Example 4: Centroid Aggregation (K-Means)
Use K-Means clustering to find the centroid of embeddings for each document ID.
```python
from faiss_vector_aggregator import aggregate_embeddings
# Aggregate embeddings using K-Means clustering to find the centroid
aggregate_embeddings(
input_folder="data/input",
column_name="id",
output_folder="data/output",
method="centroid"
)
```
- **Parameters:**
- `method="centroid"`: Specifies that K-Means clustering should be used.
### Example 5: Attentive Pooling Aggregation
To use an attention mechanism for aggregating embeddings:
```python
from faiss_vector_aggregator import aggregate_embeddings
# Aggregate embeddings using Attentive Pooling
aggregate_embeddings(
input_folder="data/input",
column_name="id",
output_folder="data/output",
method="attentive_pooling"
)
```
- **Parameters:**
- `method="attentive_pooling"`: Specifies the attentive pooling method.
## Aggregation Methods
Below is a detailed description of each aggregation method supported by the library:
- **average**: Compute the arithmetic mean of embeddings.
- **weighted_average**: Compute a weighted average of embeddings. Requires `weights`.
- **geometric_mean**: Compute the geometric mean across embeddings. Only for positive values.
- **harmonic_mean**: Compute the harmonic mean across embeddings. Only for positive values.
- **median**: Compute the element-wise median of embeddings.
- **trimmed_mean**: Compute the mean after trimming a percentage of outliers. Use `trim_percentage` parameter.
- **centroid**: Use K-Means clustering to find the centroid of the embeddings.
- **pca**: Use Principal Component Analysis to project embeddings onto the first principal component.
- **exemplar**: Select the embedding that minimizes the average cosine distance to others.
- **max_pooling**: Take the maximum value for each dimension across embeddings.
- **min_pooling**: Take the minimum value for each dimension across embeddings.
- **entropy_weighted_average**: Weight embeddings by their entropy (information content).
- **attentive_pooling**: Use an attention mechanism based on similarity to aggregate embeddings.
- **tukeys_biweight**: A robust method to down-weight outliers in the embeddings.
## Parameters
- `input_folder` (str): Path to the folder containing the input FAISS index (`index.faiss`) and metadata (`index.pkl`).
- `column_name` (str): The metadata field by which to aggregate embeddings (e.g., `'id'`).
- `output_folder` (str): Path where the output FAISS index and metadata will be saved.
- `method` (str): The aggregation method to use. Options include:
- `'average'`, `'weighted_average'`, `'geometric_mean'`, `'harmonic_mean'`, `'centroid'`, `'pca'`, `'median'`, `'trimmed_mean'`, `'max_pooling'`, `'min_pooling'`, `'entropy_weighted_average'`, `'attentive_pooling'`, `'tukeys_biweight'`, `'exemplar'`.
- `weights` (list or np.ndarray, optional): Weights for the `weighted_average` method.
- `trim_percentage` (float, optional): Fraction to trim from each end for `trimmed_mean`. Should be between 0 and less than 0.5.
- `weights` (list or np.ndarray, optional): Weights for the `weighted_average` method.
## Dependencies
Ensure you have the following packages installed:
- **faiss**: For handling FAISS indexes.
- **numpy**: For numerical computations.
- **scipy**: For statistical functions.
- **scikit-learn**: For PCA and K-Means clustering.
- **langchain**: For handling document stores and vector stores.
You can install the dependencies using:
```bash
pip install faiss-cpu numpy scipy scikit-learn langchain
```
*Note:* Replace `faiss-cpu` with `faiss-gpu` if you prefer to use the GPU version of FAISS.
## Contributing
Contributions are welcome! Please feel free to submit a pull request or open an issue on the [GitHub repository](https://github.com/vinerya/faiss_vector_aggregator).
When contributing, please ensure that your code adheres to the following guidelines:
- Follow PEP 8 coding standards.
- Include docstrings and comments where necessary.
- Write unit tests for new features or bug fixes.
- Update the documentation to reflect changes.
## License
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
### Additional Notes
- **Usage with LangChain:**
- This library is compatible with LangChain's `FAISS` vector store. Ensure that your embeddings and indexes are handled consistently when integrating with LangChain.
Raw data
{
"_id": null,
"home_page": "https://github.com/vinerya/faiss_vector_aggregator",
"name": "faiss-vector-aggregator",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": null,
"author": "Moudather Chelbi",
"author_email": "moudather.chelbi@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/35/5e/badaf299c8fbb3ebfc3aeb7f68ab8630fa5e2af90ea44f752772c3aeeb19/faiss_vector_aggregator-0.3.0.tar.gz",
"platform": null,
"description": "\n# Faiss Embeddings Aggregation Library\n\nThis Python library provides a suite of advanced methods for aggregating multiple embeddings associated with a single document or entity into a single representative embedding. It supports a wide range of aggregation techniques, from simple averaging to sophisticated methods like PCA and Attentive Pooling.\n\n## Table of Contents\n\n- [Features](#features)\n- [Installation](#installation)\n- [Usage](#usage)\n - [Example 1: Simple Average Aggregation](#example-1-simple-average-aggregation)\n - [Example 2: Weighted Average Aggregation](#example-2-weighted-average-aggregation)\n - [Example 3: Principal Component Analysis (PCA) Aggregation](#example-3-principal-component-analysis-pca-aggregation)\n - [Example 4: Centroid Aggregation (K-Means)](#example-4-centroid-aggregation-k-means)\n - [Example 5: Attentive Pooling Aggregation](#example-5-attentive-pooling-aggregation)\n- [Aggregation Methods](#aggregation-methods)\n- [Parameters](#parameters)\n- [Dependencies](#dependencies)\n- [Contributing](#contributing)\n- [License](#license)\n\n## Features\n\n- **Simple Average**: Compute the arithmetic mean of embeddings.\n- **Weighted Average**: Compute a weighted average of embeddings.\n- **Geometric Mean**: Compute the geometric mean across embeddings (for positive values).\n- **Harmonic Mean**: Compute the harmonic mean across embeddings (for positive values).\n- **Centroid (K-Means)**: Use K-Means clustering to find the centroid of the embeddings.\n- **Principal Component Analysis (PCA)**: Use PCA to reduce embeddings to a single representative vector.\n- **Median**: Compute the element-wise median of embeddings.\n- **Trimmed Mean**: Compute the mean after trimming outliers.\n- **Max-Pooling**: Take the maximum value for each dimension across embeddings.\n- **Min-Pooling**: Take the minimum value for each dimension across embeddings.\n- **Entropy-Weighted Average**: Weight embeddings by their entropy (information content).\n- **Attentive Pooling**: Use an attention mechanism to learn the weights for combining embeddings.\n- **Tukey's Biweight**: A robust method to down-weight outliers.\n- **Exemplar**: Select the embedding that best represents the group by minimizing average distance.\n\n## Installation\n\nTo install the package, you can use pip:\n\n```bash\npip install faiss_vector_aggregator\n```\n\n## Usage\n\nBelow are examples demonstrating how to use the library to aggregate embeddings using different methods.\n\n### Example 1: Simple Average Aggregation\n\nSuppose you have a collection of embeddings stored in a FAISS index, and you want to aggregate them by their associated document IDs using simple averaging.\n\n```python\nfrom faiss_vector_aggregator import aggregate_embeddings\n\n# Aggregate embeddings using simple averaging\naggregate_embeddings(\n input_folder=\"data/input\",\n column_name=\"id\",\n output_folder=\"data/output\",\n method=\"average\"\n)\n```\n\n- **Parameters:**\n - `input_folder`: Path to the folder containing the input FAISS index and metadata.\n - `column_name`: The metadata field by which to aggregate embeddings (e.g., `'id'`).\n - `output_folder`: Path where the output FAISS index and metadata will be saved.\n - `method=\"average\"`: Specifies the aggregation method.\n\n### Example 2: Weighted Average Aggregation\n\nIf you have different weights for the embeddings, you can apply a weighted average to give more importance to certain embeddings.\n\n```python\nfrom faiss_vector_aggregator import aggregate_embeddings\n\n# Example weights for the embeddings\nweights = [0.1, 0.3, 0.6]\n\n# Aggregate embeddings using weighted averaging\naggregate_embeddings(\n input_folder=\"data/input\",\n column_name=\"id\",\n output_folder=\"data/output\",\n method=\"weighted_average\",\n weights=weights\n)\n```\n\n- **Parameters:**\n - `weights`: A list or array of weights corresponding to each embedding.\n - `method=\"weighted_average\"`: Specifies the weighted average method.\n\n### Example 3: Principal Component Analysis (PCA) Aggregation\n\nTo reduce high-dimensional embeddings to a single representative vector using PCA:\n\n```python\nfrom faiss_vector_aggregator import aggregate_embeddings\n\n# Aggregate embeddings using PCA\naggregate_embeddings(\n input_folder=\"data/input\",\n column_name=\"id\",\n output_folder=\"data/output\",\n method=\"pca\"\n)\n```\n\n- **Parameters:**\n - `method=\"pca\"`: Specifies that PCA should be used for aggregation.\n\n### Example 4: Centroid Aggregation (K-Means)\n\nUse K-Means clustering to find the centroid of embeddings for each document ID.\n\n```python\nfrom faiss_vector_aggregator import aggregate_embeddings\n\n# Aggregate embeddings using K-Means clustering to find the centroid\naggregate_embeddings(\n input_folder=\"data/input\",\n column_name=\"id\",\n output_folder=\"data/output\",\n method=\"centroid\"\n)\n```\n\n- **Parameters:**\n - `method=\"centroid\"`: Specifies that K-Means clustering should be used.\n\n### Example 5: Attentive Pooling Aggregation\n\nTo use an attention mechanism for aggregating embeddings:\n\n```python\nfrom faiss_vector_aggregator import aggregate_embeddings\n\n# Aggregate embeddings using Attentive Pooling\naggregate_embeddings(\n input_folder=\"data/input\",\n column_name=\"id\",\n output_folder=\"data/output\",\n method=\"attentive_pooling\"\n)\n```\n\n- **Parameters:**\n - `method=\"attentive_pooling\"`: Specifies the attentive pooling method.\n\n## Aggregation Methods\n\nBelow is a detailed description of each aggregation method supported by the library:\n\n- **average**: Compute the arithmetic mean of embeddings.\n- **weighted_average**: Compute a weighted average of embeddings. Requires `weights`.\n- **geometric_mean**: Compute the geometric mean across embeddings. Only for positive values.\n- **harmonic_mean**: Compute the harmonic mean across embeddings. Only for positive values.\n- **median**: Compute the element-wise median of embeddings.\n- **trimmed_mean**: Compute the mean after trimming a percentage of outliers. Use `trim_percentage` parameter.\n- **centroid**: Use K-Means clustering to find the centroid of the embeddings.\n- **pca**: Use Principal Component Analysis to project embeddings onto the first principal component.\n- **exemplar**: Select the embedding that minimizes the average cosine distance to others.\n- **max_pooling**: Take the maximum value for each dimension across embeddings.\n- **min_pooling**: Take the minimum value for each dimension across embeddings.\n- **entropy_weighted_average**: Weight embeddings by their entropy (information content).\n- **attentive_pooling**: Use an attention mechanism based on similarity to aggregate embeddings.\n- **tukeys_biweight**: A robust method to down-weight outliers in the embeddings.\n\n## Parameters\n\n- `input_folder` (str): Path to the folder containing the input FAISS index (`index.faiss`) and metadata (`index.pkl`).\n- `column_name` (str): The metadata field by which to aggregate embeddings (e.g., `'id'`).\n- `output_folder` (str): Path where the output FAISS index and metadata will be saved.\n- `method` (str): The aggregation method to use. Options include:\n - `'average'`, `'weighted_average'`, `'geometric_mean'`, `'harmonic_mean'`, `'centroid'`, `'pca'`, `'median'`, `'trimmed_mean'`, `'max_pooling'`, `'min_pooling'`, `'entropy_weighted_average'`, `'attentive_pooling'`, `'tukeys_biweight'`, `'exemplar'`.\n- `weights` (list or np.ndarray, optional): Weights for the `weighted_average` method.\n- `trim_percentage` (float, optional): Fraction to trim from each end for `trimmed_mean`. Should be between 0 and less than 0.5.\n- `weights` (list or np.ndarray, optional): Weights for the `weighted_average` method.\n\n## Dependencies\n\nEnsure you have the following packages installed:\n\n- **faiss**: For handling FAISS indexes.\n- **numpy**: For numerical computations.\n- **scipy**: For statistical functions.\n- **scikit-learn**: For PCA and K-Means clustering.\n- **langchain**: For handling document stores and vector stores.\n\nYou can install the dependencies using:\n\n```bash\npip install faiss-cpu numpy scipy scikit-learn langchain\n```\n\n*Note:* Replace `faiss-cpu` with `faiss-gpu` if you prefer to use the GPU version of FAISS.\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a pull request or open an issue on the [GitHub repository](https://github.com/vinerya/faiss_vector_aggregator).\n\nWhen contributing, please ensure that your code adheres to the following guidelines:\n\n- Follow PEP 8 coding standards.\n- Include docstrings and comments where necessary.\n- Write unit tests for new features or bug fixes.\n- Update the documentation to reflect changes.\n\n## License\n\nThis project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.\n\n### Additional Notes\n\n- **Usage with LangChain:**\n - This library is compatible with LangChain's `FAISS` vector store. Ensure that your embeddings and indexes are handled consistently when integrating with LangChain.\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A package to aggregate embeddings in a Faiss vector store based on metadata columns.",
"version": "0.3.0",
"project_urls": {
"Homepage": "https://github.com/vinerya/faiss_vector_aggregator"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "1a0c051c3d406f9a0905cf36f21c07a3ef90b5356475709e54dd9488e90f8a11",
"md5": "25d9094de50cac39ed019f007549a979",
"sha256": "50b6a96462ad640be9a69569eaca1f41e603a202cdfd1d400d4c3617d5e11417"
},
"downloads": -1,
"filename": "faiss_vector_aggregator-0.3.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "25d9094de50cac39ed019f007549a979",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 10345,
"upload_time": "2024-09-15T22:32:29",
"upload_time_iso_8601": "2024-09-15T22:32:29.580212Z",
"url": "https://files.pythonhosted.org/packages/1a/0c/051c3d406f9a0905cf36f21c07a3ef90b5356475709e54dd9488e90f8a11/faiss_vector_aggregator-0.3.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "355ebadaf299c8fbb3ebfc3aeb7f68ab8630fa5e2af90ea44f752772c3aeeb19",
"md5": "fbdd17183ace84bcd1735a3bd97fc03d",
"sha256": "d673b7584d02a056ee14d565013d7c6f643391e123cd7812663cbfac0abdbeb2"
},
"downloads": -1,
"filename": "faiss_vector_aggregator-0.3.0.tar.gz",
"has_sig": false,
"md5_digest": "fbdd17183ace84bcd1735a3bd97fc03d",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 11032,
"upload_time": "2024-09-15T22:32:31",
"upload_time_iso_8601": "2024-09-15T22:32:31.815522Z",
"url": "https://files.pythonhosted.org/packages/35/5e/badaf299c8fbb3ebfc3aeb7f68ab8630fa5e2af90ea44f752772c3aeeb19/faiss_vector_aggregator-0.3.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-09-15 22:32:31",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "vinerya",
"github_project": "faiss_vector_aggregator",
"github_not_found": true,
"lcname": "faiss-vector-aggregator"
}