# Chroma Embeddings Aggregation Library
This Python library provides a suite of advanced methods for aggregating multiple embeddings associated with a single document or entity into a single representative embedding. It is designed to work with Chroma vector stores and is compatible with LangChain's Chroma integration.
## Table of Contents
- [Features](#features)
- [Installation](#installation)
- [Usage](#usage)
- [Example: Simple Average Aggregation](#example-simple-average-aggregation)
- [Aggregation Methods](#aggregation-methods)
- [Parameters](#parameters)
- [Dependencies](#dependencies)
- [Contributing](#contributing)
- [License](#license)
## Features
- Multiple aggregation methods (average, weighted average, geometric mean, harmonic mean, centroid, PCA, etc.)
- Compatible with Chroma vector stores and LangChain
- Easy-to-use API for aggregating embeddings
## Installation
To install the package, you can use pip:
```bash
pip install chroma_vector_aggregator
```
## Usage
Here's an example demonstrating how to use the library to aggregate embeddings using simple averaging:
### Example: Simple Average Aggregation
```python
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import FakeEmbeddings
from langchain.schema import Document
from chroma_vector_aggregator import aggregate_embeddings
# Create a sample Chroma collection
embeddings = FakeEmbeddings(size=10)
documents = [
Document(page_content="Test document 1", metadata={"id": "group1"}),
Document(page_content="Test document 2", metadata={"id": "group1"}),
Document(page_content="Test document 3", metadata={"id": "group2"}),
Document(page_content="Test document 4", metadata={"id": "group2"}),
]
chroma_collection = Chroma.from_documents(documents, embeddings)
# Aggregate embeddings using simple averaging
aggregated_collection = aggregate_embeddings(
chroma_collection=chroma_collection,
column_name="id",
method="average"
)
# Use the aggregated collection for similarity search
results = aggregated_collection.similarity_search("Test query", k=2)
```
## Aggregation Methods
- `average`: Compute the arithmetic mean of embeddings.
- `weighted_average`: Compute a weighted average of embeddings.
- `geometric_mean`: Compute the geometric mean across embeddings.
- `harmonic_mean`: Compute the harmonic mean across embeddings.
- `median`: Compute the element-wise median of embeddings.
- `trimmed_mean`: Compute the mean after trimming outliers.
- `centroid`: Use K-Means clustering to find the centroid of the embeddings.
- `pca`: Use Principal Component Analysis to reduce embeddings.
- `exemplar`: Select the embedding that best represents the group.
- `max_pooling`: Take the maximum value for each dimension across embeddings.
- `min_pooling`: Take the minimum value for each dimension across embeddings.
- `entropy_weighted_average`: Weight embeddings by their entropy.
- `attentive_pooling`: Use an attention mechanism to aggregate embeddings.
- `tukeys_biweight`: A robust method to down-weight outliers.
## Parameters
- `chroma_collection`: The Chroma collection to aggregate embeddings from.
- `column_name`: The metadata field by which to aggregate embeddings (e.g., 'id').
- `method`: The aggregation method to use.
- `weights` (optional): Weights for the `weighted_average` method.
- `trim_percentage` (optional): Fraction to trim from each end for `trimmed_mean`.
## Dependencies
- chromadb
- numpy
- scipy
- scikit-learn
- langchain
## Contributing
Contributions are welcome! Please feel free to submit a pull request.
## License
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
Raw data
{
"_id": null,
"home_page": "https://github.com/vinerya/chroma_vector_aggregator",
"name": "chroma-vector-aggregator",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": null,
"author": "Moudather Chelbi",
"author_email": "moudather.chelbi@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/6f/bc/8b0a01a52d177cda98583bfa649504a4d7cf1dc1bce3a75bd53ea09d658f/chroma_vector_aggregator-0.1.0.tar.gz",
"platform": null,
"description": "# Chroma Embeddings Aggregation Library\n\nThis Python library provides a suite of advanced methods for aggregating multiple embeddings associated with a single document or entity into a single representative embedding. It is designed to work with Chroma vector stores and is compatible with LangChain's Chroma integration.\n\n## Table of Contents\n\n- [Features](#features)\n- [Installation](#installation)\n- [Usage](#usage)\n - [Example: Simple Average Aggregation](#example-simple-average-aggregation)\n- [Aggregation Methods](#aggregation-methods)\n- [Parameters](#parameters)\n- [Dependencies](#dependencies)\n- [Contributing](#contributing)\n- [License](#license)\n\n## Features\n\n- Multiple aggregation methods (average, weighted average, geometric mean, harmonic mean, centroid, PCA, etc.)\n- Compatible with Chroma vector stores and LangChain\n- Easy-to-use API for aggregating embeddings\n\n## Installation\n\nTo install the package, you can use pip:\n\n```bash\npip install chroma_vector_aggregator\n```\n\n## Usage\n\nHere's an example demonstrating how to use the library to aggregate embeddings using simple averaging:\n\n### Example: Simple Average Aggregation\n\n```python\nfrom langchain_community.vectorstores import Chroma\nfrom langchain_community.embeddings import FakeEmbeddings\nfrom langchain.schema import Document\nfrom chroma_vector_aggregator import aggregate_embeddings\n\n# Create a sample Chroma collection\nembeddings = FakeEmbeddings(size=10)\ndocuments = [\n Document(page_content=\"Test document 1\", metadata={\"id\": \"group1\"}),\n Document(page_content=\"Test document 2\", metadata={\"id\": \"group1\"}),\n Document(page_content=\"Test document 3\", metadata={\"id\": \"group2\"}),\n Document(page_content=\"Test document 4\", metadata={\"id\": \"group2\"}),\n]\nchroma_collection = Chroma.from_documents(documents, embeddings)\n\n# Aggregate embeddings using simple averaging\naggregated_collection = aggregate_embeddings(\n chroma_collection=chroma_collection,\n column_name=\"id\",\n method=\"average\"\n)\n\n# Use the aggregated collection for similarity search\nresults = aggregated_collection.similarity_search(\"Test query\", k=2)\n```\n\n## Aggregation Methods\n\n- `average`: Compute the arithmetic mean of embeddings.\n- `weighted_average`: Compute a weighted average of embeddings.\n- `geometric_mean`: Compute the geometric mean across embeddings.\n- `harmonic_mean`: Compute the harmonic mean across embeddings.\n- `median`: Compute the element-wise median of embeddings.\n- `trimmed_mean`: Compute the mean after trimming outliers.\n- `centroid`: Use K-Means clustering to find the centroid of the embeddings.\n- `pca`: Use Principal Component Analysis to reduce embeddings.\n- `exemplar`: Select the embedding that best represents the group.\n- `max_pooling`: Take the maximum value for each dimension across embeddings.\n- `min_pooling`: Take the minimum value for each dimension across embeddings.\n- `entropy_weighted_average`: Weight embeddings by their entropy.\n- `attentive_pooling`: Use an attention mechanism to aggregate embeddings.\n- `tukeys_biweight`: A robust method to down-weight outliers.\n\n## Parameters\n\n- `chroma_collection`: The Chroma collection to aggregate embeddings from.\n- `column_name`: The metadata field by which to aggregate embeddings (e.g., 'id').\n- `method`: The aggregation method to use.\n- `weights` (optional): Weights for the `weighted_average` method.\n- `trim_percentage` (optional): Fraction to trim from each end for `trimmed_mean`.\n\n## Dependencies\n\n- chromadb\n- numpy\n- scipy\n- scikit-learn\n- langchain\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a pull request.\n\n## License\n\nThis project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A package to aggregate embeddings in a Chroma vector store based on metadata columns.",
"version": "0.1.0",
"project_urls": {
"Homepage": "https://github.com/vinerya/chroma_vector_aggregator"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "2a4783a2495ba4acc2eff18be1d1d0bffa464939886d0282f2cf2ae5c0d4971b",
"md5": "ec107ebf220b44fae52f8f96ac397bc8",
"sha256": "eb5b7a3ce0a2c291e1a895abdcaf259254d013485fee567e907e697901e15df0"
},
"downloads": -1,
"filename": "chroma_vector_aggregator-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "ec107ebf220b44fae52f8f96ac397bc8",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 7584,
"upload_time": "2024-09-21T20:13:09",
"upload_time_iso_8601": "2024-09-21T20:13:09.944275Z",
"url": "https://files.pythonhosted.org/packages/2a/47/83a2495ba4acc2eff18be1d1d0bffa464939886d0282f2cf2ae5c0d4971b/chroma_vector_aggregator-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "6fbc8b0a01a52d177cda98583bfa649504a4d7cf1dc1bce3a75bd53ea09d658f",
"md5": "29452503829a23d987e5316b8a12b7db",
"sha256": "888568a2fa98864d515e7393c6f09f24aca7b290087e75e3f4a45eac93601a87"
},
"downloads": -1,
"filename": "chroma_vector_aggregator-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "29452503829a23d987e5316b8a12b7db",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 6058,
"upload_time": "2024-09-21T20:13:11",
"upload_time_iso_8601": "2024-09-21T20:13:11.832827Z",
"url": "https://files.pythonhosted.org/packages/6f/bc/8b0a01a52d177cda98583bfa649504a4d7cf1dc1bce3a75bd53ea09d658f/chroma_vector_aggregator-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-09-21 20:13:11",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "vinerya",
"github_project": "chroma_vector_aggregator",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "chroma-vector-aggregator"
}