Name | text-embedder JSON |
Version |
0.1.2
JSON |
| download |
home_page | None |
Summary | A unified inference library for transformer-based pre-trained multilingual embedding models |
upload_time | 2024-08-22 19:53:29 |
maintainer | None |
docs_url | None |
author | None |
requires_python | >=3.7 |
license | MIT License Copyright (c) 2024 Mohammed Faheem Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. |
keywords |
transformers
embeddings
nlp
rag
pytorch
huggingface
similarity
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# Text Embedder
`text_embedder` is a powerful and flexible Python library for generating and managing text embeddings using pre-trained transformer based multilingual embedding models. It offers support for various pooling strategies, similarity functions, and quantization techniques, making it a versatile tool for a variety of NLP tasks, including embedding, similarity search, clustering, and more.
## 🚀 Features
- **Model Integration**: Wraps around 🤗 transformers to leverage the state-of-ther-art pre-trained embedding models.
- **Pooling Strategies**: Choose from multiple pooling methods such as CLS token, max/mean pooling, and more to tailor to your need.
- **Flexible Similarity Metrics**: Compute similarity scores between embeddings using cosine, dot, euclidean, and manhattan metrics.
- **Quantization Support**: Reduce memory usage and improve performance by quantizing embeddings to multiple precision levels with support for **auto mixed precision quantization**.
- **Prompt Support**: Optionally include a custom prompt in embeddings for contextualized representation.
- **Configurable Options**: Tune embedding generation with options for batch size, sequence length, normalization, and more.
## 🛠 Installation
Install `text_embedder` from PyPI using pip:
```bash
pip install text_embedder
```
## 📖 Usage
### Initialization
Initialize the `TransformersEmbedder` with your desired configurations:
```python
from text_embedder import TextEmbedder
embedder = TextEmbedder(
model="BAAI/bge-small-en",
sim_fn="cosine",
pooling_strategy=["cls"],
device="cuda", # Specify device if needed
)
```
### Generating Embeddings
Generate embeddings for a list of texts:
```python
embeddings = embedder.embed(["Hello world", "Transformers are amazing!"])
print(embeddings)
```
### Computing Similarity
Compute similarity between two embeddings:
```python
embedding1 = embedder.embed(["Cat jumped from a chair"])
embedding2 = embedder.embed(["Mamba architecture is better than transformers tho, ngl."])
similarity_score = embedder.get_similarity(embedding1, embedding2)
print(f"Similarity Score: {similarity_score}")
```
## Advanced Usage
### Pooling Strategies
You can choose from various pooling strategies:
- `"cls"`: Use the CLS token embedding.
- `"max"`: Take the maximum value across tokens.
- `"mean"`: Compute the mean of token embeddings.
- `"mean_sqrt_len"`: Compute the mean divided by the square root of token length.
- `"weightedmean"`: Compute a weighted mean of token embeddings.
- `"lasttoken"`: Use the last token embedding.
### Similarity Functions
Supported similarity functions:
- **Cosine Similarity**: Measures the cosine of the angle between two vectors.
- **Dot Product**: Measures the dot product between two vectors.
- **Euclidean Distance**: Measures the straight-line distance between two vectors. (L1)
- **Manhattan Distance**: Measures the sum of absolute differences between two vectors. (L2)
### Quantization
Quantize embeddings to lower precision:
- **float32**: 32-bit floating-point precision.
- **float16**: 16-bit floating-point precision.
- **int8**: 8-bit integer precision.
- **uint8**: 8-bit unsigned integer precision.
- **binary**: Binary quantization.
- **ubinary**: Unsigned binary quantization.
- **2bit**: 2-bit quantization.
- **4bit**: 4-bit quantization.
- **8bit**: 8-bit quantization.
### Future Work
- Additional Pooling Strategies: Implement more advanced pooling methods (eg., attention-based). Also have to add a `auto` option to pooling_strategy to find a right pooling method based on model config.
- Custom Quantization Methods: Add to new quantization techniques for further improvement.
- Similarity function: Also add more similarity metric functions
## 🤝 Contributing
Contributions are welcome! Please follow these steps to get started with your contribution:
1. Fork the repository.
2. Create a new branch (`git checkout -b feature/your-feature`).
3. Make your changes.
4. Commit your changes (`git commit -am 'Add new feature'`).
5. Push to the branch (`git push origin feature/your-feature`).
6. Create a new Pull Request.
## 📄 License
This project is licensed under the MIT License. See the [LICENSE](https://github.com/xdevfaheem/transformers_embedder/blob/main/LICENSE) file for details.
## Acknowledgement
Special Thanks to devs of [Sentence-Transformers](https://github.com/UKPLab/sentence-transformers/tree/master) library.
Raw data
{
"_id": null,
"home_page": null,
"name": "text-embedder",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": null,
"keywords": "transformers, embeddings, NLP, rag, pytorch, huggingface, similarity",
"author": null,
"author_email": "Faheem <xdevfaheem@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/69/2e/f9565b3ff447f1a7898c5ab9238af98014c8e06445c64ed8d53c0b1d9b0e/text_embedder-0.1.2.tar.gz",
"platform": null,
"description": "# Text Embedder\n\n`text_embedder` is a powerful and flexible Python library for generating and managing text embeddings using pre-trained transformer based multilingual embedding models. It offers support for various pooling strategies, similarity functions, and quantization techniques, making it a versatile tool for a variety of NLP tasks, including embedding, similarity search, clustering, and more.\n\n## \ud83d\ude80 Features\n\n- **Model Integration**: Wraps around \ud83e\udd17 transformers to leverage the state-of-ther-art pre-trained embedding models.\n- **Pooling Strategies**: Choose from multiple pooling methods such as CLS token, max/mean pooling, and more to tailor to your need.\n- **Flexible Similarity Metrics**: Compute similarity scores between embeddings using cosine, dot, euclidean, and manhattan metrics.\n- **Quantization Support**: Reduce memory usage and improve performance by quantizing embeddings to multiple precision levels with support for **auto mixed precision quantization**.\n- **Prompt Support**: Optionally include a custom prompt in embeddings for contextualized representation.\n- **Configurable Options**: Tune embedding generation with options for batch size, sequence length, normalization, and more.\n\n## \ud83d\udee0 Installation\n\nInstall `text_embedder` from PyPI using pip:\n\n```bash\npip install text_embedder\n```\n\n## \ud83d\udcd6 Usage\n\n### Initialization\n\nInitialize the `TransformersEmbedder` with your desired configurations:\n\n```python\nfrom text_embedder import TextEmbedder\n\nembedder = TextEmbedder(\n model=\"BAAI/bge-small-en\",\n sim_fn=\"cosine\",\n pooling_strategy=[\"cls\"],\n device=\"cuda\", # Specify device if needed\n)\n```\n\n### Generating Embeddings\n\nGenerate embeddings for a list of texts:\n\n```python\nembeddings = embedder.embed([\"Hello world\", \"Transformers are amazing!\"])\nprint(embeddings)\n```\n\n### Computing Similarity\n\nCompute similarity between two embeddings:\n\n```python\nembedding1 = embedder.embed([\"Cat jumped from a chair\"])\nembedding2 = embedder.embed([\"Mamba architecture is better than transformers tho, ngl.\"])\nsimilarity_score = embedder.get_similarity(embedding1, embedding2)\nprint(f\"Similarity Score: {similarity_score}\")\n```\n\n## Advanced Usage\n\n### Pooling Strategies\n\nYou can choose from various pooling strategies:\n- `\"cls\"`: Use the CLS token embedding.\n- `\"max\"`: Take the maximum value across tokens.\n- `\"mean\"`: Compute the mean of token embeddings.\n- `\"mean_sqrt_len\"`: Compute the mean divided by the square root of token length.\n- `\"weightedmean\"`: Compute a weighted mean of token embeddings.\n- `\"lasttoken\"`: Use the last token embedding.\n\n### Similarity Functions\n\nSupported similarity functions:\n- **Cosine Similarity**: Measures the cosine of the angle between two vectors.\n- **Dot Product**: Measures the dot product between two vectors.\n- **Euclidean Distance**: Measures the straight-line distance between two vectors. (L1)\n- **Manhattan Distance**: Measures the sum of absolute differences between two vectors. (L2)\n\n### Quantization\n\nQuantize embeddings to lower precision:\n- **float32**: 32-bit floating-point precision.\n- **float16**: 16-bit floating-point precision.\n- **int8**: 8-bit integer precision.\n- **uint8**: 8-bit unsigned integer precision.\n- **binary**: Binary quantization.\n- **ubinary**: Unsigned binary quantization.\n- **2bit**: 2-bit quantization.\n- **4bit**: 4-bit quantization.\n- **8bit**: 8-bit quantization.\n\n### Future Work\n- Additional Pooling Strategies: Implement more advanced pooling methods (eg., attention-based). Also have to add a `auto` option to pooling_strategy to find a right pooling method based on model config.\n- Custom Quantization Methods: Add to new quantization techniques for further improvement.\n- Similarity function: Also add more similarity metric functions\n\n## \ud83e\udd1d Contributing\n\nContributions are welcome! Please follow these steps to get started with your contribution:\n\n1. Fork the repository.\n2. Create a new branch (`git checkout -b feature/your-feature`).\n3. Make your changes.\n4. Commit your changes (`git commit -am 'Add new feature'`).\n5. Push to the branch (`git push origin feature/your-feature`).\n6. Create a new Pull Request.\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the MIT License. See the [LICENSE](https://github.com/xdevfaheem/transformers_embedder/blob/main/LICENSE) file for details.\n\n## Acknowledgement\n\nSpecial Thanks to devs of [Sentence-Transformers](https://github.com/UKPLab/sentence-transformers/tree/master) library. \n",
"bugtrack_url": null,
"license": "MIT License Copyright (c) 2024 Mohammed Faheem Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ",
"summary": "A unified inference library for transformer-based pre-trained multilingual embedding models",
"version": "0.1.2",
"project_urls": {
"Issues": "https://github.com/xdevfaheem/text_embedder/issues",
"Repository": "https://github.com/xdevfaheem/text_embedder"
},
"split_keywords": [
"transformers",
" embeddings",
" nlp",
" rag",
" pytorch",
" huggingface",
" similarity"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "56e0e41d53817909ebb06bda360b03c7f303d9f9bf518b2daed69d42fc6e5647",
"md5": "d6069ae0d986f8bb12ab948c5b3afa39",
"sha256": "d281ade1dd0ac133d31f8a274b6c3e77010c0f72535a4ff4dcf748ed13a2437f"
},
"downloads": -1,
"filename": "text_embedder-0.1.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "d6069ae0d986f8bb12ab948c5b3afa39",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 12141,
"upload_time": "2024-08-22T19:53:27",
"upload_time_iso_8601": "2024-08-22T19:53:27.363834Z",
"url": "https://files.pythonhosted.org/packages/56/e0/e41d53817909ebb06bda360b03c7f303d9f9bf518b2daed69d42fc6e5647/text_embedder-0.1.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "692ef9565b3ff447f1a7898c5ab9238af98014c8e06445c64ed8d53c0b1d9b0e",
"md5": "6cec24325e1cc436cce51260a7b7f3e8",
"sha256": "bda2085a59268da3be1c8bf7cfa6600483d2112db0b246fb83b526d824219919"
},
"downloads": -1,
"filename": "text_embedder-0.1.2.tar.gz",
"has_sig": false,
"md5_digest": "6cec24325e1cc436cce51260a7b7f3e8",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 13514,
"upload_time": "2024-08-22T19:53:29",
"upload_time_iso_8601": "2024-08-22T19:53:29.308886Z",
"url": "https://files.pythonhosted.org/packages/69/2e/f9565b3ff447f1a7898c5ab9238af98014c8e06445c64ed8d53c0b1d9b0e/text_embedder-0.1.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-08-22 19:53:29",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "xdevfaheem",
"github_project": "text_embedder",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "text-embedder"
}