![embedders](https://uploads-ssl.webflow.com/61e47fafb12bd56b40022a49/626ee1c35a3abf0ca872486d_embedder-banner.png)
[![Python 3.9](https://img.shields.io/badge/python-3.9-blue.svg)](https://www.python.org/downloads/release/python-390/)
[![pypi 0.1.8](https://img.shields.io/badge/pypi-0.1.8-red.svg)](https://pypi.org/project/embedders/0.1.8/)
# ⚗️ embedders
With `embedders`, you can easily convert your texts into sentence- or token-level embeddings within a few lines of code. Use cases for this include similarity search between texts, information extraction such as named entity recognition, or basic text classification.
## Prerequisites
This library uses [spaCy](https://github.com/explosion/spaCy) for tokenization; to apply it, please download the [respective language model](https://spacy.io/models) first.
## Installation
You can set up this library via either running `$ pip install embedders`, or via cloning this repository and running `$ pip install -r requirements.txt` in your repository.
A sample installation would be:
```
$ conda create --name embedders python=3.9
$ conda activate embedders
$ pip install embedders
$ python -m spacy download en_core_web_sm
```
## Usage
Once you installed the package, you can apply the embedders with a few lines of code. You can apply embedders on sentence- or token-level.
### Sentence embeddings
`"Wow, what a cool tool!"` is embedded to
```
[
2.453, 8.325, ..., 3.863
]
```
Currently, we provide the following sentence embeddings:
| **Path** | **Name** | **Embeds documents using ...** |
| ------------------------------------ | --------------------------- | ------------------------------------------------------------ |
| embedders.classification.contextual | HuggingFaceSentenceEmbedder | large, pre-trained transformers from https://huggingface.co |
| embedders.classification.contextual | OpenAISentenceEmbedder | large, pre-trained transformers from https://openai.com |
| embedders.classification.contextual | CohereSentenceEmbedder | large, pre-trained transformers from https://cohere.com |
| embedders.classification.count_based | BagOfCharsSentenceEmbedder | plain Bag of Chars approach |
| embedders.classification.count_based | BagOfWordsSentenceEmbedder | plain Bag of Words approach |
| embedders.classification.count_based | TfidfSentenceEmbedder | Term Frequency - Inverse Document Frequency (TFIDF) approach |
### Token embeddings
`"Wow, what a cool tool!"` is embedded to
```
[
[8.453, 1.853, ...],
[3.623, 2.023, ...],
[1.906, 9.604, ...],
[7.306, 2.325, ...],
[6.630, 1.643, ...],
[3.023, 4.974, ...]
]
```
Currently, we provide the following token embeddings:
| **Path** | **Name** | **Embeds documents using ...** |
| -------------------------------- | ------------------------ | ----------------------------------------------------------- |
| embedders.extraction.contextual | TransformerTokenEmbedder | large, pre-trained transformers from https://huggingface.co |
| embedders.extraction.count_based | BagOfCharsTokenEmbedder | plain Bag of Characters approach |
You can choose the embedding category depending on your task at hand. To implement them, you can just grab one of the available methods and apply them to your text corpus as follows (shown for sentence embeddings, but the same is possible for token):
```python
from embedders.classification.contextual import TransformerSentenceEmbedder
from embedders.classification.reduce import PCASentenceReducer
corpus = [
"I went to Cologne in 2009",
"My favorite number is 41",
# ...
]
embedder = TransformerSentenceEmbedder("bert-base-cased")
embeddings = embedder.fit_transform(corpus) # contains a list of shape [num_texts, embedding_dimension]
```
Sometimes, you want to reduce the size of the embeddings you received. To do so, you can easily wrap your embedder with some dimensionality reduction technique.
```python
# if the dimension is too large, you can also apply dimensionality reduction
reducer = PCASentenceReducer(embedder)
embeddings_reduced = reducer.fit_transform(corpus)
```
Currently, we provide the following dimensionality reductions:
| **Path** | **Name** | **Description** |
| ------------------------------- | ------------------- | -------------------------------------------------------------------------------- |
| embedders.classification.reduce | PCASentenceEmbedder | Wraps embedder into a principial component analysis to reduce the dimensionality |
| embedders.extraction.reduce | PCATokenEmbedder | Wraps embedder into a principial component analysis to reduce the dimensionality |
## Pre-trained embedders
With growing availability of large, pre-trained models such as provided by [🤗 Hugging Face](https://huggingface.co/), embedding complex sentences in a wide variety of languages and domains becomes much more applicable. If you want to make use of transformer models, you can just use the configuration string of the respective model, which will automatically pull the correct model for the [🤗 Hugging Face Hub](https://huggingface.co/models).
## Contributing
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are **greatly appreciated**.
If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement".
Don't forget to give the project a star! Thanks again!
1. Fork the Project
2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`)
3. Commit your Changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the Branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request
And please don't forget to leave a ⭐ if you like the work!
## License
Distributed under the Apache 2.0 License. See LICENSE.txt for more information.
## Contact
This library is developed and maintained by [kern.ai](https://github.com/code-kern-ai). If you want to provide us with feedback or have some questions, don't hesitate to contact us. We're super happy to help ✌️
Raw data
{
"_id": null,
"home_page": "https://github.com/code-kern-ai/embedders",
"name": "embedders",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "kern,machine learning,representation learning,python",
"author": "Johannes H\u00f6tter",
"author_email": "johannes.hoetter@kern.ai",
"download_url": "",
"platform": null,
"description": "![embedders](https://uploads-ssl.webflow.com/61e47fafb12bd56b40022a49/626ee1c35a3abf0ca872486d_embedder-banner.png)\n[![Python 3.9](https://img.shields.io/badge/python-3.9-blue.svg)](https://www.python.org/downloads/release/python-390/)\n[![pypi 0.1.8](https://img.shields.io/badge/pypi-0.1.8-red.svg)](https://pypi.org/project/embedders/0.1.8/)\n\n# \u2697\ufe0f embedders\n\nWith `embedders`, you can easily convert your texts into sentence- or token-level embeddings within a few lines of code. Use cases for this include similarity search between texts, information extraction such as named entity recognition, or basic text classification.\n\n## Prerequisites\n\nThis library uses [spaCy](https://github.com/explosion/spaCy) for tokenization; to apply it, please download the [respective language model](https://spacy.io/models) first.\n\n## Installation\n\nYou can set up this library via either running `$ pip install embedders`, or via cloning this repository and running `$ pip install -r requirements.txt` in your repository.\n\nA sample installation would be:\n\n```\n$ conda create --name embedders python=3.9\n$ conda activate embedders\n$ pip install embedders\n$ python -m spacy download en_core_web_sm\n```\n\n## Usage\n\nOnce you installed the package, you can apply the embedders with a few lines of code. You can apply embedders on sentence- or token-level.\n\n### Sentence embeddings\n\n`\"Wow, what a cool tool!\"` is embedded to\n\n```\n[\n 2.453, 8.325, ..., 3.863\n]\n```\n\nCurrently, we provide the following sentence embeddings:\n| **Path** | **Name** | **Embeds documents using ...** |\n| ------------------------------------ | --------------------------- | ------------------------------------------------------------ |\n| embedders.classification.contextual | HuggingFaceSentenceEmbedder | large, pre-trained transformers from https://huggingface.co |\n| embedders.classification.contextual | OpenAISentenceEmbedder | large, pre-trained transformers from https://openai.com |\n| embedders.classification.contextual | CohereSentenceEmbedder | large, pre-trained transformers from https://cohere.com |\n| embedders.classification.count_based | BagOfCharsSentenceEmbedder | plain Bag of Chars approach |\n| embedders.classification.count_based | BagOfWordsSentenceEmbedder | plain Bag of Words approach |\n| embedders.classification.count_based | TfidfSentenceEmbedder | Term Frequency - Inverse Document Frequency (TFIDF) approach |\n\n### Token embeddings\n\n`\"Wow, what a cool tool!\"` is embedded to\n\n```\n[\n [8.453, 1.853, ...],\n [3.623, 2.023, ...],\n [1.906, 9.604, ...],\n [7.306, 2.325, ...],\n [6.630, 1.643, ...],\n [3.023, 4.974, ...]\n]\n```\n\nCurrently, we provide the following token embeddings:\n\n| **Path** | **Name** | **Embeds documents using ...** |\n| -------------------------------- | ------------------------ | ----------------------------------------------------------- |\n| embedders.extraction.contextual | TransformerTokenEmbedder | large, pre-trained transformers from https://huggingface.co |\n| embedders.extraction.count_based | BagOfCharsTokenEmbedder | plain Bag of Characters approach |\n\nYou can choose the embedding category depending on your task at hand. To implement them, you can just grab one of the available methods and apply them to your text corpus as follows (shown for sentence embeddings, but the same is possible for token):\n\n```python\nfrom embedders.classification.contextual import TransformerSentenceEmbedder\nfrom embedders.classification.reduce import PCASentenceReducer\n\ncorpus = [\n \"I went to Cologne in 2009\",\n \"My favorite number is 41\",\n # ...\n]\n\nembedder = TransformerSentenceEmbedder(\"bert-base-cased\")\nembeddings = embedder.fit_transform(corpus) # contains a list of shape [num_texts, embedding_dimension]\n```\n\nSometimes, you want to reduce the size of the embeddings you received. To do so, you can easily wrap your embedder with some dimensionality reduction technique.\n\n```python\n# if the dimension is too large, you can also apply dimensionality reduction\nreducer = PCASentenceReducer(embedder)\nembeddings_reduced = reducer.fit_transform(corpus)\n```\n\nCurrently, we provide the following dimensionality reductions:\n| **Path** | **Name** | **Description** |\n| ------------------------------- | ------------------- | -------------------------------------------------------------------------------- |\n| embedders.classification.reduce | PCASentenceEmbedder | Wraps embedder into a principial component analysis to reduce the dimensionality |\n| embedders.extraction.reduce | PCATokenEmbedder | Wraps embedder into a principial component analysis to reduce the dimensionality |\n\n## Pre-trained embedders\n\nWith growing availability of large, pre-trained models such as provided by [\ud83e\udd17 Hugging Face](https://huggingface.co/), embedding complex sentences in a wide variety of languages and domains becomes much more applicable. If you want to make use of transformer models, you can just use the configuration string of the respective model, which will automatically pull the correct model for the [\ud83e\udd17 Hugging Face Hub](https://huggingface.co/models).\n\n## Contributing\n\nContributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are **greatly appreciated**.\n\nIf you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag \"enhancement\".\nDon't forget to give the project a star! Thanks again!\n\n1. Fork the Project\n2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`)\n3. Commit your Changes (`git commit -m 'Add some AmazingFeature'`)\n4. Push to the Branch (`git push origin feature/AmazingFeature`)\n5. Open a Pull Request\n\nAnd please don't forget to leave a \u2b50 if you like the work!\n\n## License\n\nDistributed under the Apache 2.0 License. See LICENSE.txt for more information.\n\n## Contact\n\nThis library is developed and maintained by [kern.ai](https://github.com/code-kern-ai). If you want to provide us with feedback or have some questions, don't hesitate to contact us. We're super happy to help \u270c\ufe0f\n\n\n",
"bugtrack_url": null,
"license": "",
"summary": "High-level API for creating sentence and token embeddings",
"version": "0.1.8",
"project_urls": {
"Homepage": "https://github.com/code-kern-ai/embedders"
},
"split_keywords": [
"kern",
"machine learning",
"representation learning",
"python"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "0e6f603db1e11518dc8b90bd1d69edd18bda67f8745418abd25f03adfae6486b",
"md5": "f512c2074ccfde4818884b9dc1789bd6",
"sha256": "ed955ef95592380ca980e5991ec07c35ba7c1850469354bd19096ee6595a1fb2"
},
"downloads": -1,
"filename": "embedders-0.1.8-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "f512c2074ccfde4818884b9dc1789bd6",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": null,
"size": 24266,
"upload_time": "2023-08-14T11:46:27",
"upload_time_iso_8601": "2023-08-14T11:46:27.248785Z",
"url": "https://files.pythonhosted.org/packages/0e/6f/603db1e11518dc8b90bd1d69edd18bda67f8745418abd25f03adfae6486b/embedders-0.1.8-py2.py3-none-any.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-08-14 11:46:27",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "code-kern-ai",
"github_project": "embedders",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "embedders"
}