document-transformers


Namedocument-transformers JSON
Version 2.2.2 PyPI version JSON
download
home_pagehttps://www.SBERT.net
SummaryMultilingual text embeddings
upload_time2023-10-27 00:19:19
maintainer
docs_urlNone
authorNils Reimers
requires_python>=3.6.0
licenseApache License 2.0
keywords transformer networks bert xlnet sentence embedding pytorch nlp deep learning
VCS
bugtrack_url
requirements transformers tokenizers tqdm torch torchvision numpy scikit-learn scipy nltk sentencepiece huggingface-hub
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <!--- BADGES: START --->
[![GitHub - License](https://img.shields.io/github/license/UKPLab/sentence-transformers?logo=github&style=flat&color=green)][#github-license]
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/sentence-transformers?logo=pypi&style=flat&color=blue)][#pypi-package]
[![PyPI - Package Version](https://img.shields.io/pypi/v/sentence-transformers?logo=pypi&style=flat&color=orange)][#pypi-package]
[![Conda - Platform](https://img.shields.io/conda/pn/conda-forge/sentence-transformers?logo=anaconda&style=flat)][#conda-forge-package]
[![Conda (channel only)](https://img.shields.io/conda/vn/conda-forge/sentence-transformers?logo=anaconda&style=flat&color=orange)][#conda-forge-package]
[![Docs - GitHub.io](https://img.shields.io/static/v1?logo=github&style=flat&color=pink&label=docs&message=sentence-transformers)][#docs-package]
<!--- 
[![PyPI - Downloads](https://img.shields.io/pypi/dm/sentence-transformers?logo=pypi&style=flat&color=green)][#pypi-package]
[![Conda](https://img.shields.io/conda/dn/conda-forge/sentence-transformers?logo=anaconda)][#conda-forge-package] 
--->

[#github-license]: https://github.com/UKPLab/sentence-transformers/blob/master/LICENSE
[#pypi-package]: https://pypi.org/project/sentence-transformers/
[#conda-forge-package]: https://anaconda.org/conda-forge/sentence-transformers
[#docs-package]: https://www.sbert.net/
<!--- BADGES: END --->

# Sentence Transformers: Multilingual Sentence, Paragraph, and Image Embeddings using BERT & Co.

This framework provides an easy method to compute dense vector representations for **sentences**, **paragraphs**, and **images**. The models are based on transformer networks like BERT / RoBERTa / XLM-RoBERTa etc. and achieve state-of-the-art performance in various tasks. Text is embedded in vector space such that similar text are closer and can efficiently be found using cosine similarity.

We provide an increasing number of **[state-of-the-art pretrained models](https://www.sbert.net/docs/pretrained_models.html)** for more than 100 languages, fine-tuned for various use-cases.

Further, this framework allows an easy  **[fine-tuning of custom embeddings models](https://www.sbert.net/docs/training/overview.html)**, to achieve maximal performance on your specific task.

For the **full documentation**, see **[www.SBERT.net](https://www.sbert.net)**.

The following publications are integrated in this framework:

- [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084) (EMNLP 2019)
- [Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://arxiv.org/abs/2004.09813) (EMNLP 2020)
- [Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks](https://arxiv.org/abs/2010.08240) (NAACL 2021)
- [The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes](https://arxiv.org/abs/2012.14210) (arXiv 2020)
- [TSDAE: Using Transformer-based Sequential Denoising Auto-Encoder for Unsupervised Sentence Embedding Learning](https://arxiv.org/abs/2104.06979) (arXiv 2021)
- [BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models](https://arxiv.org/abs/2104.08663) (arXiv 2021)

## Installation

We recommend **Python 3.6** or higher, **[PyTorch 1.6.0](https://pytorch.org/get-started/locally/)** or higher and **[transformers v4.6.0](https://github.com/huggingface/transformers)** or higher. The code does **not** work with Python 2.7.

**Install with pip**

Install the *sentence-transformers* with `pip`:

```
pip install -U sentence-transformers
```

**Install with conda**

You can install the *sentence-transformers* with `conda`:

```
conda install -c conda-forge sentence-transformers
```

**Install from sources**

Alternatively, you can also clone the latest version from the [repository](https://github.com/UKPLab/sentence-transformers) and install it directly from the source code:

````
pip install -e .
```` 

**PyTorch with CUDA**

If you want to use a GPU / CUDA, you must install PyTorch with the matching CUDA Version. Follow
[PyTorch - Get Started](https://pytorch.org/get-started/locally/) for further details how to install PyTorch.

## Getting Started

See [Quickstart](https://www.sbert.net/docs/quickstart.html) in our documenation.

[This example](https://github.com/UKPLab/sentence-transformers/tree/master/examples/applications/computing-embeddings/computing_embeddings.py) shows you how to use an already trained Sentence Transformer model to embed sentences for another task.

First download a pretrained model.

````python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
````

Then provide some sentences to the model.

````python
sentences = ['This framework generates embeddings for each input sentence',
    'Sentences are passed as a list of string.', 
    'The quick brown fox jumps over the lazy dog.']
sentence_embeddings = model.encode(sentences)
````

And that's it already. We now have a list of numpy arrays with the embeddings.

````python
for sentence, embedding in zip(sentences, sentence_embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")
````

## Pre-Trained Models

We provide a large list of [Pretrained Models](https://www.sbert.net/docs/pretrained_models.html) for more than 100 languages. Some models are general purpose models, while others produce embeddings for specific use cases. Pre-trained models can be loaded by just passing the model name: `SentenceTransformer('model_name')`.

[»  Full list of pretrained models](https://www.sbert.net/docs/pretrained_models.html)

## Training

This framework allows you to fine-tune your own sentence embedding methods, so that you get task-specific sentence embeddings. You have various options to choose from in order to get perfect sentence embeddings for your specific task. 

See [Training Overview](https://www.sbert.net/docs/training/overview.html) for an introduction how to train your own embedding models. We provide [various examples](https://github.com/UKPLab/sentence-transformers/tree/master/examples/training) how to train models on various datasets.

Some highlights are:
- Support of various transformer networks including BERT, RoBERTa, XLM-R, DistilBERT, Electra, BART, ...
- Multi-Lingual and multi-task learning
- Evaluation during training to find optimal model
- [10+ loss-functions](https://www.sbert.net/docs/package_reference/losses.html) allowing to tune models specifically for semantic search, paraphrase mining, semantic similarity comparison, clustering, triplet loss, contrastive loss.

## Performance

Our models are evaluated extensively on 15+ datasets including challening domains like Tweets, Reddit, emails. They achieve by far the **best performance** from all available sentence embedding methods. Further, we provide several **smaller models** that are **optimized for speed**.

[» Full list of pretrained models](https://www.sbert.net/docs/pretrained_models.html)

## Application Examples

You can use this framework for:

- [Computing Sentence Embeddings](https://www.sbert.net/examples/applications/computing-embeddings/README.html)
- [Semantic Textual Similarity](https://www.sbert.net/docs/usage/semantic_textual_similarity.html)
- [Clustering](https://www.sbert.net/examples/applications/clustering/README.html)
- [Paraphrase Mining](https://www.sbert.net/examples/applications/paraphrase-mining/README.html)
 - [Translated Sentence Mining](https://www.sbert.net/examples/applications/parallel-sentence-mining/README.html)
 - [Semantic Search](https://www.sbert.net/examples/applications/semantic-search/README.html)
 - [Retrieve & Re-Rank](https://www.sbert.net/examples/applications/retrieve_rerank/README.html) 
 - [Text Summarization](https://www.sbert.net/examples/applications/text-summarization/README.html) 
- [Multilingual Image Search, Clustering & Duplicate Detection](https://www.sbert.net/examples/applications/image-search/README.html)

and many more use-cases.

For all examples, see [examples/applications](https://github.com/UKPLab/sentence-transformers/tree/master/examples/applications).

## Citing & Authors

If you find this repository helpful, feel free to cite our publication [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084):

```bibtex 
@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}
```

If you use one of the multilingual models, feel free to cite our publication [Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://arxiv.org/abs/2004.09813):

```bibtex
@inproceedings{reimers-2020-multilingual-sentence-bert,
    title = "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2020",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/2004.09813",
}
```

Please have a look at [Publications](https://www.sbert.net/docs/publications.html) for our different publications that are integrated into SentenceTransformers.

Contact person: [Nils Reimers](https://www.nils-reimers.de), [info@nils-reimers.de](mailto:info@nils-reimers.de)

https://www.ukp.tu-darmstadt.de/

Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.

> This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.

            

Raw data

            {
    "_id": null,
    "home_page": "https://www.SBERT.net",
    "name": "document-transformers",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6.0",
    "maintainer_email": "",
    "keywords": "Transformer Networks BERT XLNet sentence embedding PyTorch NLP deep learning",
    "author": "Nils Reimers",
    "author_email": "info@nils-reimers.de",
    "download_url": "https://files.pythonhosted.org/packages/76/47/9f0d7bed086b9c5198523ca0ce4e2d6a14d004f96d8590bc87b757fc53dd/document-transformers-2.2.2.tar.gz",
    "platform": null,
    "description": "<!--- BADGES: START --->\n[![GitHub - License](https://img.shields.io/github/license/UKPLab/sentence-transformers?logo=github&style=flat&color=green)][#github-license]\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/sentence-transformers?logo=pypi&style=flat&color=blue)][#pypi-package]\n[![PyPI - Package Version](https://img.shields.io/pypi/v/sentence-transformers?logo=pypi&style=flat&color=orange)][#pypi-package]\n[![Conda - Platform](https://img.shields.io/conda/pn/conda-forge/sentence-transformers?logo=anaconda&style=flat)][#conda-forge-package]\n[![Conda (channel only)](https://img.shields.io/conda/vn/conda-forge/sentence-transformers?logo=anaconda&style=flat&color=orange)][#conda-forge-package]\n[![Docs - GitHub.io](https://img.shields.io/static/v1?logo=github&style=flat&color=pink&label=docs&message=sentence-transformers)][#docs-package]\n<!--- \n[![PyPI - Downloads](https://img.shields.io/pypi/dm/sentence-transformers?logo=pypi&style=flat&color=green)][#pypi-package]\n[![Conda](https://img.shields.io/conda/dn/conda-forge/sentence-transformers?logo=anaconda)][#conda-forge-package] \n--->\n\n[#github-license]: https://github.com/UKPLab/sentence-transformers/blob/master/LICENSE\n[#pypi-package]: https://pypi.org/project/sentence-transformers/\n[#conda-forge-package]: https://anaconda.org/conda-forge/sentence-transformers\n[#docs-package]: https://www.sbert.net/\n<!--- BADGES: END --->\n\n# Sentence Transformers: Multilingual Sentence, Paragraph, and Image Embeddings using BERT & Co.\n\nThis framework provides an easy method to compute dense vector representations for **sentences**, **paragraphs**, and **images**. The models are based on transformer networks like BERT / RoBERTa / XLM-RoBERTa etc. and achieve state-of-the-art performance in various tasks. Text is embedded in vector space such that similar text are closer and can efficiently be found using cosine similarity.\n\nWe provide an increasing number of **[state-of-the-art pretrained models](https://www.sbert.net/docs/pretrained_models.html)** for more than 100 languages, fine-tuned for various use-cases.\n\nFurther, this framework allows an easy  **[fine-tuning of custom embeddings models](https://www.sbert.net/docs/training/overview.html)**, to achieve maximal performance on your specific task.\n\nFor the **full documentation**, see **[www.SBERT.net](https://www.sbert.net)**.\n\nThe following publications are integrated in this framework:\n\n- [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084) (EMNLP 2019)\n- [Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://arxiv.org/abs/2004.09813) (EMNLP 2020)\n- [Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks](https://arxiv.org/abs/2010.08240) (NAACL 2021)\n- [The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes](https://arxiv.org/abs/2012.14210) (arXiv 2020)\n- [TSDAE: Using Transformer-based Sequential Denoising Auto-Encoder for Unsupervised Sentence Embedding Learning](https://arxiv.org/abs/2104.06979) (arXiv 2021)\n- [BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models](https://arxiv.org/abs/2104.08663) (arXiv 2021)\n\n## Installation\n\nWe recommend **Python 3.6** or higher, **[PyTorch 1.6.0](https://pytorch.org/get-started/locally/)** or higher and **[transformers v4.6.0](https://github.com/huggingface/transformers)** or higher. The code does **not** work with Python 2.7.\n\n**Install with pip**\n\nInstall the *sentence-transformers* with `pip`:\n\n```\npip install -U sentence-transformers\n```\n\n**Install with conda**\n\nYou can install the *sentence-transformers* with `conda`:\n\n```\nconda install -c conda-forge sentence-transformers\n```\n\n**Install from sources**\n\nAlternatively, you can also clone the latest version from the [repository](https://github.com/UKPLab/sentence-transformers) and install it directly from the source code:\n\n````\npip install -e .\n```` \n\n**PyTorch with CUDA**\n\nIf you want to use a GPU / CUDA, you must install PyTorch with the matching CUDA Version. Follow\n[PyTorch - Get Started](https://pytorch.org/get-started/locally/) for further details how to install PyTorch.\n\n## Getting Started\n\nSee [Quickstart](https://www.sbert.net/docs/quickstart.html) in our documenation.\n\n[This example](https://github.com/UKPLab/sentence-transformers/tree/master/examples/applications/computing-embeddings/computing_embeddings.py) shows you how to use an already trained Sentence Transformer model to embed sentences for another task.\n\nFirst download a pretrained model.\n\n````python\nfrom sentence_transformers import SentenceTransformer\nmodel = SentenceTransformer('all-MiniLM-L6-v2')\n````\n\nThen provide some sentences to the model.\n\n````python\nsentences = ['This framework generates embeddings for each input sentence',\n    'Sentences are passed as a list of string.', \n    'The quick brown fox jumps over the lazy dog.']\nsentence_embeddings = model.encode(sentences)\n````\n\nAnd that's it already. We now have a list of numpy arrays with the embeddings.\n\n````python\nfor sentence, embedding in zip(sentences, sentence_embeddings):\n    print(\"Sentence:\", sentence)\n    print(\"Embedding:\", embedding)\n    print(\"\")\n````\n\n## Pre-Trained Models\n\nWe provide a large list of [Pretrained Models](https://www.sbert.net/docs/pretrained_models.html) for more than 100 languages. Some models are general purpose models, while others produce embeddings for specific use cases. Pre-trained models can be loaded by just passing the model name: `SentenceTransformer('model_name')`.\n\n[\u00bb  Full list of pretrained models](https://www.sbert.net/docs/pretrained_models.html)\n\n## Training\n\nThis framework allows you to fine-tune your own sentence embedding methods, so that you get task-specific sentence embeddings. You have various options to choose from in order to get perfect sentence embeddings for your specific task. \n\nSee [Training Overview](https://www.sbert.net/docs/training/overview.html) for an introduction how to train your own embedding models. We provide [various examples](https://github.com/UKPLab/sentence-transformers/tree/master/examples/training) how to train models on various datasets.\n\nSome highlights are:\n- Support of various transformer networks including BERT, RoBERTa, XLM-R, DistilBERT, Electra, BART, ...\n- Multi-Lingual and multi-task learning\n- Evaluation during training to find optimal model\n- [10+ loss-functions](https://www.sbert.net/docs/package_reference/losses.html) allowing to tune models specifically for semantic search, paraphrase mining, semantic similarity comparison, clustering, triplet loss, contrastive loss.\n\n## Performance\n\nOur models are evaluated extensively on 15+ datasets including challening domains like Tweets, Reddit, emails. They achieve by far the **best performance** from all available sentence embedding methods. Further, we provide several **smaller models** that are **optimized for speed**.\n\n[\u00bb Full list of pretrained models](https://www.sbert.net/docs/pretrained_models.html)\n\n## Application Examples\n\nYou can use this framework for:\n\n- [Computing Sentence Embeddings](https://www.sbert.net/examples/applications/computing-embeddings/README.html)\n- [Semantic Textual Similarity](https://www.sbert.net/docs/usage/semantic_textual_similarity.html)\n- [Clustering](https://www.sbert.net/examples/applications/clustering/README.html)\n- [Paraphrase Mining](https://www.sbert.net/examples/applications/paraphrase-mining/README.html)\n - [Translated Sentence Mining](https://www.sbert.net/examples/applications/parallel-sentence-mining/README.html)\n - [Semantic Search](https://www.sbert.net/examples/applications/semantic-search/README.html)\n - [Retrieve & Re-Rank](https://www.sbert.net/examples/applications/retrieve_rerank/README.html) \n - [Text Summarization](https://www.sbert.net/examples/applications/text-summarization/README.html) \n- [Multilingual Image Search, Clustering & Duplicate Detection](https://www.sbert.net/examples/applications/image-search/README.html)\n\nand many more use-cases.\n\nFor all examples, see [examples/applications](https://github.com/UKPLab/sentence-transformers/tree/master/examples/applications).\n\n## Citing & Authors\n\nIf you find this repository helpful, feel free to cite our publication [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084):\n\n```bibtex \n@inproceedings{reimers-2019-sentence-bert,\n    title = \"Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks\",\n    author = \"Reimers, Nils and Gurevych, Iryna\",\n    booktitle = \"Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing\",\n    month = \"11\",\n    year = \"2019\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://arxiv.org/abs/1908.10084\",\n}\n```\n\nIf you use one of the multilingual models, feel free to cite our publication [Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation](https://arxiv.org/abs/2004.09813):\n\n```bibtex\n@inproceedings{reimers-2020-multilingual-sentence-bert,\n    title = \"Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation\",\n    author = \"Reimers, Nils and Gurevych, Iryna\",\n    booktitle = \"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing\",\n    month = \"11\",\n    year = \"2020\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://arxiv.org/abs/2004.09813\",\n}\n```\n\nPlease have a look at [Publications](https://www.sbert.net/docs/publications.html) for our different publications that are integrated into SentenceTransformers.\n\nContact person: [Nils Reimers](https://www.nils-reimers.de), [info@nils-reimers.de](mailto:info@nils-reimers.de)\n\nhttps://www.ukp.tu-darmstadt.de/\n\nDon't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.\n\n> This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.\n",
    "bugtrack_url": null,
    "license": "Apache License 2.0",
    "summary": "Multilingual text embeddings",
    "version": "2.2.2",
    "project_urls": {
        "Download": "https://github.com/UKPLab/sentence-transformers/",
        "Homepage": "https://www.SBERT.net"
    },
    "split_keywords": [
        "transformer",
        "networks",
        "bert",
        "xlnet",
        "sentence",
        "embedding",
        "pytorch",
        "nlp",
        "deep",
        "learning"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "76479f0d7bed086b9c5198523ca0ce4e2d6a14d004f96d8590bc87b757fc53dd",
                "md5": "fc35f7654a38a50b4285c4377119f29c",
                "sha256": "747495b3f3dccd8541c4df81bd525b581b6205eec7ca6aa9b2171cfce2bdd417"
            },
            "downloads": -1,
            "filename": "document-transformers-2.2.2.tar.gz",
            "has_sig": false,
            "md5_digest": "fc35f7654a38a50b4285c4377119f29c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6.0",
            "size": 81969,
            "upload_time": "2023-10-27T00:19:19",
            "upload_time_iso_8601": "2023-10-27T00:19:19.303484Z",
            "url": "https://files.pythonhosted.org/packages/76/47/9f0d7bed086b9c5198523ca0ce4e2d6a14d004f96d8590bc87b757fc53dd/document-transformers-2.2.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-10-27 00:19:19",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "UKPLab",
    "github_project": "sentence-transformers",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "transformers",
            "specs": [
                [
                    ">=",
                    "4.6.0"
                ],
                [
                    "<",
                    "5.0.0"
                ]
            ]
        },
        {
            "name": "tokenizers",
            "specs": [
                [
                    ">=",
                    "0.10.3"
                ]
            ]
        },
        {
            "name": "tqdm",
            "specs": []
        },
        {
            "name": "torch",
            "specs": [
                [
                    ">=",
                    "1.6.0"
                ]
            ]
        },
        {
            "name": "torchvision",
            "specs": []
        },
        {
            "name": "numpy",
            "specs": []
        },
        {
            "name": "scikit-learn",
            "specs": []
        },
        {
            "name": "scipy",
            "specs": []
        },
        {
            "name": "nltk",
            "specs": []
        },
        {
            "name": "sentencepiece",
            "specs": []
        },
        {
            "name": "huggingface-hub",
            "specs": []
        }
    ],
    "lcname": "document-transformers"
}
        
Elapsed time: 0.16811s