[![license](https://img.shields.io/badge/License-MIT-brightgreen.svg)](https://github.com/asahi417/relbert/blob/master/LICENSE)
[![PyPI version](https://badge.fury.io/py/relbert.svg)](https://badge.fury.io/py/relbert)
[![PyPI pyversions](https://img.shields.io/pypi/pyversions/relbert.svg)](https://pypi.python.org/pypi/relbert/)
[![PyPI status](https://img.shields.io/pypi/status/relbert.svg)](https://pypi.python.org/pypi/relbert/)
# RelBERT
We release the package `relbert` that includes the official implementation of
***Distilling Relation Embeddings from Pre-trained Language Models*** ([https://aclanthology.org/2021.emnlp-main.712/](https://aclanthology.org/2021.emnlp-main.712/))
that has been accepted by the [**EMNLP 2021 main conference**](https://2021.emnlp.org/)
### What's RelBERT?
RelBERT is a state-of-the-art lexical relation embedding model (i.e. model representing any word pair such as "Paris-France" as a fixed-length vector) based on large-scale pretrained masked language models. RelBERT also establishes a very strong baseline to solve analogies in a zero-shot transfer fashion and even outperform strong few-shot models such as [GPT-3](https://arxiv.org/abs/2005.14165) and [Analogical Proportion (AP)](https://aclanthology.org/2021.acl-long.280/).
| | SAT (full) | SAT | U2 | U4 | Google | BATS |
|:-------------------|-------------:|------:|-----:|-----:|---------:|-------:|
| [GloVe](https://nlp.stanford.edu/projects/glove/) | 48.9 | 47.8 | 46.5 | 39.8 | 96 | 68.7 |
| [FastText](https://fasttext.cc/) | 49.7 | 47.8 | 43 | 40.7 | 96.6 | 72 |
| [RELATIVE](http://josecamachocollados.com/papers/relative_ijcai2019.pdf) | 24.9 | 24.6 | 32.5 | 27.1 | 62 | 39 |
| [pair2vec](https://arxiv.org/abs/1810.08854) | 33.7 | 34.1 | 25.4 | 28.2 | 66.6 | 53.8 |
| [GPT-2 (AP)](https://aclanthology.org/2021.acl-long.280/) | 41.4 | 35.9 | 41.2 | 44.9 | 80.4 | 63.5 |
| [RoBERTa (AP)](https://aclanthology.org/2021.acl-long.280/) | 49.6 | 42.4 | 49.1 | 49.1 | 90.8 | 69.7 |
| [GPT-2 (tuned AP)](https://aclanthology.org/2021.acl-long.280/) | 57.8 | 56.7 | 50.9 | 49.5 | 95.2 | 81.2 |
| [RoBERTa (tuned AP)](https://aclanthology.org/2021.acl-long.280/) | 55.8 | 53.4 | 58.3 | 57.4 | 93.6 | 78.4 |
| [GPT3 (zeroshot)](https://arxiv.org/abs/2005.14165) | 53.7 | - | - | - | - | - |
| [GPT3 (fewshot)](https://arxiv.org/abs/2005.14165) | 65.2 | - | - | - | - | - |
| ***RelBERT*** | ***72.2*** | ***72.7*** | ***65.8*** | ***65.3*** | ***94.2*** | ***79.3*** |
[comment]: <> (| ***RelBERT (triplet)*** | ***67.9*** | ***67.7*** | ***68.0*** | ***63.2*** | ***94.2*** | ***78.9*** |)
[comment]: <> (| ***RelBERT (nce)*** | ***72.2*** | ***72.7*** | ***65.8*** | ***65.3*** | ***94.2*** | ***79.3*** |)
We also report the performance of RelBERT universal relation embeddings on lexical relation classification datasets, which reinforces the capability of RelBERT to model relations.
All datasets are public and available in the following links: [analogy questions](https://github.com/asahi417/AnalogyTools/releases/download/0.0.0/analogy_test_dataset.zip), [lexical relation classification](https://github.com/asahi417/AnalogyTools/releases/download/0.0.0/lexical_relation_dataset.zip).
Please have a look our paper to know more about RelBERT and [AnalogyTool](https://github.com/asahi417/AnalogyTools) or [AP paper](https://aclanthology.org/2021.acl-long.280/) for more information about the datasets.
### What can we do with `relbert`?
In this repository, we release a python package `relbert` to work around with RelBERT and its checkpoints via [huggingface modelhub](https://huggingface.co/models) and [gensim](https://radimrehurek.com/gensim/).
In brief, what you can do with the `relbert` is summarized as below:
- **Get a high quality embedding vector** given a pair of word
- **Get similar word pairs (nearest neighbors)**
- **Reproduce the results** of our EMNLP 2021 paper.
## Get Started
```shell
pip install relbert
```
## Play with RelBERT
RelBERT can give you a high-quality relation embedding vector of a word pair. First, you need to define the model class with a RelBERT checkpoint.
```python
from relbert import RelBERT
model = RelBERT()
```
Then you give a word pair to the model to get the embedding.
```python
# the vector has (1024,)
v_tokyo_japan = model.get_embedding(['Tokyo', 'Japan'])
```
Let's run a quick experiment to check the embedding quality. Given candidate lists `['Paris', 'France']`, `['music', 'pizza']`, and `['London', 'Tokyo']`, the pair which shares
the same relation with the `['Tokyo', 'Japan']` is `['Paris', 'France']`. Would the RelBERT embedding be possible to retain it with simple cosine similarity?
```python
from relbert import cosine_similarity
v_paris_france, v_music_pizza, v_london_tokyo = model.get_embedding([['Paris', 'France'], ['music', 'pizza'], ['London', 'Tokyo']])
cosine_similarity(v_tokyo_japan, v_paris_france)
>>> 0.999
cosine_similarity(v_tokyo_japan, v_music_pizza)
>>> 0.991
cosine_similarity(v_tokyo_japan, v_london_tokyo)
>>> 0.996
```
Bravo! The distance between `['Tokyo', 'Japan']` and `['Paris', 'France']` is the closest among the candidates.
In fact, this pipeline is how we evaluate the RelBERT on the analogy question.
### Nearest Neighbours of RelBERT
To get the similar word pairs in terms of the RelBERT embedding, we convert the RelBERT embedding to a gensim model file with a fixed vocabulary.
Specifically, we take the vocabulary of the [RELATIVE embedding](http://josecamachocollados.com/papers/relative_ijcai2019.pdf) that is released as a part of
[Analogy Tool](https://github.com/asahi417/AnalogyTools#relative-embedding), and generate the embedding for all the word pairs with RelBERT (`asahi417/relbert-roberta-large`).
Following the original vocabulary representation, words are joined by `__` and multiple token should be combined by `_` such as `New_york__Tokyo`.
The RelBERT embedding gensim file can be found [here](https://drive.google.com/file/d/1z3UeWALwf6EkujI3oYUCwkrIhMuJFdRA/view?usp=sharing). For example, you can get the nearest neighbours as below.
```python
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('gensim_model.bin', binary=True)
model.most_similar('Tokyo__Japan')
>>> [('Moscow__Russia', 0.9997282028198242),
('Cairo__Egypt', 0.9997045993804932),
('Baghdad__Iraq', 0.9997043013572693),
('Helsinki__Finland', 0.9996970891952515),
('Paris__France', 0.999695897102356),
('Damascus__Syria', 0.9996891617774963),
('Bangkok__Thailand', 0.9996803998947144),
('Madrid__Spain', 0.9996673464775085),
('Budapest__Hungary', 0.9996543526649475),
('Beijing__China', 0.9996539354324341)]
```
## Citation
If you use any of these resources, please cite the following [paper](https://arxiv.org/abs/2110.15705):
```
@inproceedings{ushio-etal-2021-distilling,
title = "Distilling Relation Embeddings from Pretrained Language Models",
author = "Ushio, Asahi and
Camacho-Collados, Jose and
Schockaert, Steven",
booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2021",
address = "Online and Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.emnlp-main.712",
doi = "10.18653/v1/2021.emnlp-main.712",
pages = "9044--9062",
abstract = "Pre-trained language models have been found to capture a surprisingly rich amount of lexical knowledge, ranging from commonsense properties of everyday concepts to detailed factual knowledge about named entities. Among others, this makes it possible to distill high-quality word vectors from pre-trained language models. However, it is currently unclear to what extent it is possible to distill relation embeddings, i.e. vectors that characterize the relationship between two words. Such relation embeddings are appealing because they can, in principle, encode relational knowledge in a more fine-grained way than is possible with knowledge graphs. To obtain relation embeddings from a pre-trained language model, we encode word pairs using a (manually or automatically generated) prompt, and we fine-tune the language model such that relationally similar word pairs yield similar output vectors. We find that the resulting relation embeddings are highly competitive on analogy (unsupervised) and relation classification (supervised) benchmarks, even without any task-specific fine-tuning. Source code to reproduce our experimental results and the model checkpoints are available in the following repository: https://github.com/asahi417/relbert",
}
```
Raw data
{
"_id": null,
"home_page": "https://github.com/asahi417/relbert",
"name": "relbert",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": "",
"keywords": "nlp",
"author": "Asahi Ushio",
"author_email": "asahi1992ushio@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/a7/f9/85d03e6de2f04d3b85a6d8f04471d3ad0d3aa7a5f1aca30b5bd64669f4c9/relbert-2.0.2.tar.gz",
"platform": null,
"description": "[![license](https://img.shields.io/badge/License-MIT-brightgreen.svg)](https://github.com/asahi417/relbert/blob/master/LICENSE)\n[![PyPI version](https://badge.fury.io/py/relbert.svg)](https://badge.fury.io/py/relbert)\n[![PyPI pyversions](https://img.shields.io/pypi/pyversions/relbert.svg)](https://pypi.python.org/pypi/relbert/)\n[![PyPI status](https://img.shields.io/pypi/status/relbert.svg)](https://pypi.python.org/pypi/relbert/)\n\n# RelBERT\nWe release the package `relbert` that includes the official implementation of\n***Distilling Relation Embeddings from Pre-trained Language Models*** ([https://aclanthology.org/2021.emnlp-main.712/](https://aclanthology.org/2021.emnlp-main.712/))\nthat has been accepted by the [**EMNLP 2021 main conference**](https://2021.emnlp.org/)\n\n### What's RelBERT?\nRelBERT is a state-of-the-art lexical relation embedding model (i.e. model representing any word pair such as \"Paris-France\" as a fixed-length vector) based on large-scale pretrained masked language models. RelBERT also establishes a very strong baseline to solve analogies in a zero-shot transfer fashion and even outperform strong few-shot models such as [GPT-3](https://arxiv.org/abs/2005.14165) and [Analogical Proportion (AP)](https://aclanthology.org/2021.acl-long.280/).\n\n| | SAT (full) | SAT | U2 | U4 | Google | BATS |\n|:-------------------|-------------:|------:|-----:|-----:|---------:|-------:|\n| [GloVe](https://nlp.stanford.edu/projects/glove/) | 48.9 | 47.8 | 46.5 | 39.8 | 96 | 68.7 |\n| [FastText](https://fasttext.cc/) | 49.7 | 47.8 | 43 | 40.7 | 96.6 | 72 |\n| [RELATIVE](http://josecamachocollados.com/papers/relative_ijcai2019.pdf) | 24.9 | 24.6 | 32.5 | 27.1 | 62 | 39 |\n| [pair2vec](https://arxiv.org/abs/1810.08854) | 33.7 | 34.1 | 25.4 | 28.2 | 66.6 | 53.8 |\n| [GPT-2 (AP)](https://aclanthology.org/2021.acl-long.280/) | 41.4 | 35.9 | 41.2 | 44.9 | 80.4 | 63.5 |\n| [RoBERTa (AP)](https://aclanthology.org/2021.acl-long.280/) | 49.6 | 42.4 | 49.1 | 49.1 | 90.8 | 69.7 |\n| [GPT-2 (tuned AP)](https://aclanthology.org/2021.acl-long.280/) | 57.8 | 56.7 | 50.9 | 49.5 | 95.2 | 81.2 |\n| [RoBERTa (tuned AP)](https://aclanthology.org/2021.acl-long.280/) | 55.8 | 53.4 | 58.3 | 57.4 | 93.6 | 78.4 | \n| [GPT3 (zeroshot)](https://arxiv.org/abs/2005.14165) | 53.7 | - | - | - | - | - |\n| [GPT3 (fewshot)](https://arxiv.org/abs/2005.14165) | 65.2 | - | - | - | - | - |\n| ***RelBERT*** | ***72.2*** | ***72.7*** | ***65.8*** | ***65.3*** | ***94.2*** | ***79.3*** |\n\n[comment]: <> (| ***RelBERT (triplet)*** | ***67.9*** | ***67.7*** | ***68.0*** | ***63.2*** | ***94.2*** | ***78.9*** |)\n[comment]: <> (| ***RelBERT (nce)*** | ***72.2*** | ***72.7*** | ***65.8*** | ***65.3*** | ***94.2*** | ***79.3*** |)\n\nWe also report the performance of RelBERT universal relation embeddings on lexical relation classification datasets, which reinforces the capability of RelBERT to model relations. \nAll datasets are public and available in the following links: [analogy questions](https://github.com/asahi417/AnalogyTools/releases/download/0.0.0/analogy_test_dataset.zip), [lexical relation classification](https://github.com/asahi417/AnalogyTools/releases/download/0.0.0/lexical_relation_dataset.zip).\nPlease have a look our paper to know more about RelBERT and [AnalogyTool](https://github.com/asahi417/AnalogyTools) or [AP paper](https://aclanthology.org/2021.acl-long.280/) for more information about the datasets.\n\n### What can we do with `relbert`?\nIn this repository, we release a python package `relbert` to work around with RelBERT and its checkpoints via [huggingface modelhub](https://huggingface.co/models) and [gensim](https://radimrehurek.com/gensim/).\nIn brief, what you can do with the `relbert` is summarized as below:\n- **Get a high quality embedding vector** given a pair of word\n- **Get similar word pairs (nearest neighbors)**\n- **Reproduce the results** of our EMNLP 2021 paper.\n\n## Get Started\n```shell\npip install relbert\n```\n\n## Play with RelBERT\nRelBERT can give you a high-quality relation embedding vector of a word pair. First, you need to define the model class with a RelBERT checkpoint.\n```python\nfrom relbert import RelBERT\nmodel = RelBERT()\n```\n\nThen you give a word pair to the model to get the embedding.\n```python\n# the vector has (1024,)\nv_tokyo_japan = model.get_embedding(['Tokyo', 'Japan'])\n```\n\nLet's run a quick experiment to check the embedding quality. Given candidate lists `['Paris', 'France']`, `['music', 'pizza']`, and `['London', 'Tokyo']`, the pair which shares\nthe same relation with the `['Tokyo', 'Japan']` is `['Paris', 'France']`. Would the RelBERT embedding be possible to retain it with simple cosine similarity? \n```python\nfrom relbert import cosine_similarity\nv_paris_france, v_music_pizza, v_london_tokyo = model.get_embedding([['Paris', 'France'], ['music', 'pizza'], ['London', 'Tokyo']])\ncosine_similarity(v_tokyo_japan, v_paris_france)\n>>> 0.999\ncosine_similarity(v_tokyo_japan, v_music_pizza)\n>>> 0.991\ncosine_similarity(v_tokyo_japan, v_london_tokyo)\n>>> 0.996\n```\nBravo! The distance between `['Tokyo', 'Japan']` and `['Paris', 'France']` is the closest among the candidates.\nIn fact, this pipeline is how we evaluate the RelBERT on the analogy question.\n\n### Nearest Neighbours of RelBERT\nTo get the similar word pairs in terms of the RelBERT embedding, we convert the RelBERT embedding to a gensim model file with a fixed vocabulary.\nSpecifically, we take the vocabulary of the [RELATIVE embedding](http://josecamachocollados.com/papers/relative_ijcai2019.pdf) that is released as a part of\n[Analogy Tool](https://github.com/asahi417/AnalogyTools#relative-embedding), and generate the embedding for all the word pairs with RelBERT (`asahi417/relbert-roberta-large`).\nFollowing the original vocabulary representation, words are joined by `__` and multiple token should be combined by `_` such as `New_york__Tokyo`.\n\nThe RelBERT embedding gensim file can be found [here](https://drive.google.com/file/d/1z3UeWALwf6EkujI3oYUCwkrIhMuJFdRA/view?usp=sharing). For example, you can get the nearest neighbours as below.\n```python\nfrom gensim.models import KeyedVectors\nmodel = KeyedVectors.load_word2vec_format('gensim_model.bin', binary=True)\nmodel.most_similar('Tokyo__Japan')\n>>> [('Moscow__Russia', 0.9997282028198242),\n ('Cairo__Egypt', 0.9997045993804932),\n ('Baghdad__Iraq', 0.9997043013572693),\n ('Helsinki__Finland', 0.9996970891952515),\n ('Paris__France', 0.999695897102356),\n ('Damascus__Syria', 0.9996891617774963),\n ('Bangkok__Thailand', 0.9996803998947144),\n ('Madrid__Spain', 0.9996673464775085),\n ('Budapest__Hungary', 0.9996543526649475),\n ('Beijing__China', 0.9996539354324341)]\n```\n\n\n## Citation\nIf you use any of these resources, please cite the following [paper](https://arxiv.org/abs/2110.15705):\n```\n@inproceedings{ushio-etal-2021-distilling,\n title = \"Distilling Relation Embeddings from Pretrained Language Models\",\n author = \"Ushio, Asahi and\n Camacho-Collados, Jose and\n Schockaert, Steven\",\n booktitle = \"Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing\",\n month = nov,\n year = \"2021\",\n address = \"Online and Punta Cana, Dominican Republic\",\n publisher = \"Association for Computational Linguistics\",\n url = \"https://aclanthology.org/2021.emnlp-main.712\",\n doi = \"10.18653/v1/2021.emnlp-main.712\",\n pages = \"9044--9062\",\n abstract = \"Pre-trained language models have been found to capture a surprisingly rich amount of lexical knowledge, ranging from commonsense properties of everyday concepts to detailed factual knowledge about named entities. Among others, this makes it possible to distill high-quality word vectors from pre-trained language models. However, it is currently unclear to what extent it is possible to distill relation embeddings, i.e. vectors that characterize the relationship between two words. Such relation embeddings are appealing because they can, in principle, encode relational knowledge in a more fine-grained way than is possible with knowledge graphs. To obtain relation embeddings from a pre-trained language model, we encode word pairs using a (manually or automatically generated) prompt, and we fine-tune the language model such that relationally similar word pairs yield similar output vectors. We find that the resulting relation embeddings are highly competitive on analogy (unsupervised) and relation classification (supervised) benchmarks, even without any task-specific fine-tuning. Source code to reproduce our experimental results and the model checkpoints are available in the following repository: https://github.com/asahi417/relbert\",\n}\n```\n\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "RelBERT: the state-of-the-art lexical relation embedding model.",
"version": "2.0.2",
"split_keywords": [
"nlp"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "a7f985d03e6de2f04d3b85a6d8f04471d3ad0d3aa7a5f1aca30b5bd64669f4c9",
"md5": "799c4ff7a7736e4c5e7ebcd314518d13",
"sha256": "0bcb1ce2f42f321c51ff117d642f09f08d4ca8a45b1e57e1b53d4d67031d8af1"
},
"downloads": -1,
"filename": "relbert-2.0.2.tar.gz",
"has_sig": false,
"md5_digest": "799c4ff7a7736e4c5e7ebcd314518d13",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 28801,
"upload_time": "2023-01-24T14:31:40",
"upload_time_iso_8601": "2023-01-24T14:31:40.728608Z",
"url": "https://files.pythonhosted.org/packages/a7/f9/85d03e6de2f04d3b85a6d8f04471d3ad0d3aa7a5f1aca30b5bd64669f4c9/relbert-2.0.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-01-24 14:31:40",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "asahi417",
"github_project": "relbert",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "relbert"
}