![](https://drive.google.com/uc?export=view&id=1UwPIfBrG021siM9SBAku2JNqG4R6avs6)
<div align="center">
# Retrieve, Read and LinK: Fast and Accurate Entity Linking and Relation Extraction on an Academic Budget
[![Conference](http://img.shields.io/badge/ACL-2024-4b44ce.svg)](https://2024.aclweb.org/)
[![Paper](http://img.shields.io/badge/paper-ACL--anthology-B31B1B.svg)](https://aclanthology.org/)
[![arXiv](https://img.shields.io/badge/arXiv-placeholder-b31b1b.svg)](https://arxiv.org/abs/placeholder)
[![Hugging Face Collection](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Collection-FCD21D)](https://huggingface.co/collections/sapienzanlp/relik-retrieve-read-and-link-665d9e4a5c3ecba98c1bef19)
[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-FCD21D)](https://huggingface.co/spaces/sapienzanlp/relik-demo)
[![Lightning](https://img.shields.io/badge/-Lightning-792ee5?logo=pytorchlightning&logoColor=white)](https://github.com/Lightning-AI/lightning)
[![PyTorch](https://img.shields.io/badge/PyTorch-orange?logo=pytorch)](https://pytorch.org/)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000)](https://github.com/psf/black)
[![Upload to PyPi](https://github.com/SapienzaNLP/relik/actions/workflows/python-publish-pypi.yml/badge.svg)](https://github.com/SapienzaNLP/relik/actions/workflows/python-publish-pypi.yml)
[![PyPi Version](https://img.shields.io/github/v/release/SapienzaNLP/relik)](https://github.com/SapienzaNLP/relik/releases)
</div>
A blazing fast and lightweight Information Extraction model for Entity Linking and Relation Extraction.
## Installation
Installation from PyPI
```console
pip install relik
```
<details>
<summary>Other installation options</summary>
#### Install with optional dependencies
Install with all the optional dependencies.
```bash
pip install relik[all]
```
Install with optional dependencies for training and evaluation.
```bash
pip install relik[train]
```
Install with optional dependencies for [FAISS](https://github.com/facebookresearch/faiss)
FAISS pypi package is only available for CPU. If you want to use GPU, you need to install it from source or use the conda package.
For CPU:
```bash
pip install relik[faiss]
```
For GPU:
```bash
conda create -n relik python=3.10
conda activate relik
# install pytorch
conda install -y pytorch=2.1.0 pytorch-cuda=12.1 -c pytorch -c nvidia
# GPU
conda install -y -c pytorch -c nvidia faiss-gpu=1.8.0
# or GPU with NVIDIA RAFT
conda install -y -c pytorch -c nvidia -c rapidsai -c conda-forge faiss-gpu-raft=1.8.0
pip install relik
```
Install with optional dependencies for serving the models with
[FastAPI](https://fastapi.tiangolo.com/) and [Ray](https://docs.ray.io/en/latest/serve/quickstart.html).
```bash
pip install relik[serve]
```
#### Installation from source
```bash
git clone https://github.com/SapienzaNLP/relik.git
cd relik
pip install -e .[all]
```
</details>
## Quick Start
[//]: # (Write a short description of the model and how to use it with the `from_pretrained` method.)
ReLiK is a lightweight and fast model for **Entity Linking** and **Relation Extraction**.
It is composed of two main components: a retriever and a reader.
The retriever is responsible for retrieving relevant documents from a large collection of documents,
while the reader is responsible for extracting entities and relations from the retrieved documents.
ReLiK can be used with the `from_pretrained` method to load a pre-trained pipeline.
Here is an example of how to use ReLiK for Entity Linking:
```python
from relik import Relik
from relik.inference.data.objects import RelikOutput
relik = Relik.from_pretrained("sapienzanlp/relik-entity-linking-large")
relik_out: RelikOutput = relik("Michael Jordan was one of the best players in the NBA.")
# RelikOutput(
# text="Michael Jordan was one of the best players in the NBA.",
# tokens=['Michael', 'Jordan', 'was', 'one', 'of', 'the', 'best', 'players', 'in', 'the', 'NBA', '.'],
# id=0,
# spans=[
# Span(start=0, end=14, label="Michael Jordan", text="Michael Jordan"),
# Span(start=50, end=53, label="National Basketball Association", text="NBA"),
# ],
# triples=[],
# candidates=Candidates(
# span=[
# [
# [
# {"text": "Michael Jordan", "id": 4484083},
# {"text": "National Basketball Association", "id": 5209815},
# {"text": "Walter Jordan", "id": 2340190},
# {"text": "Jordan", "id": 3486773},
# {"text": "50 Greatest Players in NBA History", "id": 1742909},
# ...
# ]
# ]
# ]
# ),
# )
```
and for Relation Extraction:
```python
from relik import Relik
from relik.inference.data.objects import RelikOutput
relik = Relik.from_pretrained("sapienzanlp/relik-relation-extraction-large")
relik_out: RelikOutput = relik("Michael Jordan was one of the best players in the NBA.")
```
The full list of available models can be found on [🤗 Hugging Face](https://huggingface.co/collections/sapienzanlp/relik-retrieve-read-and-link-665d9e4a5c3ecba98c1bef19).
Retrievers and Readers can be used separately.
In the case of retriever-only ReLiK, the output will contain the candidates for the input text.
```python
from relik import Relik
from relik.inference.data.objects import RelikOutput
# If you want to use only the retriever
retriever = Relik.from_pretrained("sapienzanlp/relik-entity-linking-large", reader=None)
relik_out: RelikOutput = retriever("Michael Jordan was one of the best players in the NBA.")
# RelikOutput(
# text="Michael Jordan was one of the best players in the NBA.",
# tokens=['Michael', 'Jordan', 'was', 'one', 'of', 'the', 'best', 'players', 'in', 'the', 'NBA', '.'],
# id=0,
# spans=[],
# triples=[],
# candidates=Candidates(
# span=[
# [
# {"text": "Michael Jordan", "id": 4484083},
# {"text": "National Basketball Association", "id": 5209815},
# {"text": "Walter Jordan", "id": 2340190},
# {"text": "Jordan", "id": 3486773},
# {"text": "50 Greatest Players in NBA History", "id": 1742909},
# ...
# ]
# ],
# triplet=[],
# ),
# )
```
```python
from relik import Relik
from relik.inference.data.objects import RelikOutput
# If you want to use only the reader
reader = Relik.from_pretrained("sapienzanlp/relik-entity-linking-large", retriever=None)
candidates = [
"Michael Jordan",
"National Basketball Association",
"Walter Jordan",
"Jordan",
"50 Greatest Players in NBA History",
]
text = "Michael Jordan was one of the best players in the NBA."
relik_out: RelikOutput = reader(text, candidates=candidates)
# RelikOutput(
# text="Michael Jordan was one of the best players in the NBA.",
# tokens=['Michael', 'Jordan', 'was', 'one', 'of', 'the', 'best', 'players', 'in', 'the', 'NBA', '.'],
# id=0,
# spans=[
# Span(start=0, end=14, label="Michael Jordan", text="Michael Jordan"),
# Span(start=50, end=53, label="National Basketball Association", text="NBA"),
# ],
# triples=[],
# candidates=Candidates(
# span=[
# [
# [
# {
# "text": "Michael Jordan",
# "id": -731245042436891448,
# },
# {
# "text": "National Basketball Association",
# "id": 8135443493867772328,
# },
# {
# "text": "Walter Jordan",
# "id": -5873847607270755146,
# "metadata": {},
# },
# {"text": "Jordan", "id": 6387058293887192208, "metadata": {}},
# {
# "text": "50 Greatest Players in NBA History",
# "id": 2173802663468652889,
# },
# ]
# ]
# ],
# ),
# )
```
### CLI
ReLiK provides a CLI to perform inference on a text file or a directory of text files. The CLI can be used as follows:
```bash
relik inference --help
Usage: relik inference [OPTIONS] MODEL_NAME_OR_PATH INPUT_PATH OUTPUT_PATH
â•â”€ Arguments ─────────────────────────────────────────────────────────────────────────────────────────────╮
│ * model_name_or_path TEXT [default: None] [required] │
│ * input_path TEXT [default: None] [required] │
│ * output_path TEXT [default: None] [required] │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────╯
â•â”€ Options ───────────────────────────────────────────────────────────────────────────────────────────────╮
│ --batch-size INTEGER [default: 8] │
│ --num-workers INTEGER [default: 4] │
│ --device TEXT [default: cuda] │
│ --precision TEXT [default: fp16] │
│ --top-k INTEGER [default: 100] │
│ --window-size INTEGER [default: None] │
│ --window-stride INTEGER [default: None] │
│ --annotation-type TEXT [default: char] │
│ --progress-bar --no-progress-bar [default: progress-bar] │
│ --model-kwargs TEXT [default: None] │
│ --inference-kwargs TEXT [default: None] │
│ --help Show this message and exit. │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────╯
```
For example:
```bash
relik inference sapienzanlp/relik-entity-linking-large data.txt output.jsonl
```
## Before You Start
In the following sections, we provide a step-by-step guide on how to prepare the data, train the retriever and reader, and evaluate the model.
### Entity Linking
All your data should have the following starting structure:
```jsonl
{
"doc_id": int, # Unique identifier for the document
"doc_text": txt, # Text of the document
"doc_annotations": # Char level annotations
[
[start, end, label],
[start, end, label],
...
]
}
```
We used BLINK (Wu et al., 2019) and AIDA (Hoffart et al, 2011) datasets for training and evaluation.
More specifically, we used the BLINK dataset for pre-training the retriever and the AIDA dataset for fine-tuning the retriever and training the reader.
The BLINK dataset can be downloaded from the [GENRE](https://github.com/facebookresearch/GENRE) repo from
[here](https://github.com/facebookresearch/GENRE/blob/main/scripts_genre/download_all_datasets.sh).
We used `blink-train-kilt.jsonl` and `blink-dev-kilt.jsonl` as training and validation datasets.
Assuming we have downloaded the two files in the `data/blink` folder, we converted the BLINK dataset to the ReLiK format using the following script:
```console
# Train
python scripts/data/blink/preprocess_genre_blink.py \
data/blink/blink-train-kilt.jsonl \
data/blink/processed/blink-train-kilt-relik.jsonl
# Dev
python scripts/data/blink/preprocess_genre_blink.py \
data/blink/blink-dev-kilt.jsonl \
data/blink/processed/blink-dev-kilt-relik.jsonl
```
The AIDA dataset is not publicly available, but we provide the file we used without `text` field. You can find the file in ReLiK format in `data/aida/processed` folder.
The Wikipedia index we used can be downloaded from [here](https://huggingface.co/sapienzanlp/relik-retriever-e5-base-v2-aida-blink-wikipedia-index/blob/main/documents.jsonl).
### Relation Extraction
TODO
## Retriever
We perform a two-step training process for the retriever. First, we "pre-train" the retriever using BLINK (Wu et al., 2019) dataset and then we "fine-tune" it using AIDA (Hoffart et al, 2011).
### Data Preparation
The retriever requires a dataset in a format similar to [DPR](https://github.com/facebookresearch/DPR): a `jsonl` file where each line is a dictionary with the following keys:
```json lines
{
"question": "....",
"positive_ctxs": [{
"title": "...",
"text": "...."
}],
"negative_ctxs": [{
"title": "...",
"text": "...."
}],
"hard_negative_ctxs": [{
"title": "...",
"text": "...."
}]
}
```
The retriever also needs an index to search for the documents. The documents to index can be either a jsonl file or a tsv file similar to
[DPR](https://github.com/facebookresearch/DPR):
- `jsonl`: each line is a json object with the following keys: `id`, `text`, `metadata`
- `tsv`: each line is a tab-separated string with the `id` and `text` column,
followed by any other column that will be stored in the `metadata` field
`jsonl` example:
```json lines
{
"id": "...",
"text": "...",
"metadata": ["{...}"]
},
...
```
`tsv` example:
```tsv
id \t text \t any other column
...
```
#### Entity Linking
##### BLINK
Once you have the BLINK dataset in the ReLiK format, you can create the windows with the following script:
```console
# train
python scripts/data/create_windows.py \
data/blink/processed/blink-train-kilt-relik.jsonl \
data/blink/processed/blink-train-kilt-relik-windowed.jsonl
# dev
python scripts/data/create_windows.py \
data/blink/processed/blink-dev-kilt-relik.jsonl \
data/blink/processed/blink-dev-kilt-relik-windowed.jsonl
```
and then convert it to the DPR format:
```console
# train
python scripts/data/blink/convert_to_dpr.py \
data/blink/processed/blink-train-kilt-relik-windowed.jsonl \
data/blink/processed/blink-train-kilt-relik-windowed-dpr.jsonl
# dev
python scripts/data/blink/convert_to_dpr.py \
data/blink/processed/blink-dev-kilt-relik-windowed.jsonl \
data/blink/processed/blink-dev-kilt-relik-windowed-dpr.jsonl
```
##### AIDA
Since the AIDA dataset is not publicly available, we can provide the annotations for the AIDA dataset in the ReLiK format as an example.
Assuming you have the full AIDA dataset in the `data/aida`, you can convert it to the ReLiK format and then create the windows with the following script:
```console
python scripts/data/create_windows.py \
data/data/processed/aida-train-relik.jsonl \
data/data/processed/aida-train-relik-windowed.jsonl
```
and then convert it to the DPR format:
```console
python scripts/data/convert_to_dpr.py \
data/data/processed/aida-train-relik-windowed.jsonl \
data/data/processed/aida-train-relik-windowed-dpr.jsonl
```
### Training the model
The `relik retriever train` command can be used to train the retriever. It requires the following arguments:
- `config_path`: The path to the configuration file.
- `overrides`: A list of overrides to the configuration file, in the format `key=value`.
Examples of configuration files can be found in the `relik/retriever/conf` folder.
#### Entity Linking
<!-- You can find an example in `relik/retriever/conf/finetune_iterable_in_batch.yaml`. -->
The configuration files in `relik/retriever/conf` are `pretrain_iterable_in_batch.yaml` and `finetune_iterable_in_batch.yaml`, which we used to pre-train and fine-tune the retriever, respectively.
For instance, to train the retriever on the AIDA dataset, you can run the following command:
```console
relik retriever train relik/retriever/conf/finetune_iterable_in_batch.yaml \
model.language_model=intfloat/e5-base-v2 \
train_dataset_path=data/aida/processed/aida-train-relik-windowed-dpr.jsonl \
val_dataset_path=data/aida/processed/aida-dev-relik-windowed-dpr.jsonl \
test_dataset_path=data/aida/processed/aida-test-relik-windowed-dpr.jsonl
```
#### Relation Extraction
TODO
### Inference
By passing `train.only_test=True` to the `relik retriever train` command, you can skip the training and only evaluate the model.
It needs also the path to the PyTorch Lightning checkpoint and the dataset to evaluate on.
```console
relik retriever train relik/retriever/conf/finetune_iterable_in_batch.yaml \
train.only_test=True \
test_dataset_path=data/aida/processed/aida-test-relik-windowed-dpr.jsonl
model.checkpoint_path=path/to/checkpoint
```
The retriever encoder can be saved from the checkpoint with the following command:
```python
from relik.retriever.lightning_modules.pl_modules import GoldenRetrieverPLModule
checkpoint_path = "path/to/checkpoint"
retriever_folder = "path/to/retriever"
# If you want to push the model to the Hugging Face Hub set push_to_hub=True
push_to_hub = False
# If you want to push the model to the Hugging Face Hub set the repo_id
repo_id = "sapienzanlp/relik-retriever-e5-base-v2-aida-blink-encoder"
pl_module = GoldenRetrieverPLModule.load_from_checkpoint(checkpoint_path)
pl_module.model.save_pretrained(retriever_folder, push_to_hub=push_to_hub, repo_id=repo_id)
```
with `push_to_hub=True` the model will be pushed to the 🤗 Hugging Face Hub with `repo_id` the repository id where the model will be pushed.
The retriever needs a index to search for the documents. The index can be created using `relik retriever build-index` command
```bash
relik retriever build-index --help
Usage: relik retriever build-index [OPTIONS] QUESTION_ENCODER_NAME_OR_PATH
DOCUMENT_PATH OUTPUT_FOLDER
â•â”€ Arguments ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ * question_encoder_name_or_path TEXT [default: None] [required] │
│ * document_path TEXT [default: None] [required] │
│ * output_folder TEXT [default: None] [required] │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
â•â”€ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --document-file-type TEXT [default: jsonl] │
│ --passage-encoder-name-or-path TEXT [default: None] │
│ --indexer-class TEXT [default: relik.retriever.indexers.inmemory.InMemoryDocumentIndex] │
│ --batch-size INTEGER [default: 512] │
│ --num-workers INTEGER [default: 4] │
│ --passage-max-length INTEGER [default: 64] │
│ --device TEXT [default: cuda] │
│ --index-device TEXT [default: cpu] │
│ --precision TEXT [default: fp32] │
│ --push-to-hub --no-push-to-hub [default: no-push-to-hub] │
│ --repo-id TEXT [default: None] │
│ --help Show this message and exit. │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
```
With the encoder and the index, the retriever can be loaded from a repo id or a local path:
```python
from relik.retriever import GoldenRetriever
encoder_name_or_path = "sapienzanlp/relik-retriever-e5-base-v2-aida-blink-encoder"
index_name_or_path = "sapienzanlp/relik-retriever-e5-base-v2-aida-blink-wikipedia-index"
retriever = GoldenRetriever(
question_encoder=encoder_name_or_path,
document_index=index_name_or_path,
device="cuda", # or "cpu"
precision="16", # or "32", "bf16"
index_device="cuda", # or "cpu"
index_precision="16", # or "32", "bf16"
)
```
and then it can be used to retrieve documents:
```python
retriever.retrieve("Michael Jordan was one of the best players in the NBA.", top_k=100)
```
## Reader
The reader is responsible for extracting entities and relations from documents from a set of candidates (e.g., possible entities or relations).
The reader can be trained for span extraction or triplet extraction.
The `RelikReaderForSpanExtraction` is used for span extraction, i.e. Entity Linking , while the `RelikReaderForTripletExtraction` is used for triplet extraction, i.e. Relation Extraction.
### Data Preparation
The reader requires the windowized dataset we created in section [Before You Start](#before-you-start) augmented with the candidate from the retriever.
The candidate can be added to the dataset using the `relik retriever add-candidates` command.
```bash
relik retriever add-candidates --help
Usage: relik retriever add-candidates [OPTIONS] QUESTION_ENCODER_NAME_OR_PATH
DOCUMENT_NAME_OR_PATH INPUT_PATH
OUTPUT_PATH
â•â”€ Arguments ─────────────────────────────────────────────────────────────────────────────────────────────────╮
│ * question_encoder_name_or_path TEXT [default: None] [required] │
│ * document_name_or_path TEXT [default: None] [required] │
│ * input_path TEXT [default: None] [required] │
│ * output_path TEXT [default: None] [required] │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
â•â”€ Options ───────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --passage-encoder-name-or-path TEXT [default: None] │
│ --top-k INTEGER [default: 100] │
│ --batch-size INTEGER [default: 128] │
│ --num-workers INTEGER [default: 4] │
│ --device TEXT [default: cuda] │
│ --index-device TEXT [default: cpu] │
│ --precision TEXT [default: fp32] │
│ --use-doc-topics --no-use-doc-topics [default: no-use-doc-topics] │
│ --help Show this message and exit. │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
```
### Training the model
Similar to the retriever, the `relik reader train` command can be used to train the retriever. It requires the following arguments:
- `config_path`: The path to the configuration file.
- `overrides`: A list of overrides to the configuration file, in the format `key=value`.
Examples of configuration files can be found in the `relik/reader/conf` folder.
#### Entity Linking
The configuration files in `relik/reader/conf` are `large.yaml` and `base.yaml`, which we used to train the large and base reader, respectively.
For instance, to train the large reader on the AIDA dataset run:
```console
relik reader train relik/reader/conf/large.yaml \
train_dataset_path=data/aida/processed/aida-train-relik-windowed-candidates.jsonl \
val_dataset_path=data/aida/processed/aida-dev-relik-windowed-candidates.jsonl \
test_dataset_path=data/aida/processed/aida-dev-relik-windowed-candidates.jsonl
```
#### Relation Extraction
TODO
### Inference
The reader can be saved from the checkpoint with the following command:
```python
from relik.reader.lightning_modules.relik_reader_pl_module import RelikReaderPLModule
checkpoint_path = "path/to/checkpoint"
reader_folder = "path/to/reader"
# If you want to push the model to the Hugging Face Hub set push_to_hub=True
push_to_hub = False
# If you want to push the model to the Hugging Face Hub set the repo_id
repo_id = "sapienzanlp/relik-reader-deberta-v3-large-aida"
pl_model = RelikReaderPLModule.load_from_checkpoint(
trainer.checkpoint_callback.best_model_path
)
pl_model.relik_reader_core_model.save_pretrained(experiment_path, push_to_hub=push_to_hub, repo_id=repo_id)
```
with `push_to_hub=True` the model will be pushed to the 🤗 Hugging Face Hub with `repo_id` the repository id where the model will be pushed.
The reader can be loaded from a repo id or a local path:
```python
from relik.reader import RelikReaderForSpanExtraction, RelikReaderForTripletExtraction
# the reader for span extraction
reader_span = RelikReaderForSpanExtraction(
"sapienzanlp/relik-reader-deberta-v3-large-aida"
)
# the reader for triplet extraction
reader_tripltes = RelikReaderForTripletExtraction(
"sapienzanlp/relik-reader-deberta-v3-large-nyt"
)
```
and used to extract entities and relations:
```python
# an example of candidates for the reader
candidates = ["Michael Jordan", "NBA", "Chicago Bulls", "Basketball", "United States"]
reader_span.read("Michael Jordan was one of the best players in the NBA.", candidates=candidates)
```
## Performance
### Entity Linking
We evaluate the performance of ReLiK on Entity Linking using [GERBIL](http://gerbil-qa.aksw.org/gerbil/). The following table shows the results (InKB Micro F1) of ReLiK Large and Base:
| Model | AIDA-B | MSNBC | Der | K50 | R128 | R500 | OKE15 | OKE16 | AVG | AVG-OOD | Speed (ms) |
|-------|--------|-------|-----|-----|------|------|-------|-------|-----|---------|------------|
| Base | 85.25 | 72.27 | 55.59 | 68.02 | 48.13 | 41.61 | 62.53 | 52.25 | 60.71 | 57.2 | n |
| Large | 86.37 | 75.04 | 56.25 | 72.8 | 51.67 | 42.95 | 65.12 | 57.21 | 63.43 | 60.15 | n |
To evaluate ReLiK we use the following steps:
1. Download the GERBIL server from [here](LINK).
2. Start the GERBIL server:
```console
cd gerbil && ./start.sh
```
2. Start the following services:
```console
cd gerbil-SpotWrapNifWS4Test && mvn clean -Dmaven.tomcat.port=1235 tomcat:run
```
3. Start the ReLiK server for GERBIL providing the model name as an argument (e.g. `sapienzanlp/relik-entity-linking-large`):
```console
python relik/reader/utils/gerbil_server.py --relik-model-name sapienzanlp/relik-entity-linking-large
```
4. Open the url [http://localhost:1234/gerbil](http://localhost:1234/gerbil) and:
- Select A2KB as experiment type
- Select "Ma - strong annotation match"
- In Name filed write the name you want to give to the experiment
- In URI field write: [http://localhost:1235/gerbil-spotWrapNifWS4Test/myalgorithm](http://localhost:1235/gerbil-spotWrapNifWS4Test/myalgorithm)
- Select the datasets (We use AIDA-B, MSNBC, Der, K50, R128, R500, OKE15, OKE16)
- Finally, run experiment
### Relation Extraction
- TODO
## Cite this work
If you use any part of this work, please consider citing the paper as follows:
```bibtex
@inproceedings{orlando-etal-2024-relik,
title = "Retrieve, Read and LinK: Fast and Accurate Entity Linking and Relation Extraction on an Academic Budget",
author = "Orlando, Riccardo and Huguet Cabot, Pere-Llu{\'\i}s and Barba, Edoardo and Navigli, Roberto",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2024",
month = aug,
year = "2024",
address = "Bangkok, Thailand",
publisher = "Association for Computational Linguistics",
}
```
## License
TODO
<!-- The data is licensed under [Creative Commons Attribution-ShareAlike 4.0](https://creativecommons.org/licenses/by-sa/4.0/). -->
Raw data
{
"_id": null,
"home_page": "https://github.com/SapienzaNLP/relik",
"name": "relik",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "NLP Sapienza sapienzanlp deep learning transformer pytorch retriever entity linking relation extraction reader budget",
"author": "Edoardo Barba, Riccardo Orlando, Pere-Llu\u00eds Huguet Cabot",
"author_email": "orlandorcc@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/c1/8b/1acff39c64808379f0be67a418ec701a07a0b8aefe9a7eaad2d66d29291b/relik-1.0.0.dev1.tar.gz",
"platform": null,
"description": "![](https://drive.google.com/uc?export=view&id=1UwPIfBrG021siM9SBAku2JNqG4R6avs6)\n\n<div align=\"center\">\n\n# Retrieve, Read and LinK: Fast and Accurate Entity Linking and Relation Extraction on an Academic Budget\n\n[![Conference](http://img.shields.io/badge/ACL-2024-4b44ce.svg)](https://2024.aclweb.org/)\n[![Paper](http://img.shields.io/badge/paper-ACL--anthology-B31B1B.svg)](https://aclanthology.org/)\n[![arXiv](https://img.shields.io/badge/arXiv-placeholder-b31b1b.svg)](https://arxiv.org/abs/placeholder)\n\n[![Hugging Face Collection](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Collection-FCD21D)](https://huggingface.co/collections/sapienzanlp/relik-retrieve-read-and-link-665d9e4a5c3ecba98c1bef19)\n[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-FCD21D)](https://huggingface.co/spaces/sapienzanlp/relik-demo)\n\n[![Lightning](https://img.shields.io/badge/-Lightning-792ee5?logo=pytorchlightning&logoColor=white)](https://github.com/Lightning-AI/lightning)\n[![PyTorch](https://img.shields.io/badge/PyTorch-orange?logo=pytorch)](https://pytorch.org/)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000)](https://github.com/psf/black)\n[![Upload to PyPi](https://github.com/SapienzaNLP/relik/actions/workflows/python-publish-pypi.yml/badge.svg)](https://github.com/SapienzaNLP/relik/actions/workflows/python-publish-pypi.yml)\n[![PyPi Version](https://img.shields.io/github/v/release/SapienzaNLP/relik)](https://github.com/SapienzaNLP/relik/releases)\n\n</div>\n\nA blazing fast and lightweight Information Extraction model for Entity Linking and Relation Extraction.\n\n## Installation\n\nInstallation from PyPI\n\n```console\npip install relik\n```\n\n<details>\n <summary>Other installation options</summary>\n\n#### Install with optional dependencies\n\nInstall with all the optional dependencies.\n\n```bash\npip install relik[all]\n```\n\nInstall with optional dependencies for training and evaluation.\n\n```bash\npip install relik[train]\n```\n\nInstall with optional dependencies for [FAISS](https://github.com/facebookresearch/faiss)\n\nFAISS pypi package is only available for CPU. If you want to use GPU, you need to install it from source or use the conda package.\n\nFor CPU:\n\n```bash\npip install relik[faiss]\n```\n\nFor GPU:\n\n```bash\nconda create -n relik python=3.10\nconda activate relik\n\n# install pytorch\nconda install -y pytorch=2.1.0 pytorch-cuda=12.1 -c pytorch -c nvidia\n\n# GPU\nconda install -y -c pytorch -c nvidia faiss-gpu=1.8.0\n# or GPU with NVIDIA RAFT\nconda install -y -c pytorch -c nvidia -c rapidsai -c conda-forge faiss-gpu-raft=1.8.0\n\npip install relik\n```\n\nInstall with optional dependencies for serving the models with\n[FastAPI](https://fastapi.tiangolo.com/) and [Ray](https://docs.ray.io/en/latest/serve/quickstart.html).\n\n```bash\npip install relik[serve]\n```\n\n#### Installation from source\n\n```bash\ngit clone https://github.com/SapienzaNLP/relik.git\ncd relik\npip install -e .[all]\n```\n\n</details>\n\n## Quick Start\n\n[//]: # (Write a short description of the model and how to use it with the `from_pretrained` method.)\n\nReLiK is a lightweight and fast model for **Entity Linking** and **Relation Extraction**.\nIt is composed of two main components: a retriever and a reader.\nThe retriever is responsible for retrieving relevant documents from a large collection of documents,\nwhile the reader is responsible for extracting entities and relations from the retrieved documents.\nReLiK can be used with the `from_pretrained` method to load a pre-trained pipeline.\n\nHere is an example of how to use ReLiK for Entity Linking:\n\n```python\nfrom relik import Relik\nfrom relik.inference.data.objects import RelikOutput\n\nrelik = Relik.from_pretrained(\"sapienzanlp/relik-entity-linking-large\")\nrelik_out: RelikOutput = relik(\"Michael Jordan was one of the best players in the NBA.\")\n\n# RelikOutput(\n# text=\"Michael Jordan was one of the best players in the NBA.\",\n# tokens=['Michael', 'Jordan', 'was', 'one', 'of', 'the', 'best', 'players', 'in', 'the', 'NBA', '.'],\n# id=0,\n# spans=[\n# Span(start=0, end=14, label=\"Michael Jordan\", text=\"Michael Jordan\"),\n# Span(start=50, end=53, label=\"National Basketball Association\", text=\"NBA\"),\n# ],\n# triples=[],\n# candidates=Candidates(\n# span=[\n# [\n# [\n# {\"text\": \"Michael Jordan\", \"id\": 4484083},\n# {\"text\": \"National Basketball Association\", \"id\": 5209815},\n# {\"text\": \"Walter Jordan\", \"id\": 2340190},\n# {\"text\": \"Jordan\", \"id\": 3486773},\n# {\"text\": \"50 Greatest Players in NBA History\", \"id\": 1742909},\n# ...\n# ]\n# ]\n# ]\n# ),\n# )\n```\n\nand for Relation Extraction:\n\n```python\nfrom relik import Relik\nfrom relik.inference.data.objects import RelikOutput\n\nrelik = Relik.from_pretrained(\"sapienzanlp/relik-relation-extraction-large\")\nrelik_out: RelikOutput = relik(\"Michael Jordan was one of the best players in the NBA.\")\n```\n\nThe full list of available models can be found on [\ud83e\udd17 Hugging Face](https://huggingface.co/collections/sapienzanlp/relik-retrieve-read-and-link-665d9e4a5c3ecba98c1bef19).\n\nRetrievers and Readers can be used separately.\nIn the case of retriever-only ReLiK, the output will contain the candidates for the input text.\n\n```python\nfrom relik import Relik\nfrom relik.inference.data.objects import RelikOutput\n\n# If you want to use only the retriever\nretriever = Relik.from_pretrained(\"sapienzanlp/relik-entity-linking-large\", reader=None)\nrelik_out: RelikOutput = retriever(\"Michael Jordan was one of the best players in the NBA.\")\n# RelikOutput(\n# text=\"Michael Jordan was one of the best players in the NBA.\",\n# tokens=['Michael', 'Jordan', 'was', 'one', 'of', 'the', 'best', 'players', 'in', 'the', 'NBA', '.'],\n# id=0,\n# spans=[],\n# triples=[],\n# candidates=Candidates(\n# span=[\n# [\n# {\"text\": \"Michael Jordan\", \"id\": 4484083},\n# {\"text\": \"National Basketball Association\", \"id\": 5209815},\n# {\"text\": \"Walter Jordan\", \"id\": 2340190},\n# {\"text\": \"Jordan\", \"id\": 3486773},\n# {\"text\": \"50 Greatest Players in NBA History\", \"id\": 1742909},\n# ...\n# ]\n# ],\n# triplet=[],\n# ),\n# )\n```\n\n```python\nfrom relik import Relik\nfrom relik.inference.data.objects import RelikOutput\n\n# If you want to use only the reader\nreader = Relik.from_pretrained(\"sapienzanlp/relik-entity-linking-large\", retriever=None)\ncandidates = [\n \"Michael Jordan\",\n \"National Basketball Association\",\n \"Walter Jordan\",\n \"Jordan\",\n \"50 Greatest Players in NBA History\",\n]\ntext = \"Michael Jordan was one of the best players in the NBA.\"\nrelik_out: RelikOutput = reader(text, candidates=candidates)\n# RelikOutput(\n# text=\"Michael Jordan was one of the best players in the NBA.\",\n# tokens=['Michael', 'Jordan', 'was', 'one', 'of', 'the', 'best', 'players', 'in', 'the', 'NBA', '.'],\n# id=0,\n# spans=[\n# Span(start=0, end=14, label=\"Michael Jordan\", text=\"Michael Jordan\"),\n# Span(start=50, end=53, label=\"National Basketball Association\", text=\"NBA\"),\n# ],\n# triples=[],\n# candidates=Candidates(\n# span=[\n# [\n# [\n# {\n# \"text\": \"Michael Jordan\",\n# \"id\": -731245042436891448,\n# },\n# {\n# \"text\": \"National Basketball Association\",\n# \"id\": 8135443493867772328,\n# },\n# {\n# \"text\": \"Walter Jordan\",\n# \"id\": -5873847607270755146,\n# \"metadata\": {},\n# },\n# {\"text\": \"Jordan\", \"id\": 6387058293887192208, \"metadata\": {}},\n# {\n# \"text\": \"50 Greatest Players in NBA History\",\n# \"id\": 2173802663468652889,\n# },\n# ]\n# ]\n# ],\n# ),\n# )\n```\n\n### CLI\n\nReLiK provides a CLI to perform inference on a text file or a directory of text files. The CLI can be used as follows:\n\n```bash\nrelik inference --help\n\n Usage: relik inference [OPTIONS] MODEL_NAME_OR_PATH INPUT_PATH OUTPUT_PATH\n\n\u256d\u2500 Arguments \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 * model_name_or_path TEXT [default: None] [required] \u2502\n\u2502 * input_path TEXT [default: None] [required] \u2502\n\u2502 * output_path TEXT [default: None] [required] \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n\u256d\u2500 Options \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 --batch-size INTEGER [default: 8] \u2502\n\u2502 --num-workers INTEGER [default: 4] \u2502\n\u2502 --device TEXT [default: cuda] \u2502\n\u2502 --precision TEXT [default: fp16] \u2502\n\u2502 --top-k INTEGER [default: 100] \u2502\n\u2502 --window-size INTEGER [default: None] \u2502\n\u2502 --window-stride INTEGER [default: None] \u2502\n\u2502 --annotation-type TEXT [default: char] \u2502\n\u2502 --progress-bar --no-progress-bar [default: progress-bar] \u2502\n\u2502 --model-kwargs TEXT [default: None] \u2502\n\u2502 --inference-kwargs TEXT [default: None] \u2502\n\u2502 --help Show this message and exit. \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n```\n\nFor example:\n\n```bash\nrelik inference sapienzanlp/relik-entity-linking-large data.txt output.jsonl\n```\n\n## Before You Start\n\nIn the following sections, we provide a step-by-step guide on how to prepare the data, train the retriever and reader, and evaluate the model.\n\n### Entity Linking\n\nAll your data should have the following starting structure:\n\n```jsonl\n{\n \"doc_id\": int, # Unique identifier for the document\n \"doc_text\": txt, # Text of the document\n \"doc_annotations\": # Char level annotations\n [\n [start, end, label],\n [start, end, label],\n ...\n ]\n}\n```\n\nWe used BLINK (Wu et al., 2019) and AIDA (Hoffart et al, 2011) datasets for training and evaluation.\nMore specifically, we used the BLINK dataset for pre-training the retriever and the AIDA dataset for fine-tuning the retriever and training the reader.\n\nThe BLINK dataset can be downloaded from the [GENRE](https://github.com/facebookresearch/GENRE) repo from\n[here](https://github.com/facebookresearch/GENRE/blob/main/scripts_genre/download_all_datasets.sh).\nWe used `blink-train-kilt.jsonl` and `blink-dev-kilt.jsonl` as training and validation datasets.\nAssuming we have downloaded the two files in the `data/blink` folder, we converted the BLINK dataset to the ReLiK format using the following script:\n\n```console\n# Train\npython scripts/data/blink/preprocess_genre_blink.py \\\n data/blink/blink-train-kilt.jsonl \\\n data/blink/processed/blink-train-kilt-relik.jsonl\n\n# Dev\npython scripts/data/blink/preprocess_genre_blink.py \\\n data/blink/blink-dev-kilt.jsonl \\\n data/blink/processed/blink-dev-kilt-relik.jsonl\n```\n\nThe AIDA dataset is not publicly available, but we provide the file we used without `text` field. You can find the file in ReLiK format in `data/aida/processed` folder.\n\nThe Wikipedia index we used can be downloaded from [here](https://huggingface.co/sapienzanlp/relik-retriever-e5-base-v2-aida-blink-wikipedia-index/blob/main/documents.jsonl).\n\n### Relation Extraction\n\nTODO\n\n## Retriever\n\nWe perform a two-step training process for the retriever. First, we \"pre-train\" the retriever using BLINK (Wu et al., 2019) dataset and then we \"fine-tune\" it using AIDA (Hoffart et al, 2011).\n\n### Data Preparation\n\nThe retriever requires a dataset in a format similar to [DPR](https://github.com/facebookresearch/DPR): a `jsonl` file where each line is a dictionary with the following keys:\n\n```json lines\n{\n \"question\": \"....\",\n \"positive_ctxs\": [{\n \"title\": \"...\",\n \"text\": \"....\"\n }],\n \"negative_ctxs\": [{\n \"title\": \"...\",\n \"text\": \"....\"\n }],\n \"hard_negative_ctxs\": [{\n \"title\": \"...\",\n \"text\": \"....\"\n }]\n}\n```\n\nThe retriever also needs an index to search for the documents. The documents to index can be either a jsonl file or a tsv file similar to\n[DPR](https://github.com/facebookresearch/DPR):\n\n- `jsonl`: each line is a json object with the following keys: `id`, `text`, `metadata`\n- `tsv`: each line is a tab-separated string with the `id` and `text` column,\n followed by any other column that will be stored in the `metadata` field\n\n`jsonl` example:\n\n```json lines\n{\n \"id\": \"...\",\n \"text\": \"...\",\n \"metadata\": [\"{...}\"]\n},\n...\n```\n\n`tsv` example:\n\n```tsv\nid \\t text \\t any other column\n...\n```\n\n#### Entity Linking\n\n##### BLINK\n\nOnce you have the BLINK dataset in the ReLiK format, you can create the windows with the following script:\n\n```console\n# train\npython scripts/data/create_windows.py \\\n data/blink/processed/blink-train-kilt-relik.jsonl \\\n data/blink/processed/blink-train-kilt-relik-windowed.jsonl\n\n# dev\npython scripts/data/create_windows.py \\\n data/blink/processed/blink-dev-kilt-relik.jsonl \\\n data/blink/processed/blink-dev-kilt-relik-windowed.jsonl\n```\n\nand then convert it to the DPR format:\n\n```console\n# train\npython scripts/data/blink/convert_to_dpr.py \\\n data/blink/processed/blink-train-kilt-relik-windowed.jsonl \\\n data/blink/processed/blink-train-kilt-relik-windowed-dpr.jsonl\n\n# dev\npython scripts/data/blink/convert_to_dpr.py \\\n data/blink/processed/blink-dev-kilt-relik-windowed.jsonl \\\n data/blink/processed/blink-dev-kilt-relik-windowed-dpr.jsonl\n```\n\n##### AIDA\n\nSince the AIDA dataset is not publicly available, we can provide the annotations for the AIDA dataset in the ReLiK format as an example.\nAssuming you have the full AIDA dataset in the `data/aida`, you can convert it to the ReLiK format and then create the windows with the following script:\n\n```console\npython scripts/data/create_windows.py \\\n data/data/processed/aida-train-relik.jsonl \\\n data/data/processed/aida-train-relik-windowed.jsonl\n```\n\nand then convert it to the DPR format:\n\n```console\npython scripts/data/convert_to_dpr.py \\\n data/data/processed/aida-train-relik-windowed.jsonl \\\n data/data/processed/aida-train-relik-windowed-dpr.jsonl\n```\n\n### Training the model\n\nThe `relik retriever train` command can be used to train the retriever. It requires the following arguments:\n\n- `config_path`: The path to the configuration file.\n- `overrides`: A list of overrides to the configuration file, in the format `key=value`.\n\nExamples of configuration files can be found in the `relik/retriever/conf` folder.\n\n#### Entity Linking\n\n<!-- You can find an example in `relik/retriever/conf/finetune_iterable_in_batch.yaml`. -->\nThe configuration files in `relik/retriever/conf` are `pretrain_iterable_in_batch.yaml` and `finetune_iterable_in_batch.yaml`, which we used to pre-train and fine-tune the retriever, respectively.\n\nFor instance, to train the retriever on the AIDA dataset, you can run the following command:\n\n```console\nrelik retriever train relik/retriever/conf/finetune_iterable_in_batch.yaml \\\n model.language_model=intfloat/e5-base-v2 \\\n train_dataset_path=data/aida/processed/aida-train-relik-windowed-dpr.jsonl \\\n val_dataset_path=data/aida/processed/aida-dev-relik-windowed-dpr.jsonl \\\n test_dataset_path=data/aida/processed/aida-test-relik-windowed-dpr.jsonl\n```\n\n#### Relation Extraction\n\nTODO\n\n### Inference\n\nBy passing `train.only_test=True` to the `relik retriever train` command, you can skip the training and only evaluate the model.\nIt needs also the path to the PyTorch Lightning checkpoint and the dataset to evaluate on.\n\n```console\nrelik retriever train relik/retriever/conf/finetune_iterable_in_batch.yaml \\\n train.only_test=True \\\n test_dataset_path=data/aida/processed/aida-test-relik-windowed-dpr.jsonl\n model.checkpoint_path=path/to/checkpoint\n```\n\nThe retriever encoder can be saved from the checkpoint with the following command:\n\n```python\nfrom relik.retriever.lightning_modules.pl_modules import GoldenRetrieverPLModule\n\ncheckpoint_path = \"path/to/checkpoint\"\nretriever_folder = \"path/to/retriever\"\n\n# If you want to push the model to the Hugging Face Hub set push_to_hub=True\npush_to_hub = False\n# If you want to push the model to the Hugging Face Hub set the repo_id\nrepo_id = \"sapienzanlp/relik-retriever-e5-base-v2-aida-blink-encoder\"\n\npl_module = GoldenRetrieverPLModule.load_from_checkpoint(checkpoint_path)\npl_module.model.save_pretrained(retriever_folder, push_to_hub=push_to_hub, repo_id=repo_id)\n```\n\nwith `push_to_hub=True` the model will be pushed to the \ud83e\udd17 Hugging Face Hub with `repo_id` the repository id where the model will be pushed.\n\nThe retriever needs a index to search for the documents. The index can be created using `relik retriever build-index` command\n\n```bash\nrelik retriever build-index --help \n\n Usage: relik retriever build-index [OPTIONS] QUESTION_ENCODER_NAME_OR_PATH \n DOCUMENT_PATH OUTPUT_FOLDER \n\u256d\u2500 Arguments \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 * question_encoder_name_or_path TEXT [default: None] [required] \u2502\n\u2502 * document_path TEXT [default: None] [required] \u2502\n\u2502 * output_folder TEXT [default: None] [required] \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n\u256d\u2500 Options \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 --document-file-type TEXT [default: jsonl] \u2502\n\u2502 --passage-encoder-name-or-path TEXT [default: None] \u2502\n\u2502 --indexer-class TEXT [default: relik.retriever.indexers.inmemory.InMemoryDocumentIndex] \u2502\n\u2502 --batch-size INTEGER [default: 512] \u2502\n\u2502 --num-workers INTEGER [default: 4] \u2502\n\u2502 --passage-max-length INTEGER [default: 64] \u2502\n\u2502 --device TEXT [default: cuda] \u2502\n\u2502 --index-device TEXT [default: cpu] \u2502\n\u2502 --precision TEXT [default: fp32] \u2502\n\u2502 --push-to-hub --no-push-to-hub [default: no-push-to-hub] \u2502\n\u2502 --repo-id TEXT [default: None] \u2502\n\u2502 --help Show this message and exit. \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n```\n\nWith the encoder and the index, the retriever can be loaded from a repo id or a local path:\n\n```python\nfrom relik.retriever import GoldenRetriever\n\nencoder_name_or_path = \"sapienzanlp/relik-retriever-e5-base-v2-aida-blink-encoder\"\nindex_name_or_path = \"sapienzanlp/relik-retriever-e5-base-v2-aida-blink-wikipedia-index\"\n\nretriever = GoldenRetriever(\n question_encoder=encoder_name_or_path,\n document_index=index_name_or_path,\n device=\"cuda\", # or \"cpu\"\n precision=\"16\", # or \"32\", \"bf16\"\n index_device=\"cuda\", # or \"cpu\"\n index_precision=\"16\", # or \"32\", \"bf16\"\n)\n```\n\nand then it can be used to retrieve documents:\n\n```python\nretriever.retrieve(\"Michael Jordan was one of the best players in the NBA.\", top_k=100)\n```\n\n## Reader\n\nThe reader is responsible for extracting entities and relations from documents from a set of candidates (e.g., possible entities or relations).\nThe reader can be trained for span extraction or triplet extraction.\nThe `RelikReaderForSpanExtraction` is used for span extraction, i.e. Entity Linking , while the `RelikReaderForTripletExtraction` is used for triplet extraction, i.e. Relation Extraction.\n\n### Data Preparation\n\nThe reader requires the windowized dataset we created in section [Before You Start](#before-you-start) augmented with the candidate from the retriever.\nThe candidate can be added to the dataset using the `relik retriever add-candidates` command.\n\n```bash\nrelik retriever add-candidates --help\n\n Usage: relik retriever add-candidates [OPTIONS] QUESTION_ENCODER_NAME_OR_PATH \n DOCUMENT_NAME_OR_PATH INPUT_PATH \n OUTPUT_PATH\n\n\u256d\u2500 Arguments \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 * question_encoder_name_or_path TEXT [default: None] [required] \u2502\n\u2502 * document_name_or_path TEXT [default: None] [required] \u2502\n\u2502 * input_path TEXT [default: None] [required] \u2502\n\u2502 * output_path TEXT [default: None] [required] \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n\u256d\u2500 Options \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 --passage-encoder-name-or-path TEXT [default: None] \u2502\n\u2502 --top-k INTEGER [default: 100] \u2502\n\u2502 --batch-size INTEGER [default: 128] \u2502\n\u2502 --num-workers INTEGER [default: 4] \u2502\n\u2502 --device TEXT [default: cuda] \u2502\n\u2502 --index-device TEXT [default: cpu] \u2502\n\u2502 --precision TEXT [default: fp32] \u2502\n\u2502 --use-doc-topics --no-use-doc-topics [default: no-use-doc-topics] \u2502\n\u2502 --help Show this message and exit. \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n```\n\n### Training the model\n\nSimilar to the retriever, the `relik reader train` command can be used to train the retriever. It requires the following arguments:\n\n- `config_path`: The path to the configuration file.\n- `overrides`: A list of overrides to the configuration file, in the format `key=value`.\n\nExamples of configuration files can be found in the `relik/reader/conf` folder.\n\n#### Entity Linking\n\nThe configuration files in `relik/reader/conf` are `large.yaml` and `base.yaml`, which we used to train the large and base reader, respectively.\nFor instance, to train the large reader on the AIDA dataset run:\n\n```console\nrelik reader train relik/reader/conf/large.yaml \\\n train_dataset_path=data/aida/processed/aida-train-relik-windowed-candidates.jsonl \\\n val_dataset_path=data/aida/processed/aida-dev-relik-windowed-candidates.jsonl \\\n test_dataset_path=data/aida/processed/aida-dev-relik-windowed-candidates.jsonl\n```\n\n#### Relation Extraction\n\nTODO\n\n### Inference\n\nThe reader can be saved from the checkpoint with the following command:\n\n```python\nfrom relik.reader.lightning_modules.relik_reader_pl_module import RelikReaderPLModule\n\ncheckpoint_path = \"path/to/checkpoint\"\nreader_folder = \"path/to/reader\"\n\n# If you want to push the model to the Hugging Face Hub set push_to_hub=True\npush_to_hub = False\n# If you want to push the model to the Hugging Face Hub set the repo_id\nrepo_id = \"sapienzanlp/relik-reader-deberta-v3-large-aida\"\n\npl_model = RelikReaderPLModule.load_from_checkpoint(\n trainer.checkpoint_callback.best_model_path\n)\npl_model.relik_reader_core_model.save_pretrained(experiment_path, push_to_hub=push_to_hub, repo_id=repo_id)\n```\n\nwith `push_to_hub=True` the model will be pushed to the \ud83e\udd17 Hugging Face Hub with `repo_id` the repository id where the model will be pushed.\n\nThe reader can be loaded from a repo id or a local path:\n\n```python\nfrom relik.reader import RelikReaderForSpanExtraction, RelikReaderForTripletExtraction\n\n# the reader for span extraction\nreader_span = RelikReaderForSpanExtraction(\n \"sapienzanlp/relik-reader-deberta-v3-large-aida\"\n)\n# the reader for triplet extraction\nreader_tripltes = RelikReaderForTripletExtraction(\n \"sapienzanlp/relik-reader-deberta-v3-large-nyt\"\n)\n```\n\nand used to extract entities and relations:\n\n```python\n# an example of candidates for the reader\ncandidates = [\"Michael Jordan\", \"NBA\", \"Chicago Bulls\", \"Basketball\", \"United States\"]\nreader_span.read(\"Michael Jordan was one of the best players in the NBA.\", candidates=candidates)\n```\n\n## Performance\n\n### Entity Linking\n\nWe evaluate the performance of ReLiK on Entity Linking using [GERBIL](http://gerbil-qa.aksw.org/gerbil/). The following table shows the results (InKB Micro F1) of ReLiK Large and Base:\n\n| Model | AIDA-B | MSNBC | Der | K50 | R128 | R500 | OKE15 | OKE16 | AVG | AVG-OOD | Speed (ms) |\n|-------|--------|-------|-----|-----|------|------|-------|-------|-----|---------|------------|\n| Base | 85.25 | 72.27 | 55.59 | 68.02 | 48.13 | 41.61 | 62.53 | 52.25 | 60.71 | 57.2 | n |\n| Large | 86.37 | 75.04 | 56.25 | 72.8 | 51.67 | 42.95 | 65.12 | 57.21 | 63.43 | 60.15 | n |\n\nTo evaluate ReLiK we use the following steps:\n\n1. Download the GERBIL server from [here](LINK).\n\n2. Start the GERBIL server:\n\n```console\ncd gerbil && ./start.sh\n```\n\n2. Start the following services:\n\n```console\ncd gerbil-SpotWrapNifWS4Test && mvn clean -Dmaven.tomcat.port=1235 tomcat:run\n```\n\n3. Start the ReLiK server for GERBIL providing the model name as an argument (e.g. `sapienzanlp/relik-entity-linking-large`):\n\n```console\npython relik/reader/utils/gerbil_server.py --relik-model-name sapienzanlp/relik-entity-linking-large\n```\n\n4. Open the url [http://localhost:1234/gerbil](http://localhost:1234/gerbil) and:\n - Select A2KB as experiment type\n - Select \"Ma - strong annotation match\"\n - In Name filed write the name you want to give to the experiment\n - In URI field write: [http://localhost:1235/gerbil-spotWrapNifWS4Test/myalgorithm](http://localhost:1235/gerbil-spotWrapNifWS4Test/myalgorithm)\n - Select the datasets (We use AIDA-B, MSNBC, Der, K50, R128, R500, OKE15, OKE16)\n - Finally, run experiment\n\n### Relation Extraction\n\n- TODO\n\n## Cite this work\n\nIf you use any part of this work, please consider citing the paper as follows:\n\n```bibtex\n@inproceedings{orlando-etal-2024-relik,\n title = \"Retrieve, Read and LinK: Fast and Accurate Entity Linking and Relation Extraction on an Academic Budget\",\n author = \"Orlando, Riccardo and Huguet Cabot, Pere-Llu{\\'\\i}s and Barba, Edoardo and Navigli, Roberto\",\n booktitle = \"Findings of the Association for Computational Linguistics: ACL 2024\",\n month = aug,\n year = \"2024\",\n address = \"Bangkok, Thailand\",\n publisher = \"Association for Computational Linguistics\",\n}\n```\n\n## License\n\nTODO\n<!-- The data is licensed under [Creative Commons Attribution-ShareAlike 4.0](https://creativecommons.org/licenses/by-sa/4.0/). -->\n",
"bugtrack_url": null,
"license": "Apache",
"summary": "Fast and Accurate Entity Linking and Relation Extraction on an Academic Budget",
"version": "1.0.0.dev1",
"project_urls": {
"Homepage": "https://github.com/SapienzaNLP/relik"
},
"split_keywords": [
"nlp",
"sapienza",
"sapienzanlp",
"deep",
"learning",
"transformer",
"pytorch",
"retriever",
"entity",
"linking",
"relation",
"extraction",
"reader",
"budget"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "bfe9642003bb324d581b3df4b47bcad4a1d7d120d786f2a6ed580c184d876542",
"md5": "7c4aa9fcdbfa2e2892514d47302b53f4",
"sha256": "85da7b3e65cebd88fb4e826936e80839dc290abf9a4bda5583e57cc2c9564607"
},
"downloads": -1,
"filename": "relik-1.0.0.dev1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "7c4aa9fcdbfa2e2892514d47302b53f4",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 193911,
"upload_time": "2024-06-14T09:23:28",
"upload_time_iso_8601": "2024-06-14T09:23:28.129173Z",
"url": "https://files.pythonhosted.org/packages/bf/e9/642003bb324d581b3df4b47bcad4a1d7d120d786f2a6ed580c184d876542/relik-1.0.0.dev1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "c18b1acff39c64808379f0be67a418ec701a07a0b8aefe9a7eaad2d66d29291b",
"md5": "9e95f9e740f0ac100261d76a74bc3bd4",
"sha256": "d2c46381a1077325d89eafbcb28e19ce1c98e1dd08a3e823f102c4549f66fa96"
},
"downloads": -1,
"filename": "relik-1.0.0.dev1.tar.gz",
"has_sig": false,
"md5_digest": "9e95f9e740f0ac100261d76a74bc3bd4",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 165546,
"upload_time": "2024-06-14T09:23:31",
"upload_time_iso_8601": "2024-06-14T09:23:31.224033Z",
"url": "https://files.pythonhosted.org/packages/c1/8b/1acff39c64808379f0be67a418ec701a07a0b8aefe9a7eaad2d66d29291b/relik-1.0.0.dev1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-06-14 09:23:31",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "SapienzaNLP",
"github_project": "relik",
"github_not_found": true,
"lcname": "relik"
}