<div align="center">
<h1>SparsEmbed - Splade</h1>
<p>Neural search</p>
</div>
This repository presents an unofficial replication of both models Splade and SparseEmbed with are state of the art models in information retrieval:
- *[SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking](https://arxiv.org/abs/2107.05720)* authored by Thibault Formal, Benjamin Piwowarski, Stéphane Clinchant, SIGIR 2021.
- *[SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval](https://arxiv.org/abs/2109.10086)* authored by Thibault Formal, Carlos Lassance, Benjamin Piwowarski, Stéphane Clinchant, SIGIR 2022.
- *[SparseEmbed: Learning Sparse Lexical Representations with Contextual Embeddings for Retrieval](https://research.google/pubs/pub52289/)* authored by Weize Kong, Jeffrey M. Dudek, Cheng Li, Mingyang Zhang, and Mike Bendersky, SIGIR 2023.
**Note:** This project is currently a work in progress. Splade Model is ready to use but I'm working on SparseEmbed. 🔨🧹
## Installation
We can install sparsembed using:
```
pip install sparsembed
```
If we plan to evaluate our model while training install:
```
pip install "sparsembed[eval]"
```
## Retriever
### Splade
We can initialize a Splade Retriever directly from the `splade_v2_max` checkpoint available on HuggingFace. Retrievers are based on PyTorch sparse matrices, stored in memory and accelerated with GPU. We can reduce the number of activated tokens via the `n_tokens` parameter in order to reduce the memory usage of those sparse matrices.
```python
from sparsembed import model, retrieve
from transformers import AutoModelForMaskedLM, AutoTokenizer
device = "cuda" # cpu
batch_size = 10
# List documents to index:
documents = [
{'id': 0,
'title': 'Paris',
'url': 'https://en.wikipedia.org/wiki/Paris',
'text': 'Paris is the capital and most populous city of France.'},
{'id': 1,
'title': 'Paris',
'url': 'https://en.wikipedia.org/wiki/Paris',
'text': "Since the 17th century, Paris has been one of Europe's major centres of science, and arts."},
{'id': 2,
'title': 'Paris',
'url': 'https://en.wikipedia.org/wiki/Paris',
'text': 'The City of Paris is the centre and seat of government of the region and province of Île-de-France.'
}]
model = model.Splade(
model=AutoModelForMaskedLM.from_pretrained("naver/splade_v2_max").to(device),
tokenizer=AutoTokenizer.from_pretrained("naver/splade_v2_max"),
device=device
)
retriever = retrieve.SpladeRetriever(
key="id", # Key identifier of each document.
on=["title", "text"], # Fields to search.
model=model # Splade retriever.
)
retriever = retriever.add(
documents=documents,
batch_size=batch_size,
k_tokens=256, # Number of activated tokens.
)
retriever(
["paris", "Toulouse"], # Queries
k_tokens=20, # Maximum number of activated tokens.
k=100, # Number of documents to retrieve.
batch_size=batch_size
)
```
```python
[[{'id': 0, 'similarity': 11.481657981872559},
{'id': 2, 'similarity': 11.294965744018555},
{'id': 1, 'similarity': 10.059721946716309}],
[{'id': 0, 'similarity': 0.7379149198532104},
{'id': 2, 'similarity': 0.6973429918289185},
{'id': 1, 'similarity': 0.5428210496902466}]]
```
### SparsEmbed
We can also initialize a retriever dedicated to SparseEmbed model. The checkpoint `naver/splade_v2_max` is not a SparseEmbed trained model so we should train one before using it as a retriever.
```python
from sparsembed import model, retrieve
from transformers import AutoModelForMaskedLM, AutoTokenizer
device = "cuda" # cpu
batch_size = 10
# List documents to index:
documents = [
{'id': 0,
'title': 'Paris',
'url': 'https://en.wikipedia.org/wiki/Paris',
'text': 'Paris is the capital and most populous city of France.'},
{'id': 1,
'title': 'Paris',
'url': 'https://en.wikipedia.org/wiki/Paris',
'text': "Since the 17th century, Paris has been one of Europe's major centres of science, and arts."},
{'id': 2,
'title': 'Paris',
'url': 'https://en.wikipedia.org/wiki/Paris',
'text': 'The City of Paris is the centre and seat of government of the region and province of Île-de-France.'
}]
model = model.SparsEmbed(
model=AutoModelForMaskedLM.from_pretrained("naver/splade_v2_max").to(device),
tokenizer=AutoTokenizer.from_pretrained("naver/splade_v2_max"),
device=device
)
retriever = retrieve.SparsEmbedRetriever(
key="id", # Key identifier of each document.
on=["title", "text"], # Fields to search.
model=model # Splade retriever.
)
retriever = retriever.add(
documents=documents,
batch_size=batch_size,
k_tokens=256, # Number of activated tokens.
)
retriever(
["paris", "Toulouse"], # Queries
k_tokens=20, # Maximum number of activated tokens.
k=100, # Number of documents to retrieve.
batch_size=batch_size
)
```
## Training
Let's fine-tune Splade and SparsEmbed.
### Dataset
Your training dataset must be made out of triples `(anchor, positive, negative)` where anchor is a query, positive is a document that is directly linked to the anchor and negative is a document that is not relevant for the anchor.
```python
X = [
("anchor 1", "positive 1", "negative 1"),
("anchor 2", "positive 2", "negative 2"),
("anchor 3", "positive 3", "negative 3"),
]
```
### Models
Both Splade and SparseEmbed models can be initialized from the `AutoModelForMaskedLM` pretrained models.
```python
from transformers import AutoModelForMaskedLM, AutoTokenizer
model = model.Splade(
model=AutoModelForMaskedLM.from_pretrained("naver/splade_v2_max").to(device),
tokenizer=AutoTokenizer.from_pretrained("naver/splade_v2_max"),
device=device,
)
```
```python
from transformers import AutoModelForMaskedLM, AutoTokenizer
model = model.SparsEmbed(
model=AutoModelForMaskedLM.from_pretrained("naver/splade_v2_max").to(device),
tokenizer=AutoTokenizer.from_pretrained("naver/splade_v2_max"),
embedding_size=64,
k_tokens=96,
device=device,
)
```
### Splade
The following PyTorch code snippet illustrates the training loop to fine-tune Splade:
```python
from transformers import AutoModelForMaskedLM, AutoTokenizer, optimization
from sparsembed import model, utils, train, retrieve, losses
import torch
device = "cuda" # cpu or cuda
batch_size = 8
epochs = 1 # Number of times the model will train over the whole dataset.
model = model.Splade(
model=AutoModelForMaskedLM.from_pretrained("naver/splade_v2_max").to(device),
tokenizer=AutoTokenizer.from_pretrained("naver/splade_v2_max"),
device=device
)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
scheduler = optimization.get_linear_schedule_with_warmup(
optimizer=optimizer,
num_warmup_steps=6000,
num_training_steps=4_000_000,
)
flops_scheduler = losses.FlopsScheduler(weight=1e-4, steps=50_000)
X = [
("anchor 1", "positive 1", "negative 1"),
("anchor 2", "positive 2", "negative 2"),
("anchor 3", "positive 3", "negative 3"),
]
for anchor, positive, negative in utils.iter(
X,
epochs=epochs,
batch_size=batch_size,
shuffle=True
):
loss = train.train_splade(
model=model,
optimizer=optimizer,
anchor=anchor,
positive=positive,
negative=negative,
flops_loss_weight=flops_scheduler.get(),
)
scheduler.step()
flops_scheduler.step()
# Save the model.
model.save_pretrained("checkpoint")
# Beir benchmark for evaluation.
documents, queries, qrels = utils.load_beir("scifact", split="test")
retriever = retrieve.SpladeRetriever(
key="id",
on=["title", "text"],
model=model
)
retriever = retriever.add(
documents=documents,
batch_size=batch_size,
k_tokens=96,
)
utils.evaluate(
retriever=retriever,
batch_size=batch_size,
qrels=qrels,
queries=queries,
k=100,
k_tokens=96,
metrics=["map", "ndcg@10", "ndcg@10", "recall@10", "hits@10"]
)
```
After having saved the model with `save_pretrained`, we can load the checkpoint using:
```python
from sparsembed import model
device = "cuda"
model = model.Splade(
model_name_or_path="checkpoint",
device=device,
)
```
## SparsEmbed
The following PyTorch code snippet illustrates the training loop to fine-tune SparseEmbed:
```python
from transformers import AutoModelForMaskedLM, AutoTokenizer, optimization
from sparsembed import model, utils, train, retrieve, losses
import torch
device = "cuda" # cpu or cuda
batch_size = 8
epochs = 1 # Number of times the model will train over the whole dataset.
model = model.SparsEmbed(
model=AutoModelForMaskedLM.from_pretrained("distilbert-base-uncased").to(device),
tokenizer=AutoTokenizer.from_pretrained("distilbert-base-uncased"),
device=device
)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
scheduler = optimization.get_linear_schedule_with_warmup(
optimizer=optimizer,
num_warmup_steps=6000, # Number of warmup steps.
num_training_steps=4_000_000 # Length training set.
)
flops_scheduler = losses.FlopsScheduler(weight=1e-4, steps=50_000)
X = [
("anchor 1", "positive 1", "negative 1"),
("anchor 2", "positive 2", "negative 2"),
("anchor 3", "positive 3", "negative 3"),
]
for anchor, positive, negative in utils.iter(
X,
epochs=epochs,
batch_size=batch_size,
shuffle=True
):
loss = train.train_sparsembed(
model=model,
optimizer=optimizer,
k_tokens=96,
anchor=anchor,
positive=positive,
negative=negative,
flops_loss_weight=flops_scheduler.get(),
sparse_loss_weight=0.1,
)
scheduler.step()
flops_scheduler.step()
# Save the model.
model.save_pretrained("checkpoint")
# Beir benchmark for evaluation.
documents, queries, qrels = utils.load_beir("scifact", split="test")
retriever = retrieve.SparsEmbedRetriever(
key="id",
on=["title", "text"],
model=model
)
retriever = retriever.add(
documents=documents,
k_tokens=96,
batch_size=batch_size
)
utils.evaluate(
retriever=retriever,
batch_size=batch_size,
qrels=qrels,
queries=queries,
k=100,
k_tokens=96,
metrics=["map", "ndcg@10", "ndcg@10", "recall@10", "hits@10"]
)
```
After having saved the model with `save_pretrained`, we can load the checkpoint using:
```python
from sparsembed import model
device = "cuda"
model = model.SparsEmbed(
model_name_or_path="checkpoint",
device=device,
)
```
## Utils
We can get the activated tokens / embeddings of a sentence with:
```python
model.encode(["deep learning, information retrieval, sparse models"])
```
We can evaluate similarities between pairs of queries and documents without the use of a retriever:
```python
model.scores(
queries=["Query A", "Query B"],
documents=["Document A", "Document B"],
batch_size=32,
)
```
```python
tensor([5.1449, 9.1194])
```
Wen can visualize activated tokens:
```python
model.decode(**model.encode(["deep learning, information retrieval, sparse models"]))
```
```python
['deep sparse model retrieval information models depth fuzzy learning dense poor memory recall processing reading lacy include remember knowledge training heavy retrieve guide vague type small learn data']
```
## Benchmarks
Work in progress.
Raw data
{
"_id": null,
"home_page": "https://github.com/raphaelsty/sparseembed",
"name": "sparsembed",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": "",
"keywords": "neural search,information retrieval,semantic search,SparseEmbed,Google Research,SPLADE",
"author": "Raphael Sourty",
"author_email": "raphael.sourty@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/11/3f/11aedc86d1839b24199114299a33f0d4faffeced2c7228b4063b8884dbd0/sparsembed-0.1.1.tar.gz",
"platform": null,
"description": "<div align=\"center\">\n <h1>SparsEmbed - Splade</h1>\n <p>Neural search</p>\n</div>\n\nThis repository presents an unofficial replication of both models Splade and SparseEmbed with are state of the art models in information retrieval:\n\n- *[SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking](https://arxiv.org/abs/2107.05720)* authored by Thibault Formal, Benjamin Piwowarski, St\u00e9phane Clinchant, SIGIR 2021.\n\n- *[SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval](https://arxiv.org/abs/2109.10086)* authored by Thibault Formal, Carlos Lassance, Benjamin Piwowarski, St\u00e9phane Clinchant, SIGIR 2022.\n\n- *[SparseEmbed: Learning Sparse Lexical Representations with Contextual Embeddings for Retrieval](https://research.google/pubs/pub52289/)* authored by Weize Kong, Jeffrey M. Dudek, Cheng Li, Mingyang Zhang, and Mike Bendersky, SIGIR 2023.\n\n**Note:** This project is currently a work in progress. Splade Model is ready to use but I'm working on SparseEmbed. \ud83d\udd28\ud83e\uddf9\n\n## Installation\n\nWe can install sparsembed using:\n\n```\npip install sparsembed\n```\n\nIf we plan to evaluate our model while training install:\n\n```\npip install \"sparsembed[eval]\"\n```\n\n## Retriever\n\n### Splade\n\nWe can initialize a Splade Retriever directly from the `splade_v2_max` checkpoint available on HuggingFace. Retrievers are based on PyTorch sparse matrices, stored in memory and accelerated with GPU. We can reduce the number of activated tokens via the `n_tokens` parameter in order to reduce the memory usage of those sparse matrices. \n\n```python\nfrom sparsembed import model, retrieve\nfrom transformers import AutoModelForMaskedLM, AutoTokenizer\n\ndevice = \"cuda\" # cpu\n\nbatch_size = 10\n\n# List documents to index:\ndocuments = [\n {'id': 0,\n 'title': 'Paris',\n 'url': 'https://en.wikipedia.org/wiki/Paris',\n 'text': 'Paris is the capital and most populous city of France.'},\n {'id': 1,\n 'title': 'Paris',\n 'url': 'https://en.wikipedia.org/wiki/Paris',\n 'text': \"Since the 17th century, Paris has been one of Europe's major centres of science, and arts.\"},\n {'id': 2,\n 'title': 'Paris',\n 'url': 'https://en.wikipedia.org/wiki/Paris',\n 'text': 'The City of Paris is the centre and seat of government of the region and province of \u00cele-de-France.'\n}]\n\nmodel = model.Splade(\n model=AutoModelForMaskedLM.from_pretrained(\"naver/splade_v2_max\").to(device),\n tokenizer=AutoTokenizer.from_pretrained(\"naver/splade_v2_max\"),\n device=device\n)\n\nretriever = retrieve.SpladeRetriever(\n key=\"id\", # Key identifier of each document.\n on=[\"title\", \"text\"], # Fields to search.\n model=model # Splade retriever.\n)\n\nretriever = retriever.add(\n documents=documents,\n batch_size=batch_size,\n k_tokens=256, # Number of activated tokens.\n)\n\nretriever(\n [\"paris\", \"Toulouse\"], # Queries \n k_tokens=20, # Maximum number of activated tokens.\n k=100, # Number of documents to retrieve.\n batch_size=batch_size\n)\n```\n\n```python\n[[{'id': 0, 'similarity': 11.481657981872559},\n {'id': 2, 'similarity': 11.294965744018555},\n {'id': 1, 'similarity': 10.059721946716309}],\n [{'id': 0, 'similarity': 0.7379149198532104},\n {'id': 2, 'similarity': 0.6973429918289185},\n {'id': 1, 'similarity': 0.5428210496902466}]]\n```\n\n### SparsEmbed \n\nWe can also initialize a retriever dedicated to SparseEmbed model. The checkpoint `naver/splade_v2_max` is not a SparseEmbed trained model so we should train one before using it as a retriever.\n\n```python\nfrom sparsembed import model, retrieve\nfrom transformers import AutoModelForMaskedLM, AutoTokenizer\n\ndevice = \"cuda\" # cpu\n\nbatch_size = 10\n\n# List documents to index:\ndocuments = [\n {'id': 0,\n 'title': 'Paris',\n 'url': 'https://en.wikipedia.org/wiki/Paris',\n 'text': 'Paris is the capital and most populous city of France.'},\n {'id': 1,\n 'title': 'Paris',\n 'url': 'https://en.wikipedia.org/wiki/Paris',\n 'text': \"Since the 17th century, Paris has been one of Europe's major centres of science, and arts.\"},\n {'id': 2,\n 'title': 'Paris',\n 'url': 'https://en.wikipedia.org/wiki/Paris',\n 'text': 'The City of Paris is the centre and seat of government of the region and province of \u00cele-de-France.'\n}]\n\nmodel = model.SparsEmbed(\n model=AutoModelForMaskedLM.from_pretrained(\"naver/splade_v2_max\").to(device),\n tokenizer=AutoTokenizer.from_pretrained(\"naver/splade_v2_max\"),\n device=device\n)\n\nretriever = retrieve.SparsEmbedRetriever(\n key=\"id\", # Key identifier of each document.\n on=[\"title\", \"text\"], # Fields to search.\n model=model # Splade retriever.\n)\n\nretriever = retriever.add(\n documents=documents,\n batch_size=batch_size,\n k_tokens=256, # Number of activated tokens.\n)\n\nretriever(\n [\"paris\", \"Toulouse\"], # Queries \n k_tokens=20, # Maximum number of activated tokens.\n k=100, # Number of documents to retrieve.\n batch_size=batch_size\n)\n```\n\n## Training\n\nLet's fine-tune Splade and SparsEmbed.\n\n### Dataset\n\nYour training dataset must be made out of triples `(anchor, positive, negative)` where anchor is a query, positive is a document that is directly linked to the anchor and negative is a document that is not relevant for the anchor.\n\n```python\nX = [\n (\"anchor 1\", \"positive 1\", \"negative 1\"),\n (\"anchor 2\", \"positive 2\", \"negative 2\"),\n (\"anchor 3\", \"positive 3\", \"negative 3\"),\n]\n```\n\n### Models\n\nBoth Splade and SparseEmbed models can be initialized from the `AutoModelForMaskedLM` pretrained models.\n\n```python\nfrom transformers import AutoModelForMaskedLM, AutoTokenizer\n\nmodel = model.Splade(\n model=AutoModelForMaskedLM.from_pretrained(\"naver/splade_v2_max\").to(device),\n tokenizer=AutoTokenizer.from_pretrained(\"naver/splade_v2_max\"),\n device=device,\n)\n```\n\n```python\nfrom transformers import AutoModelForMaskedLM, AutoTokenizer\n\nmodel = model.SparsEmbed(\n model=AutoModelForMaskedLM.from_pretrained(\"naver/splade_v2_max\").to(device),\n tokenizer=AutoTokenizer.from_pretrained(\"naver/splade_v2_max\"),\n embedding_size=64,\n k_tokens=96,\n device=device,\n)\n```\n\n### Splade\n\nThe following PyTorch code snippet illustrates the training loop to fine-tune Splade:\n\n```python\nfrom transformers import AutoModelForMaskedLM, AutoTokenizer, optimization\nfrom sparsembed import model, utils, train, retrieve, losses\nimport torch\n\ndevice = \"cuda\" # cpu or cuda\nbatch_size = 8\nepochs = 1 # Number of times the model will train over the whole dataset.\n\nmodel = model.Splade(\n model=AutoModelForMaskedLM.from_pretrained(\"naver/splade_v2_max\").to(device),\n tokenizer=AutoTokenizer.from_pretrained(\"naver/splade_v2_max\"),\n device=device\n)\n\noptimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)\n\nscheduler = optimization.get_linear_schedule_with_warmup(\n optimizer=optimizer, \n num_warmup_steps=6000,\n num_training_steps=4_000_000,\n)\n\nflops_scheduler = losses.FlopsScheduler(weight=1e-4, steps=50_000) \n\nX = [\n (\"anchor 1\", \"positive 1\", \"negative 1\"),\n (\"anchor 2\", \"positive 2\", \"negative 2\"),\n (\"anchor 3\", \"positive 3\", \"negative 3\"),\n]\n\nfor anchor, positive, negative in utils.iter(\n X,\n epochs=epochs,\n batch_size=batch_size,\n shuffle=True\n ):\n loss = train.train_splade(\n model=model,\n optimizer=optimizer,\n anchor=anchor,\n positive=positive,\n negative=negative,\n flops_loss_weight=flops_scheduler.get(),\n )\n\n scheduler.step()\n flops_scheduler.step()\n\n# Save the model.\nmodel.save_pretrained(\"checkpoint\")\n\n# Beir benchmark for evaluation.\ndocuments, queries, qrels = utils.load_beir(\"scifact\", split=\"test\")\n\nretriever = retrieve.SpladeRetriever(\n key=\"id\",\n on=[\"title\", \"text\"],\n model=model\n)\n\nretriever = retriever.add(\n documents=documents,\n batch_size=batch_size,\n k_tokens=96,\n)\n\nutils.evaluate(\n retriever=retriever,\n batch_size=batch_size,\n qrels=qrels,\n queries=queries,\n k=100,\n k_tokens=96,\n metrics=[\"map\", \"ndcg@10\", \"ndcg@10\", \"recall@10\", \"hits@10\"]\n)\n```\n\nAfter having saved the model with `save_pretrained`, we can load the checkpoint using:\n\n```python\nfrom sparsembed import model\n\ndevice = \"cuda\"\n\nmodel = model.Splade(\n model_name_or_path=\"checkpoint\",\n device=device,\n)\n```\n\n## SparsEmbed\n\nThe following PyTorch code snippet illustrates the training loop to fine-tune SparseEmbed:\n\n```python\nfrom transformers import AutoModelForMaskedLM, AutoTokenizer, optimization\nfrom sparsembed import model, utils, train, retrieve, losses\nimport torch\n\ndevice = \"cuda\" # cpu or cuda\nbatch_size = 8\nepochs = 1 # Number of times the model will train over the whole dataset.\n\nmodel = model.SparsEmbed(\n model=AutoModelForMaskedLM.from_pretrained(\"distilbert-base-uncased\").to(device),\n tokenizer=AutoTokenizer.from_pretrained(\"distilbert-base-uncased\"),\n device=device\n)\n\noptimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)\n\nscheduler = optimization.get_linear_schedule_with_warmup(\n optimizer=optimizer, \n num_warmup_steps=6000, # Number of warmup steps.\n num_training_steps=4_000_000 # Length training set.\n)\n\nflops_scheduler = losses.FlopsScheduler(weight=1e-4, steps=50_000)\n\nX = [\n (\"anchor 1\", \"positive 1\", \"negative 1\"),\n (\"anchor 2\", \"positive 2\", \"negative 2\"),\n (\"anchor 3\", \"positive 3\", \"negative 3\"),\n]\n\nfor anchor, positive, negative in utils.iter(\n X,\n epochs=epochs,\n batch_size=batch_size,\n shuffle=True\n ):\n loss = train.train_sparsembed(\n model=model,\n optimizer=optimizer,\n k_tokens=96,\n anchor=anchor,\n positive=positive,\n negative=negative,\n flops_loss_weight=flops_scheduler.get(),\n sparse_loss_weight=0.1,\n )\n\n scheduler.step()\n flops_scheduler.step()\n\n# Save the model.\nmodel.save_pretrained(\"checkpoint\")\n\n# Beir benchmark for evaluation.\ndocuments, queries, qrels = utils.load_beir(\"scifact\", split=\"test\")\n\nretriever = retrieve.SparsEmbedRetriever(\n key=\"id\",\n on=[\"title\", \"text\"],\n model=model\n)\n\nretriever = retriever.add(\n documents=documents,\n k_tokens=96,\n batch_size=batch_size\n)\n\nutils.evaluate(\n retriever=retriever,\n batch_size=batch_size,\n qrels=qrels,\n queries=queries,\n k=100,\n k_tokens=96,\n metrics=[\"map\", \"ndcg@10\", \"ndcg@10\", \"recall@10\", \"hits@10\"]\n)\n```\n\nAfter having saved the model with `save_pretrained`, we can load the checkpoint using:\n\n```python\nfrom sparsembed import model\n\ndevice = \"cuda\"\n\nmodel = model.SparsEmbed(\n model_name_or_path=\"checkpoint\",\n device=device,\n)\n```\n\n## Utils\n\nWe can get the activated tokens / embeddings of a sentence with:\n\n```python\nmodel.encode([\"deep learning, information retrieval, sparse models\"])\n```\n\nWe can evaluate similarities between pairs of queries and documents without the use of a retriever: \n\n```python\nmodel.scores(\n queries=[\"Query A\", \"Query B\"], \n documents=[\"Document A\", \"Document B\"],\n batch_size=32,\n)\n```\n\n```python\ntensor([5.1449, 9.1194])\n```\n\nWen can visualize activated tokens:\n\n```python\nmodel.decode(**model.encode([\"deep learning, information retrieval, sparse models\"]))\n```\n\n```python\n['deep sparse model retrieval information models depth fuzzy learning dense poor memory recall processing reading lacy include remember knowledge training heavy retrieve guide vague type small learn data']\n```\n\n## Benchmarks\n\nWork in progress.\n",
"bugtrack_url": null,
"license": "",
"summary": "Sparse Embeddings for Neural Search.",
"version": "0.1.1",
"project_urls": {
"Download": "https://github.com/user/sparseembed/archive/v_01.tar.gz",
"Homepage": "https://github.com/raphaelsty/sparseembed"
},
"split_keywords": [
"neural search",
"information retrieval",
"semantic search",
"sparseembed",
"google research",
"splade"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "113f11aedc86d1839b24199114299a33f0d4faffeced2c7228b4063b8884dbd0",
"md5": "a55cc21a0456d4ab2cbbb6394116f00c",
"sha256": "5fa23560eb168e404e64cbf22db6ac6cdd6e38817df68fe5bf3dbb0abf10e335"
},
"downloads": -1,
"filename": "sparsembed-0.1.1.tar.gz",
"has_sig": false,
"md5_digest": "a55cc21a0456d4ab2cbbb6394116f00c",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 19022,
"upload_time": "2023-08-26T16:24:19",
"upload_time_iso_8601": "2023-08-26T16:24:19.903713Z",
"url": "https://files.pythonhosted.org/packages/11/3f/11aedc86d1839b24199114299a33f0d4faffeced2c7228b4063b8884dbd0/sparsembed-0.1.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-08-26 16:24:19",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "raphaelsty",
"github_project": "sparseembed",
"github_not_found": true,
"lcname": "sparsembed"
}