<div align="center">
<h1>FreshStack</h1>
<p>A Repository for Constructing Realistic IR/RAG Benchmarks</p>
</div>
<p align="center"><img width=500 src="https://raw.githubusercontent.com/fresh-stack/freshstack/main/images/freshstack-logo-cropped.png"/></p>
<h4 align="center">
<p>
<a href="https://arxiv.org/abs/2504.13128">Paper</a> |
<a href="https://fresh-stack.github.io/">Website</a> |
<a href="https://fresh-stack.github.io/#leaderboard">Leaderboard</a> |
<a href="https://huggingface.co/freshstack">Dataset</a> |
<a href="https://colab.research.google.com/drive/1eaB_5cF62kW4E3g0bCXw9_4nC4xt-_tX?usp=sharing">Google Colab</a>
<p>
</h4>
**FreshStack** is a modular framework to **automatically build realistic IR/RAG benchmarks** from niche, community-sourced technical content (e.g., Stackβ―Overflow + GitHub repositories). It supports:
* Scraping **human-asked queries** based on StackOverflow.
* Gathering **up-to-date corpora** via chunking any GitHub repository.
* **Retrieval evaluation** of any dense/multi-vector model on the Freshstack repository.
* Datasets released under **CC-BY-SAβ―4.0** and code and scripts under **Apache 2.0 License**.
## Installation
Install via pip, tested with Python 3.10+:
```python
pip install freshstack
```
If you want to build from source, use:
```bash
git clone https://github.com/fresh-stack/freshstack.git
cd freshstack
pip install -e .
```
## π Quickstart: Load Freshstack Dataset
```python
from freshstack.datasets import DataLoader
dataloader = DataLoader(
queries_repo="freshstack/queries-oct-2024",
corpus_repo="freshstack/corpus-oct-2024",
topic="langchain") # or "yolo", "angular", "laravel" or "godot"
# Loads the corpus, queries and nuggets in the BEIR format
corpus, queries, nuggets = dataloader.load(split="test")
# Loads the qrels (nuggets), qrels (query) and query to nugget mapping
qrels_nuggets, qrels_query, query_to_nuggets = dataloader.load_qrels(split="test")
```
## π Quickstart: Model Evaluation
### 1. Evaluate only the retrieved results
```python
# Your runfile can be stored as a .txt in the following format: [qid, Q0, docid, 0, score, run_name], e.g.,
# 76185522 Q0 angular/adev/src/content/tutorials/learn-angular/steps/14-routerLink/answer/src/app/app.component.ts_0_368 0 0.7353782057762146 your_model_name
from freshstack import util
from freshstack.retrieval.evaluation import EvaluateRetrieval
# retrieval_results: dict[str, dict[str, str]] with qid: {doc_id: score}
retrieval_results = util.load_runfile("<path_to_your_runfile>")
evaluator = EvaluateRetrieval(k_values=[10, 20, 50])
alpha_ndcg, coverage, recall = evaluator.evaluate(
qrels_nuggets=qrels_nuggets,
query_to_nuggets=query_to_nuggets,
qrels_query=qrels_query,
results=retrieval_results,
)
```
### 2. Evaluate any dense embedding model (e.g., Qwen3-0.6B-embedding) using BEIR.
> Make sure you install the latest PyPI BEIR repository: `pip install beir`
```python
from beir.retrieval import models
from beir.retrieval.evaluation import EvaluateRetrieval as BEIREval
from beir.retrieval.search.dense import DenseRetrievalExactSearch as DRES
from freshstack.retrieval.evaluation import EvaluateRetrieval
# Custom query prompt for evaluating the Qwen3-0.6B model on Freshstack.
query_prompt = "Instruct: Given a technical question, retrieve relevant code snippets or technical documentation that best answer the question\nQuery: "
model = DRES(models.SentenceBERT(
"Qwen/Qwen3-Embedding-0.6B",
max_length=2048, # IMP: keep the max_length for both query & passage atleast 2048 tokens.
prompts={"query": query_prompt, "passage": ""},
model_kwargs={
"attn_implementation": "flash_attention_2",
"device_map": "auto",
"torch_dtype": "bfloat16"
},
tokenizer_kwargs={"padding_side": "left"},
), batch_size=32)
retriever = BEIREval(model, score_function="cos_sim")
retrieval_results = retriever.retrieve(corpus=corpus, queries=queries)
# Evaluate and compute retrieval score once you have results
evaluator = EvaluateRetrieval(k_values=[10, 20, 50])
alpha_ndcg, coverage, recall = evaluator.evaluate(
qrels_nuggets=qrels_nuggets,
query_to_nuggets=query_to_nuggets,
qrels_query=qrels_query,
results=retrieval_results,
)
```
### 3. Evaluate any multi-vector model (e.g., ColBERT) using Pylate.
> Make sure you install the latest PyLate repository: `pip install pylate`.
```python
from pylate import indexes, models, retrieve
from freshstack.retrieval.evaluation import EvaluateRetrieval
# Step 1: Load the ColBERT model
model = models.ColBERT(
model_name_or_path="lightonai/GTE-ModernColBERT-v1",
query_length=2048, document_length=2048
)
# Step 2: Initialize the Voyager index or (PLAID index)
index = indexes.Voyager(
index_folder=f"./langchain_index",
index_name="index",
override=False, # This overwrites the existing index if any
)
# Step 3: Encode the documents and add them to index
documents_ids = list(corpus.keys())
documents_embeddings = model.encode(
[doc["text"] for doc in corpus.values()],
batch_size=32,
is_query=False, # Ensure that it is set to False to indicate that these are documents, not queries
)
index.add_documents(
documents_ids=documents_ids,
documents_embeddings=documents_embeddings,
)
# Step 5: Compute query embeddings
query_ids = list(queries.keys())
queries_embeddings = model.encode(
list(queries.values()),
batch_size=32,
is_query=True, # Ensure that it is set to False to indicate that these are queries
)
# Step 6: Initialize the ColBERT retriever with the Voyager index & retrieve documents
retriever = retrieve.ColBERT(index=index)
scores = retriever.retrieve(
queries_embeddings=queries_embeddings,
k=50, # Retrieve top-k results based on the maximum k value specified
batch_size=1, # We have kept a batch size of 1 to avoid memory issues.
device="cpu", # Use CPU for inference, change to "cuda" if you have a GPU available.
)
# Step 7: Prepare the results in the required BEIR format
retrieval_results = {}
for query_id, doc_scores in zip(query_ids, scores):
retrieval_results[query_id] = {}
for doc_id, score in doc_scores:
retrieval_results[query_id][doc_id] = score
# Step 8: Evaluate and compute retrieval score once you have results
evaluator = EvaluateRetrieval(k_values=[10, 20, 50])
alpha_ndcg, coverage, recall = evaluator.evaluate(
qrels_nuggets=qrels_nuggets,
query_to_nuggets=query_to_nuggets,
qrels_query=qrels_query,
results=retrieval_results,
)
```
---
## π Raw Freshstack Datasets (Oct 2024)
The raw freshstack datasets can be downloaded via HF:
* langchain, yolo, laravel, angular, godot (via [huggingface.co](https://huggingface.co/datasets/freshstack/queries-oct-2024))
```python
from datasets import load_dataset
queries = load_dataset("freshstack/queries-oct-2024", subset="yolo")
corpus = load_dataset("freshstack/corpus-oct-2024", subset="yolo")
```
---
## π§ Project Structure
```
freshstack/
ββ examples/ # contains examples
β ββ chunking/ # examples for github repo chunking
β ββ evaluation/ # examples for model eval on freshstack
ββ freshstack/ # core logic modules
β ββ retrieval/ # code for retrieval evaluation
β ββ datasets/ # code for freshstakc dataloader
β ββ chunking/ # code for github repo chunking
ββ pyproject.toml
```
---
## FreshStack Leaderboard
The upto date leaderboard for Freshstack (version oct-2024) is provided here: [https://fresh-stack.github.io/#leaderboard](https://fresh-stack.github.io/#leaderboard).
> NOTE: Below is the snapshot of the Freshstack leaderboard from Jun 12th 2025.
| Model Name | Size | Date | AVERAGE Ξ±@10 | AVERAGE C\@20 | AVERAGE R\@50 | LANGCHAIN Ξ±@10 | LANGCHAIN C\@20 | LANGCHAIN R\@50 | YOLO Ξ±@10 | YOLO C\@20 | YOLO R\@50 | LARAVEL Ξ±@10 | LARAVEL C\@20 | LARAVEL R\@50 | ANGULAR Ξ±@10 | ANGULAR C\@20 | ANGULAR R\@50 | GODOT Ξ±@10 | GODOT C\@20 | GODOT R\@50 |
| ------------------------------------ | ---- | ---------- | ------------ | ------------- | ------------- | -------------- | --------------- | --------------- | --------- | ---------- | ---------- | ------------ | ------------- | ------------- | ------------ | ------------- | ------------- | ---------- | ----------- | ----------- |
| Oracle: Fusion (BM25; ...) (Nuggets) | - | 2024β11β01 | 0.541 | 0.868 | 0.755 | 0.519 | 0.881 | 0.655 | 0.601 | 0.876 | 0.825 | 0.566 | 0.888 | 0.818 | 0.544 | 0.881 | 0.756 | 0.476 | 0.815 | 0.719 |
| Oracle: BM25 (Nuggets) | - | 2024β11β01 | 0.488 | 0.768 | 0.556 | 0.467 | 0.739 | 0.446 | 0.519 | 0.796 | 0.657 | 0.540 | 0.840 | 0.654 | 0.485 | 0.787 | 0.536 | 0.428 | 0.680 | 0.489 |
| Oracle: Voyage Large 2 (Nuggets) | - | 2024β11β01 | 0.404 | 0.769 | 0.586 | 0.419 | 0.763 | 0.508 | 0.430 | 0.845 | 0.675 | 0.409 | 0.791 | 0.624 | 0.406 | 0.733 | 0.533 | 0.353 | 0.715 | 0.590 |
| Oracle: BGE (Gemma-2) (Nuggets) | 9B | 2024β11β01 | 0.389 | 0.735 | 0.547 | 0.308 | 0.667 | 0.405 | 0.461 | 0.784 | 0.572 | 0.448 | 0.806 | 0.666 | 0.393 | 0.755 | 0.536 | 0.335 | 0.664 | 0.555 |
| Qwen3β8B (Emb) | 8B | 2025β06β05 | 0.365 | 0.689 | 0.525 | 0.331 | 0.694 | 0.423 | 0.393 | 0.728 | 0.567 | 0.421 | 0.748 | 0.615 | 0.373 | 0.700 | 0.502 | 0.307 | 0.576 | 0.521 |
| Qwen3β4B (Emb) | 4B | 2025β06β05 | 0.347 | 0.656 | 0.490 | 0.320 | 0.675 | 0.415 | 0.404 | 0.744 | 0.550 | 0.402 | 0.748 | 0.604 | 0.304 | 0.618 | 0.442 | 0.303 | 0.496 | 0.440 |
| Fusion (BM25; BGE; E5; Voyage) | - | 2024β11β01 | 0.343 | 0.669 | 0.539 | 0.337 | 0.700 | 0.477 | 0.304 | 0.627 | 0.534 | 0.425 | 0.748 | 0.646 | 0.385 | 0.719 | 0.532 | 0.265 | 0.550 | 0.505 |
| Oracle: E5 (Mistral-7B) (Nuggets) | 7B | 2024β11β01 | 0.337 | 0.664 | 0.496 | 0.323 | 0.684 | 0.432 | 0.437 | 0.737 | 0.554 | 0.287 | 0.631 | 0.532 | 0.346 | 0.670 | 0.470 | 0.292 | 0.596 | 0.494 |
| Stellaβ1.5B v5 | 1.5B | 2025β01β01 | 0.317 | 0.615 | 0.479 | 0.315 | 0.660 | 0.388 | 0.334 | 0.624 | 0.559 | 0.370 | 0.681 | 0.590 | 0.330 | 0.630 | 0.414 | 0.237 | 0.481 | 0.443 |
| Voyage Large 2 | - | 2024β11β01 | 0.289 | 0.589 | 0.438 | 0.246 | 0.528 | 0.308 | 0.270 | 0.570 | 0.453 | 0.345 | 0.701 | 0.543 | 0.304 | 0.625 | 0.427 | 0.282 | 0.522 | 0.458 |
| Stellaβ400M v5 | 400M | 2025β01β01 | 0.276 | 0.578 | 0.422 | 0.285 | 0.608 | 0.356 | 0.241 | 0.538 | 0.447 | 0.320 | 0.648 | 0.534 | 0.288 | 0.619 | 0.359 | 0.244 | 0.476 | 0.412 |
| BGE (Gemma-2) | 9B | 2024β11β01 | 0.269 | 0.569 | 0.427 | 0.216 | 0.548 | 0.337 | 0.258 | 0.547 | 0.430 | 0.348 | 0.699 | 0.574 | 0.323 | 0.571 | 0.378 | 0.200 | 0.479 | 0.419 |
| Qwen3β0.6B (Emb) | 596M | 2025β06β05 | 0.262 | 0.543 | 0.394 | 0.259 | 0.588 | 0.369 | 0.260 | 0.504 | 0.383 | 0.288 | 0.593 | 0.463 | 0.253 | 0.535 | 0.356 | 0.249 | 0.495 | 0.400 |
| E5 (Mistral-7B) | 7B | 2024β11β01 | 0.255 | 0.553 | 0.397 | 0.304 | 0.654 | 0.393 | 0.243 | 0.552 | 0.394 | 0.250 | 0.565 | 0.470 | 0.262 | 0.548 | 0.368 | 0.217 | 0.444 | 0.359 |
| GTE (large) v1.5 | 434M | 2024β01β09 | 0.226 | 0.494 | 0.318 | 0.206 | 0.470 | 0.252 | 0.195 | 0.445 | 0.271 | 0.318 | 0.626 | 0.482 | 0.284 | 0.578 | 0.343 | 0.127 | 0.348 | 0.240 |
| BM25 | - | 2024β11β01 | 0.218 | 0.448 | 0.316 | 0.230 | 0.475 | 0.261 | 0.137 | 0.342 | 0.337 | 0.319 | 0.602 | 0.441 | 0.259 | 0.551 | 0.340 | 0.144 | 0.268 | 0.200 |
| Nomic Embed (Code) | 7B | 2025β03β24 | 0.218 | 0.488 | 0.348 | 0.224 | 0.518 | 0.292 | 0.227 | 0.539 | 0.390 | 0.222 | 0.532 | 0.407 | 0.237 | 0.511 | 0.356 | 0.178 | 0.341 | 0.295 |
| CodeRankEmbed | 137M | 2024β11β03 | 0.104 | 0.279 | 0.162 | 0.099 | 0.271 | 0.128 | 0.075 | 0.215 | 0.128 | 0.108 | 0.324 | 0.225 | 0.146 | 0.363 | 0.167 | 0.091 | 0.224 | 0.160 |
### π₯ Contribute your model to the leaderboard
1. Fork the repo (https://github.com/fresh-stack/fresh-stack.github.io)
2. Create a new branch (`git checkout -b <your_branch>`)
3. Add your model scores in the following format to the `leaderboard_data.json`:
```bash
{
"leaderboardData": [
{
"info": {
"name": "BM25",
"size": "-",
"type": "open_source",
"date": "2024-11-01",
"link": "https://github.com/castorini/pyserini"
},
"datasets": {
"langchain": {"alpha_ndcg_10": 0.230, "coverage_20": 0.475, "recall_50": 0.261},
"yolo": {"alpha_ndcg_10": 0.137, "coverage_20": 0.342, "recall_50": 0.337},
"laravel": {"alpha_ndcg_10": 0.319, "coverage_20": 0.602, "recall_50": 0.441},
"angular": {"alpha_ndcg_10": 0.259, "coverage_20": 0.551, "recall_50": 0.340},
"godot": {"alpha_ndcg_10": 0.144, "coverage_20": 0.268, "recall_50": 0.200},
"average": {"alpha_ndcg_10": 0.218, "coverage_20": 0.448, "recall_50": 0.316},
}
},
...
]
}
```
4. Submit a pull request, ideally including:
* The updated `leaderboard_data.json`
* Pipeline invocation script (reference)
* Brief evaluation summary (reference)
All contributions welcomeβespecially new domain expansions, evaluation improvements, and retrieval baselines!
---
## π Citation
If you use FreshStack in your work, please cite:
```bib
@article{thakur-freshstack:2025,
author = {Nandan Thakur and
Jimmy Lin and
Sam Havens and
Michael Carbin and
Omar Khattab and
Andrew Drozdov},
title = {FreshStack: Building Realistic Benchmarks for Evaluating Retrieval
on Technical Documents},
journal = {CoRR},
volume = {abs/2504.13128},
year = {2025},
url = {https://doi.org/10.48550/arXiv.2504.13128},
doi = {10.48550/ARXIV.2504.13128},
eprinttype = {arXiv},
eprint = {2504.13128},
timestamp = {Thu, 22 May 2025 21:00:35 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-2504-13128.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
```
The main contributors of this repository are:
- [Nandan Thakur](https://github.com/thakur-nandan), Personal Website: [thakur-nandan.gitub.io](https://thakur-nandan.github.io)
Contact person: Nandan Thakur, [nandant@gmail.com](mailto:nandant@gmail.com)
Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.
> This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.
---
## Collaboration
This project is developed in collaboration with the following organizations:
| Organization | Logo |
| ---------------------- | ---------------------------------------------------------------------------------------------- |
| University of Waterloo | <img src="https://raw.githubusercontent.com/fresh-stack/freshstack/main/images/uwaterloo.png" alt="University of Waterloo logo" /> |
| Databricks | <img src="https://raw.githubusercontent.com/fresh-stack/freshstack/main/images/databricks-logo.png" alt="Databricks logo" /> |
Raw data
{
"_id": null,
"home_page": null,
"name": "freshstack",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": "Nandan Thakur <nandant@gmail.com>",
"keywords": "Benchmarking, Evaluation Framework, Information Retrieval, Retrieval-Augmented Generation, Transformer Networks, Large Language Models, PyTorch, RAG, IR, NLP, Deep Learning",
"author": null,
"author_email": "Nandan Thakur <nandant@gmail.com>, Andrew Drozdov <andrew.drozdov@databricks.com>, Omar Khattab <omar.khattab@databricks.com>",
"download_url": "https://files.pythonhosted.org/packages/15/3c/90acd2df7e967bbdcf8d3075cec8f360ef0576b50e09dcbed727f2028dd1/freshstack-0.0.6.tar.gz",
"platform": null,
"description": "<div align=\"center\">\n <h1>FreshStack</h1>\n <p>A Repository for Constructing Realistic IR/RAG Benchmarks</p>\n</div>\n\n<p align=\"center\"><img width=500 src=\"https://raw.githubusercontent.com/fresh-stack/freshstack/main/images/freshstack-logo-cropped.png\"/></p>\n\n<h4 align=\"center\">\n <p>\n <a href=\"https://arxiv.org/abs/2504.13128\">Paper</a> |\n <a href=\"https://fresh-stack.github.io/\">Website</a> |\n <a href=\"https://fresh-stack.github.io/#leaderboard\">Leaderboard</a> |\n <a href=\"https://huggingface.co/freshstack\">Dataset</a> |\n <a href=\"https://colab.research.google.com/drive/1eaB_5cF62kW4E3g0bCXw9_4nC4xt-_tX?usp=sharing\">Google Colab</a>\n <p>\n</h4>\n\n**FreshStack** is a modular framework to **automatically build realistic IR/RAG benchmarks** from niche, community-sourced technical content (e.g., Stack\u202fOverflow + GitHub repositories). It supports:\n\n* Scraping **human-asked queries** based on StackOverflow.\n* Gathering **up-to-date corpora** via chunking any GitHub repository.\n* **Retrieval evaluation** of any dense/multi-vector model on the Freshstack repository.\n* Datasets released under **CC-BY-SA\u202f4.0** and code and scripts under **Apache 2.0 License**.\n\n## Installation\n\nInstall via pip, tested with Python 3.10+:\n\n```python\npip install freshstack\n```\n\nIf you want to build from source, use:\n\n```bash\ngit clone https://github.com/fresh-stack/freshstack.git\ncd freshstack\npip install -e .\n```\n\n## \ud83d\ude80 Quickstart: Load Freshstack Dataset\n```python\nfrom freshstack.datasets import DataLoader\n\ndataloader = DataLoader(\n queries_repo=\"freshstack/queries-oct-2024\", \n corpus_repo=\"freshstack/corpus-oct-2024\",\n topic=\"langchain\") # or \"yolo\", \"angular\", \"laravel\" or \"godot\"\n\n# Loads the corpus, queries and nuggets in the BEIR format\ncorpus, queries, nuggets = dataloader.load(split=\"test\")\n\n# Loads the qrels (nuggets), qrels (query) and query to nugget mapping\nqrels_nuggets, qrels_query, query_to_nuggets = dataloader.load_qrels(split=\"test\")\n```\n\n## \ud83d\ude80 Quickstart: Model Evaluation\n\n### 1. Evaluate only the retrieved results\n```python\n# Your runfile can be stored as a .txt in the following format: [qid, Q0, docid, 0, score, run_name], e.g.,\n# 76185522 Q0 angular/adev/src/content/tutorials/learn-angular/steps/14-routerLink/answer/src/app/app.component.ts_0_368 0 0.7353782057762146 your_model_name\n\nfrom freshstack import util\nfrom freshstack.retrieval.evaluation import EvaluateRetrieval\n\n# retrieval_results: dict[str, dict[str, str]] with qid: {doc_id: score}\nretrieval_results = util.load_runfile(\"<path_to_your_runfile>\")\nevaluator = EvaluateRetrieval(k_values=[10, 20, 50])\nalpha_ndcg, coverage, recall = evaluator.evaluate(\n qrels_nuggets=qrels_nuggets,\n query_to_nuggets=query_to_nuggets,\n qrels_query=qrels_query,\n results=retrieval_results,\n)\n```\n\n### 2. Evaluate any dense embedding model (e.g., Qwen3-0.6B-embedding) using BEIR.\n> Make sure you install the latest PyPI BEIR repository: `pip install beir`\n\n```python\nfrom beir.retrieval import models\nfrom beir.retrieval.evaluation import EvaluateRetrieval as BEIREval\nfrom beir.retrieval.search.dense import DenseRetrievalExactSearch as DRES\nfrom freshstack.retrieval.evaluation import EvaluateRetrieval\n\n# Custom query prompt for evaluating the Qwen3-0.6B model on Freshstack.\nquery_prompt = \"Instruct: Given a technical question, retrieve relevant code snippets or technical documentation that best answer the question\\nQuery: \"\n\nmodel = DRES(models.SentenceBERT(\n \"Qwen/Qwen3-Embedding-0.6B\",\n max_length=2048, # IMP: keep the max_length for both query & passage atleast 2048 tokens.\n prompts={\"query\": query_prompt, \"passage\": \"\"},\n model_kwargs={\n \"attn_implementation\": \"flash_attention_2\", \n \"device_map\": \"auto\", \n \"torch_dtype\": \"bfloat16\"\n },\n tokenizer_kwargs={\"padding_side\": \"left\"},\n), batch_size=32)\n\nretriever = BEIREval(model, score_function=\"cos_sim\")\nretrieval_results = retriever.retrieve(corpus=corpus, queries=queries)\n\n# Evaluate and compute retrieval score once you have results\nevaluator = EvaluateRetrieval(k_values=[10, 20, 50])\nalpha_ndcg, coverage, recall = evaluator.evaluate(\n qrels_nuggets=qrels_nuggets,\n query_to_nuggets=query_to_nuggets,\n qrels_query=qrels_query,\n results=retrieval_results,\n)\n```\n\n### 3. Evaluate any multi-vector model (e.g., ColBERT) using Pylate.\n> Make sure you install the latest PyLate repository: `pip install pylate`.\n\n```python\nfrom pylate import indexes, models, retrieve\nfrom freshstack.retrieval.evaluation import EvaluateRetrieval\n\n# Step 1: Load the ColBERT model\nmodel = models.ColBERT(\n model_name_or_path=\"lightonai/GTE-ModernColBERT-v1\",\n query_length=2048, document_length=2048\n)\n\n# Step 2: Initialize the Voyager index or (PLAID index)\nindex = indexes.Voyager(\n index_folder=f\"./langchain_index\",\n index_name=\"index\",\n override=False, # This overwrites the existing index if any\n)\n\n# Step 3: Encode the documents and add them to index\ndocuments_ids = list(corpus.keys())\ndocuments_embeddings = model.encode(\n [doc[\"text\"] for doc in corpus.values()],\n batch_size=32,\n is_query=False, # Ensure that it is set to False to indicate that these are documents, not queries\n)\n\nindex.add_documents(\n documents_ids=documents_ids,\n documents_embeddings=documents_embeddings,\n)\n\n# Step 5: Compute query embeddings\nquery_ids = list(queries.keys())\nqueries_embeddings = model.encode(\n list(queries.values()),\n batch_size=32,\n is_query=True, # Ensure that it is set to False to indicate that these are queries\n)\n\n# Step 6: Initialize the ColBERT retriever with the Voyager index & retrieve documents\nretriever = retrieve.ColBERT(index=index)\nscores = retriever.retrieve(\n queries_embeddings=queries_embeddings,\n k=50, # Retrieve top-k results based on the maximum k value specified\n batch_size=1, # We have kept a batch size of 1 to avoid memory issues.\n device=\"cpu\", # Use CPU for inference, change to \"cuda\" if you have a GPU available.\n)\n\n# Step 7: Prepare the results in the required BEIR format\nretrieval_results = {}\nfor query_id, doc_scores in zip(query_ids, scores):\n retrieval_results[query_id] = {}\n for doc_id, score in doc_scores:\n retrieval_results[query_id][doc_id] = score\n\n# Step 8: Evaluate and compute retrieval score once you have results\nevaluator = EvaluateRetrieval(k_values=[10, 20, 50])\nalpha_ndcg, coverage, recall = evaluator.evaluate(\n qrels_nuggets=qrels_nuggets,\n query_to_nuggets=query_to_nuggets,\n qrels_query=qrels_query,\n results=retrieval_results,\n)\n```\n---\n\n## \ud83d\udcda Raw Freshstack Datasets (Oct 2024)\n\nThe raw freshstack datasets can be downloaded via HF:\n\n* langchain, yolo, laravel, angular, godot (via [huggingface.co](https://huggingface.co/datasets/freshstack/queries-oct-2024))\n\n```python\nfrom datasets import load_dataset\nqueries = load_dataset(\"freshstack/queries-oct-2024\", subset=\"yolo\")\ncorpus = load_dataset(\"freshstack/corpus-oct-2024\", subset=\"yolo\")\n```\n\n---\n\n## \ud83e\udded Project Structure\n\n```\nfreshstack/\n\u251c\u2500 examples/ # contains examples\n\u2502 \u251c\u2500 chunking/ # examples for github repo chunking\n\u2502 \u251c\u2500 evaluation/ # examples for model eval on freshstack\n\u251c\u2500 freshstack/ # core logic modules\n\u2502 \u251c\u2500 retrieval/ # code for retrieval evaluation\n\u2502 \u251c\u2500 datasets/ # code for freshstakc dataloader\n\u2502 \u2514\u2500 chunking/ # code for github repo chunking\n\u2514\u2500 pyproject.toml\n```\n\n---\n## FreshStack Leaderboard\n\nThe upto date leaderboard for Freshstack (version oct-2024) is provided here: [https://fresh-stack.github.io/#leaderboard](https://fresh-stack.github.io/#leaderboard). \n\n> NOTE: Below is the snapshot of the Freshstack leaderboard from Jun 12th 2025.\n\n| Model Name | Size | Date | AVERAGE \u03b1@10 | AVERAGE C\\@20 | AVERAGE R\\@50 | LANGCHAIN \u03b1@10 | LANGCHAIN C\\@20 | LANGCHAIN R\\@50 | YOLO \u03b1@10 | YOLO C\\@20 | YOLO R\\@50 | LARAVEL \u03b1@10 | LARAVEL C\\@20 | LARAVEL R\\@50 | ANGULAR \u03b1@10 | ANGULAR C\\@20 | ANGULAR R\\@50 | GODOT \u03b1@10 | GODOT C\\@20 | GODOT R\\@50 |\n| ------------------------------------ | ---- | ---------- | ------------ | ------------- | ------------- | -------------- | --------------- | --------------- | --------- | ---------- | ---------- | ------------ | ------------- | ------------- | ------------ | ------------- | ------------- | ---------- | ----------- | ----------- |\n| Oracle: Fusion (BM25; ...) (Nuggets) | - | 2024\u201111\u201101 | 0.541 | 0.868 | 0.755 | 0.519 | 0.881 | 0.655 | 0.601 | 0.876 | 0.825 | 0.566 | 0.888 | 0.818 | 0.544 | 0.881 | 0.756 | 0.476 | 0.815 | 0.719 |\n| Oracle: BM25 (Nuggets) | - | 2024\u201111\u201101 | 0.488 | 0.768 | 0.556 | 0.467 | 0.739 | 0.446 | 0.519 | 0.796 | 0.657 | 0.540 | 0.840 | 0.654 | 0.485 | 0.787 | 0.536 | 0.428 | 0.680 | 0.489 |\n| Oracle: Voyage Large 2 (Nuggets) | - | 2024\u201111\u201101 | 0.404 | 0.769 | 0.586 | 0.419 | 0.763 | 0.508 | 0.430 | 0.845 | 0.675 | 0.409 | 0.791 | 0.624 | 0.406 | 0.733 | 0.533 | 0.353 | 0.715 | 0.590 |\n| Oracle: BGE (Gemma-2) (Nuggets) | 9B | 2024\u201111\u201101 | 0.389 | 0.735 | 0.547 | 0.308 | 0.667 | 0.405 | 0.461 | 0.784 | 0.572 | 0.448 | 0.806 | 0.666 | 0.393 | 0.755 | 0.536 | 0.335 | 0.664 | 0.555 |\n| Qwen3\u20118B (Emb) | 8B | 2025\u201106\u201105 | 0.365 | 0.689 | 0.525 | 0.331 | 0.694 | 0.423 | 0.393 | 0.728 | 0.567 | 0.421 | 0.748 | 0.615 | 0.373 | 0.700 | 0.502 | 0.307 | 0.576 | 0.521 |\n| Qwen3\u20114B (Emb) | 4B | 2025\u201106\u201105 | 0.347 | 0.656 | 0.490 | 0.320 | 0.675 | 0.415 | 0.404 | 0.744 | 0.550 | 0.402 | 0.748 | 0.604 | 0.304 | 0.618 | 0.442 | 0.303 | 0.496 | 0.440 |\n| Fusion (BM25; BGE; E5; Voyage) | - | 2024\u201111\u201101 | 0.343 | 0.669 | 0.539 | 0.337 | 0.700 | 0.477 | 0.304 | 0.627 | 0.534 | 0.425 | 0.748 | 0.646 | 0.385 | 0.719 | 0.532 | 0.265 | 0.550 | 0.505 |\n| Oracle: E5 (Mistral-7B) (Nuggets) | 7B | 2024\u201111\u201101 | 0.337 | 0.664 | 0.496 | 0.323 | 0.684 | 0.432 | 0.437 | 0.737 | 0.554 | 0.287 | 0.631 | 0.532 | 0.346 | 0.670 | 0.470 | 0.292 | 0.596 | 0.494 |\n| Stella\u20111.5B v5 | 1.5B | 2025\u201101\u201101 | 0.317 | 0.615 | 0.479 | 0.315 | 0.660 | 0.388 | 0.334 | 0.624 | 0.559 | 0.370 | 0.681 | 0.590 | 0.330 | 0.630 | 0.414 | 0.237 | 0.481 | 0.443 |\n| Voyage Large 2 | - | 2024\u201111\u201101 | 0.289 | 0.589 | 0.438 | 0.246 | 0.528 | 0.308 | 0.270 | 0.570 | 0.453 | 0.345 | 0.701 | 0.543 | 0.304 | 0.625 | 0.427 | 0.282 | 0.522 | 0.458 |\n| Stella\u2011400M v5 | 400M | 2025\u201101\u201101 | 0.276 | 0.578 | 0.422 | 0.285 | 0.608 | 0.356 | 0.241 | 0.538 | 0.447 | 0.320 | 0.648 | 0.534 | 0.288 | 0.619 | 0.359 | 0.244 | 0.476 | 0.412 |\n| BGE (Gemma-2) | 9B | 2024\u201111\u201101 | 0.269 | 0.569 | 0.427 | 0.216 | 0.548 | 0.337 | 0.258 | 0.547 | 0.430 | 0.348 | 0.699 | 0.574 | 0.323 | 0.571 | 0.378 | 0.200 | 0.479 | 0.419 |\n| Qwen3\u20110.6B (Emb) | 596M | 2025\u201106\u201105 | 0.262 | 0.543 | 0.394 | 0.259 | 0.588 | 0.369 | 0.260 | 0.504 | 0.383 | 0.288 | 0.593 | 0.463 | 0.253 | 0.535 | 0.356 | 0.249 | 0.495 | 0.400 |\n| E5 (Mistral-7B) | 7B | 2024\u201111\u201101 | 0.255 | 0.553 | 0.397 | 0.304 | 0.654 | 0.393 | 0.243 | 0.552 | 0.394 | 0.250 | 0.565 | 0.470 | 0.262 | 0.548 | 0.368 | 0.217 | 0.444 | 0.359 |\n| GTE (large) v1.5 | 434M | 2024\u201101\u201109 | 0.226 | 0.494 | 0.318 | 0.206 | 0.470 | 0.252 | 0.195 | 0.445 | 0.271 | 0.318 | 0.626 | 0.482 | 0.284 | 0.578 | 0.343 | 0.127 | 0.348 | 0.240 |\n| BM25 | - | 2024\u201111\u201101 | 0.218 | 0.448 | 0.316 | 0.230 | 0.475 | 0.261 | 0.137 | 0.342 | 0.337 | 0.319 | 0.602 | 0.441 | 0.259 | 0.551 | 0.340 | 0.144 | 0.268 | 0.200 |\n| Nomic Embed (Code) | 7B | 2025\u201103\u201124 | 0.218 | 0.488 | 0.348 | 0.224 | 0.518 | 0.292 | 0.227 | 0.539 | 0.390 | 0.222 | 0.532 | 0.407 | 0.237 | 0.511 | 0.356 | 0.178 | 0.341 | 0.295 |\n| CodeRankEmbed | 137M | 2024\u201111\u201103 | 0.104 | 0.279 | 0.162 | 0.099 | 0.271 | 0.128 | 0.075 | 0.215 | 0.128 | 0.108 | 0.324 | 0.225 | 0.146 | 0.363 | 0.167 | 0.091 | 0.224 | 0.160 |\n\n\n### \ud83d\udc65 Contribute your model to the leaderboard\n\n1. Fork the repo (https://github.com/fresh-stack/fresh-stack.github.io)\n2. Create a new branch (`git checkout -b <your_branch>`)\n3. Add your model scores in the following format to the `leaderboard_data.json`:\n```bash\n{\n \"leaderboardData\": [\n {\n \"info\": {\n \"name\": \"BM25\",\n \"size\": \"-\",\n \"type\": \"open_source\",\n \"date\": \"2024-11-01\",\n \"link\": \"https://github.com/castorini/pyserini\"\n },\n \"datasets\": {\n \"langchain\": {\"alpha_ndcg_10\": 0.230, \"coverage_20\": 0.475, \"recall_50\": 0.261},\n \"yolo\": {\"alpha_ndcg_10\": 0.137, \"coverage_20\": 0.342, \"recall_50\": 0.337},\n \"laravel\": {\"alpha_ndcg_10\": 0.319, \"coverage_20\": 0.602, \"recall_50\": 0.441},\n \"angular\": {\"alpha_ndcg_10\": 0.259, \"coverage_20\": 0.551, \"recall_50\": 0.340},\n \"godot\": {\"alpha_ndcg_10\": 0.144, \"coverage_20\": 0.268, \"recall_50\": 0.200},\n \"average\": {\"alpha_ndcg_10\": 0.218, \"coverage_20\": 0.448, \"recall_50\": 0.316},\n }\n },\n ...\n ]\n}\n```\n\n4. Submit a pull request, ideally including:\n\n * The updated `leaderboard_data.json`\n * Pipeline invocation script (reference)\n * Brief evaluation summary (reference)\n\nAll contributions welcome\u2014especially new domain expansions, evaluation improvements, and retrieval baselines!\n\n---\n\n## \ud83d\udcc4 Citation\n\nIf you use FreshStack in your work, please cite:\n\n```bib\n@article{thakur-freshstack:2025,\n author = {Nandan Thakur and\n Jimmy Lin and\n Sam Havens and\n Michael Carbin and\n Omar Khattab and\n Andrew Drozdov},\n title = {FreshStack: Building Realistic Benchmarks for Evaluating Retrieval\n on Technical Documents},\n journal = {CoRR},\n volume = {abs/2504.13128},\n year = {2025},\n url = {https://doi.org/10.48550/arXiv.2504.13128},\n doi = {10.48550/ARXIV.2504.13128},\n eprinttype = {arXiv},\n eprint = {2504.13128},\n timestamp = {Thu, 22 May 2025 21:00:35 +0200},\n biburl = {https://dblp.org/rec/journals/corr/abs-2504-13128.bib},\n bibsource = {dblp computer science bibliography, https://dblp.org}\n}\n```\nThe main contributors of this repository are:\n- [Nandan Thakur](https://github.com/thakur-nandan), Personal Website: [thakur-nandan.gitub.io](https://thakur-nandan.github.io)\n\nContact person: Nandan Thakur, [nandant@gmail.com](mailto:nandant@gmail.com)\n\nDon't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.\n\n> This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.\n\n---\n\n## Collaboration\n\nThis project is developed in collaboration with the following organizations:\n\n| Organization | Logo |\n| ---------------------- | ---------------------------------------------------------------------------------------------- |\n| University of Waterloo | <img src=\"https://raw.githubusercontent.com/fresh-stack/freshstack/main/images/uwaterloo.png\" alt=\"University of Waterloo logo\" /> |\n| Databricks | <img src=\"https://raw.githubusercontent.com/fresh-stack/freshstack/main/images/databricks-logo.png\" alt=\"Databricks logo\" /> |\n",
"bugtrack_url": null,
"license": "Apache License 2.0",
"summary": "A framework to generate realistic IR & RAG Benchmarks.",
"version": "0.0.6",
"project_urls": {
"Download": "https://github.com/fresh-stack/freshstack/archive/v0.0.5.zip",
"Homepage": "https://fresh-stack.github.io/",
"Repository": "https://github.com/fresh-stack/freshstack"
},
"split_keywords": [
"benchmarking",
" evaluation framework",
" information retrieval",
" retrieval-augmented generation",
" transformer networks",
" large language models",
" pytorch",
" rag",
" ir",
" nlp",
" deep learning"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "fe344fb9cbc00c86931a3ecd1edc3c835839ca26698a3c1de1abcf7c71373c03",
"md5": "36770b232141096c1cbf67d131f2cea0",
"sha256": "695ce31156ffcd457ce9afc3eec04b6fa55cfe7bb3a107e81496167a1e577c0d"
},
"downloads": -1,
"filename": "freshstack-0.0.6-py3-none-any.whl",
"has_sig": false,
"md5_digest": "36770b232141096c1cbf67d131f2cea0",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 31985,
"upload_time": "2025-11-05T18:35:33",
"upload_time_iso_8601": "2025-11-05T18:35:33.758882Z",
"url": "https://files.pythonhosted.org/packages/fe/34/4fb9cbc00c86931a3ecd1edc3c835839ca26698a3c1de1abcf7c71373c03/freshstack-0.0.6-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "153c90acd2df7e967bbdcf8d3075cec8f360ef0576b50e09dcbed727f2028dd1",
"md5": "6a3cda14a8b40a7eb270887052340fa8",
"sha256": "cb28ddec6640a4cf00d17bc1793a3e86172c29ced839452db4bbffe04f790dc1"
},
"downloads": -1,
"filename": "freshstack-0.0.6.tar.gz",
"has_sig": false,
"md5_digest": "6a3cda14a8b40a7eb270887052340fa8",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 34360,
"upload_time": "2025-11-05T18:35:35",
"upload_time_iso_8601": "2025-11-05T18:35:35.033143Z",
"url": "https://files.pythonhosted.org/packages/15/3c/90acd2df7e967bbdcf8d3075cec8f360ef0576b50e09dcbed727f2028dd1/freshstack-0.0.6.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-11-05 18:35:35",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "fresh-stack",
"github_project": "freshstack",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "freshstack"
}