freshstack


Namefreshstack JSON
Version 0.0.6 PyPI version JSON
download
home_pageNone
SummaryA framework to generate realistic IR & RAG Benchmarks.
upload_time2025-11-05 18:35:35
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseApache License 2.0
keywords benchmarking evaluation framework information retrieval retrieval-augmented generation transformer networks large language models pytorch rag ir nlp deep learning
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <div align="center">
  <h1>FreshStack</h1>
  <p>A Repository for Constructing Realistic IR/RAG Benchmarks</p>
</div>

<p align="center"><img width=500 src="https://raw.githubusercontent.com/fresh-stack/freshstack/main/images/freshstack-logo-cropped.png"/></p>

<h4 align="center">
    <p>
        <a href="https://arxiv.org/abs/2504.13128">Paper</a> |
        <a href="https://fresh-stack.github.io/">Website</a> |
        <a href="https://fresh-stack.github.io/#leaderboard">Leaderboard</a> |
        <a href="https://huggingface.co/freshstack">Dataset</a> |
        <a href="https://colab.research.google.com/drive/1eaB_5cF62kW4E3g0bCXw9_4nC4xt-_tX?usp=sharing">Google Colab</a>
    <p>
</h4>

**FreshStack** is a modular framework to **automatically build realistic IR/RAG benchmarks** from niche, community-sourced technical content (e.g., Stackβ€―Overflow + GitHub repositories). It supports:

* Scraping **human-asked queries** based on StackOverflow.
* Gathering **up-to-date corpora** via chunking any GitHub repository.
* **Retrieval evaluation** of any dense/multi-vector model on the Freshstack repository.
* Datasets released under **CC-BY-SAβ€―4.0** and code and scripts under **Apache 2.0 License**.

## Installation

Install via pip, tested with Python 3.10+:

```python
pip install freshstack
```

If you want to build from source, use:

```bash
git clone https://github.com/fresh-stack/freshstack.git
cd freshstack
pip install -e .
```

## πŸš€ Quickstart: Load Freshstack Dataset
```python
from freshstack.datasets import DataLoader

dataloader = DataLoader(
    queries_repo="freshstack/queries-oct-2024", 
    corpus_repo="freshstack/corpus-oct-2024",
    topic="langchain") # or "yolo", "angular", "laravel" or "godot"

# Loads the corpus, queries and nuggets in the BEIR format
corpus, queries, nuggets = dataloader.load(split="test")

# Loads the qrels (nuggets), qrels (query) and query to nugget mapping
qrels_nuggets, qrels_query, query_to_nuggets = dataloader.load_qrels(split="test")
```

## πŸš€ Quickstart: Model Evaluation

### 1. Evaluate only the retrieved results
```python
# Your runfile can be stored as a .txt in the following format: [qid, Q0, docid, 0, score, run_name], e.g.,
# 76185522 Q0 angular/adev/src/content/tutorials/learn-angular/steps/14-routerLink/answer/src/app/app.component.ts_0_368 0 0.7353782057762146 your_model_name

from freshstack import util
from freshstack.retrieval.evaluation import EvaluateRetrieval

# retrieval_results: dict[str, dict[str, str]] with qid: {doc_id: score}
retrieval_results = util.load_runfile("<path_to_your_runfile>")
evaluator = EvaluateRetrieval(k_values=[10, 20, 50])
alpha_ndcg, coverage, recall = evaluator.evaluate(
    qrels_nuggets=qrels_nuggets,
    query_to_nuggets=query_to_nuggets,
    qrels_query=qrels_query,
    results=retrieval_results,
)
```

### 2. Evaluate any dense embedding model (e.g., Qwen3-0.6B-embedding) using BEIR.
> Make sure you install the latest PyPI BEIR repository: `pip install beir`

```python
from beir.retrieval import models
from beir.retrieval.evaluation import EvaluateRetrieval as BEIREval
from beir.retrieval.search.dense import DenseRetrievalExactSearch as DRES
from freshstack.retrieval.evaluation import EvaluateRetrieval

# Custom query prompt for evaluating the Qwen3-0.6B model on Freshstack.
query_prompt = "Instruct: Given a technical question, retrieve relevant code snippets or technical documentation that best answer the question\nQuery: "

model = DRES(models.SentenceBERT(
    "Qwen/Qwen3-Embedding-0.6B",
    max_length=2048, # IMP: keep the max_length for both query & passage atleast 2048 tokens.
    prompts={"query": query_prompt, "passage": ""},
    model_kwargs={
        "attn_implementation": "flash_attention_2", 
        "device_map": "auto", 
        "torch_dtype": "bfloat16"
    },
    tokenizer_kwargs={"padding_side": "left"},
), batch_size=32)

retriever = BEIREval(model, score_function="cos_sim")
retrieval_results = retriever.retrieve(corpus=corpus, queries=queries)

# Evaluate and compute retrieval score once you have results
evaluator = EvaluateRetrieval(k_values=[10, 20, 50])
alpha_ndcg, coverage, recall = evaluator.evaluate(
    qrels_nuggets=qrels_nuggets,
    query_to_nuggets=query_to_nuggets,
    qrels_query=qrels_query,
    results=retrieval_results,
)
```

### 3. Evaluate any multi-vector model (e.g., ColBERT) using Pylate.
> Make sure you install the latest PyLate repository: `pip install pylate`.

```python
from pylate import indexes, models, retrieve
from freshstack.retrieval.evaluation import EvaluateRetrieval

# Step 1: Load the ColBERT model
model = models.ColBERT(
    model_name_or_path="lightonai/GTE-ModernColBERT-v1",
    query_length=2048, document_length=2048
)

# Step 2: Initialize the Voyager index or (PLAID index)
index = indexes.Voyager(
    index_folder=f"./langchain_index",
    index_name="index",
    override=False,  # This overwrites the existing index if any
)

# Step 3: Encode the documents and add them to index
documents_ids = list(corpus.keys())
documents_embeddings = model.encode(
    [doc["text"] for doc in corpus.values()],
    batch_size=32,
    is_query=False,  # Ensure that it is set to False to indicate that these are documents, not queries
)

index.add_documents(
    documents_ids=documents_ids,
    documents_embeddings=documents_embeddings,
)

# Step 5: Compute query embeddings
query_ids = list(queries.keys())
queries_embeddings = model.encode(
    list(queries.values()),
    batch_size=32,
    is_query=True,  # Ensure that it is set to False to indicate that these are queries
)

# Step 6: Initialize the ColBERT retriever with the Voyager index & retrieve documents
retriever = retrieve.ColBERT(index=index)
scores = retriever.retrieve(
    queries_embeddings=queries_embeddings,
    k=50,  # Retrieve top-k results based on the maximum k value specified
    batch_size=1,  # We have kept a batch size of 1 to avoid memory issues.
    device="cpu",  # Use CPU for inference, change to "cuda" if you have a GPU available.
)

# Step 7: Prepare the results in the required BEIR format
retrieval_results = {}
for query_id, doc_scores in zip(query_ids, scores):
    retrieval_results[query_id] = {}
    for doc_id, score in doc_scores:
        retrieval_results[query_id][doc_id] = score

# Step 8: Evaluate and compute retrieval score once you have results
evaluator = EvaluateRetrieval(k_values=[10, 20, 50])
alpha_ndcg, coverage, recall = evaluator.evaluate(
    qrels_nuggets=qrels_nuggets,
    query_to_nuggets=query_to_nuggets,
    qrels_query=qrels_query,
    results=retrieval_results,
)
```
---

## πŸ“š Raw Freshstack Datasets (Oct 2024)

The raw freshstack datasets can be downloaded via HF:

* langchain, yolo, laravel, angular, godot (via [huggingface.co](https://huggingface.co/datasets/freshstack/queries-oct-2024))

```python
from datasets import load_dataset
queries = load_dataset("freshstack/queries-oct-2024", subset="yolo")
corpus = load_dataset("freshstack/corpus-oct-2024", subset="yolo")
```

---

## 🧭 Project Structure

```
freshstack/
β”œβ”€ examples/            # contains examples
β”‚   β”œβ”€ chunking/        # examples for github repo chunking
β”‚   β”œβ”€ evaluation/      # examples for model eval on freshstack
β”œβ”€ freshstack/          # core logic modules
β”‚   β”œβ”€ retrieval/       # code for retrieval evaluation
β”‚   β”œβ”€ datasets/        # code for freshstakc dataloader
β”‚   └─ chunking/        # code for github repo chunking
└─ pyproject.toml
```

---
## FreshStack Leaderboard

The upto date leaderboard for Freshstack (version oct-2024) is provided here: [https://fresh-stack.github.io/#leaderboard](https://fresh-stack.github.io/#leaderboard). 

> NOTE: Below is the snapshot of the Freshstack leaderboard from Jun 12th 2025.

| Model Name                           | Size | Date       | AVERAGE Ξ±@10 | AVERAGE C\@20 | AVERAGE R\@50 | LANGCHAIN Ξ±@10 | LANGCHAIN C\@20 | LANGCHAIN R\@50 | YOLO Ξ±@10 | YOLO C\@20 | YOLO R\@50 | LARAVEL Ξ±@10 | LARAVEL C\@20 | LARAVEL R\@50 | ANGULAR Ξ±@10 | ANGULAR C\@20 | ANGULAR R\@50 | GODOT Ξ±@10 | GODOT C\@20 | GODOT R\@50 |
| ------------------------------------ | ---- | ---------- | ------------ | ------------- | ------------- | -------------- | --------------- | --------------- | --------- | ---------- | ---------- | ------------ | ------------- | ------------- | ------------ | ------------- | ------------- | ---------- | ----------- | ----------- |
| Oracle: Fusion (BM25; ...) (Nuggets) | -    | 2024‑11‑01 | 0.541        | 0.868         | 0.755         | 0.519          | 0.881           | 0.655           | 0.601     | 0.876      | 0.825      | 0.566        | 0.888         | 0.818         | 0.544        | 0.881         | 0.756         | 0.476      | 0.815       | 0.719       |
| Oracle: BM25 (Nuggets)               | -    | 2024‑11‑01 | 0.488        | 0.768         | 0.556         | 0.467          | 0.739           | 0.446           | 0.519     | 0.796      | 0.657      | 0.540        | 0.840         | 0.654         | 0.485        | 0.787         | 0.536         | 0.428      | 0.680       | 0.489       |
| Oracle: Voyage Large 2 (Nuggets)     | -    | 2024‑11‑01 | 0.404        | 0.769         | 0.586         | 0.419          | 0.763           | 0.508           | 0.430     | 0.845      | 0.675      | 0.409        | 0.791         | 0.624         | 0.406        | 0.733         | 0.533         | 0.353      | 0.715       | 0.590       |
| Oracle: BGE (Gemma-2) (Nuggets)      | 9B   | 2024‑11‑01 | 0.389        | 0.735         | 0.547         | 0.308          | 0.667           | 0.405           | 0.461     | 0.784      | 0.572      | 0.448        | 0.806         | 0.666         | 0.393        | 0.755         | 0.536         | 0.335      | 0.664       | 0.555       |
| Qwen3‑8B (Emb)                    | 8B   | 2025‑06‑05 | 0.365        | 0.689         | 0.525         | 0.331          | 0.694           | 0.423           | 0.393     | 0.728      | 0.567      | 0.421        | 0.748         | 0.615         | 0.373        | 0.700         | 0.502         | 0.307      | 0.576       | 0.521       |
| Qwen3‑4B (Emb)                    | 4B   | 2025‑06‑05 | 0.347        | 0.656         | 0.490         | 0.320          | 0.675           | 0.415           | 0.404     | 0.744      | 0.550      | 0.402        | 0.748         | 0.604         | 0.304        | 0.618         | 0.442         | 0.303      | 0.496       | 0.440       |
| Fusion (BM25; BGE; E5; Voyage)       | -    | 2024‑11‑01 | 0.343        | 0.669         | 0.539         | 0.337          | 0.700           | 0.477           | 0.304     | 0.627      | 0.534      | 0.425        | 0.748         | 0.646         | 0.385        | 0.719         | 0.532         | 0.265      | 0.550       | 0.505       |
| Oracle: E5 (Mistral-7B) (Nuggets)    | 7B   | 2024‑11‑01 | 0.337        | 0.664         | 0.496         | 0.323          | 0.684           | 0.432           | 0.437     | 0.737      | 0.554      | 0.287        | 0.631         | 0.532         | 0.346        | 0.670         | 0.470         | 0.292      | 0.596       | 0.494       |
| Stella‑1.5B v5                       | 1.5B | 2025‑01‑01 | 0.317        | 0.615         | 0.479         | 0.315          | 0.660           | 0.388           | 0.334     | 0.624      | 0.559      | 0.370        | 0.681         | 0.590         | 0.330        | 0.630         | 0.414         | 0.237      | 0.481       | 0.443       |
| Voyage Large 2                       | -    | 2024‑11‑01 | 0.289        | 0.589         | 0.438         | 0.246          | 0.528           | 0.308           | 0.270     | 0.570      | 0.453      | 0.345        | 0.701         | 0.543         | 0.304        | 0.625         | 0.427         | 0.282      | 0.522       | 0.458       |
| Stella‑400M v5                       | 400M | 2025‑01‑01 | 0.276        | 0.578         | 0.422         | 0.285          | 0.608           | 0.356           | 0.241     | 0.538      | 0.447      | 0.320        | 0.648         | 0.534         | 0.288        | 0.619         | 0.359         | 0.244      | 0.476       | 0.412       |
| BGE (Gemma-2)                        | 9B   | 2024‑11‑01 | 0.269        | 0.569         | 0.427         | 0.216          | 0.548           | 0.337           | 0.258     | 0.547      | 0.430      | 0.348        | 0.699         | 0.574         | 0.323        | 0.571         | 0.378         | 0.200      | 0.479       | 0.419       |
| Qwen3‑0.6B (Emb)                  | 596M | 2025‑06‑05 | 0.262        | 0.543         | 0.394         | 0.259          | 0.588           | 0.369           | 0.260     | 0.504      | 0.383      | 0.288        | 0.593         | 0.463         | 0.253        | 0.535         | 0.356         | 0.249      | 0.495       | 0.400       |
| E5 (Mistral-7B)                      | 7B   | 2024‑11‑01 | 0.255        | 0.553         | 0.397         | 0.304          | 0.654           | 0.393           | 0.243     | 0.552      | 0.394      | 0.250        | 0.565         | 0.470         | 0.262        | 0.548         | 0.368         | 0.217      | 0.444       | 0.359       |
| GTE (large) v1.5                     | 434M | 2024‑01‑09 | 0.226        | 0.494         | 0.318         | 0.206          | 0.470           | 0.252           | 0.195     | 0.445      | 0.271      | 0.318        | 0.626         | 0.482         | 0.284        | 0.578         | 0.343         | 0.127      | 0.348       | 0.240       |
| BM25                                 | -    | 2024‑11‑01 | 0.218        | 0.448         | 0.316         | 0.230          | 0.475           | 0.261           | 0.137     | 0.342      | 0.337      | 0.319        | 0.602         | 0.441         | 0.259        | 0.551         | 0.340         | 0.144      | 0.268       | 0.200       |
| Nomic Embed (Code)                | 7B   | 2025‑03‑24 | 0.218        | 0.488         | 0.348         | 0.224          | 0.518           | 0.292           | 0.227     | 0.539      | 0.390      | 0.222        | 0.532         | 0.407         | 0.237        | 0.511         | 0.356         | 0.178      | 0.341       | 0.295       |
| CodeRankEmbed                        | 137M | 2024‑11‑03 | 0.104        | 0.279         | 0.162         | 0.099          | 0.271           | 0.128           | 0.075     | 0.215      | 0.128      | 0.108        | 0.324         | 0.225         | 0.146        | 0.363         | 0.167         | 0.091      | 0.224       | 0.160       |


### πŸ‘₯ Contribute your model to the leaderboard

1. Fork the repo (https://github.com/fresh-stack/fresh-stack.github.io)
2. Create a new branch (`git checkout -b <your_branch>`)
3. Add your model scores in the following format to the `leaderboard_data.json`:
```bash
{
    "leaderboardData": [
        {
            "info": {
                "name": "BM25",
                "size": "-",
                "type": "open_source",
                "date": "2024-11-01",
                "link": "https://github.com/castorini/pyserini"
            },
            "datasets": {
                "langchain": {"alpha_ndcg_10": 0.230, "coverage_20": 0.475, "recall_50": 0.261},
                "yolo":      {"alpha_ndcg_10": 0.137, "coverage_20": 0.342, "recall_50": 0.337},
                "laravel":   {"alpha_ndcg_10": 0.319, "coverage_20": 0.602, "recall_50": 0.441},
                "angular":   {"alpha_ndcg_10": 0.259, "coverage_20": 0.551, "recall_50": 0.340},
                "godot":     {"alpha_ndcg_10": 0.144, "coverage_20": 0.268, "recall_50": 0.200},
                "average":   {"alpha_ndcg_10": 0.218, "coverage_20": 0.448, "recall_50": 0.316},
            }
        },
        ...
    ]
}
```

4. Submit a pull request, ideally including:

   * The updated `leaderboard_data.json`
   * Pipeline invocation script (reference)
   * Brief evaluation summary (reference)

All contributions welcomeβ€”especially new domain expansions, evaluation improvements, and retrieval baselines!

---

## πŸ“„ Citation

If you use FreshStack in your work, please cite:

```bib
@article{thakur-freshstack:2025,
  author       = {Nandan Thakur and
                  Jimmy Lin and
                  Sam Havens and
                  Michael Carbin and
                  Omar Khattab and
                  Andrew Drozdov},
  title        = {FreshStack: Building Realistic Benchmarks for Evaluating Retrieval
                  on Technical Documents},
  journal      = {CoRR},
  volume       = {abs/2504.13128},
  year         = {2025},
  url          = {https://doi.org/10.48550/arXiv.2504.13128},
  doi          = {10.48550/ARXIV.2504.13128},
  eprinttype    = {arXiv},
  eprint       = {2504.13128},
  timestamp    = {Thu, 22 May 2025 21:00:35 +0200},
  biburl       = {https://dblp.org/rec/journals/corr/abs-2504-13128.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}
```
The main contributors of this repository are:
- [Nandan Thakur](https://github.com/thakur-nandan), Personal Website: [thakur-nandan.gitub.io](https://thakur-nandan.github.io)

Contact person: Nandan Thakur, [nandant@gmail.com](mailto:nandant@gmail.com)

Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.

> This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.

---

## Collaboration

This project is developed in collaboration with the following organizations:

| Organization           | Logo                                                                                           |
| ---------------------- | ---------------------------------------------------------------------------------------------- |
| University of Waterloo | <img src="https://raw.githubusercontent.com/fresh-stack/freshstack/main/images/uwaterloo.png" alt="University of Waterloo logo" /> |
| Databricks             | <img src="https://raw.githubusercontent.com/fresh-stack/freshstack/main/images/databricks-logo.png" alt="Databricks logo" />     |

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "freshstack",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": "Nandan Thakur <nandant@gmail.com>",
    "keywords": "Benchmarking, Evaluation Framework, Information Retrieval, Retrieval-Augmented Generation, Transformer Networks, Large Language Models, PyTorch, RAG, IR, NLP, Deep Learning",
    "author": null,
    "author_email": "Nandan Thakur <nandant@gmail.com>, Andrew Drozdov <andrew.drozdov@databricks.com>, Omar Khattab <omar.khattab@databricks.com>",
    "download_url": "https://files.pythonhosted.org/packages/15/3c/90acd2df7e967bbdcf8d3075cec8f360ef0576b50e09dcbed727f2028dd1/freshstack-0.0.6.tar.gz",
    "platform": null,
    "description": "<div align=\"center\">\n  <h1>FreshStack</h1>\n  <p>A Repository for Constructing Realistic IR/RAG Benchmarks</p>\n</div>\n\n<p align=\"center\"><img width=500 src=\"https://raw.githubusercontent.com/fresh-stack/freshstack/main/images/freshstack-logo-cropped.png\"/></p>\n\n<h4 align=\"center\">\n    <p>\n        <a href=\"https://arxiv.org/abs/2504.13128\">Paper</a> |\n        <a href=\"https://fresh-stack.github.io/\">Website</a> |\n        <a href=\"https://fresh-stack.github.io/#leaderboard\">Leaderboard</a> |\n        <a href=\"https://huggingface.co/freshstack\">Dataset</a> |\n        <a href=\"https://colab.research.google.com/drive/1eaB_5cF62kW4E3g0bCXw9_4nC4xt-_tX?usp=sharing\">Google Colab</a>\n    <p>\n</h4>\n\n**FreshStack** is a modular framework to **automatically build realistic IR/RAG benchmarks** from niche, community-sourced technical content (e.g., Stack\u202fOverflow + GitHub repositories). It supports:\n\n* Scraping **human-asked queries** based on StackOverflow.\n* Gathering **up-to-date corpora** via chunking any GitHub repository.\n* **Retrieval evaluation** of any dense/multi-vector model on the Freshstack repository.\n* Datasets released under **CC-BY-SA\u202f4.0** and code and scripts under **Apache 2.0 License**.\n\n## Installation\n\nInstall via pip, tested with Python 3.10+:\n\n```python\npip install freshstack\n```\n\nIf you want to build from source, use:\n\n```bash\ngit clone https://github.com/fresh-stack/freshstack.git\ncd freshstack\npip install -e .\n```\n\n## \ud83d\ude80 Quickstart: Load Freshstack Dataset\n```python\nfrom freshstack.datasets import DataLoader\n\ndataloader = DataLoader(\n    queries_repo=\"freshstack/queries-oct-2024\", \n    corpus_repo=\"freshstack/corpus-oct-2024\",\n    topic=\"langchain\") # or \"yolo\", \"angular\", \"laravel\" or \"godot\"\n\n# Loads the corpus, queries and nuggets in the BEIR format\ncorpus, queries, nuggets = dataloader.load(split=\"test\")\n\n# Loads the qrels (nuggets), qrels (query) and query to nugget mapping\nqrels_nuggets, qrels_query, query_to_nuggets = dataloader.load_qrels(split=\"test\")\n```\n\n## \ud83d\ude80 Quickstart: Model Evaluation\n\n### 1. Evaluate only the retrieved results\n```python\n# Your runfile can be stored as a .txt in the following format: [qid, Q0, docid, 0, score, run_name], e.g.,\n# 76185522 Q0 angular/adev/src/content/tutorials/learn-angular/steps/14-routerLink/answer/src/app/app.component.ts_0_368 0 0.7353782057762146 your_model_name\n\nfrom freshstack import util\nfrom freshstack.retrieval.evaluation import EvaluateRetrieval\n\n# retrieval_results: dict[str, dict[str, str]] with qid: {doc_id: score}\nretrieval_results = util.load_runfile(\"<path_to_your_runfile>\")\nevaluator = EvaluateRetrieval(k_values=[10, 20, 50])\nalpha_ndcg, coverage, recall = evaluator.evaluate(\n    qrels_nuggets=qrels_nuggets,\n    query_to_nuggets=query_to_nuggets,\n    qrels_query=qrels_query,\n    results=retrieval_results,\n)\n```\n\n### 2. Evaluate any dense embedding model (e.g., Qwen3-0.6B-embedding) using BEIR.\n> Make sure you install the latest PyPI BEIR repository: `pip install beir`\n\n```python\nfrom beir.retrieval import models\nfrom beir.retrieval.evaluation import EvaluateRetrieval as BEIREval\nfrom beir.retrieval.search.dense import DenseRetrievalExactSearch as DRES\nfrom freshstack.retrieval.evaluation import EvaluateRetrieval\n\n# Custom query prompt for evaluating the Qwen3-0.6B model on Freshstack.\nquery_prompt = \"Instruct: Given a technical question, retrieve relevant code snippets or technical documentation that best answer the question\\nQuery: \"\n\nmodel = DRES(models.SentenceBERT(\n    \"Qwen/Qwen3-Embedding-0.6B\",\n    max_length=2048, # IMP: keep the max_length for both query & passage atleast 2048 tokens.\n    prompts={\"query\": query_prompt, \"passage\": \"\"},\n    model_kwargs={\n        \"attn_implementation\": \"flash_attention_2\", \n        \"device_map\": \"auto\", \n        \"torch_dtype\": \"bfloat16\"\n    },\n    tokenizer_kwargs={\"padding_side\": \"left\"},\n), batch_size=32)\n\nretriever = BEIREval(model, score_function=\"cos_sim\")\nretrieval_results = retriever.retrieve(corpus=corpus, queries=queries)\n\n# Evaluate and compute retrieval score once you have results\nevaluator = EvaluateRetrieval(k_values=[10, 20, 50])\nalpha_ndcg, coverage, recall = evaluator.evaluate(\n    qrels_nuggets=qrels_nuggets,\n    query_to_nuggets=query_to_nuggets,\n    qrels_query=qrels_query,\n    results=retrieval_results,\n)\n```\n\n### 3. Evaluate any multi-vector model (e.g., ColBERT) using Pylate.\n> Make sure you install the latest PyLate repository: `pip install pylate`.\n\n```python\nfrom pylate import indexes, models, retrieve\nfrom freshstack.retrieval.evaluation import EvaluateRetrieval\n\n# Step 1: Load the ColBERT model\nmodel = models.ColBERT(\n    model_name_or_path=\"lightonai/GTE-ModernColBERT-v1\",\n    query_length=2048, document_length=2048\n)\n\n# Step 2: Initialize the Voyager index or (PLAID index)\nindex = indexes.Voyager(\n    index_folder=f\"./langchain_index\",\n    index_name=\"index\",\n    override=False,  # This overwrites the existing index if any\n)\n\n# Step 3: Encode the documents and add them to index\ndocuments_ids = list(corpus.keys())\ndocuments_embeddings = model.encode(\n    [doc[\"text\"] for doc in corpus.values()],\n    batch_size=32,\n    is_query=False,  # Ensure that it is set to False to indicate that these are documents, not queries\n)\n\nindex.add_documents(\n    documents_ids=documents_ids,\n    documents_embeddings=documents_embeddings,\n)\n\n# Step 5: Compute query embeddings\nquery_ids = list(queries.keys())\nqueries_embeddings = model.encode(\n    list(queries.values()),\n    batch_size=32,\n    is_query=True,  # Ensure that it is set to False to indicate that these are queries\n)\n\n# Step 6: Initialize the ColBERT retriever with the Voyager index & retrieve documents\nretriever = retrieve.ColBERT(index=index)\nscores = retriever.retrieve(\n    queries_embeddings=queries_embeddings,\n    k=50,  # Retrieve top-k results based on the maximum k value specified\n    batch_size=1,  # We have kept a batch size of 1 to avoid memory issues.\n    device=\"cpu\",  # Use CPU for inference, change to \"cuda\" if you have a GPU available.\n)\n\n# Step 7: Prepare the results in the required BEIR format\nretrieval_results = {}\nfor query_id, doc_scores in zip(query_ids, scores):\n    retrieval_results[query_id] = {}\n    for doc_id, score in doc_scores:\n        retrieval_results[query_id][doc_id] = score\n\n# Step 8: Evaluate and compute retrieval score once you have results\nevaluator = EvaluateRetrieval(k_values=[10, 20, 50])\nalpha_ndcg, coverage, recall = evaluator.evaluate(\n    qrels_nuggets=qrels_nuggets,\n    query_to_nuggets=query_to_nuggets,\n    qrels_query=qrels_query,\n    results=retrieval_results,\n)\n```\n---\n\n## \ud83d\udcda Raw Freshstack Datasets (Oct 2024)\n\nThe raw freshstack datasets can be downloaded via HF:\n\n* langchain, yolo, laravel, angular, godot (via [huggingface.co](https://huggingface.co/datasets/freshstack/queries-oct-2024))\n\n```python\nfrom datasets import load_dataset\nqueries = load_dataset(\"freshstack/queries-oct-2024\", subset=\"yolo\")\ncorpus = load_dataset(\"freshstack/corpus-oct-2024\", subset=\"yolo\")\n```\n\n---\n\n## \ud83e\udded Project Structure\n\n```\nfreshstack/\n\u251c\u2500 examples/            # contains examples\n\u2502   \u251c\u2500 chunking/        # examples for github repo chunking\n\u2502   \u251c\u2500 evaluation/      # examples for model eval on freshstack\n\u251c\u2500 freshstack/          # core logic modules\n\u2502   \u251c\u2500 retrieval/       # code for retrieval evaluation\n\u2502   \u251c\u2500 datasets/        # code for freshstakc dataloader\n\u2502   \u2514\u2500 chunking/        # code for github repo chunking\n\u2514\u2500 pyproject.toml\n```\n\n---\n## FreshStack Leaderboard\n\nThe upto date leaderboard for Freshstack (version oct-2024) is provided here: [https://fresh-stack.github.io/#leaderboard](https://fresh-stack.github.io/#leaderboard). \n\n> NOTE: Below is the snapshot of the Freshstack leaderboard from Jun 12th 2025.\n\n| Model Name                           | Size | Date       | AVERAGE \u03b1@10 | AVERAGE C\\@20 | AVERAGE R\\@50 | LANGCHAIN \u03b1@10 | LANGCHAIN C\\@20 | LANGCHAIN R\\@50 | YOLO \u03b1@10 | YOLO C\\@20 | YOLO R\\@50 | LARAVEL \u03b1@10 | LARAVEL C\\@20 | LARAVEL R\\@50 | ANGULAR \u03b1@10 | ANGULAR C\\@20 | ANGULAR R\\@50 | GODOT \u03b1@10 | GODOT C\\@20 | GODOT R\\@50 |\n| ------------------------------------ | ---- | ---------- | ------------ | ------------- | ------------- | -------------- | --------------- | --------------- | --------- | ---------- | ---------- | ------------ | ------------- | ------------- | ------------ | ------------- | ------------- | ---------- | ----------- | ----------- |\n| Oracle: Fusion (BM25; ...) (Nuggets) | -    | 2024\u201111\u201101 | 0.541        | 0.868         | 0.755         | 0.519          | 0.881           | 0.655           | 0.601     | 0.876      | 0.825      | 0.566        | 0.888         | 0.818         | 0.544        | 0.881         | 0.756         | 0.476      | 0.815       | 0.719       |\n| Oracle: BM25 (Nuggets)               | -    | 2024\u201111\u201101 | 0.488        | 0.768         | 0.556         | 0.467          | 0.739           | 0.446           | 0.519     | 0.796      | 0.657      | 0.540        | 0.840         | 0.654         | 0.485        | 0.787         | 0.536         | 0.428      | 0.680       | 0.489       |\n| Oracle: Voyage Large 2 (Nuggets)     | -    | 2024\u201111\u201101 | 0.404        | 0.769         | 0.586         | 0.419          | 0.763           | 0.508           | 0.430     | 0.845      | 0.675      | 0.409        | 0.791         | 0.624         | 0.406        | 0.733         | 0.533         | 0.353      | 0.715       | 0.590       |\n| Oracle: BGE (Gemma-2) (Nuggets)      | 9B   | 2024\u201111\u201101 | 0.389        | 0.735         | 0.547         | 0.308          | 0.667           | 0.405           | 0.461     | 0.784      | 0.572      | 0.448        | 0.806         | 0.666         | 0.393        | 0.755         | 0.536         | 0.335      | 0.664       | 0.555       |\n| Qwen3\u20118B (Emb)                    | 8B   | 2025\u201106\u201105 | 0.365        | 0.689         | 0.525         | 0.331          | 0.694           | 0.423           | 0.393     | 0.728      | 0.567      | 0.421        | 0.748         | 0.615         | 0.373        | 0.700         | 0.502         | 0.307      | 0.576       | 0.521       |\n| Qwen3\u20114B (Emb)                    | 4B   | 2025\u201106\u201105 | 0.347        | 0.656         | 0.490         | 0.320          | 0.675           | 0.415           | 0.404     | 0.744      | 0.550      | 0.402        | 0.748         | 0.604         | 0.304        | 0.618         | 0.442         | 0.303      | 0.496       | 0.440       |\n| Fusion (BM25; BGE; E5; Voyage)       | -    | 2024\u201111\u201101 | 0.343        | 0.669         | 0.539         | 0.337          | 0.700           | 0.477           | 0.304     | 0.627      | 0.534      | 0.425        | 0.748         | 0.646         | 0.385        | 0.719         | 0.532         | 0.265      | 0.550       | 0.505       |\n| Oracle: E5 (Mistral-7B) (Nuggets)    | 7B   | 2024\u201111\u201101 | 0.337        | 0.664         | 0.496         | 0.323          | 0.684           | 0.432           | 0.437     | 0.737      | 0.554      | 0.287        | 0.631         | 0.532         | 0.346        | 0.670         | 0.470         | 0.292      | 0.596       | 0.494       |\n| Stella\u20111.5B v5                       | 1.5B | 2025\u201101\u201101 | 0.317        | 0.615         | 0.479         | 0.315          | 0.660           | 0.388           | 0.334     | 0.624      | 0.559      | 0.370        | 0.681         | 0.590         | 0.330        | 0.630         | 0.414         | 0.237      | 0.481       | 0.443       |\n| Voyage Large 2                       | -    | 2024\u201111\u201101 | 0.289        | 0.589         | 0.438         | 0.246          | 0.528           | 0.308           | 0.270     | 0.570      | 0.453      | 0.345        | 0.701         | 0.543         | 0.304        | 0.625         | 0.427         | 0.282      | 0.522       | 0.458       |\n| Stella\u2011400M v5                       | 400M | 2025\u201101\u201101 | 0.276        | 0.578         | 0.422         | 0.285          | 0.608           | 0.356           | 0.241     | 0.538      | 0.447      | 0.320        | 0.648         | 0.534         | 0.288        | 0.619         | 0.359         | 0.244      | 0.476       | 0.412       |\n| BGE (Gemma-2)                        | 9B   | 2024\u201111\u201101 | 0.269        | 0.569         | 0.427         | 0.216          | 0.548           | 0.337           | 0.258     | 0.547      | 0.430      | 0.348        | 0.699         | 0.574         | 0.323        | 0.571         | 0.378         | 0.200      | 0.479       | 0.419       |\n| Qwen3\u20110.6B (Emb)                  | 596M | 2025\u201106\u201105 | 0.262        | 0.543         | 0.394         | 0.259          | 0.588           | 0.369           | 0.260     | 0.504      | 0.383      | 0.288        | 0.593         | 0.463         | 0.253        | 0.535         | 0.356         | 0.249      | 0.495       | 0.400       |\n| E5 (Mistral-7B)                      | 7B   | 2024\u201111\u201101 | 0.255        | 0.553         | 0.397         | 0.304          | 0.654           | 0.393           | 0.243     | 0.552      | 0.394      | 0.250        | 0.565         | 0.470         | 0.262        | 0.548         | 0.368         | 0.217      | 0.444       | 0.359       |\n| GTE (large) v1.5                     | 434M | 2024\u201101\u201109 | 0.226        | 0.494         | 0.318         | 0.206          | 0.470           | 0.252           | 0.195     | 0.445      | 0.271      | 0.318        | 0.626         | 0.482         | 0.284        | 0.578         | 0.343         | 0.127      | 0.348       | 0.240       |\n| BM25                                 | -    | 2024\u201111\u201101 | 0.218        | 0.448         | 0.316         | 0.230          | 0.475           | 0.261           | 0.137     | 0.342      | 0.337      | 0.319        | 0.602         | 0.441         | 0.259        | 0.551         | 0.340         | 0.144      | 0.268       | 0.200       |\n| Nomic Embed (Code)                | 7B   | 2025\u201103\u201124 | 0.218        | 0.488         | 0.348         | 0.224          | 0.518           | 0.292           | 0.227     | 0.539      | 0.390      | 0.222        | 0.532         | 0.407         | 0.237        | 0.511         | 0.356         | 0.178      | 0.341       | 0.295       |\n| CodeRankEmbed                        | 137M | 2024\u201111\u201103 | 0.104        | 0.279         | 0.162         | 0.099          | 0.271           | 0.128           | 0.075     | 0.215      | 0.128      | 0.108        | 0.324         | 0.225         | 0.146        | 0.363         | 0.167         | 0.091      | 0.224       | 0.160       |\n\n\n### \ud83d\udc65 Contribute your model to the leaderboard\n\n1. Fork the repo (https://github.com/fresh-stack/fresh-stack.github.io)\n2. Create a new branch (`git checkout -b <your_branch>`)\n3. Add your model scores in the following format to the `leaderboard_data.json`:\n```bash\n{\n    \"leaderboardData\": [\n        {\n            \"info\": {\n                \"name\": \"BM25\",\n                \"size\": \"-\",\n                \"type\": \"open_source\",\n                \"date\": \"2024-11-01\",\n                \"link\": \"https://github.com/castorini/pyserini\"\n            },\n            \"datasets\": {\n                \"langchain\": {\"alpha_ndcg_10\": 0.230, \"coverage_20\": 0.475, \"recall_50\": 0.261},\n                \"yolo\":      {\"alpha_ndcg_10\": 0.137, \"coverage_20\": 0.342, \"recall_50\": 0.337},\n                \"laravel\":   {\"alpha_ndcg_10\": 0.319, \"coverage_20\": 0.602, \"recall_50\": 0.441},\n                \"angular\":   {\"alpha_ndcg_10\": 0.259, \"coverage_20\": 0.551, \"recall_50\": 0.340},\n                \"godot\":     {\"alpha_ndcg_10\": 0.144, \"coverage_20\": 0.268, \"recall_50\": 0.200},\n                \"average\":   {\"alpha_ndcg_10\": 0.218, \"coverage_20\": 0.448, \"recall_50\": 0.316},\n            }\n        },\n        ...\n    ]\n}\n```\n\n4. Submit a pull request, ideally including:\n\n   * The updated `leaderboard_data.json`\n   * Pipeline invocation script (reference)\n   * Brief evaluation summary (reference)\n\nAll contributions welcome\u2014especially new domain expansions, evaluation improvements, and retrieval baselines!\n\n---\n\n## \ud83d\udcc4 Citation\n\nIf you use FreshStack in your work, please cite:\n\n```bib\n@article{thakur-freshstack:2025,\n  author       = {Nandan Thakur and\n                  Jimmy Lin and\n                  Sam Havens and\n                  Michael Carbin and\n                  Omar Khattab and\n                  Andrew Drozdov},\n  title        = {FreshStack: Building Realistic Benchmarks for Evaluating Retrieval\n                  on Technical Documents},\n  journal      = {CoRR},\n  volume       = {abs/2504.13128},\n  year         = {2025},\n  url          = {https://doi.org/10.48550/arXiv.2504.13128},\n  doi          = {10.48550/ARXIV.2504.13128},\n  eprinttype    = {arXiv},\n  eprint       = {2504.13128},\n  timestamp    = {Thu, 22 May 2025 21:00:35 +0200},\n  biburl       = {https://dblp.org/rec/journals/corr/abs-2504-13128.bib},\n  bibsource    = {dblp computer science bibliography, https://dblp.org}\n}\n```\nThe main contributors of this repository are:\n- [Nandan Thakur](https://github.com/thakur-nandan), Personal Website: [thakur-nandan.gitub.io](https://thakur-nandan.github.io)\n\nContact person: Nandan Thakur, [nandant@gmail.com](mailto:nandant@gmail.com)\n\nDon't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.\n\n> This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.\n\n---\n\n## Collaboration\n\nThis project is developed in collaboration with the following organizations:\n\n| Organization           | Logo                                                                                           |\n| ---------------------- | ---------------------------------------------------------------------------------------------- |\n| University of Waterloo | <img src=\"https://raw.githubusercontent.com/fresh-stack/freshstack/main/images/uwaterloo.png\" alt=\"University of Waterloo logo\" /> |\n| Databricks             | <img src=\"https://raw.githubusercontent.com/fresh-stack/freshstack/main/images/databricks-logo.png\" alt=\"Databricks logo\" />     |\n",
    "bugtrack_url": null,
    "license": "Apache License 2.0",
    "summary": "A framework to generate realistic IR & RAG Benchmarks.",
    "version": "0.0.6",
    "project_urls": {
        "Download": "https://github.com/fresh-stack/freshstack/archive/v0.0.5.zip",
        "Homepage": "https://fresh-stack.github.io/",
        "Repository": "https://github.com/fresh-stack/freshstack"
    },
    "split_keywords": [
        "benchmarking",
        " evaluation framework",
        " information retrieval",
        " retrieval-augmented generation",
        " transformer networks",
        " large language models",
        " pytorch",
        " rag",
        " ir",
        " nlp",
        " deep learning"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "fe344fb9cbc00c86931a3ecd1edc3c835839ca26698a3c1de1abcf7c71373c03",
                "md5": "36770b232141096c1cbf67d131f2cea0",
                "sha256": "695ce31156ffcd457ce9afc3eec04b6fa55cfe7bb3a107e81496167a1e577c0d"
            },
            "downloads": -1,
            "filename": "freshstack-0.0.6-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "36770b232141096c1cbf67d131f2cea0",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 31985,
            "upload_time": "2025-11-05T18:35:33",
            "upload_time_iso_8601": "2025-11-05T18:35:33.758882Z",
            "url": "https://files.pythonhosted.org/packages/fe/34/4fb9cbc00c86931a3ecd1edc3c835839ca26698a3c1de1abcf7c71373c03/freshstack-0.0.6-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "153c90acd2df7e967bbdcf8d3075cec8f360ef0576b50e09dcbed727f2028dd1",
                "md5": "6a3cda14a8b40a7eb270887052340fa8",
                "sha256": "cb28ddec6640a4cf00d17bc1793a3e86172c29ced839452db4bbffe04f790dc1"
            },
            "downloads": -1,
            "filename": "freshstack-0.0.6.tar.gz",
            "has_sig": false,
            "md5_digest": "6a3cda14a8b40a7eb270887052340fa8",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 34360,
            "upload_time": "2025-11-05T18:35:35",
            "upload_time_iso_8601": "2025-11-05T18:35:35.033143Z",
            "url": "https://files.pythonhosted.org/packages/15/3c/90acd2df7e967bbdcf8d3075cec8f360ef0576b50e09dcbed727f2028dd1/freshstack-0.0.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-11-05 18:35:35",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "fresh-stack",
    "github_project": "freshstack",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "freshstack"
}
        
Elapsed time: 4.52271s