<div align="center">
<h1>Neural-Tree</h1>
<p>Neural Search</p>
</div>
<p align="center"><img width=500 src="docs/img/neural_tree.png"/></p>
<div align="center">
<!-- Documentation -->
<a href="https://raphaelsty.github.io/neural-tree/"><img src="https://img.shields.io/website?label=Documentation&style=flat-square&url=https%3A%2F%2Fraphaelsty.github.io/neural-tree/%2F" alt="documentation"></a>
<!-- License -->
<a href="https://opensource.org/licenses/MIT"><img src="https://img.shields.io/badge/License-MIT-blue.svg?style=flat-square" alt="license"></a>
</div>
<p></p>
Are tree-based indexes the counterpart of standard ANN algorithms for token-level embeddings IR models? Neural-Tree replicate the SIGIR 2023 publication [Constructing Tree-based Index for Efficient and Effective Dense Retrieval](https://dl.acm.org/doi/10.1145/3539618.3591651) in order to accelerate ColBERT. Neural-Tree is compatible with Sentence Transformers and TfIdf models as in the original paper.
Neural-Tree creates a tree using hierarchical clustering of documents and then learn embeddings in each node of the tree using paired queries and documents. Additionally, there is the flexibility to input an existing tree structure in JSON format to build the index.
The optimization of the index by Neural-Tree is geared towards maintaining the performance level of the original model while significantly speeding up the search process. It is important to note that Neural-Tree does not modify the underlying model; therefore, it is advisable to initiate tree creation with a model that has already been fine-tuned. Given that Neural-Tree does not alter the model, the index training process is relatively quick.
## Installation
We can install neural-tree using:
```
pip install neural-tree
```
If we plan to evaluate our model while training install:
```
pip install "neural-tree[eval]"
```
## Documentation
The complete documentation is available [here](https://raphaelsty.github.io/neural-tree/).
## Quick Start
The following code shows how to train a tree index. Let's start by creating a fictional dataset:
```python
documents = [
{"id": 0, "content": "paris"},
{"id": 1, "content": "london"},
{"id": 2, "content": "berlin"},
{"id": 3, "content": "rome"},
{"id": 4, "content": "bordeaux"},
{"id": 5, "content": "milan"},
]
train_queries = [
"paris is the capital of france",
"london is the capital of england",
"berlin is the capital of germany",
"rome is the capital of italy",
]
train_documents = [
{"id": 0, "content": "paris"},
{"id": 1, "content": "london"},
{"id": 2, "content": "berlin"},
{"id": 3, "content": "rome"},
]
test_queries = [
"bordeaux is the capital of france",
"milan is the capital of italy",
]
```
Let's train the index using the `documents`, `train_queries` and `train_documents` we have gathered.
```python
import torch
from neural_cherche import models
from neural_tree import clustering, trees, utils
model = models.ColBERT(
model_name_or_path="raphaelsty/neural-cherche-colbert",
device="cuda" if torch.cuda.is_available() else "cpu",
)
tree = trees.ColBERT(
key="id",
on=["content"],
model=model,
documents=documents,
leaf_balance_factor=100, # Number of documents per leaf
branch_balance_factor=5, # Number children per node
n_jobs=-1, # set to 1 with Google Colab
)
optimizer = torch.optim.AdamW(lr=3e-3, params=list(tree.parameters()))
for step, batch_queries, batch_documents in utils.iter(
queries=train_queries,
documents=train_documents,
shuffle=True,
epochs=50,
batch_size=32,
):
loss = tree.loss(
queries=batch_queries,
documents=batch_documents,
)
loss.backward()
optimizer.step()
optimizer.zero_grad(set_to_none=True)
```
Let's now duplicate some documents of the tree in order to increase accuracy.
```python
documents_to_leafs = clustering.optimize_leafs(
tree=tree,
queries=train_queries + test_queries,
documents=documents,
)
tree = tree.add(
documents=documents,
documents_to_leafs=documents_to_leafs,
)
```
We are now ready to retrieve documents:
```python
scores = tree(
queries=["bordeaux", "milan"],
k_leafs=2,
k=2,
)
print(scores["documents"])
```
```python
[
[
{"id": 4, "similarity": 5.28, "leaf": "12"},
{"id": 0, "similarity": 3.17, "leaf": "12"},
],
[
{"id": 5, "similarity": 5.11, "leaf": "10"},
{"id": 2, "similarity": 3.57, "leaf": "10"},
],
]
```
## Evaluation
We can evaluate the performance of the tree using the following code:
```python
documents, queries_ids, test_queries, qrels = datasets.load_beir_test(
dataset_name="scifact",
)
candidates = tree(
queries=test_queries,
k_leafs=2,
k=10,
)
scores = utils.evaluate(
scores=candidates["documents"],
qrels=qrels,
queries_ids=queries_ids,
)
print(scores)
```
## Benchmarks
<table>
<thead>
<tr>
<th colspan="2" rowspan="2"></th>
<th colspan="9">Scifact Dataset</th>
</tr>
<tr>
<th colspan="4">Vanilla</th>
<th colspan="5">Neural-Tree </th>
</tr>
</thead>
<tbody>
<tr>
<td>model</td>
<td>HuggingFace Checkpoint</td>
<td>ndcg@10</td>
<td>hits@10</td>
<td>hits@1</td>
<td>queries / second</td>
<td>ndcg@10</td>
<td>hits@10</td>
<td>hits@1</td>
<td>queries / second</td>
<td>Acceleration</td>
</tr>
<tr>
<td>TfIdf<br>Cherche</td>
<td>-</td>
<td>0,61</td>
<td>0,85</td>
<td>0,47</td>
<td>760</td>
<td>0,56</td>
<td>0,82</td>
<td>0,42</td>
<td>1080</td>
<td>+42.11%</td>
</tr>
<tr>
<td>SentenceTransformer GPU<br>Faiss.IndexFlatL2 CPU</td>
<td>sentence-transformers/all-mpnet-base-v2</td>
<td>0,66</td>
<td>0,89</td>
<td>0,53</td>
<td>475</td>
<td>0,66</td>
<td>0,88</td>
<td>0,53</td>
<td>518</td>
<td>+9.05%</td>
</tr>
<tr>
<td>ColBERT<br>Neural-Cherche GPU</td>
<td>raphaelsty/neural-cherche-colbert</td>
<td>0,70</td>
<td>0,92</td>
<td>0,58</td>
<td>3</td>
<td>0,70</td>
<td>0,91</td>
<td>0,59</td>
<td>256</td>
<td>x85</td>
</tr>
</tbody>
</table>
Note that this benchmark do not implement [ColBERTV2](https://arxiv.org/abs/2112.01488) efficient retrieval but rather compare ColBERT raw retrieval with Neural-Tree. We could accelerate SentenceTransformer vanilla by using optimized Faiss index.
## Contributing
We welcome contributions to Neural-Tree, a tool designed to enhance tree visualization, model node topics, and leverage the tree structure to expedite Large Language Model (LLM) searches. Our focus includes refining the clustering of ColBERT embeddings through hierarchical clustering, which is currently facilitated by TfIdf. Additionally, there's an opportunity to contribute towards optimizing clustering, aiming to achieve comprehensive ColBERT cluster optimization independently of TfIdf.
## License
This project is licensed under the terms of the MIT license.
## References
- [Constructing Tree-based Index for Efficient and Effective Dense Retrieval, Github](https://github.com/cshaitao/jtr)
- [ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT](https://arxiv.org/abs/2004.12832)
- [Myriade](https://github.com/MaxHalford/myriade)
Raw data
{
"_id": null,
"home_page": "https://github.com/raphaelsty/neural-tree",
"name": "neural-tree",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": "",
"keywords": "tree search,neural search,information retrieval,semantic search,colbert,tree",
"author": "Raphael Sourty",
"author_email": "raphael.sourty@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/83/77/f12a27ae5e142a74f4c4f31b5db9b3104ba8dad40d2fcba42cb0e9eba3c3/neural-tree-0.0.1.tar.gz",
"platform": null,
"description": "\n<div align=\"center\">\n <h1>Neural-Tree</h1>\n <p>Neural Search</p>\n</div>\n\n<p align=\"center\"><img width=500 src=\"docs/img/neural_tree.png\"/></p>\n\n<div align=\"center\">\n <!-- Documentation -->\n <a href=\"https://raphaelsty.github.io/neural-tree/\"><img src=\"https://img.shields.io/website?label=Documentation&style=flat-square&url=https%3A%2F%2Fraphaelsty.github.io/neural-tree/%2F\" alt=\"documentation\"></a>\n <!-- License -->\n <a href=\"https://opensource.org/licenses/MIT\"><img src=\"https://img.shields.io/badge/License-MIT-blue.svg?style=flat-square\" alt=\"license\"></a>\n</div>\n\n<p></p>\n\nAre tree-based indexes the counterpart of standard ANN algorithms for token-level embeddings IR models? Neural-Tree replicate the SIGIR 2023 publication [Constructing Tree-based Index for Efficient and Effective Dense Retrieval](https://dl.acm.org/doi/10.1145/3539618.3591651) in order to accelerate ColBERT. Neural-Tree is compatible with Sentence Transformers and TfIdf models as in the original paper. \n\nNeural-Tree creates a tree using hierarchical clustering of documents and then learn embeddings in each node of the tree using paired queries and documents. Additionally, there is the flexibility to input an existing tree structure in JSON format to build the index.\n\nThe optimization of the index by Neural-Tree is geared towards maintaining the performance level of the original model while significantly speeding up the search process. It is important to note that Neural-Tree does not modify the underlying model; therefore, it is advisable to initiate tree creation with a model that has already been fine-tuned. Given that Neural-Tree does not alter the model, the index training process is relatively quick.\n\n## Installation\n\nWe can install neural-tree using:\n\n```\npip install neural-tree\n```\n\nIf we plan to evaluate our model while training install:\n\n```\npip install \"neural-tree[eval]\"\n```\n\n## Documentation\n\nThe complete documentation is available [here](https://raphaelsty.github.io/neural-tree/).\n\n\n## Quick Start\n\nThe following code shows how to train a tree index. Let's start by creating a fictional dataset:\n\n```python\ndocuments = [\n {\"id\": 0, \"content\": \"paris\"},\n {\"id\": 1, \"content\": \"london\"},\n {\"id\": 2, \"content\": \"berlin\"},\n {\"id\": 3, \"content\": \"rome\"},\n {\"id\": 4, \"content\": \"bordeaux\"},\n {\"id\": 5, \"content\": \"milan\"},\n]\n\ntrain_queries = [\n \"paris is the capital of france\",\n \"london is the capital of england\",\n \"berlin is the capital of germany\",\n \"rome is the capital of italy\",\n]\n\ntrain_documents = [\n {\"id\": 0, \"content\": \"paris\"},\n {\"id\": 1, \"content\": \"london\"},\n {\"id\": 2, \"content\": \"berlin\"},\n {\"id\": 3, \"content\": \"rome\"},\n]\n\ntest_queries = [\n \"bordeaux is the capital of france\",\n \"milan is the capital of italy\",\n]\n```\n\nLet's train the index using the `documents`, `train_queries` and `train_documents` we have gathered.\n\n```python\nimport torch\nfrom neural_cherche import models\nfrom neural_tree import clustering, trees, utils\n\nmodel = models.ColBERT(\n model_name_or_path=\"raphaelsty/neural-cherche-colbert\",\n device=\"cuda\" if torch.cuda.is_available() else \"cpu\",\n)\n\ntree = trees.ColBERT(\n key=\"id\",\n on=[\"content\"],\n model=model,\n documents=documents,\n leaf_balance_factor=100, # Number of documents per leaf\n branch_balance_factor=5, # Number children per node\n n_jobs=-1, # set to 1 with Google Colab\n)\n\noptimizer = torch.optim.AdamW(lr=3e-3, params=list(tree.parameters()))\n\nfor step, batch_queries, batch_documents in utils.iter(\n queries=train_queries,\n documents=train_documents,\n shuffle=True,\n epochs=50,\n batch_size=32,\n):\n loss = tree.loss(\n queries=batch_queries,\n documents=batch_documents,\n )\n\n loss.backward()\n optimizer.step()\n optimizer.zero_grad(set_to_none=True)\n```\n\n\nLet's now duplicate some documents of the tree in order to increase accuracy.\n\n```python\ndocuments_to_leafs = clustering.optimize_leafs(\n tree=tree,\n queries=train_queries + test_queries,\n documents=documents,\n)\n\ntree = tree.add(\n documents=documents,\n documents_to_leafs=documents_to_leafs,\n)\n```\n\nWe are now ready to retrieve documents:\n\n```python\nscores = tree(\n queries=[\"bordeaux\", \"milan\"],\n k_leafs=2,\n k=2,\n)\n\nprint(scores[\"documents\"])\n```\n\n```python\n[\n [\n {\"id\": 4, \"similarity\": 5.28, \"leaf\": \"12\"},\n {\"id\": 0, \"similarity\": 3.17, \"leaf\": \"12\"},\n ],\n [\n {\"id\": 5, \"similarity\": 5.11, \"leaf\": \"10\"},\n {\"id\": 2, \"similarity\": 3.57, \"leaf\": \"10\"},\n ],\n]\n```\n\n## Evaluation\n\nWe can evaluate the performance of the tree using the following code:\n\n```python\ndocuments, queries_ids, test_queries, qrels = datasets.load_beir_test(\n dataset_name=\"scifact\",\n)\n\ncandidates = tree(\n queries=test_queries,\n k_leafs=2,\n k=10,\n)\n\nscores = utils.evaluate(\n scores=candidates[\"documents\"],\n qrels=qrels,\n queries_ids=queries_ids,\n)\n\nprint(scores)\n```\n\n## Benchmarks \n\n<table>\n<thead>\n <tr>\n <th colspan=\"2\" rowspan=\"2\"></th>\n <th colspan=\"9\">Scifact Dataset</th>\n </tr>\n <tr>\n <th colspan=\"4\">Vanilla</th>\n <th colspan=\"5\">Neural-Tree </th>\n </tr>\n</thead>\n<tbody>\n <tr>\n <td>model</td>\n <td>HuggingFace Checkpoint</td>\n <td>ndcg@10</td>\n <td>hits@10</td>\n <td>hits@1</td>\n <td>queries / second</td>\n <td>ndcg@10</td>\n <td>hits@10</td>\n <td>hits@1</td>\n <td>queries / second</td>\n <td>Acceleration</td>\n </tr>\n <tr>\n <td>TfIdf<br>Cherche</td>\n <td>-</td>\n <td>0,61</td>\n <td>0,85</td>\n <td>0,47</td>\n <td>760</td>\n <td>0,56</td>\n <td>0,82</td>\n <td>0,42</td>\n <td>1080</td>\n <td>+42.11%</td>\n </tr>\n <tr>\n <td>SentenceTransformer GPU<br>Faiss.IndexFlatL2 CPU</td>\n <td>sentence-transformers/all-mpnet-base-v2</td>\n <td>0,66</td>\n <td>0,89</td>\n <td>0,53</td>\n <td>475</td>\n <td>0,66</td>\n <td>0,88</td>\n <td>0,53</td>\n <td>518</td>\n <td>+9.05%</td>\n </tr>\n <tr>\n <td>ColBERT<br>Neural-Cherche GPU</td>\n <td>raphaelsty/neural-cherche-colbert</td>\n <td>0,70</td>\n <td>0,92</td>\n <td>0,58</td>\n <td>3</td>\n <td>0,70</td>\n <td>0,91</td>\n <td>0,59</td>\n <td>256</td>\n <td>x85</td>\n </tr>\n</tbody>\n</table>\n\nNote that this benchmark do not implement [ColBERTV2](https://arxiv.org/abs/2112.01488) efficient retrieval but rather compare ColBERT raw retrieval with Neural-Tree. We could accelerate SentenceTransformer vanilla by using optimized Faiss index.\n\n## Contributing\n\nWe welcome contributions to Neural-Tree, a tool designed to enhance tree visualization, model node topics, and leverage the tree structure to expedite Large Language Model (LLM) searches. Our focus includes refining the clustering of ColBERT embeddings through hierarchical clustering, which is currently facilitated by TfIdf. Additionally, there's an opportunity to contribute towards optimizing clustering, aiming to achieve comprehensive ColBERT cluster optimization independently of TfIdf.\n\n## License\n\nThis project is licensed under the terms of the MIT license.\n\n## References\n\n- [Constructing Tree-based Index for Efficient and Effective Dense Retrieval, Github](https://github.com/cshaitao/jtr)\n\n- [ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT](https://arxiv.org/abs/2004.12832)\n\n- [Myriade](https://github.com/MaxHalford/myriade)\n\n \n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Neural-Tree",
"version": "0.0.1",
"project_urls": {
"Download": "https://github.com/user/neural-tree/archive/v_01.tar.gz",
"Homepage": "https://github.com/raphaelsty/neural-tree"
},
"split_keywords": [
"tree search",
"neural search",
"information retrieval",
"semantic search",
"colbert",
"tree"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "8377f12a27ae5e142a74f4c4f31b5db9b3104ba8dad40d2fcba42cb0e9eba3c3",
"md5": "b4f7aa7e3669e92d5d6bb628611f6fa8",
"sha256": "2aebe34e0242538fa13d2adc17923ea285156df9013d710537101299424bb857"
},
"downloads": -1,
"filename": "neural-tree-0.0.1.tar.gz",
"has_sig": false,
"md5_digest": "b4f7aa7e3669e92d5d6bb628611f6fa8",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 27085,
"upload_time": "2024-02-28T02:04:09",
"upload_time_iso_8601": "2024-02-28T02:04:09.951637Z",
"url": "https://files.pythonhosted.org/packages/83/77/f12a27ae5e142a74f4c4f31b5db9b3104ba8dad40d2fcba42cb0e9eba3c3/neural-tree-0.0.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-02-28 02:04:09",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "raphaelsty",
"github_project": "neural-tree",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "neural-tree"
}