pyseismic-lsr

Name	pyseismic-lsr JSON
Version	0.1.1 JSON
	download
home_page	None
Summary	Seismic: A high-performance data structure for fast retrieval over learned sparse embeddings.
upload_time	2025-03-03 16:17:33
maintainer	None
docs_url	None
author	None
requires_python	>=3.7
license	MIT
keywords	search indexing sparse retrieval
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <h1 align="center">Seismic</h1>
<p align="center">
    <img width="200px" src="imgs/new_logo_seismic.webp" />
    
</p>

<p align="center">
    <a href="https://dl.acm.org/doi/pdf/10.1145/3626772.3657769"><img src="https://badgen.net/static/paper/SIGIR 2024/green" /></a>  
    <a href="https://dl.acm.org/doi/pdf/10.1145/3627673.3679977"><img src="https://badgen.net/static/paper/CIKM 2024/blue" /></a>
    <a href="https://arxiv.org/abs/2501.11628"><img src="https://badgen.net/static/paper/ECIR 2025/yellow" /></a>
    <a href="http://arxiv.org/abs/2404.18812"><img src="https://badgen.net/static/arXiv/2404.18812/red" /></a>
</p>

<p align="center">    
    <a href="https://crates.io/crates/seismic"><img src="https://badgen.infra.medigy.com/crates/v/seismic" /></a>
    <a href="https://crates.io/crates/seismic"><img src="https://badgen.infra.medigy.com/crates/d/seismic" /></a>
    <a href="LICENSE.md"><img src="https://badgen.net/static/license/MIT/blue" /></a>
</p>

Seismic is a highly efficient data structure for fast retrieval over *learned sparse embeddings*. Designed with scalability and performance in mind, Seismic makes querying sparse representations seamless.




### ⚡ Installation  


To install Seismic, simply run:


```bash
pip install py-seismic 
```
For performance optimizations, check out the detailed installation guide in docs/Installation.md.


### 🚀 Quick Start  


Given a collection as a `jsonl` file  (details [here](#data-format)), you can quickly index it by running 
```python
json_input_file = "" # Your data collection

index = SeismicIndex.build(json_input_file)
print("Number of documents: ", index.len)
print("Avg number of non-zero components: ", index.nnz / index.len)
print("Dimensionality of the vectors: ", index.dim)

index.print_space_usage_byte()
```

and then exploit Seismic to quickly retrieve your set of queries

```python
MAX_TOKEN_LEN = 30
string_type  = f'U{MAX_TOKEN_LEN}'

query = {"a": 3.5, "certain": 3.5, "query": 0.4}
queries_ids = np.array([0])
query_components = np.array(list(query.keys()), dtype=string_type)
query_values = np.array(list(query.values()), dtype=np.float32)

results = index.batch_search(
    queries_ids=queries_ids,
    query_components=query_components,
    query_values=query_values,
    k=10
)
```







### 📥 Download the Datasets  


The embeddings in ```jsonl```  format for several encoders and several datasets can be downloaded from this HuggingFace [repository](https://huggingface.co/collections/tuskanny/seismic-datasets-6610108d39c0f2299f20fc9b), together with the queries representations. 

As an example, the Splade embeddings for MSMARCO can be downloaded and extracted by running the following commands.

```bash
wget https://huggingface.co/datasets/tuskanny/seismic-msmarco-splade/resolve/main/documents.tar.gz?download=true -O documents.tar.gz 

tar -xvzf documents.tar.gz
```

or by using the Huggingface dataset download [tool](https://huggingface.co/docs/hub/en/datasets-downloading).

### 📄 Data Format  


Documents and queries should have the following format. Each line should be a JSON-formatted string with the following fields:
- `id`: must represent the ID of the document as an integer.
- `content`: the original content of the document, as a string. This field is optional. 
- `vector`: a dictionary where each key represents a token, and its corresponding value is the score, e.g., `{"dog": 2.45}`.

This is the standard output format of several libraries to train sparse models, such as [`learned-sparse-retrieval`](https://github.com/thongnt99/learned-sparse-retrieval).

The script ```convert_json_to_inner_format.py``` allows converting files formatted accordingly into the ```seismic``` inner format.

```bash
python scripts/convert_json_to_inner_format.py --document-path /path/to/document.jsonl --queries-path /path/to/queries.jsonl --output-dir /path/to/output 
```
This will generate a ```data``` directory at the ```/path/to/output``` path, with ```documents.bin``` and ```queries.bin``` binary files inside.

If you download the NQ dataset from the HuggingFace repo, you need to specify ```--input-format nq``` as it uses a slightly different format. 


### Resources

Check out our `docs` folder for more detailed guide on use to use Seismic directly in Rust, replicate the results of our paper, or use Seismic with your custom collection. 



### <a name="bib">📚 Bibliography</a>
1. Sebastian Bruch, Franco Maria Nardini, Cosimo Rulli, Rossano Venturini. "*Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse Representations*." In ACM SIGIR. 2024. 
2. Sebastian Bruch, Franco Maria Nardini, Cosimo Rulli, and Rossano Venturini. "Pairing Clustered Inverted Indexes with κ-NN Graphs for Fast Approximate Retrieval over Learned Sparse Representations."  In ACM CIKM 2024.
3. Sebastian Bruch, Franco Maria Nardini, Cosimo Rulli,Rossano Venturini, and Leonardo Venuta. Investigating the Scalability of Approximate Sparse Retrieval Algorithms to Massive Datasets. *To Appear* In ECIR 2025.

### Citation License

The source code in this repository is subject to the following citation license:

By downloading and using this software, you agree to cite the under-noted paper in any kind of material you produce where it was used to conduct a search or experimentation, whether be it a research paper, dissertation, article, poster, presentation, or documentation. By using this software, you have agreed to the citation license.


SIGIR 2024
```bibtex
@inproceedings{Seismic,
  author    = {Sebastian Bruch and Franco Maria Nardini and Cosimo Rulli and Rossano Venturini},
  title     = {Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse Representations},
  booktitle = {The 47th International {ACM} {SIGIR} {C}onference on Research and Development in Information Retrieval ({SIGIR})},
  pages     = {152--162},
  publisher = {{ACM}},
  year      = {2024},
  url       = {https://doi.org/10.1145/3626772.3657769},
  doi       = {10.1145/3626772.3657769},
}
```
CIKM 2024

```bibtex 
@inproceedings{bruch2024pairing,
  title={Pairing Clustered Inverted Indexes with $\kappa$-NN Graphs for Fast Approximate Retrieval over Learned Sparse Representations},
  author={Bruch, Sebastian and Nardini, Franco Maria and Rulli, Cosimo and Venturini, Rossano},
  booktitle={Proceedings of the 33rd ACM International Conference on Information and Knowledge Management},
  pages={3642--3646},
  year={2024}
}
```

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "pyseismic-lsr",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "search, indexing, sparse retrieval",
    "author": null,
    "author_email": "Sebastian Bruch <s.bruch@northeastern.edu>, Franco Maria Nardini <francomaria.nardini@isti.cnr.it>, Cosimo Rulli <cosimo.rulli@gmail.com>, Rossano Venturini <rossano.venturini@unipi.it>, Leonardo Venuta <l.venuta@studenti.unipi.it>",
    "download_url": "https://files.pythonhosted.org/packages/c0/96/cef8c4c9d913c84e4dc1978ab03acf52bb2ff701cb2a053bf3276ed20564/pyseismic_lsr-0.1.1.tar.gz",
    "platform": null,
    "description": "<h1 align=\"center\">Seismic</h1>\n<p align=\"center\">\n    <img width=\"200px\" src=\"imgs/new_logo_seismic.webp\" />\n    \n</p>\n\n<p align=\"center\">\n    <a href=\"https://dl.acm.org/doi/pdf/10.1145/3626772.3657769\"><img src=\"https://badgen.net/static/paper/SIGIR 2024/green\" /></a>  \n    <a href=\"https://dl.acm.org/doi/pdf/10.1145/3627673.3679977\"><img src=\"https://badgen.net/static/paper/CIKM 2024/blue\" /></a>\n    <a href=\"https://arxiv.org/abs/2501.11628\"><img src=\"https://badgen.net/static/paper/ECIR 2025/yellow\" /></a>\n    <a href=\"http://arxiv.org/abs/2404.18812\"><img src=\"https://badgen.net/static/arXiv/2404.18812/red\" /></a>\n</p>\n\n<p align=\"center\">    \n    <a href=\"https://crates.io/crates/seismic\"><img src=\"https://badgen.infra.medigy.com/crates/v/seismic\" /></a>\n    <a href=\"https://crates.io/crates/seismic\"><img src=\"https://badgen.infra.medigy.com/crates/d/seismic\" /></a>\n    <a href=\"LICENSE.md\"><img src=\"https://badgen.net/static/license/MIT/blue\" /></a>\n</p>\n\nSeismic is a highly efficient data structure for fast retrieval over *learned sparse embeddings*. Designed with scalability and performance in mind, Seismic makes querying sparse representations seamless.\n\n\n\n\n### \u26a1 Installation  \n\n\nTo install Seismic, simply run:\n\n\n```bash\npip install py-seismic \n```\nFor performance optimizations, check out the detailed installation guide in docs/Installation.md.\n\n\n### \ud83d\ude80 Quick Start  \n\n\nGiven a collection as a `jsonl` file  (details [here](#data-format)), you can quickly index it by running \n```python\njson_input_file = \"\" # Your data collection\n\nindex = SeismicIndex.build(json_input_file)\nprint(\"Number of documents: \", index.len)\nprint(\"Avg number of non-zero components: \", index.nnz / index.len)\nprint(\"Dimensionality of the vectors: \", index.dim)\n\nindex.print_space_usage_byte()\n```\n\nand then exploit Seismic to quickly retrieve your set of queries\n\n```python\nMAX_TOKEN_LEN = 30\nstring_type  = f'U{MAX_TOKEN_LEN}'\n\nquery = {\"a\": 3.5, \"certain\": 3.5, \"query\": 0.4}\nqueries_ids = np.array([0])\nquery_components = np.array(list(query.keys()), dtype=string_type)\nquery_values = np.array(list(query.values()), dtype=np.float32)\n\nresults = index.batch_search(\n    queries_ids=queries_ids,\n    query_components=query_components,\n    query_values=query_values,\n    k=10\n)\n```\n\n\n\n\n\n\n\n### \ud83d\udce5 Download the Datasets  \n\n\nThe embeddings in ```jsonl```  format for several encoders and several datasets can be downloaded from this HuggingFace [repository](https://huggingface.co/collections/tuskanny/seismic-datasets-6610108d39c0f2299f20fc9b), together with the queries representations. \n\nAs an example, the Splade embeddings for MSMARCO can be downloaded and extracted by running the following commands.\n\n```bash\nwget https://huggingface.co/datasets/tuskanny/seismic-msmarco-splade/resolve/main/documents.tar.gz?download=true -O documents.tar.gz \n\ntar -xvzf documents.tar.gz\n```\n\nor by using the Huggingface dataset download [tool](https://huggingface.co/docs/hub/en/datasets-downloading).\n\n### \ud83d\udcc4 Data Format  \n\n\nDocuments and queries should have the following format. Each line should be a JSON-formatted string with the following fields:\n- `id`: must represent the ID of the document as an integer.\n- `content`: the original content of the document, as a string. This field is optional. \n- `vector`: a dictionary where each key represents a token, and its corresponding value is the score, e.g., `{\"dog\": 2.45}`.\n\nThis is the standard output format of several libraries to train sparse models, such as [`learned-sparse-retrieval`](https://github.com/thongnt99/learned-sparse-retrieval).\n\nThe script ```convert_json_to_inner_format.py``` allows converting files formatted accordingly into the ```seismic``` inner format.\n\n```bash\npython scripts/convert_json_to_inner_format.py --document-path /path/to/document.jsonl --queries-path /path/to/queries.jsonl --output-dir /path/to/output \n```\nThis will generate a ```data``` directory at the ```/path/to/output``` path, with ```documents.bin``` and ```queries.bin``` binary files inside.\n\nIf you download the NQ dataset from the HuggingFace repo, you need to specify ```--input-format nq``` as it uses a slightly different format. \n\n\n### Resources\n\nCheck out our `docs` folder for more detailed guide on use to use Seismic directly in Rust, replicate the results of our paper, or use Seismic with your custom collection. \n\n\n\n### <a name=\"bib\">\ud83d\udcda Bibliography</a>\n1. Sebastian Bruch, Franco Maria Nardini, Cosimo Rulli, Rossano Venturini. \"*Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse Representations*.\" In ACM SIGIR. 2024. \n2. Sebastian Bruch, Franco Maria Nardini, Cosimo Rulli, and Rossano Venturini. \"Pairing Clustered Inverted Indexes with \u03ba-NN Graphs for Fast Approximate Retrieval over Learned Sparse Representations.\"  In ACM CIKM 2024.\n3. Sebastian Bruch, Franco Maria Nardini, Cosimo Rulli,Rossano Venturini, and Leonardo Venuta. Investigating the Scalability of Approximate Sparse Retrieval Algorithms to Massive Datasets. *To Appear* In ECIR 2025.\n\n### Citation License\n\nThe source code in this repository is subject to the following citation license:\n\nBy downloading and using this software, you agree to cite the under-noted paper in any kind of material you produce where it was used to conduct a search or experimentation, whether be it a research paper, dissertation, article, poster, presentation, or documentation. By using this software, you have agreed to the citation license.\n\n\nSIGIR 2024\n```bibtex\n@inproceedings{Seismic,\n  author    = {Sebastian Bruch and Franco Maria Nardini and Cosimo Rulli and Rossano Venturini},\n  title     = {Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse Representations},\n  booktitle = {The 47th International {ACM} {SIGIR} {C}onference on Research and Development in Information Retrieval ({SIGIR})},\n  pages     = {152--162},\n  publisher = {{ACM}},\n  year      = {2024},\n  url       = {https://doi.org/10.1145/3626772.3657769},\n  doi       = {10.1145/3626772.3657769},\n}\n```\nCIKM 2024\n\n```bibtex \n@inproceedings{bruch2024pairing,\n  title={Pairing Clustered Inverted Indexes with $\\kappa$-NN Graphs for Fast Approximate Retrieval over Learned Sparse Representations},\n  author={Bruch, Sebastian and Nardini, Franco Maria and Rulli, Cosimo and Venturini, Rossano},\n  booktitle={Proceedings of the 33rd ACM International Conference on Information and Knowledge Management},\n  pages={3642--3646},\n  year={2024}\n}\n```\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Seismic: A high-performance data structure for fast retrieval over learned sparse embeddings.",
    "version": "0.1.1",
    "project_urls": {
        "Source Code": "https://github.com/TusKANNy/seismic"
    },
    "split_keywords": [
        "search",
        " indexing",
        " sparse retrieval"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "eb29fbe2e9edb9e397a351babc709ecd806bf99d849928eddf29208cba8e346c",
                "md5": "21a6f6728786a279ddcabab76c6275ea",
                "sha256": "e475a01733a02a4a70d0c82be05f3f9e143cf1d16641e22246b1083f3daecc2a"
            },
            "downloads": -1,
            "filename": "pyseismic_lsr-0.1.1-cp310-cp310-macosx_10_12_x86_64.whl",
            "has_sig": false,
            "md5_digest": "21a6f6728786a279ddcabab76c6275ea",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.7",
            "size": 682206,
            "upload_time": "2025-03-03T16:17:31",
            "upload_time_iso_8601": "2025-03-03T16:17:31.563933Z",
            "url": "https://files.pythonhosted.org/packages/eb/29/fbe2e9edb9e397a351babc709ecd806bf99d849928eddf29208cba8e346c/pyseismic_lsr-0.1.1-cp310-cp310-macosx_10_12_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "c096cef8c4c9d913c84e4dc1978ab03acf52bb2ff701cb2a053bf3276ed20564",
                "md5": "a6741528ebe77c50668ea8cb0c03f9f6",
                "sha256": "63a5d1d76f50e1a7c588f15c2b4a8d703c25dc3dd01981a71082e269cf41c397"
            },
            "downloads": -1,
            "filename": "pyseismic_lsr-0.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "a6741528ebe77c50668ea8cb0c03f9f6",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 1068262,
            "upload_time": "2025-03-03T16:17:33",
            "upload_time_iso_8601": "2025-03-03T16:17:33.640901Z",
            "url": "https://files.pythonhosted.org/packages/c0/96/cef8c4c9d913c84e4dc1978ab03acf52bb2ff701cb2a053bf3276ed20564/pyseismic_lsr-0.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-03-03 16:17:33",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "TusKANNy",
    "github_project": "seismic",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "pyseismic-lsr"
}

None