indxr


Nameindxr JSON
Version 0.1.5 PyPI version JSON
download
home_pagehttps://github.com/AmenRa/indxr
Summaryindxr: A Python utility for indexing long files.
upload_time2023-10-10 12:08:58
maintainer
docs_urlNone
authorElias Bassani
requires_python>=3.7
license
keywords text index file index index indexer indexing information retrieval natural language processing
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <!-- <div align="center">
  <img src="https://repository-images.githubusercontent.com/268892956/750228ec-f3f2-465d-9c17-420c688ba2bc">
</div> -->

<p align="center">
  <!-- Python -->
  <a href="https://www.python.org" alt="Python">
      <img src="https://badges.aleen42.com/src/python.svg" />
  </a>
  <!-- Version -->
  <a href="https://badge.fury.io/py/indxr"><img src="https://badge.fury.io/py/indxr.svg" alt="PyPI version" height="18"></a>
  <!-- Black -->
  <a href="https://github.com/psf/black" alt="Code style: black">
      <img src="https://img.shields.io/badge/code%20style-black-000000.svg" />
  </a>
  <!-- License -->
  <a href="https://lbesson.mit-license.org/"><img src="https://img.shields.io/badge/License-MIT-blue.svg" alt="License: MIT"></a>
</p>


## ⚡️ Introduction

[indxr](https://github.com/AmenRa/indxr) is a Python utility for indexing long files that allows you to quickly read specific lines dynamically, avoiding hogging your RAM.

For example, given a 10M lines JOSNl file and a MacBook Pro from 2018, reading any specific line takes less than 10 µs, reading 1k non-contiguous lines takes less than 10 ms, reading 1k contiguous lines takes less than 2 ms, iterating over the entire file by reading batches of 32 lines takes less than 20 s (64 µs per batch). In other words, [indxr](https://github.com/AmenRa/indxr) allows you to use your disk as a RAM extension without noticeable slowdowns, especially with SSDs and NVMEs.

[indxr](https://github.com/AmenRa/indxr) can be particularly useful for dynamically loading data from large datasets with a low memory footprint and without slowing downstream tasks, such as data processing and Neural Networks training.

For an overview, follow the [Usage](#-usage) section.

<!-- ## ✨ Features -->

## 🔌 Installation
```bash
pip install indxr
```

## 💡 Usage

- [txt](https://github.com/AmenRa/indxr#txt)
- [jsonl](https://github.com/AmenRa/indxr#jsonl)
- [csv / tsv](https://github.com/AmenRa/indxr#csv--tsv--custom)
- [callback](https://github.com/AmenRa/indxr#callback-works-with-every-file-type)
- [write / read](https://github.com/AmenRa/indxr#write--read-index)
- [PyTorch Dataset example](https://github.com/AmenRa/indxr#usage-example-with-pytorch-dataset)

### TXT
```python
from indxr import Indxr

index = Indxr("sample.txt")

# First line of sample.txt
index[0]

# List containing the second and third lines of sample.txt
index[1:3]

# First line of sample.txt
index.get("0")

# List containing the third and second lines of sample.txt
index.mget(["2", "1"])
```


### JSONl

```python
from indxr import Indxr

index = Indxr("sample.jsonl", key_id="id")  # key_id="id" is by default

# JSON object at line 43 as Python Dictionary
# Reads only the 43th line
index[42]

# JSON objects at line 43, 44, and 45 as Python Dictionaries
# Reads only the 43th, 44th, and 45th lines
index[42:46]

# JSON object with id="id_123" as Python Dictionary,
# Reads only the line where the JSON object is located
index.get("id_123")

# Same as `get` but for multiple JSON objects
index.mget(["id_123", "id_321"])
```


### CSV / TSV / ...

```python
from indxr import Indxr

index = Indxr(
  "sample.csv",
  delimiter=",",    # Default value. Automatically switched to `\t` for `.tsv` files.
  fieldnames=None,  # Default value. List of fieldnames. Overrides header, if any.
  has_header=True,  # Default value. If `True`, treats first line as header.
  return_dict=True, # Default value. If `True`, returns Python Dictionary, string otherwise.
  key_id="id",      # Default value. Same as for JSONl. Ignored if return_dict is `False`.
)

# Line 43 as Python Dictionary
index[42]

# Lines 43, 44, and 45 as Python Dictionaries
index[42:46]

# Line with id="id_123" as Python Dictionary
index.get("id_123")

# Same as `get` but for multiple lines
index.mget(["id_123", "id_321"])
```

### Custom
```python
from indxr import Indxr

# The file must have multiple lines
index = Indxr("sample.something")

# First line of sample.something in bytes
index[0]

# List containing the second and third lines of sample.something in bytes
index[1:3]

# First line of sample.something in bytes
index.get("0")

# List containing the third and second lines of sample.something in bytes
index.mget(["2", "1"])
```

### Callback (works with every file-type)

```python
from indxr import Indxr

index = Indxr("sample.txt", callback=lambda x: x.split())

index.get("0")
>>> # First line of sample.txt split into a list
```


### Write / Read Index
```python
from indxr import Indxr

index = Indxr("sample.txt", callback=lambda x: x.split())

index.write(path)  # Write index to disk

# Read index from disk, callback must be re-defined
index = Indxr.read(path, callback=lambda x: x.split())
```


### Usage example with PyTorch Dataset

In this example, we want to build a PyTorch Dataset that returns a query and two documents, one positive and one negative, for training a Neural retriever. The data is stored in two files, `queries.jsonl` and `documents.jsonl`. The first file contains queries and the second file contains documents. Each query has a list of associated positive and negative documents. Using `Indxr` we can avoid loading the entire dataset into memory and we can load data dynamically, without slowing down the training process.

```python
import random

from indxr import Indxr
from torch.utils.data import DataLoader, Dataset

class CustomDataset(Dataset):
    def __init__(self):
      self.queries = Indxr("queries.jsonl")
      self.documents = Indxr("documents.jsonl")

    def __getitem__(self, index: int):
        # Get query ------------------------------------------------------------
        query = self.queries[index]

        # Sampling -------------------------------------------------------------
        pos_doc_id = random.choice(query["pos_doc_ids"])
        neg_doc_id = random.choice(query["neg_doc_ids"])

        # Get docs -------------------------------------------------------------
        pos_doc = self.documents.get(pos_doc_id)
        neg_doc = self.documents.get(neg_doc_id)

        # The outputs must be batched and transformed to
        # meaningful tensors using a DataLoader and
        # a custom collator function
        return query["text"], pos_doc["text"], neg_doc["text"]

    def __len__(self):
        return len(self.queries)


def collator_fn(batch):
    # Extract data -------------------------------------------------------------
    queries = [x[0] for x in batch]
    pos_docs = [x[1] for x in batch]
    neg_docs = [x[2] for x in batch]

    # Texts tokenization -------------------------------------------------------
    queries = tokenizer(queries)    # Returns PyTorch Tensor
    pos_docs = tokenizer(pos_docs)  # Returns PyTorch Tensor
    neg_docs = tokenizer(neg_docs)  # Returns PyTorch Tensor

    return queries, pos_docs, neg_docs


dataloader = DataLoader(
    dataset=CustomDataset(),
    collate_fn=collate_fn,
    batch_size=32,
    shuffle=True,
    num_workers=4,
)
```

Each line of `queries.jsonl` is as follows:
```json
{
  "q_id": "q321",
  "text": "lorem ipsum",
  "pos_doc_ids": ["d2789822", "d2558037", "d2594098"],
  "neg_doc_ids": ["d3931445", "d4652233", "d191393", "d3692918", "d3051731"]
}
```

Each line of `documents.jsonl` is as follows:
```json
{
  "doc_id": "d123",
  "text": "Lorem ipsum dolor sit amet, consectetuer adipiscing elit."
}
```


## 🎁 Feature Requests
Would you like to see other features implemented? Please, open a [feature request](https://github.com/AmenRa/indxr/issues/new?assignees=&labels=enhancement&template=feature_request.md&title=%5BFeature+Request%5D+title).


## 🤘 Want to contribute?
Would you like to contribute? Please, drop me an [e-mail](mailto:elias.bssn@gmail.com?subject=[GitHub]%20indxr).


## 📄 License
[indxr](https://github.com/AmenRa/indxr) is an open-sourced software licensed under the [MIT license](LICENSE).



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/AmenRa/indxr",
    "name": "indxr",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "",
    "keywords": "text index,file index,index,indexer,indexing,information retrieval,natural language processing",
    "author": "Elias Bassani",
    "author_email": "elias.bssn@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/e4/38/3d5c4ac18cac52c8b968983cc697398f27a67a490fd43679759586c9a062/indxr-0.1.5.tar.gz",
    "platform": null,
    "description": "<!-- <div align=\"center\">\n  <img src=\"https://repository-images.githubusercontent.com/268892956/750228ec-f3f2-465d-9c17-420c688ba2bc\">\n</div> -->\n\n<p align=\"center\">\n  <!-- Python -->\n  <a href=\"https://www.python.org\" alt=\"Python\">\n      <img src=\"https://badges.aleen42.com/src/python.svg\" />\n  </a>\n  <!-- Version -->\n  <a href=\"https://badge.fury.io/py/indxr\"><img src=\"https://badge.fury.io/py/indxr.svg\" alt=\"PyPI version\" height=\"18\"></a>\n  <!-- Black -->\n  <a href=\"https://github.com/psf/black\" alt=\"Code style: black\">\n      <img src=\"https://img.shields.io/badge/code%20style-black-000000.svg\" />\n  </a>\n  <!-- License -->\n  <a href=\"https://lbesson.mit-license.org/\"><img src=\"https://img.shields.io/badge/License-MIT-blue.svg\" alt=\"License: MIT\"></a>\n</p>\n\n\n## \u26a1\ufe0f Introduction\n\n[indxr](https://github.com/AmenRa/indxr) is a Python utility for indexing long files that allows you to quickly read specific lines dynamically, avoiding hogging your RAM.\n\nFor example, given a 10M lines JOSNl file and a MacBook Pro from 2018, reading any specific line takes less than 10 \u00b5s, reading 1k non-contiguous lines takes less than 10 ms, reading 1k contiguous lines takes less than 2 ms, iterating over the entire file by reading batches of 32 lines takes less than 20 s (64 \u00b5s per batch). In other words, [indxr](https://github.com/AmenRa/indxr) allows you to use your disk as a RAM extension without noticeable slowdowns, especially with SSDs and NVMEs.\n\n[indxr](https://github.com/AmenRa/indxr) can be particularly useful for dynamically loading data from large datasets with a low memory footprint and without slowing downstream tasks, such as data processing and Neural Networks training.\n\nFor an overview, follow the [Usage](#-usage) section.\n\n<!-- ## \u2728 Features -->\n\n## \ud83d\udd0c Installation\n```bash\npip install indxr\n```\n\n## \ud83d\udca1 Usage\n\n- [txt](https://github.com/AmenRa/indxr#txt)\n- [jsonl](https://github.com/AmenRa/indxr#jsonl)\n- [csv / tsv](https://github.com/AmenRa/indxr#csv--tsv--custom)\n- [callback](https://github.com/AmenRa/indxr#callback-works-with-every-file-type)\n- [write / read](https://github.com/AmenRa/indxr#write--read-index)\n- [PyTorch Dataset example](https://github.com/AmenRa/indxr#usage-example-with-pytorch-dataset)\n\n### TXT\n```python\nfrom indxr import Indxr\n\nindex = Indxr(\"sample.txt\")\n\n# First line of sample.txt\nindex[0]\n\n# List containing the second and third lines of sample.txt\nindex[1:3]\n\n# First line of sample.txt\nindex.get(\"0\")\n\n# List containing the third and second lines of sample.txt\nindex.mget([\"2\", \"1\"])\n```\n\n\n### JSONl\n\n```python\nfrom indxr import Indxr\n\nindex = Indxr(\"sample.jsonl\", key_id=\"id\")  # key_id=\"id\" is by default\n\n# JSON object at line 43 as Python Dictionary\n# Reads only the 43th line\nindex[42]\n\n# JSON objects at line 43, 44, and 45 as Python Dictionaries\n# Reads only the 43th, 44th, and 45th lines\nindex[42:46]\n\n# JSON object with id=\"id_123\" as Python Dictionary,\n# Reads only the line where the JSON object is located\nindex.get(\"id_123\")\n\n# Same as `get` but for multiple JSON objects\nindex.mget([\"id_123\", \"id_321\"])\n```\n\n\n### CSV / TSV / ...\n\n```python\nfrom indxr import Indxr\n\nindex = Indxr(\n  \"sample.csv\",\n  delimiter=\",\",    # Default value. Automatically switched to `\\t` for `.tsv` files.\n  fieldnames=None,  # Default value. List of fieldnames. Overrides header, if any.\n  has_header=True,  # Default value. If `True`, treats first line as header.\n  return_dict=True, # Default value. If `True`, returns Python Dictionary, string otherwise.\n  key_id=\"id\",      # Default value. Same as for JSONl. Ignored if return_dict is `False`.\n)\n\n# Line 43 as Python Dictionary\nindex[42]\n\n# Lines 43, 44, and 45 as Python Dictionaries\nindex[42:46]\n\n# Line with id=\"id_123\" as Python Dictionary\nindex.get(\"id_123\")\n\n# Same as `get` but for multiple lines\nindex.mget([\"id_123\", \"id_321\"])\n```\n\n### Custom\n```python\nfrom indxr import Indxr\n\n# The file must have multiple lines\nindex = Indxr(\"sample.something\")\n\n# First line of sample.something in bytes\nindex[0]\n\n# List containing the second and third lines of sample.something in bytes\nindex[1:3]\n\n# First line of sample.something in bytes\nindex.get(\"0\")\n\n# List containing the third and second lines of sample.something in bytes\nindex.mget([\"2\", \"1\"])\n```\n\n### Callback (works with every file-type)\n\n```python\nfrom indxr import Indxr\n\nindex = Indxr(\"sample.txt\", callback=lambda x: x.split())\n\nindex.get(\"0\")\n>>> # First line of sample.txt split into a list\n```\n\n\n### Write / Read Index\n```python\nfrom indxr import Indxr\n\nindex = Indxr(\"sample.txt\", callback=lambda x: x.split())\n\nindex.write(path)  # Write index to disk\n\n# Read index from disk, callback must be re-defined\nindex = Indxr.read(path, callback=lambda x: x.split())\n```\n\n\n### Usage example with PyTorch Dataset\n\nIn this example, we want to build a PyTorch Dataset that returns a query and two documents, one positive and one negative, for training a Neural retriever. The data is stored in two files, `queries.jsonl` and `documents.jsonl`. The first file contains queries and the second file contains documents. Each query has a list of associated positive and negative documents. Using `Indxr` we can avoid loading the entire dataset into memory and we can load data dynamically, without slowing down the training process.\n\n```python\nimport random\n\nfrom indxr import Indxr\nfrom torch.utils.data import DataLoader, Dataset\n\nclass CustomDataset(Dataset):\n    def __init__(self):\n      self.queries = Indxr(\"queries.jsonl\")\n      self.documents = Indxr(\"documents.jsonl\")\n\n    def __getitem__(self, index: int):\n        # Get query ------------------------------------------------------------\n        query = self.queries[index]\n\n        # Sampling -------------------------------------------------------------\n        pos_doc_id = random.choice(query[\"pos_doc_ids\"])\n        neg_doc_id = random.choice(query[\"neg_doc_ids\"])\n\n        # Get docs -------------------------------------------------------------\n        pos_doc = self.documents.get(pos_doc_id)\n        neg_doc = self.documents.get(neg_doc_id)\n\n        # The outputs must be batched and transformed to\n        # meaningful tensors using a DataLoader and\n        # a custom collator function\n        return query[\"text\"], pos_doc[\"text\"], neg_doc[\"text\"]\n\n    def __len__(self):\n        return len(self.queries)\n\n\ndef collator_fn(batch):\n    # Extract data -------------------------------------------------------------\n    queries = [x[0] for x in batch]\n    pos_docs = [x[1] for x in batch]\n    neg_docs = [x[2] for x in batch]\n\n    # Texts tokenization -------------------------------------------------------\n    queries = tokenizer(queries)    # Returns PyTorch Tensor\n    pos_docs = tokenizer(pos_docs)  # Returns PyTorch Tensor\n    neg_docs = tokenizer(neg_docs)  # Returns PyTorch Tensor\n\n    return queries, pos_docs, neg_docs\n\n\ndataloader = DataLoader(\n    dataset=CustomDataset(),\n    collate_fn=collate_fn,\n    batch_size=32,\n    shuffle=True,\n    num_workers=4,\n)\n```\n\nEach line of `queries.jsonl` is as follows:\n```json\n{\n  \"q_id\": \"q321\",\n  \"text\": \"lorem ipsum\",\n  \"pos_doc_ids\": [\"d2789822\", \"d2558037\", \"d2594098\"],\n  \"neg_doc_ids\": [\"d3931445\", \"d4652233\", \"d191393\", \"d3692918\", \"d3051731\"]\n}\n```\n\nEach line of `documents.jsonl` is as follows:\n```json\n{\n  \"doc_id\": \"d123\",\n  \"text\": \"Lorem ipsum dolor sit amet, consectetuer adipiscing elit.\"\n}\n```\n\n\n## \ud83c\udf81 Feature Requests\nWould you like to see other features implemented? Please, open a [feature request](https://github.com/AmenRa/indxr/issues/new?assignees=&labels=enhancement&template=feature_request.md&title=%5BFeature+Request%5D+title).\n\n\n## \ud83e\udd18 Want to contribute?\nWould you like to contribute? Please, drop me an [e-mail](mailto:elias.bssn@gmail.com?subject=[GitHub]%20indxr).\n\n\n## \ud83d\udcc4 License\n[indxr](https://github.com/AmenRa/indxr) is an open-sourced software licensed under the [MIT license](LICENSE).\n\n\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "indxr: A Python utility for indexing long files.",
    "version": "0.1.5",
    "project_urls": {
        "Homepage": "https://github.com/AmenRa/indxr"
    },
    "split_keywords": [
        "text index",
        "file index",
        "index",
        "indexer",
        "indexing",
        "information retrieval",
        "natural language processing"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "60bf56fca0580d9ce9fc026a58e56d8aacfe2294bf6c52e13498202f65e5b6ac",
                "md5": "9ec7aa15917907031961db24c8d58807",
                "sha256": "178aa7457ab6371e36839a1e90e834f99b3a2491837790d9e64729cc56357b15"
            },
            "downloads": -1,
            "filename": "indxr-0.1.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "9ec7aa15917907031961db24c8d58807",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 11361,
            "upload_time": "2023-10-10T12:08:56",
            "upload_time_iso_8601": "2023-10-10T12:08:56.705311Z",
            "url": "https://files.pythonhosted.org/packages/60/bf/56fca0580d9ce9fc026a58e56d8aacfe2294bf6c52e13498202f65e5b6ac/indxr-0.1.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e4383d5c4ac18cac52c8b968983cc697398f27a67a490fd43679759586c9a062",
                "md5": "3dc5054083e7f0524f1fe1abaf7993ff",
                "sha256": "c766ea77430247bdd289188044f5753041bffd98e01670b8054248c80243cb43"
            },
            "downloads": -1,
            "filename": "indxr-0.1.5.tar.gz",
            "has_sig": false,
            "md5_digest": "3dc5054083e7f0524f1fe1abaf7993ff",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 11259,
            "upload_time": "2023-10-10T12:08:58",
            "upload_time_iso_8601": "2023-10-10T12:08:58.524687Z",
            "url": "https://files.pythonhosted.org/packages/e4/38/3d5c4ac18cac52c8b968983cc697398f27a67a490fd43679759586c9a062/indxr-0.1.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-10-10 12:08:58",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "AmenRa",
    "github_project": "indxr",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "indxr"
}
        
Elapsed time: 0.15372s