<!-- <div align="center">
<img src="https://repository-images.githubusercontent.com/268892956/750228ec-f3f2-465d-9c17-420c688ba2bc">
</div> -->
<p align="center">
<!-- Python -->
<a href="https://www.python.org" alt="Python">
<img src="https://badges.aleen42.com/src/python.svg" />
</a>
<!-- Version -->
<a href="https://badge.fury.io/py/indxr"><img src="https://badge.fury.io/py/indxr.svg" alt="PyPI version" height="18"></a>
<!-- Black -->
<a href="https://github.com/psf/black" alt="Code style: black">
<img src="https://img.shields.io/badge/code%20style-black-000000.svg" />
</a>
<!-- License -->
<a href="https://lbesson.mit-license.org/"><img src="https://img.shields.io/badge/License-MIT-blue.svg" alt="License: MIT"></a>
</p>
## ⚡️ Introduction
[indxr](https://github.com/AmenRa/indxr) is a Python utility for indexing long files that allows you to quickly read specific lines dynamically, avoiding hogging your RAM.
For example, given a 10M lines JOSNl file and a MacBook Pro from 2018, reading any specific line takes less than 10 µs, reading 1k non-contiguous lines takes less than 10 ms, reading 1k contiguous lines takes less than 2 ms, iterating over the entire file by reading batches of 32 lines takes less than 20 s (64 µs per batch). In other words, [indxr](https://github.com/AmenRa/indxr) allows you to use your disk as a RAM extension without noticeable slowdowns, especially with SSDs and NVMEs.
[indxr](https://github.com/AmenRa/indxr) can be particularly useful for dynamically loading data from large datasets with a low memory footprint and without slowing downstream tasks, such as data processing and Neural Networks training.
For an overview, follow the [Usage](#-usage) section.
<!-- ## ✨ Features -->
## 🔌 Installation
```bash
pip install indxr
```
## 💡 Usage
- [txt](https://github.com/AmenRa/indxr#txt)
- [jsonl](https://github.com/AmenRa/indxr#jsonl)
- [csv / tsv](https://github.com/AmenRa/indxr#csv--tsv--custom)
- [callback](https://github.com/AmenRa/indxr#callback-works-with-every-file-type)
- [write / read](https://github.com/AmenRa/indxr#write--read-index)
- [PyTorch Dataset example](https://github.com/AmenRa/indxr#usage-example-with-pytorch-dataset)
### TXT
```python
from indxr import Indxr
index = Indxr("sample.txt")
# First line of sample.txt
index[0]
# List containing the second and third lines of sample.txt
index[1:3]
# First line of sample.txt
index.get("0")
# List containing the third and second lines of sample.txt
index.mget(["2", "1"])
```
### JSONl
```python
from indxr import Indxr
index = Indxr("sample.jsonl", key_id="id") # key_id="id" is by default
# JSON object at line 43 as Python Dictionary
# Reads only the 43th line
index[42]
# JSON objects at line 43, 44, and 45 as Python Dictionaries
# Reads only the 43th, 44th, and 45th lines
index[42:46]
# JSON object with id="id_123" as Python Dictionary,
# Reads only the line where the JSON object is located
index.get("id_123")
# Same as `get` but for multiple JSON objects
index.mget(["id_123", "id_321"])
```
### CSV / TSV / ...
```python
from indxr import Indxr
index = Indxr(
"sample.csv",
delimiter=",", # Default value. Automatically switched to `\t` for `.tsv` files.
fieldnames=None, # Default value. List of fieldnames. Overrides header, if any.
has_header=True, # Default value. If `True`, treats first line as header.
return_dict=True, # Default value. If `True`, returns Python Dictionary, string otherwise.
key_id="id", # Default value. Same as for JSONl. Ignored if return_dict is `False`.
)
# Line 43 as Python Dictionary
index[42]
# Lines 43, 44, and 45 as Python Dictionaries
index[42:46]
# Line with id="id_123" as Python Dictionary
index.get("id_123")
# Same as `get` but for multiple lines
index.mget(["id_123", "id_321"])
```
### Custom
```python
from indxr import Indxr
# The file must have multiple lines
index = Indxr("sample.something")
# First line of sample.something in bytes
index[0]
# List containing the second and third lines of sample.something in bytes
index[1:3]
# First line of sample.something in bytes
index.get("0")
# List containing the third and second lines of sample.something in bytes
index.mget(["2", "1"])
```
### Callback (works with every file-type)
```python
from indxr import Indxr
index = Indxr("sample.txt", callback=lambda x: x.split())
index.get("0")
>>> # First line of sample.txt split into a list
```
### Write / Read Index
```python
from indxr import Indxr
index = Indxr("sample.txt", callback=lambda x: x.split())
index.write(path) # Write index to disk
# Read index from disk, callback must be re-defined
index = Indxr.read(path, callback=lambda x: x.split())
```
### Usage example with PyTorch Dataset
In this example, we want to build a PyTorch Dataset that returns a query and two documents, one positive and one negative, for training a Neural retriever. The data is stored in two files, `queries.jsonl` and `documents.jsonl`. The first file contains queries and the second file contains documents. Each query has a list of associated positive and negative documents. Using `Indxr` we can avoid loading the entire dataset into memory and we can load data dynamically, without slowing down the training process.
```python
import random
from indxr import Indxr
from torch.utils.data import DataLoader, Dataset
class CustomDataset(Dataset):
def __init__(self):
self.queries = Indxr("queries.jsonl")
self.documents = Indxr("documents.jsonl")
def __getitem__(self, index: int):
# Get query ------------------------------------------------------------
query = self.queries[index]
# Sampling -------------------------------------------------------------
pos_doc_id = random.choice(query["pos_doc_ids"])
neg_doc_id = random.choice(query["neg_doc_ids"])
# Get docs -------------------------------------------------------------
pos_doc = self.documents.get(pos_doc_id)
neg_doc = self.documents.get(neg_doc_id)
# The outputs must be batched and transformed to
# meaningful tensors using a DataLoader and
# a custom collator function
return query["text"], pos_doc["text"], neg_doc["text"]
def __len__(self):
return len(self.queries)
def collator_fn(batch):
# Extract data -------------------------------------------------------------
queries = [x[0] for x in batch]
pos_docs = [x[1] for x in batch]
neg_docs = [x[2] for x in batch]
# Texts tokenization -------------------------------------------------------
queries = tokenizer(queries) # Returns PyTorch Tensor
pos_docs = tokenizer(pos_docs) # Returns PyTorch Tensor
neg_docs = tokenizer(neg_docs) # Returns PyTorch Tensor
return queries, pos_docs, neg_docs
dataloader = DataLoader(
dataset=CustomDataset(),
collate_fn=collate_fn,
batch_size=32,
shuffle=True,
num_workers=4,
)
```
Each line of `queries.jsonl` is as follows:
```json
{
"q_id": "q321",
"text": "lorem ipsum",
"pos_doc_ids": ["d2789822", "d2558037", "d2594098"],
"neg_doc_ids": ["d3931445", "d4652233", "d191393", "d3692918", "d3051731"]
}
```
Each line of `documents.jsonl` is as follows:
```json
{
"doc_id": "d123",
"text": "Lorem ipsum dolor sit amet, consectetuer adipiscing elit."
}
```
## 🎁 Feature Requests
Would you like to see other features implemented? Please, open a [feature request](https://github.com/AmenRa/indxr/issues/new?assignees=&labels=enhancement&template=feature_request.md&title=%5BFeature+Request%5D+title).
## 🤘 Want to contribute?
Would you like to contribute? Please, drop me an [e-mail](mailto:elias.bssn@gmail.com?subject=[GitHub]%20indxr).
## 📄 License
[indxr](https://github.com/AmenRa/indxr) is an open-sourced software licensed under the [MIT license](LICENSE).
Raw data
{
"_id": null,
"home_page": "https://github.com/AmenRa/indxr",
"name": "indxr",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": "",
"keywords": "text index,file index,index,indexer,indexing,information retrieval,natural language processing",
"author": "Elias Bassani",
"author_email": "elias.bssn@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/e4/38/3d5c4ac18cac52c8b968983cc697398f27a67a490fd43679759586c9a062/indxr-0.1.5.tar.gz",
"platform": null,
"description": "<!-- <div align=\"center\">\n <img src=\"https://repository-images.githubusercontent.com/268892956/750228ec-f3f2-465d-9c17-420c688ba2bc\">\n</div> -->\n\n<p align=\"center\">\n <!-- Python -->\n <a href=\"https://www.python.org\" alt=\"Python\">\n <img src=\"https://badges.aleen42.com/src/python.svg\" />\n </a>\n <!-- Version -->\n <a href=\"https://badge.fury.io/py/indxr\"><img src=\"https://badge.fury.io/py/indxr.svg\" alt=\"PyPI version\" height=\"18\"></a>\n <!-- Black -->\n <a href=\"https://github.com/psf/black\" alt=\"Code style: black\">\n <img src=\"https://img.shields.io/badge/code%20style-black-000000.svg\" />\n </a>\n <!-- License -->\n <a href=\"https://lbesson.mit-license.org/\"><img src=\"https://img.shields.io/badge/License-MIT-blue.svg\" alt=\"License: MIT\"></a>\n</p>\n\n\n## \u26a1\ufe0f Introduction\n\n[indxr](https://github.com/AmenRa/indxr) is a Python utility for indexing long files that allows you to quickly read specific lines dynamically, avoiding hogging your RAM.\n\nFor example, given a 10M lines JOSNl file and a MacBook Pro from 2018, reading any specific line takes less than 10 \u00b5s, reading 1k non-contiguous lines takes less than 10 ms, reading 1k contiguous lines takes less than 2 ms, iterating over the entire file by reading batches of 32 lines takes less than 20 s (64 \u00b5s per batch). In other words, [indxr](https://github.com/AmenRa/indxr) allows you to use your disk as a RAM extension without noticeable slowdowns, especially with SSDs and NVMEs.\n\n[indxr](https://github.com/AmenRa/indxr) can be particularly useful for dynamically loading data from large datasets with a low memory footprint and without slowing downstream tasks, such as data processing and Neural Networks training.\n\nFor an overview, follow the [Usage](#-usage) section.\n\n<!-- ## \u2728 Features -->\n\n## \ud83d\udd0c Installation\n```bash\npip install indxr\n```\n\n## \ud83d\udca1 Usage\n\n- [txt](https://github.com/AmenRa/indxr#txt)\n- [jsonl](https://github.com/AmenRa/indxr#jsonl)\n- [csv / tsv](https://github.com/AmenRa/indxr#csv--tsv--custom)\n- [callback](https://github.com/AmenRa/indxr#callback-works-with-every-file-type)\n- [write / read](https://github.com/AmenRa/indxr#write--read-index)\n- [PyTorch Dataset example](https://github.com/AmenRa/indxr#usage-example-with-pytorch-dataset)\n\n### TXT\n```python\nfrom indxr import Indxr\n\nindex = Indxr(\"sample.txt\")\n\n# First line of sample.txt\nindex[0]\n\n# List containing the second and third lines of sample.txt\nindex[1:3]\n\n# First line of sample.txt\nindex.get(\"0\")\n\n# List containing the third and second lines of sample.txt\nindex.mget([\"2\", \"1\"])\n```\n\n\n### JSONl\n\n```python\nfrom indxr import Indxr\n\nindex = Indxr(\"sample.jsonl\", key_id=\"id\") # key_id=\"id\" is by default\n\n# JSON object at line 43 as Python Dictionary\n# Reads only the 43th line\nindex[42]\n\n# JSON objects at line 43, 44, and 45 as Python Dictionaries\n# Reads only the 43th, 44th, and 45th lines\nindex[42:46]\n\n# JSON object with id=\"id_123\" as Python Dictionary,\n# Reads only the line where the JSON object is located\nindex.get(\"id_123\")\n\n# Same as `get` but for multiple JSON objects\nindex.mget([\"id_123\", \"id_321\"])\n```\n\n\n### CSV / TSV / ...\n\n```python\nfrom indxr import Indxr\n\nindex = Indxr(\n \"sample.csv\",\n delimiter=\",\", # Default value. Automatically switched to `\\t` for `.tsv` files.\n fieldnames=None, # Default value. List of fieldnames. Overrides header, if any.\n has_header=True, # Default value. If `True`, treats first line as header.\n return_dict=True, # Default value. If `True`, returns Python Dictionary, string otherwise.\n key_id=\"id\", # Default value. Same as for JSONl. Ignored if return_dict is `False`.\n)\n\n# Line 43 as Python Dictionary\nindex[42]\n\n# Lines 43, 44, and 45 as Python Dictionaries\nindex[42:46]\n\n# Line with id=\"id_123\" as Python Dictionary\nindex.get(\"id_123\")\n\n# Same as `get` but for multiple lines\nindex.mget([\"id_123\", \"id_321\"])\n```\n\n### Custom\n```python\nfrom indxr import Indxr\n\n# The file must have multiple lines\nindex = Indxr(\"sample.something\")\n\n# First line of sample.something in bytes\nindex[0]\n\n# List containing the second and third lines of sample.something in bytes\nindex[1:3]\n\n# First line of sample.something in bytes\nindex.get(\"0\")\n\n# List containing the third and second lines of sample.something in bytes\nindex.mget([\"2\", \"1\"])\n```\n\n### Callback (works with every file-type)\n\n```python\nfrom indxr import Indxr\n\nindex = Indxr(\"sample.txt\", callback=lambda x: x.split())\n\nindex.get(\"0\")\n>>> # First line of sample.txt split into a list\n```\n\n\n### Write / Read Index\n```python\nfrom indxr import Indxr\n\nindex = Indxr(\"sample.txt\", callback=lambda x: x.split())\n\nindex.write(path) # Write index to disk\n\n# Read index from disk, callback must be re-defined\nindex = Indxr.read(path, callback=lambda x: x.split())\n```\n\n\n### Usage example with PyTorch Dataset\n\nIn this example, we want to build a PyTorch Dataset that returns a query and two documents, one positive and one negative, for training a Neural retriever. The data is stored in two files, `queries.jsonl` and `documents.jsonl`. The first file contains queries and the second file contains documents. Each query has a list of associated positive and negative documents. Using `Indxr` we can avoid loading the entire dataset into memory and we can load data dynamically, without slowing down the training process.\n\n```python\nimport random\n\nfrom indxr import Indxr\nfrom torch.utils.data import DataLoader, Dataset\n\nclass CustomDataset(Dataset):\n def __init__(self):\n self.queries = Indxr(\"queries.jsonl\")\n self.documents = Indxr(\"documents.jsonl\")\n\n def __getitem__(self, index: int):\n # Get query ------------------------------------------------------------\n query = self.queries[index]\n\n # Sampling -------------------------------------------------------------\n pos_doc_id = random.choice(query[\"pos_doc_ids\"])\n neg_doc_id = random.choice(query[\"neg_doc_ids\"])\n\n # Get docs -------------------------------------------------------------\n pos_doc = self.documents.get(pos_doc_id)\n neg_doc = self.documents.get(neg_doc_id)\n\n # The outputs must be batched and transformed to\n # meaningful tensors using a DataLoader and\n # a custom collator function\n return query[\"text\"], pos_doc[\"text\"], neg_doc[\"text\"]\n\n def __len__(self):\n return len(self.queries)\n\n\ndef collator_fn(batch):\n # Extract data -------------------------------------------------------------\n queries = [x[0] for x in batch]\n pos_docs = [x[1] for x in batch]\n neg_docs = [x[2] for x in batch]\n\n # Texts tokenization -------------------------------------------------------\n queries = tokenizer(queries) # Returns PyTorch Tensor\n pos_docs = tokenizer(pos_docs) # Returns PyTorch Tensor\n neg_docs = tokenizer(neg_docs) # Returns PyTorch Tensor\n\n return queries, pos_docs, neg_docs\n\n\ndataloader = DataLoader(\n dataset=CustomDataset(),\n collate_fn=collate_fn,\n batch_size=32,\n shuffle=True,\n num_workers=4,\n)\n```\n\nEach line of `queries.jsonl` is as follows:\n```json\n{\n \"q_id\": \"q321\",\n \"text\": \"lorem ipsum\",\n \"pos_doc_ids\": [\"d2789822\", \"d2558037\", \"d2594098\"],\n \"neg_doc_ids\": [\"d3931445\", \"d4652233\", \"d191393\", \"d3692918\", \"d3051731\"]\n}\n```\n\nEach line of `documents.jsonl` is as follows:\n```json\n{\n \"doc_id\": \"d123\",\n \"text\": \"Lorem ipsum dolor sit amet, consectetuer adipiscing elit.\"\n}\n```\n\n\n## \ud83c\udf81 Feature Requests\nWould you like to see other features implemented? Please, open a [feature request](https://github.com/AmenRa/indxr/issues/new?assignees=&labels=enhancement&template=feature_request.md&title=%5BFeature+Request%5D+title).\n\n\n## \ud83e\udd18 Want to contribute?\nWould you like to contribute? Please, drop me an [e-mail](mailto:elias.bssn@gmail.com?subject=[GitHub]%20indxr).\n\n\n## \ud83d\udcc4 License\n[indxr](https://github.com/AmenRa/indxr) is an open-sourced software licensed under the [MIT license](LICENSE).\n\n\n",
"bugtrack_url": null,
"license": "",
"summary": "indxr: A Python utility for indexing long files.",
"version": "0.1.5",
"project_urls": {
"Homepage": "https://github.com/AmenRa/indxr"
},
"split_keywords": [
"text index",
"file index",
"index",
"indexer",
"indexing",
"information retrieval",
"natural language processing"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "60bf56fca0580d9ce9fc026a58e56d8aacfe2294bf6c52e13498202f65e5b6ac",
"md5": "9ec7aa15917907031961db24c8d58807",
"sha256": "178aa7457ab6371e36839a1e90e834f99b3a2491837790d9e64729cc56357b15"
},
"downloads": -1,
"filename": "indxr-0.1.5-py3-none-any.whl",
"has_sig": false,
"md5_digest": "9ec7aa15917907031961db24c8d58807",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 11361,
"upload_time": "2023-10-10T12:08:56",
"upload_time_iso_8601": "2023-10-10T12:08:56.705311Z",
"url": "https://files.pythonhosted.org/packages/60/bf/56fca0580d9ce9fc026a58e56d8aacfe2294bf6c52e13498202f65e5b6ac/indxr-0.1.5-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "e4383d5c4ac18cac52c8b968983cc697398f27a67a490fd43679759586c9a062",
"md5": "3dc5054083e7f0524f1fe1abaf7993ff",
"sha256": "c766ea77430247bdd289188044f5753041bffd98e01670b8054248c80243cb43"
},
"downloads": -1,
"filename": "indxr-0.1.5.tar.gz",
"has_sig": false,
"md5_digest": "3dc5054083e7f0524f1fe1abaf7993ff",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 11259,
"upload_time": "2023-10-10T12:08:58",
"upload_time_iso_8601": "2023-10-10T12:08:58.524687Z",
"url": "https://files.pythonhosted.org/packages/e4/38/3d5c4ac18cac52c8b968983cc697398f27a67a490fd43679759586c9a062/indxr-0.1.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-10-10 12:08:58",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "AmenRa",
"github_project": "indxr",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "indxr"
}