npids

Name	npids JSON
Version	0.1.2 JSON
	download
home_page	None
Summary	None
upload_time	2025-02-03 16:51:25
maintainer	None
docs_url	None
author	Sean MacAvaney
requires_python	>=3.6
license	None
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # npids

This package provides time- and space-efficient bi-directional lookups for identifiers.
Contents are mmap'd, eliminating most load times and allowing for efficient caching
through the file system.

## Motivation

It's often helpful to map an external string identifier to an integer index and vice versa.
Existing techniques for doing this in Python are either slow or require a lot of memory.

## Getting Started

Install via pip:

```bash
pip install npids
```

Build a lookup:

```python
from npids import Lookup
Lookup.build(['id1', 'id2', 'id3'], 'path/to/lookup.npids')
```

Perform forward lookups (index to ID)

```python
lookup = Lookup('path/to/lookup.npids')
# individual indices
lookup.fwd[0] # -> 'id1'
lookup.fwd[2] # -> 'id3'
# multiple indices
lookup.fwd[0,1] # -> ['id1', 'id2']
# works with numpy too
lookup.fwd[np.array([0,1])] # -> array(['id1', 'id2'], dtype='<U3')
```

Perform inverse lookups (ID to index)

```python
lookup = Lookup('path/to/lookup.npids')
# individual IDs
lookup.inv['id1'] # -> 0
lookup.inv['id3'] # -> 2
# multiple IDs
lookup.inv['id1', 'id3'] # -> [0, 2]
# works with numpy too
lookup.inv[np.array(['id1', 'id3'])] # -> array([0, 2])
```

That's about it!

## Codecs

The following codecs are currently supported for forward and inverted lookups. The file format is
flexible, allowing new codecs to be added in the future.

Forward:

 - `fixedbytes`: Every item is stored as a fixed number of bytes (with optional prefix). This
   serves as a fallback if other forward codecs do not work.
 - `intsequence`: A sequence of integers (e.g., 49, 50, 51) is identifed (with optional prefix); only
   metadata about the sequence is stored.
 - `intstored`: Integers are identified (with optional prefix), but they are not in a periodic sequence
   (e.g., 49, 55, 21). The integer values are encoded and stored.
 - `uuid`: UUIDs are identified (with optional prefix). The byte values of the UUIDs are stored.

Inverse:

 - `hash`: Hashes of every item are stored on disk, enabling O(1) lookups (but with extra storage).
   This serves as a fallback if other inverse codecs do not work.
 - `intsequence`: The values only consist of a single forward `intsequence` block; these values can be
   used to compute the indices.
 - `intstored`: The values consist of only `intstored` blocks with values in sorted order. These values
   can be deconstructed and looked up in the foward codec using a binary search.

## Benchmarks

The following benchmarks test the speed of building, forward/inverse lookups (10k random lookups,
both "cold" and "hot"), and the size of the structure. Rows marked with `*` indicate that the values
include additional overheads that are not directly related to operation -- namely,
full engines include content indexing.

 - `npids`: This software
 - `inmem`: A simple Python lookup structure in memory (a list and a dict), backed by a plain text file
   that is read into memory
 - `Terrier`: Terrier engine acccessed via the pyterrier package
 - `Lucene`: Apache Lucene accessed via the pyserini package

The benchmarks show that `npids` is a reasonable choice for performing ID lookups.
Although it is a bit slower than loading them all into memory, it avoids the
considerable upfront cost of doing so. Compared to other approaches for loading them
from disk (Lucene, Terrier), it consumes far less storage, is built faster, and
(usually) performs the lookups considerably faster.

[`msmarco-passage`](https://ir-datasets.com/msmarco-passage) (8.8M docnos: `0`, `1`, `2`, ...)

| System   | Build Time | Cold Fwd | Hot Fwd | Cold Inv | Hot Inv | File Size |
|----------|-----------:|---------:|--------:|---------:|--------:|----------:|
| inmem    |      5.95s |      4ms |     1ms |      6ms |     2ms |     1.3GB |
| `npids`  |     13.88s |      6ms |     6ms |      4ms |     2ms |      206B |
| Lucene   |   * 55.39s |    119ms |    53ms |    194ms |    60ms | * 130.3MB |
| Terrier  |    * 3m53s |    121ms |   107ms |    1.60s |   218ms | * 502.9MB |

[`msmarco-document`](https://ir-datasets.com/msmarco-document) (3.2M docnos: `D1555982`, `D301595`, `D1359209`, ...)

| System   | Build Time | Cold Fwd | Hot Fwd | Cold Inv | Hot Inv | File Size |
|----------|-----------:|---------:|--------:|---------:|--------:|----------:|
| inmem    |      1.44s |      3ms |     1ms |      5ms |     2ms |    27.9MB |
| `npids`  |     13.02s |      6ms |     5ms |      8ms |     8ms |    42.5MB |
| Lucene   |   * 25.57s |    142ms |    61ms |    162ms |    62ms |  * 67.6MB |
| Terrier  |    * 1m26s |    111ms |   103ms |    866ms |   197ms | * 195.0MB |

[`hc4/fa`](https://ir-datasets.com/hc4#hc4/fa) (486k docnos: `9064520f-bc4d-4118-a30e-7d99f5adc612`, `e34ce085-cc13-4a1f-90e4-81a7fbfd7f0d`, `fa2fc4eb-4f97-4bee-bf92-a7330a80c66f`, ...)

| System   | Build Time | Cold Fwd | Hot Fwd | Cold Inv | Hot Inv | File Size |
|----------|-----------:|---------:|--------:|---------:|--------:|----------:|
| inmem    |      0.14s |      2ms |     1ms |      5ms |     1ms |    18.0MB |
| `npids`  |      2.81s |     21ms |    20ms |     32ms |    31ms |    11.8MB |
| Lucene   |    * 4.26s |    145ms |    79ms |    163ms |    75ms |  * 49.4MB |
| Terrier  |   * 14.76s |    125ms |   107ms |    564ms |   187ms |  * 85.1MB |

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "npids",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": null,
    "keywords": null,
    "author": "Sean MacAvaney",
    "author_email": "sean.macavaney@glasgow.ac.uk",
    "download_url": "https://files.pythonhosted.org/packages/55/a7/b76603059ae501c75ed80be365e34c828c10a5d1ba4628d8173209654b82/npids-0.1.2.tar.gz",
    "platform": null,
    "description": "# npids\n\nThis package provides time- and space-efficient bi-directional lookups for identifiers.\nContents are mmap'd, eliminating most load times and allowing for efficient caching\nthrough the file system.\n\n## Motivation\n\nIt's often helpful to map an external string identifier to an integer index and vice versa.\nExisting techniques for doing this in Python are either slow or require a lot of memory.\n\n## Getting Started\n\nInstall via pip:\n\n```bash\npip install npids\n```\n\nBuild a lookup:\n\n```python\nfrom npids import Lookup\nLookup.build(['id1', 'id2', 'id3'], 'path/to/lookup.npids')\n```\n\nPerform forward lookups (index to ID)\n\n```python\nlookup = Lookup('path/to/lookup.npids')\n# individual indices\nlookup.fwd[0] # -> 'id1'\nlookup.fwd[2] # -> 'id3'\n# multiple indices\nlookup.fwd[0,1] # -> ['id1', 'id2']\n# works with numpy too\nlookup.fwd[np.array([0,1])] # -> array(['id1', 'id2'], dtype='<U3')\n```\n\nPerform inverse lookups (ID to index)\n\n```python\nlookup = Lookup('path/to/lookup.npids')\n# individual IDs\nlookup.inv['id1'] # -> 0\nlookup.inv['id3'] # -> 2\n# multiple IDs\nlookup.inv['id1', 'id3'] # -> [0, 2]\n# works with numpy too\nlookup.inv[np.array(['id1', 'id3'])] # -> array([0, 2])\n```\n\nThat's about it!\n\n## Codecs\n\nThe following codecs are currently supported for forward and inverted lookups. The file format is\nflexible, allowing new codecs to be added in the future.\n\nForward:\n\n - `fixedbytes`: Every item is stored as a fixed number of bytes (with optional prefix). This\n   serves as a fallback if other forward codecs do not work.\n - `intsequence`: A sequence of integers (e.g., 49, 50, 51) is identifed (with optional prefix); only\n   metadata about the sequence is stored.\n - `intstored`: Integers are identified (with optional prefix), but they are not in a periodic sequence\n   (e.g., 49, 55, 21). The integer values are encoded and stored.\n - `uuid`: UUIDs are identified (with optional prefix). The byte values of the UUIDs are stored.\n\nInverse:\n\n - `hash`: Hashes of every item are stored on disk, enabling O(1) lookups (but with extra storage).\n   This serves as a fallback if other inverse codecs do not work.\n - `intsequence`: The values only consist of a single forward `intsequence` block; these values can be\n   used to compute the indices.\n - `intstored`: The values consist of only `intstored` blocks with values in sorted order. These values\n   can be deconstructed and looked up in the foward codec using a binary search.\n\n## Benchmarks\n\nThe following benchmarks test the speed of building, forward/inverse lookups (10k random lookups,\nboth \"cold\" and \"hot\"), and the size of the structure. Rows marked with `*` indicate that the values\ninclude additional overheads that are not directly related to operation -- namely,\nfull engines include content indexing.\n\n - `npids`: This software\n - `inmem`: A simple Python lookup structure in memory (a list and a dict), backed by a plain text file\n   that is read into memory\n - `Terrier`: Terrier engine acccessed via the pyterrier package\n - `Lucene`: Apache Lucene accessed via the pyserini package\n\nThe benchmarks show that `npids` is a reasonable choice for performing ID lookups.\nAlthough it is a bit slower than loading them all into memory, it avoids the\nconsiderable upfront cost of doing so. Compared to other approaches for loading them\nfrom disk (Lucene, Terrier), it consumes far less storage, is built faster, and\n(usually) performs the lookups considerably faster.\n\n[`msmarco-passage`](https://ir-datasets.com/msmarco-passage) (8.8M docnos: `0`, `1`, `2`, ...)\n\n| System   | Build Time | Cold Fwd | Hot Fwd | Cold Inv | Hot Inv | File Size |\n|----------|-----------:|---------:|--------:|---------:|--------:|----------:|\n| inmem    |      5.95s |      4ms |     1ms |      6ms |     2ms |     1.3GB |\n| `npids`  |     13.88s |      6ms |     6ms |      4ms |     2ms |      206B |\n| Lucene   |   * 55.39s |    119ms |    53ms |    194ms |    60ms | * 130.3MB |\n| Terrier  |    * 3m53s |    121ms |   107ms |    1.60s |   218ms | * 502.9MB |\n\n[`msmarco-document`](https://ir-datasets.com/msmarco-document) (3.2M docnos: `D1555982`, `D301595`, `D1359209`, ...)\n\n| System   | Build Time | Cold Fwd | Hot Fwd | Cold Inv | Hot Inv | File Size |\n|----------|-----------:|---------:|--------:|---------:|--------:|----------:|\n| inmem    |      1.44s |      3ms |     1ms |      5ms |     2ms |    27.9MB |\n| `npids`  |     13.02s |      6ms |     5ms |      8ms |     8ms |    42.5MB |\n| Lucene   |   * 25.57s |    142ms |    61ms |    162ms |    62ms |  * 67.6MB |\n| Terrier  |    * 1m26s |    111ms |   103ms |    866ms |   197ms | * 195.0MB |\n\n[`hc4/fa`](https://ir-datasets.com/hc4#hc4/fa) (486k docnos: `9064520f-bc4d-4118-a30e-7d99f5adc612`, `e34ce085-cc13-4a1f-90e4-81a7fbfd7f0d`, `fa2fc4eb-4f97-4bee-bf92-a7330a80c66f`, ...)\n\n| System   | Build Time | Cold Fwd | Hot Fwd | Cold Inv | Hot Inv | File Size |\n|----------|-----------:|---------:|--------:|---------:|--------:|----------:|\n| inmem    |      0.14s |      2ms |     1ms |      5ms |     1ms |    18.0MB |\n| `npids`  |      2.81s |     21ms |    20ms |     32ms |    31ms |    11.8MB |\n| Lucene   |    * 4.26s |    145ms |    79ms |    163ms |    75ms |  * 49.4MB |\n| Terrier  |   * 14.76s |    125ms |   107ms |    564ms |   187ms |  * 85.1MB |\n",
    "bugtrack_url": null,
    "license": null,
    "summary": null,
    "version": "0.1.2",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "5f7c66bb9e8f03c335da75f3ee3566e8070a778f1f0e9125665978290276555b",
                "md5": "979cc41a6f9b81e916ceff40d3a16c1d",
                "sha256": "16136b196bbb411555e831f2a16062c121649049d17828c373aded37fe16ffba"
            },
            "downloads": -1,
            "filename": "npids-0.1.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "979cc41a6f9b81e916ceff40d3a16c1d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 26379,
            "upload_time": "2025-02-03T16:51:20",
            "upload_time_iso_8601": "2025-02-03T16:51:20.403416Z",
            "url": "https://files.pythonhosted.org/packages/5f/7c/66bb9e8f03c335da75f3ee3566e8070a778f1f0e9125665978290276555b/npids-0.1.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "55a7b76603059ae501c75ed80be365e34c828c10a5d1ba4628d8173209654b82",
                "md5": "4ea49f572447798a59e199dd16153ba3",
                "sha256": "fb5de36f754bf37eb4dfbf2bceec7933f96d4b4c77dc1241bac8c78da5c1dde3"
            },
            "downloads": -1,
            "filename": "npids-0.1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "4ea49f572447798a59e199dd16153ba3",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 21337,
            "upload_time": "2025-02-03T16:51:25",
            "upload_time_iso_8601": "2025-02-03T16:51:25.570638Z",
            "url": "https://files.pythonhosted.org/packages/55/a7/b76603059ae501c75ed80be365e34c828c10a5d1ba4628d8173209654b82/npids-0.1.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-03 16:51:25",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "npids"
}

Sean MacAvaney