zindex-py


Namezindex-py JSON
Version 0.0.1 PyPI version JSON
download
home_pagehttps://github.com/hariharan-devarajan/zindex
SummaryIndexer for GZIP specially built for DLIO Profiler.
upload_time2023-11-18 20:15:06
maintainer
docs_urlNone
authorHariharan Devarajan (Hari)
requires_python>=3.7
license
keywords profiler deep learning i/o benchmark npz pytorch benchmark tensorflow benchmark
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI
coveralls test coverage No coveralls.
            ### DISCLAIMER

This repo is a fork of the original repo located at https://github.com/mattgodbolt/zindex.
We modify this repo for using it cohesively with DLIO Profiler https://github.com/hariharan-devarajan/dlio-profiler.


`zindex` creates and queries an index on a compressed, line-based text file in a
time- and space-efficient way.

### The itch I had

I have many multigigabyte text gzipped log files and I'd like to be able to find data in them by an index. 
There's a key on each line that a simple regex can pull out. However, to find a
particular record requires `zgrep`, which takes ages as it has to seek through
gigabytes of previous data to get to each record.

Enter `zindex` which builds an index and also stores decompression checkpoints along the way
which allows lightning fast random access. Pulling out single lines by either
line number of by an index entry is then almost instant, even for huge files. The indices
themselves are small too, typically ~10% of the compressed file size for a simple unique
numeric index.

## Creating an index

`zindex` needs to be told what part of each line constitutes the index. This can be done by
a regular expression, by field, or by piping each line through an external program.

By default zindex creates an index of `file.gz.zindex` when asked to index `file.gz`.

Example: create an index on lines matching a numeric regular expression. The capture group
indicates the part that's to be indexed, and the options show each line has a unique, numeric index.

```bash
$ zindex file.gz --regex 'id:([0-9]+)' --numeric --unique
```

Example: create an index on the second field of a CSV file:

```bash
$ zindex file.gz --delimiter , --field 2
```

Example: create an index on a JSON field `orderId.id` in any of the items in the document root's `actions` array (requires [jq](http://stedolan.github.io/jq/)).
The `jq` query creates an array of all the `orderId.id`s, then `join`s them with a space to ensure each individual line piped to jq creates a single line of output,
with multiple matches separated by spaces (which is the default separator).

```bash
$ zindex file.gz --pipe "jq --raw-output --unbuffered '[.actions[].orderId.id] | join(\" \")'"
```

Multiple indices, and configuration of the index creation by JSON configuration file are supported, see below.

## Querying the index

The `zq` program is used to query an index.  It's given the name of the compressed file and a list of queries. For example:

```bash
$ zq file.gz 1023 4443 554
```

It's also possible to output by line number, so to print lines 1 and 1000 from a file:

```bash
$ zq file.gz --line 1 1000
```

## Building from source

`zindex` uses CMake for its basic building (though has a bootstrapping `Makefile`), and requires a C++11 compatible compiler (GCC 4.8 or above and clang 3.4 and above). It also requires `zlib`. With the relevant compiler available, building ought to be as simple as:

```bash
$ git clone https://github.com/mattgodbolt/zindex.git
$ cd zindex
$ make
```

Binaries are left in `build/Release`.

Additionally a static binary can be built if you're happy to dip your toe into CMake:

```bash
$ cd path/to/build/directory
$ cmake path/to/zindex/checkout/dir -DStatic:BOOL=On -DCMAKE_BUILD_TYPE=Release
$ make
```

## Multiple indices

To support more than one index, or for easier configuration than all the command-line flags that might be
needed, there is a JSON configuration format. Pass the `--config <yourconfigfile>.json` option and put something like this in the configuration file:

    { 
        "indexes": [
            {
                "type": "field",
                "delimiter": "\t",
                "fieldNum": 1
            },
            {
                "name": "secondary",
                "type": "field",
                "delimiter": "\t",
                "fieldNum": 2
            }
        ]
    }

This creates two indices, one on the first field and one on the second field, as delimited by tabs. One can
then specify which index to query with the `-i <index>` option of `zq`.

### Issues and feature requests

See the [issue tracker](https://github.com/mattgodbolt/zindex/issues) for TODOs and known bugs. Please raise bugs there, and feel free to submit suggestions there also.

Feel free to [contact me](mailto:matt@godbolt.org) if you prefer email over bug trackers.



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/hariharan-devarajan/zindex",
    "name": "zindex-py",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "",
    "keywords": "profiler,deep learning,I/O,benchmark,NPZ,pytorch benchmark,tensorflow benchmark",
    "author": "Hariharan Devarajan (Hari)",
    "author_email": "",
    "download_url": "",
    "platform": null,
    "description": "### DISCLAIMER\n\nThis repo is a fork of the original repo located at https://github.com/mattgodbolt/zindex.\nWe modify this repo for using it cohesively with DLIO Profiler https://github.com/hariharan-devarajan/dlio-profiler.\n\n\n`zindex` creates and queries an index on a compressed, line-based text file in a\ntime- and space-efficient way.\n\n### The itch I had\n\nI have many multigigabyte text gzipped log files and I'd like to be able to find data in them by an index. \nThere's a key on each line that a simple regex can pull out. However, to find a\nparticular record requires `zgrep`, which takes ages as it has to seek through\ngigabytes of previous data to get to each record.\n\nEnter `zindex` which builds an index and also stores decompression checkpoints along the way\nwhich allows lightning fast random access. Pulling out single lines by either\nline number of by an index entry is then almost instant, even for huge files. The indices\nthemselves are small too, typically ~10% of the compressed file size for a simple unique\nnumeric index.\n\n## Creating an index\n\n`zindex` needs to be told what part of each line constitutes the index. This can be done by\na regular expression, by field, or by piping each line through an external program.\n\nBy default zindex creates an index of `file.gz.zindex` when asked to index `file.gz`.\n\nExample: create an index on lines matching a numeric regular expression. The capture group\nindicates the part that's to be indexed, and the options show each line has a unique, numeric index.\n\n```bash\n$ zindex file.gz --regex 'id:([0-9]+)' --numeric --unique\n```\n\nExample: create an index on the second field of a CSV file:\n\n```bash\n$ zindex file.gz --delimiter , --field 2\n```\n\nExample: create an index on a JSON field `orderId.id` in any of the items in the document root's `actions` array (requires [jq](http://stedolan.github.io/jq/)).\nThe `jq` query creates an array of all the `orderId.id`s, then `join`s them with a space to ensure each individual line piped to jq creates a single line of output,\nwith multiple matches separated by spaces (which is the default separator).\n\n```bash\n$ zindex file.gz --pipe \"jq --raw-output --unbuffered '[.actions[].orderId.id] | join(\\\" \\\")'\"\n```\n\nMultiple indices, and configuration of the index creation by JSON configuration file are supported, see below.\n\n## Querying the index\n\nThe `zq` program is used to query an index.  It's given the name of the compressed file and a list of queries. For example:\n\n```bash\n$ zq file.gz 1023 4443 554\n```\n\nIt's also possible to output by line number, so to print lines 1 and 1000 from a file:\n\n```bash\n$ zq file.gz --line 1 1000\n```\n\n## Building from source\n\n`zindex` uses CMake for its basic building (though has a bootstrapping `Makefile`), and requires a C++11 compatible compiler (GCC 4.8 or above and clang 3.4 and above). It also requires `zlib`. With the relevant compiler available, building ought to be as simple as:\n\n```bash\n$ git clone https://github.com/mattgodbolt/zindex.git\n$ cd zindex\n$ make\n```\n\nBinaries are left in `build/Release`.\n\nAdditionally a static binary can be built if you're happy to dip your toe into CMake:\n\n```bash\n$ cd path/to/build/directory\n$ cmake path/to/zindex/checkout/dir -DStatic:BOOL=On -DCMAKE_BUILD_TYPE=Release\n$ make\n```\n\n## Multiple indices\n\nTo support more than one index, or for easier configuration than all the command-line flags that might be\nneeded, there is a JSON configuration format. Pass the `--config <yourconfigfile>.json` option and put something like this in the configuration file:\n\n    { \n        \"indexes\": [\n            {\n                \"type\": \"field\",\n                \"delimiter\": \"\\t\",\n                \"fieldNum\": 1\n            },\n            {\n                \"name\": \"secondary\",\n                \"type\": \"field\",\n                \"delimiter\": \"\\t\",\n                \"fieldNum\": 2\n            }\n        ]\n    }\n\nThis creates two indices, one on the first field and one on the second field, as delimited by tabs. One can\nthen specify which index to query with the `-i <index>` option of `zq`.\n\n### Issues and feature requests\n\nSee the [issue tracker](https://github.com/mattgodbolt/zindex/issues) for TODOs and known bugs. Please raise bugs there, and feel free to submit suggestions there also.\n\nFeel free to [contact me](mailto:matt@godbolt.org) if you prefer email over bug trackers.\n\n\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Indexer for GZIP specially built for DLIO Profiler.",
    "version": "0.0.1",
    "project_urls": {
        "Bug Reports": "https://github.com/hariharan-devarajan/zindex/issues",
        "Homepage": "https://github.com/hariharan-devarajan/zindex",
        "Source": "https://github.com/hariharan-devarajan/zindex"
    },
    "split_keywords": [
        "profiler",
        "deep learning",
        "i/o",
        "benchmark",
        "npz",
        "pytorch benchmark",
        "tensorflow benchmark"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f15bbd5a84d07202f2693f6606626095a0007f412a0803143979aa5747e7edec",
                "md5": "2e155a7ccba5445cb3461d5793f9f09e",
                "sha256": "c7bc1371aba2456de0781952ab6ff77039764c6f5d8bef635ee6d600e3dd4574"
            },
            "downloads": -1,
            "filename": "zindex_py-0.0.1-cp310-cp310-manylinux_2_34_x86_64.whl",
            "has_sig": false,
            "md5_digest": "2e155a7ccba5445cb3461d5793f9f09e",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.7",
            "size": 1363922,
            "upload_time": "2023-11-18T20:15:06",
            "upload_time_iso_8601": "2023-11-18T20:15:06.833826Z",
            "url": "https://files.pythonhosted.org/packages/f1/5b/bd5a84d07202f2693f6606626095a0007f412a0803143979aa5747e7edec/zindex_py-0.0.1-cp310-cp310-manylinux_2_34_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "31d9c8eb596a8f053c0e92b5391953bfa857285abec2a1a2445a6e4ce99e3a3c",
                "md5": "83f7b3c42742faf6289530dfd8a05481",
                "sha256": "353e2b9b43f385bb017df2f13d9151d277681f6a21e124f1ae1e30fdfe6f2373"
            },
            "downloads": -1,
            "filename": "zindex_py-0.0.1-cp37-cp37m-manylinux_2_34_x86_64.whl",
            "has_sig": false,
            "md5_digest": "83f7b3c42742faf6289530dfd8a05481",
            "packagetype": "bdist_wheel",
            "python_version": "cp37",
            "requires_python": ">=3.7",
            "size": 1371017,
            "upload_time": "2023-11-18T20:15:16",
            "upload_time_iso_8601": "2023-11-18T20:15:16.032636Z",
            "url": "https://files.pythonhosted.org/packages/31/d9/c8eb596a8f053c0e92b5391953bfa857285abec2a1a2445a6e4ce99e3a3c/zindex_py-0.0.1-cp37-cp37m-manylinux_2_34_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3347e7ddcfc699cba0cf5d4ad92562016ca298bbecd39f739e9fb8408299b9d2",
                "md5": "31cadad1413d05546f09e3154c3869f8",
                "sha256": "84eb5c2e704075317b9160bbb9b44b8c814b49e4bd177923507e2ce7180da011"
            },
            "downloads": -1,
            "filename": "zindex_py-0.0.1-cp38-cp38-manylinux_2_34_x86_64.whl",
            "has_sig": false,
            "md5_digest": "31cadad1413d05546f09e3154c3869f8",
            "packagetype": "bdist_wheel",
            "python_version": "cp38",
            "requires_python": ">=3.7",
            "size": 1363676,
            "upload_time": "2023-11-18T20:15:26",
            "upload_time_iso_8601": "2023-11-18T20:15:26.752782Z",
            "url": "https://files.pythonhosted.org/packages/33/47/e7ddcfc699cba0cf5d4ad92562016ca298bbecd39f739e9fb8408299b9d2/zindex_py-0.0.1-cp38-cp38-manylinux_2_34_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "eedbf2d4bf1d531d9ed4415761508fbc15683ecaaf9245a8b670c760306ac1cf",
                "md5": "2f86a850089c228b88fb5caa5eb68796",
                "sha256": "be3938938c30624e8bfe9c7ee08f60b871f2c8d0ffba226d21ff212c33c0b5c2"
            },
            "downloads": -1,
            "filename": "zindex_py-0.0.1-cp39-cp39-manylinux_2_34_x86_64.whl",
            "has_sig": false,
            "md5_digest": "2f86a850089c228b88fb5caa5eb68796",
            "packagetype": "bdist_wheel",
            "python_version": "cp39",
            "requires_python": ">=3.7",
            "size": 1364265,
            "upload_time": "2023-11-18T20:12:25",
            "upload_time_iso_8601": "2023-11-18T20:12:25.965613Z",
            "url": "https://files.pythonhosted.org/packages/ee/db/f2d4bf1d531d9ed4415761508fbc15683ecaaf9245a8b670c760306ac1cf/zindex_py-0.0.1-cp39-cp39-manylinux_2_34_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-11-18 20:15:06",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "hariharan-devarajan",
    "github_project": "zindex",
    "travis_ci": true,
    "coveralls": false,
    "github_actions": true,
    "lcname": "zindex-py"
}
        
Elapsed time: 0.13984s