hypergrep


Namehypergrep JSON
Version 3.2.0 PyPI version JSON
download
home_page
SummaryUtilities for rapid text file processing using Intel Hyperscan in Python
upload_time2024-03-17 17:43:53
maintainer
docs_urlNone
authorDavid Fritz
requires_python>=3.10
licenseMIT
keywords regex logs hyperscan
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # HyperGrep

[![os: linux](https://img.shields.io/badge/os-linux-blue)](https://docs.python.org/3.10/)
[![python: 3.10+](https://img.shields.io/badge/python-3.10_|_3.11-blue)](https://devguide.python.org/versions)
[![python style: google](https://img.shields.io/badge/python%20style-google-blue)](https://google.github.io/styleguide/pyguide.html)
[![imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://github.com/PyCQA/isort)
[![code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![code style: pycodestyle](https://img.shields.io/badge/code%20style-pycodestyle-green)](https://github.com/PyCQA/pycodestyle)
[![doc style: pydocstyle](https://img.shields.io/badge/doc%20style-pydocstyle-green)](https://github.com/PyCQA/pydocstyle)
[![static typing: mypy](https://img.shields.io/badge/static_typing-mypy-green)](https://github.com/python/mypy)
[![linting: pylint](https://img.shields.io/badge/linting-pylint-yellowgreen)](https://github.com/PyCQA/pylint)
[![testing: pytest](https://img.shields.io/badge/testing-pytest-yellowgreen)](https://github.com/pytest-dev/pytest)
[![security: bandit](https://img.shields.io/badge/security-bandit-black)](https://github.com/PyCQA/bandit)
[![license: MIT](https://img.shields.io/badge/license-MIT-lightgrey)](LICENSE)
![maintenance: deprecated](https://img.shields.io/badge/Maintenance%20Status-Deprecated-yellow.svg)

> **Note**: This project had been replaced by [VectorGrep](https://github.com/pyranha-labs/vectorgrep).
No additional features or enhancements will be made to this library. Due to licensing changes in
Intel Hyperscan starting in 5.5, all future development besides bug fixes will be dedicated to maintaining
the [Vectorscan](https://github.com/VectorCamp/vectorscan/) version of this library. Vectorscan/VectorGrep
also provides more options for increasing portability and supporting multiple architectures.

HyperGrep is a fast (Hyperspeed) Global Regular Expression Processing library for Python. It uses Intel Hyperscan
to maximize performance, and can be used with multi-threaded or multi-processed applications. While a standard grep
if designed to print, this is designed to allow full control over processing matches. The library supports scanning
plaintext, gzip, and ztsd compressed files for regular expressions, and customizing the action to take when matched.

For full information on the amazing performance that can be obtained through Intel Hyperscan with, refer to:  
[Hyperscan](https://github.com/intel/hyperscan)


## Table Of Contents

  * [Key Features](#key-features)
  * [Compatibility](#compatibility)
  * [Getting Started](#getting-started)
    * [Installation](#installation)
    * [Examples](#examples)
    * [Contribute](#contribute)
    * [Advanced Guides](#advanced-guides)
  * [FAQ](#faq)


## Key Features

- **Simplicity**
  - No experience with Hyperscan required. Provides "grep" styled interfaces.
  - No external dependencies, and no building required (on natively supported platforms).
  - Built in support for compressed and uncompressed files.
- **Speed**
  - Uses Hyperscan, a high-performance multiple regex matching library.
  - Performs read and regex operations outside Python.
  - Batches results for Python, reducing overhead (customizable).
- **Parallelism**
  - Bypasses GIL (Global Interpreter Lock) during read and regex operations to allow proper multithreading.
  - Python consumer threads (callbacks) are able to handle many producer threads (readers).


## Compatibility

- Supports Python 3.10+
- Supports Linux systems with x86_64 architecture
  - Tested on Ubuntu Trusty (14.04) and above
  - Other Linux distros may work, but are not guaranteed
  - May be able to be built on Windows/OSX manually
  - More platforms are planned to be supported (natively) in the future
- Some regex constructs are not supported by Hyperscan in order to guarantee stable performance
  - For more information refer to: [Unsupported Constructs](https://intel.github.io/hyperscan/dev-reference/compilation.html#unsupported-constructs)


## Getting Started

### Installation

- Install HyperGrep via pip:
    ```shell
    pip install hypergrep
    ```

- Or via git clone:
    ```shell
    git clone <path to fork>
    cd hypergrep
    pip install .
    ```

- Or build and install from wheel:
    ```shell
    # Build locally.
    git clone <path to fork>
    cd hypergrep
    make wheel
    
    # Push dist/hypergrep*.tar.gz to environment where it will be installed.
    pip install dist/hypergrep*.tar.gz
    ```

### Examples

- Read one file with the example single threaded command:
    ```shell
    # hypergrep/scanner.py <regex> <file>
    hypergrep/scanner.py pattern ./hypergrep/scanner.py
    ```

- Read multiple files with the multithreaded command (drop in replacement for `grep` where patterns are compatible):
    ```shell
    # From install:
    # hypergrep <regex> <file(s)>
    hypergrep pattern ./hypergrep/scanner.py

    # From package:
    # hypergrep/multiscanner.py <regex> <file>
    hypergrep/multiscanner.py pattern ./hypergrep/scanner.py
    ```

- Collect all matches from a file, similar to grep, and perform a custom operation on results:
    ```python
    import hypergrep
    
    file = "./hypergrep/scanner.py"
    pattern = 'pattern'
    
    results, return_code = hypergrep.grep(file, [pattern])
    for index, line in results:
        print(f'{index}: {line}')
    ```

- Manually scan a file and perform a custom operation on match:
    ```python
    import hypergrep
    
    file = "./hypergrep/scanner.py"
    pattern = 'pattern'

    def on_match(matches: list, count: int) -> None:
        for index in range(count):
            match = matches[index]
            line = match.line.decode(errors='ignore')
            print(f'Custom print: {line.rstrip()}')
    
    hypergrep.scan(file, [pattern], on_match)
    ```

- Override the `libhs` and/or `libzstd` libraries to use files outside the package.
Must be called before any other usage of `hypergrep`:
    ```python
    import hypergrep

    hypergrep.configure_libraries(
        libhs='/home/myuser/libhs.so.mybuild',
        libzstd='/home/myuser/libzstd.so.mybuild',
    )
    ```

### Contributing

Refer to the [Contributing Guide](CONTRIBUTING.md) for information on how to contribute to this project.

### Advanced Guides

Refer to [How Tos](docs/HOW_TO.md) for more advanced topics, such as building the shared library objects.


## FAQ

#### Q: How does HyperGrep compare to other Hyperscan python libraries?

**A:** HyperGrep has a specific goal: provide a high performance "grep" like interface in python,
but with more control. It is not intended to be a full set of bindings to Hyperscan. If you need
full control over the low level backend, there are other python libraries intended for that use case. Here are
a few of the reasons for the focused goal of this library:

- Simplify developer integration.
  - No experience with Hyperscan required.
  - Familiarity with `grep` variants beneficial, but not required.
- Avoid messy subprocess chains common in "parallel grep" implementations.
  - Commands like `zgrep` are actually a `zcat` + `grep`. This can lead to 3+ processes per file read.
  - Subprocessing is messy in general, best to minimize its use as much as possible.
- Optimize performance.
  - Reduce callbacks to/from python to reduce overhead.
  - Allow true multithreading during read and regex matching.
  - Provide the pattern matched in multi-regex searches, without having to repeat the search in Python.

When it comes to performance, here is an example of the benefit of this design. Due to the performance of
Hyperscan, it is also often faster than native `grep` variants, even while using python. Scenario setup:
- 2.10GHz Intel x86_64 Processor
- ~17M line file (~300M gzip compressed, ~3G uncompressed).
- ~800 PCRE patterns.
- Counting only, no extra processing of lines.
- Each job run 5 times and averaged (lower is better).

|   | Scenario (Uncompressed timings in parenthesis) | HyperGrep     | Full bindings     | zgrep (grep)  |
|---|------------------------------------------------|---------------|-------------------|---------------|
| 1 | ~90K matches, 1 pattern                        | 8.2s (2.5s)   | 22.8s (15.5s)     | 12.5s (5.2s)  |
| 2 | ~900K matches, 10 patterns                     | 9.7s (3.8s)   | 25.7s (16.8s)     | 19.8s (17.3s) |
| 3 | ~15M matches, ~800 patterns                    | 44.2s (38.1s) | 73.5s (57.7s)     | *             |
| 4 | Scenario #3 (x4 files), 1 process (4 threads)  | 49.6s (46.8s) | 1432.6s (1302.2s) | *             |

* GNU grep does not allow multiple PCRE patterns natively, and concatenation via "or" failed.

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "hypergrep",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": "",
    "keywords": "regex,logs,hyperscan",
    "author": "David Fritz",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/c3/2b/737f83b33480e0378d897240e9285ed325c06e9aff409b228acc6efef72f/hypergrep-3.2.0.tar.gz",
    "platform": "Linux",
    "description": "# HyperGrep\n\n[![os: linux](https://img.shields.io/badge/os-linux-blue)](https://docs.python.org/3.10/)\n[![python: 3.10+](https://img.shields.io/badge/python-3.10_|_3.11-blue)](https://devguide.python.org/versions)\n[![python style: google](https://img.shields.io/badge/python%20style-google-blue)](https://google.github.io/styleguide/pyguide.html)\n[![imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://github.com/PyCQA/isort)\n[![code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n[![code style: pycodestyle](https://img.shields.io/badge/code%20style-pycodestyle-green)](https://github.com/PyCQA/pycodestyle)\n[![doc style: pydocstyle](https://img.shields.io/badge/doc%20style-pydocstyle-green)](https://github.com/PyCQA/pydocstyle)\n[![static typing: mypy](https://img.shields.io/badge/static_typing-mypy-green)](https://github.com/python/mypy)\n[![linting: pylint](https://img.shields.io/badge/linting-pylint-yellowgreen)](https://github.com/PyCQA/pylint)\n[![testing: pytest](https://img.shields.io/badge/testing-pytest-yellowgreen)](https://github.com/pytest-dev/pytest)\n[![security: bandit](https://img.shields.io/badge/security-bandit-black)](https://github.com/PyCQA/bandit)\n[![license: MIT](https://img.shields.io/badge/license-MIT-lightgrey)](LICENSE)\n![maintenance: deprecated](https://img.shields.io/badge/Maintenance%20Status-Deprecated-yellow.svg)\n\n> **Note**: This project had been replaced by [VectorGrep](https://github.com/pyranha-labs/vectorgrep).\nNo additional features or enhancements will be made to this library. Due to licensing changes in\nIntel Hyperscan starting in 5.5, all future development besides bug fixes will be dedicated to maintaining\nthe [Vectorscan](https://github.com/VectorCamp/vectorscan/) version of this library. Vectorscan/VectorGrep\nalso provides more options for increasing portability and supporting multiple architectures.\n\nHyperGrep is a fast (Hyperspeed) Global Regular Expression Processing library for Python. It uses Intel Hyperscan\nto maximize performance, and can be used with multi-threaded or multi-processed applications. While a standard grep\nif designed to print, this is designed to allow full control over processing matches. The library supports scanning\nplaintext, gzip, and ztsd compressed files for regular expressions, and customizing the action to take when matched.\n\nFor full information on the amazing performance that can be obtained through Intel Hyperscan with, refer to:  \n[Hyperscan](https://github.com/intel/hyperscan)\n\n\n## Table Of Contents\n\n  * [Key Features](#key-features)\n  * [Compatibility](#compatibility)\n  * [Getting Started](#getting-started)\n    * [Installation](#installation)\n    * [Examples](#examples)\n    * [Contribute](#contribute)\n    * [Advanced Guides](#advanced-guides)\n  * [FAQ](#faq)\n\n\n## Key Features\n\n- **Simplicity**\n  - No experience with Hyperscan required. Provides \"grep\" styled interfaces.\n  - No external dependencies, and no building required (on natively supported platforms).\n  - Built in support for compressed and uncompressed files.\n- **Speed**\n  - Uses Hyperscan, a high-performance multiple regex matching library.\n  - Performs read and regex operations outside Python.\n  - Batches results for Python, reducing overhead (customizable).\n- **Parallelism**\n  - Bypasses GIL (Global Interpreter Lock) during read and regex operations to allow proper multithreading.\n  - Python consumer threads (callbacks) are able to handle many producer threads (readers).\n\n\n## Compatibility\n\n- Supports Python 3.10+\n- Supports Linux systems with x86_64 architecture\n  - Tested on Ubuntu Trusty (14.04) and above\n  - Other Linux distros may work, but are not guaranteed\n  - May be able to be built on Windows/OSX manually\n  - More platforms are planned to be supported (natively) in the future\n- Some regex constructs are not supported by Hyperscan in order to guarantee stable performance\n  - For more information refer to: [Unsupported Constructs](https://intel.github.io/hyperscan/dev-reference/compilation.html#unsupported-constructs)\n\n\n## Getting Started\n\n### Installation\n\n- Install HyperGrep via pip:\n    ```shell\n    pip install hypergrep\n    ```\n\n- Or via git clone:\n    ```shell\n    git clone <path to fork>\n    cd hypergrep\n    pip install .\n    ```\n\n- Or build and install from wheel:\n    ```shell\n    # Build locally.\n    git clone <path to fork>\n    cd hypergrep\n    make wheel\n    \n    # Push dist/hypergrep*.tar.gz to environment where it will be installed.\n    pip install dist/hypergrep*.tar.gz\n    ```\n\n### Examples\n\n- Read one file with the example single threaded command:\n    ```shell\n    # hypergrep/scanner.py <regex> <file>\n    hypergrep/scanner.py pattern ./hypergrep/scanner.py\n    ```\n\n- Read multiple files with the multithreaded command (drop in replacement for `grep` where patterns are compatible):\n    ```shell\n    # From install:\n    # hypergrep <regex> <file(s)>\n    hypergrep pattern ./hypergrep/scanner.py\n\n    # From package:\n    # hypergrep/multiscanner.py <regex> <file>\n    hypergrep/multiscanner.py pattern ./hypergrep/scanner.py\n    ```\n\n- Collect all matches from a file, similar to grep, and perform a custom operation on results:\n    ```python\n    import hypergrep\n    \n    file = \"./hypergrep/scanner.py\"\n    pattern = 'pattern'\n    \n    results, return_code = hypergrep.grep(file, [pattern])\n    for index, line in results:\n        print(f'{index}: {line}')\n    ```\n\n- Manually scan a file and perform a custom operation on match:\n    ```python\n    import hypergrep\n    \n    file = \"./hypergrep/scanner.py\"\n    pattern = 'pattern'\n\n    def on_match(matches: list, count: int) -> None:\n        for index in range(count):\n            match = matches[index]\n            line = match.line.decode(errors='ignore')\n            print(f'Custom print: {line.rstrip()}')\n    \n    hypergrep.scan(file, [pattern], on_match)\n    ```\n\n- Override the `libhs` and/or `libzstd` libraries to use files outside the package.\nMust be called before any other usage of `hypergrep`:\n    ```python\n    import hypergrep\n\n    hypergrep.configure_libraries(\n        libhs='/home/myuser/libhs.so.mybuild',\n        libzstd='/home/myuser/libzstd.so.mybuild',\n    )\n    ```\n\n### Contributing\n\nRefer to the [Contributing Guide](CONTRIBUTING.md) for information on how to contribute to this project.\n\n### Advanced Guides\n\nRefer to [How Tos](docs/HOW_TO.md) for more advanced topics, such as building the shared library objects.\n\n\n## FAQ\n\n#### Q: How does HyperGrep compare to other Hyperscan python libraries?\n\n**A:** HyperGrep has a specific goal: provide a high performance \"grep\" like interface in python,\nbut with more control. It is not intended to be a full set of bindings to Hyperscan. If you need\nfull control over the low level backend, there are other python libraries intended for that use case. Here are\na few of the reasons for the focused goal of this library:\n\n- Simplify developer integration.\n  - No experience with Hyperscan required.\n  - Familiarity with `grep` variants beneficial, but not required.\n- Avoid messy subprocess chains common in \"parallel grep\" implementations.\n  - Commands like `zgrep` are actually a `zcat` + `grep`. This can lead to 3+ processes per file read.\n  - Subprocessing is messy in general, best to minimize its use as much as possible.\n- Optimize performance.\n  - Reduce callbacks to/from python to reduce overhead.\n  - Allow true multithreading during read and regex matching.\n  - Provide the pattern matched in multi-regex searches, without having to repeat the search in Python.\n\nWhen it comes to performance, here is an example of the benefit of this design. Due to the performance of\nHyperscan, it is also often faster than native `grep` variants, even while using python. Scenario setup:\n- 2.10GHz Intel x86_64 Processor\n- ~17M line file (~300M gzip compressed, ~3G uncompressed).\n- ~800 PCRE patterns.\n- Counting only, no extra processing of lines.\n- Each job run 5 times and averaged (lower is better).\n\n|   | Scenario (Uncompressed timings in parenthesis) | HyperGrep     | Full bindings     | zgrep (grep)  |\n|---|------------------------------------------------|---------------|-------------------|---------------|\n| 1 | ~90K matches, 1 pattern                        | 8.2s (2.5s)   | 22.8s (15.5s)     | 12.5s (5.2s)  |\n| 2 | ~900K matches, 10 patterns                     | 9.7s (3.8s)   | 25.7s (16.8s)     | 19.8s (17.3s) |\n| 3 | ~15M matches, ~800 patterns                    | 44.2s (38.1s) | 73.5s (57.7s)     | *             |\n| 4 | Scenario #3 (x4 files), 1 process (4 threads)  | 49.6s (46.8s) | 1432.6s (1302.2s) | *             |\n\n* GNU grep does not allow multiple PCRE patterns natively, and concatenation via \"or\" failed.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Utilities for rapid text file processing using Intel Hyperscan in Python",
    "version": "3.2.0",
    "project_urls": {
        "Changelog": "https://github.com/pyranha-labs/hypergrep/releases",
        "Home": "https://github.com/pyranha-labs/hypergrep"
    },
    "split_keywords": [
        "regex",
        "logs",
        "hyperscan"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "68f5f0dabce131280865ea5eb084daeac129403a0c61ad118326eddffad10a0d",
                "md5": "04d1ba03636afc99ef7361dc3d9e4fd4",
                "sha256": "c2dc14c5db37a899b6a1a9244d297f0a83795addf57ab494a05ca6bb6325d32d"
            },
            "downloads": -1,
            "filename": "hypergrep-3.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "04d1ba03636afc99ef7361dc3d9e4fd4",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 3664243,
            "upload_time": "2024-03-17T17:43:50",
            "upload_time_iso_8601": "2024-03-17T17:43:50.716431Z",
            "url": "https://files.pythonhosted.org/packages/68/f5/f0dabce131280865ea5eb084daeac129403a0c61ad118326eddffad10a0d/hypergrep-3.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c32b737f83b33480e0378d897240e9285ed325c06e9aff409b228acc6efef72f",
                "md5": "1b72d26f9f2cd4e313eee966f30f524c",
                "sha256": "45251a60eab41547459a389dd29dee10aade1aa060db784ea11ba1bc4466f213"
            },
            "downloads": -1,
            "filename": "hypergrep-3.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "1b72d26f9f2cd4e313eee966f30f524c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 3644047,
            "upload_time": "2024-03-17T17:43:53",
            "upload_time_iso_8601": "2024-03-17T17:43:53.044396Z",
            "url": "https://files.pythonhosted.org/packages/c3/2b/737f83b33480e0378d897240e9285ed325c06e9aff409b228acc6efef72f/hypergrep-3.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-03-17 17:43:53",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "pyranha-labs",
    "github_project": "hypergrep",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "hypergrep"
}
        
Elapsed time: 2.47708s