boyermoore

Name	boyermoore JSON
Version	1.0.0 JSON
	download
home_page	http://github.com/eriknyquist/boyermoore
Summary	Boyer-moore in pure python, search for unicode strings in large files very quickly
upload_time	2022-12-17 04:53:03
maintainer
docs_url	None
author	Erik Nyquist
requires_python	>=3.7
license	Apache 2.0
keywords	boyermoore boyer-moore search filesearch fastsearch boyermorehorspool boyer-moore-horspool
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            
.. contents:: **Table Of Contents**

Boyer-Moore in pure python: search for unicode strings quickly in large files
*****************************************************************************

.. |tests_badge| image:: https://github.com/eriknyquist/boyermoore/actions/workflows/tests.yml/badge.svg
.. |cov_badge| image:: https://github.com/eriknyquist/boyermoore/actions/workflows/coverage.yml/badge.svg
.. |codeclimate_badge| image:: https://api.codeclimate.com/v1/badges/a5d499edc22f0a05c533/maintainability

|tests_badge| |cov_badge| |codeclimate_badge|


This is an implementation of the Boyer-Moore substring search algorithm in pure python.

It is a shameless copy-paste of the python reference code provided `here <https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string-search_algorithm>`_,
with modifications to support the following additional features:

* Searching in files without reading the whole file into memory, allowing handling of large files
* Full unicode support

See the `API documentation <https://eriknyquist.github.io/boyermoore/>`_ for more details.

Installing
----------

Install from ``pip``.

::

    pip install boyermoore

Searching for all occurences of a substring in a file
-----------------------------------------------------

::

    >>> from boyermoore import search_file
    >>>
    >>> offsets = search_file("pattern!", "file.txt")                 # Find all occurrences of "pattern!" in file "file.txt"
    >>> offsets                                                       # Display found occurrences
    [12, 456, 10422]                                                  # Pattern occurs at byte offsets 12, 456, and 104222

Searching for the first occurence of a substring in a file
----------------------------------------------------------

::

    >>> from boyermoore import search_file
    >>>
    >>> offsets = search_file("pattern!", "file.txt", greedy=False)   # Find the first occurrence of "pattern!" in file "file.txt"
    >>> offsets                                                       # Display found occurrences
    [12]                                                              # First occurrence of pattern is at byte offset 12

Performance / Speed test
------------------------

The following section illustrates the average speed of the ``boyermoore.search_file``
function when searching for a unicode string in files of sizes ranging from 32MB to 4GB.

Test environment
################

The test was executed using Python 3.7.6 on a Windows 10 system with an Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
and 32 GB of RAM.

Test methodology
################

The test searches for all occurrences of a fixed unicode string in a series of test files.
The unicode string is:

::

    Hello नमस्ते Привет こんにちは

("Hello" in English, followed by the Hindi translation, followed by the Russian translation,
followed by the Japanese translation)

Each test file has 2 occurrences of the unicode string, one at the very beginning (byte offset of 0)
and one at the very end (byte offset of [file_length - pattern_length]).

Test results
############

The following table shows the times taken to search for all occurences of the unicode
string "Hello नमस्ते Привет こんにちは" inside test files of various sizes.

+-----------+----------------+
| File size | Time (seconds) |
+===========+================+
| 32 MB     | 0.31           |
+-----------+----------------+
| 64 MB     | 0.55           |
+-----------+----------------+
| 128 MB    | 1.07           |
+-----------+----------------+
| 256 MB    | 1.96           |
+-----------+----------------+
| 512 MB    | 3.87           |
+-----------+----------------+
| 1 GB      | 7.56           |
+-----------+----------------+
| 4 GB      | 31.01          |
+-----------+----------------+

Contributions
*************

Contributions are welcome, please open a pull request at `<https://github.com/eriknyquist/boyermoore>`_ and ensure that:

#. All existing unit tests pass (run tests via ``python setup.py test``)
#. New unit tests are added to cover any modified/new functionality (run ``python code_coverage.py``
   to ensure that coverage is above 98%)

You will need to install packages required for development, these are listed in ``dev_requirements.txt``:

::

    pip install -r dev_requirements.txt

If you have any questions about / need help with contributions or unit tests, please
contact Erik at eknyquist@gmail.com.

Raw data

            {
    "_id": null,
    "home_page": "http://github.com/eriknyquist/boyermoore",
    "name": "boyermoore",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "",
    "keywords": "boyermoore,boyer-moore,search,filesearch,fastsearch,boyermorehorspool,boyer-moore-horspool",
    "author": "Erik Nyquist",
    "author_email": "eknyquist@gmail.com",
    "download_url": "",
    "platform": null,
    "description": "\n.. contents:: **Table Of Contents**\n\nBoyer-Moore in pure python: search for unicode strings quickly in large files\n*****************************************************************************\n\n.. |tests_badge| image:: https://github.com/eriknyquist/boyermoore/actions/workflows/tests.yml/badge.svg\n.. |cov_badge| image:: https://github.com/eriknyquist/boyermoore/actions/workflows/coverage.yml/badge.svg\n.. |codeclimate_badge| image:: https://api.codeclimate.com/v1/badges/a5d499edc22f0a05c533/maintainability\n\n|tests_badge| |cov_badge| |codeclimate_badge|\n\n\nThis is an implementation of the Boyer-Moore substring search algorithm in pure python.\n\nIt is a shameless copy-paste of the python reference code provided `here <https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string-search_algorithm>`_,\nwith modifications to support the following additional features:\n\n* Searching in files without reading the whole file into memory, allowing handling of large files\n* Full unicode support\n\nSee the `API documentation <https://eriknyquist.github.io/boyermoore/>`_ for more details.\n\nInstalling\n----------\n\nInstall from ``pip``.\n\n::\n\n    pip install boyermoore\n\nSearching for all occurences of a substring in a file\n-----------------------------------------------------\n\n::\n\n    >>> from boyermoore import search_file\n    >>>\n    >>> offsets = search_file(\"pattern!\", \"file.txt\")                 # Find all occurrences of \"pattern!\" in file \"file.txt\"\n    >>> offsets                                                       # Display found occurrences\n    [12, 456, 10422]                                                  # Pattern occurs at byte offsets 12, 456, and 104222\n\nSearching for the first occurence of a substring in a file\n----------------------------------------------------------\n\n::\n\n    >>> from boyermoore import search_file\n    >>>\n    >>> offsets = search_file(\"pattern!\", \"file.txt\", greedy=False)   # Find the first occurrence of \"pattern!\" in file \"file.txt\"\n    >>> offsets                                                       # Display found occurrences\n    [12]                                                              # First occurrence of pattern is at byte offset 12\n\nPerformance / Speed test\n------------------------\n\nThe following section illustrates the average speed of the ``boyermoore.search_file``\nfunction when searching for a unicode string in files of sizes ranging from 32MB to 4GB.\n\nTest environment\n################\n\nThe test was executed using Python 3.7.6 on a Windows 10 system with an Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz\nand 32 GB of RAM.\n\nTest methodology\n################\n\nThe test searches for all occurrences of a fixed unicode string in a series of test files.\nThe unicode string is:\n\n::\n\n    Hello \u0928\u092e\u0938\u094d\u0924\u0947 \u041f\u0440\u0438\u0432\u0435\u0442 \u3053\u3093\u306b\u3061\u306f\n\n(\"Hello\" in English, followed by the Hindi translation, followed by the Russian translation,\nfollowed by the Japanese translation)\n\nEach test file has 2 occurrences of the unicode string, one at the very beginning (byte offset of 0)\nand one at the very end (byte offset of [file_length - pattern_length]).\n\nTest results\n############\n\nThe following table shows the times taken to search for all occurences of the unicode\nstring \"Hello \u0928\u092e\u0938\u094d\u0924\u0947 \u041f\u0440\u0438\u0432\u0435\u0442 \u3053\u3093\u306b\u3061\u306f\" inside test files of various sizes.\n\n+-----------+----------------+\n| File size | Time (seconds) |\n+===========+================+\n| 32 MB     | 0.31           |\n+-----------+----------------+\n| 64 MB     | 0.55           |\n+-----------+----------------+\n| 128 MB    | 1.07           |\n+-----------+----------------+\n| 256 MB    | 1.96           |\n+-----------+----------------+\n| 512 MB    | 3.87           |\n+-----------+----------------+\n| 1 GB      | 7.56           |\n+-----------+----------------+\n| 4 GB      | 31.01          |\n+-----------+----------------+\n\nContributions\n*************\n\nContributions are welcome, please open a pull request at `<https://github.com/eriknyquist/boyermoore>`_ and ensure that:\n\n#. All existing unit tests pass (run tests via ``python setup.py test``)\n#. New unit tests are added to cover any modified/new functionality (run ``python code_coverage.py``\n   to ensure that coverage is above 98%)\n\nYou will need to install packages required for development, these are listed in ``dev_requirements.txt``:\n\n::\n\n    pip install -r dev_requirements.txt\n\nIf you have any questions about / need help with contributions or unit tests, please\ncontact Erik at eknyquist@gmail.com.\n\n\n",
    "bugtrack_url": null,
    "license": "Apache 2.0",
    "summary": "Boyer-moore in pure python, search for unicode strings in large files very quickly",
    "version": "1.0.0",
    "split_keywords": [
        "boyermoore",
        "boyer-moore",
        "search",
        "filesearch",
        "fastsearch",
        "boyermorehorspool",
        "boyer-moore-horspool"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "ead0a2b5b11abc52f461c8dbbbe558b2",
                "sha256": "0dd4e7b29b8cabeb1114452a88241b2e5ad66bf914240151c132158dcfd87ddd"
            },
            "downloads": -1,
            "filename": "boyermoore-1.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ead0a2b5b11abc52f461c8dbbbe558b2",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 10313,
            "upload_time": "2022-12-17T04:53:03",
            "upload_time_iso_8601": "2022-12-17T04:53:03.453284Z",
            "url": "https://files.pythonhosted.org/packages/ae/ea/efa5a23bdb04044b8931d9ac5b00d7cb5831cce0560c167b413ab14dc761/boyermoore-1.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2022-12-17 04:53:03",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "eriknyquist",
    "github_project": "boyermoore",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "boyermoore"
}

Erik Nyquist