.. contents:: **Table Of Contents**
Boyer-Moore in pure python: search for unicode strings quickly in large files
*****************************************************************************
.. |tests_badge| image:: https://github.com/eriknyquist/boyermoore/actions/workflows/tests.yml/badge.svg
.. |cov_badge| image:: https://github.com/eriknyquist/boyermoore/actions/workflows/coverage.yml/badge.svg
.. |codeclimate_badge| image:: https://api.codeclimate.com/v1/badges/a5d499edc22f0a05c533/maintainability
|tests_badge| |cov_badge| |codeclimate_badge|
This is an implementation of the Boyer-Moore substring search algorithm in pure python.
It is a shameless copy-paste of the python reference code provided `here <https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string-search_algorithm>`_,
with modifications to support the following additional features:
* Searching in files without reading the whole file into memory, allowing handling of large files
* Full unicode support
See the `API documentation <https://eriknyquist.github.io/boyermoore/>`_ for more details.
Installing
----------
Install from ``pip``.
::
pip install boyermoore
Searching for all occurences of a substring in a file
-----------------------------------------------------
::
>>> from boyermoore import search_file
>>>
>>> offsets = search_file("pattern!", "file.txt") # Find all occurrences of "pattern!" in file "file.txt"
>>> offsets # Display found occurrences
[12, 456, 10422] # Pattern occurs at byte offsets 12, 456, and 104222
Searching for the first occurence of a substring in a file
----------------------------------------------------------
::
>>> from boyermoore import search_file
>>>
>>> offsets = search_file("pattern!", "file.txt", greedy=False) # Find the first occurrence of "pattern!" in file "file.txt"
>>> offsets # Display found occurrences
[12] # First occurrence of pattern is at byte offset 12
Performance / Speed test
------------------------
The following section illustrates the average speed of the ``boyermoore.search_file``
function when searching for a unicode string in files of sizes ranging from 32MB to 4GB.
Test environment
################
The test was executed using Python 3.7.6 on a Windows 10 system with an Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
and 32 GB of RAM.
Test methodology
################
The test searches for all occurrences of a fixed unicode string in a series of test files.
The unicode string is:
::
Hello नमस्ते Привет こんにちは
("Hello" in English, followed by the Hindi translation, followed by the Russian translation,
followed by the Japanese translation)
Each test file has 2 occurrences of the unicode string, one at the very beginning (byte offset of 0)
and one at the very end (byte offset of [file_length - pattern_length]).
Test results
############
The following table shows the times taken to search for all occurences of the unicode
string "Hello नमस्ते Привет こんにちは" inside test files of various sizes.
+-----------+----------------+
| File size | Time (seconds) |
+===========+================+
| 32 MB | 0.31 |
+-----------+----------------+
| 64 MB | 0.55 |
+-----------+----------------+
| 128 MB | 1.07 |
+-----------+----------------+
| 256 MB | 1.96 |
+-----------+----------------+
| 512 MB | 3.87 |
+-----------+----------------+
| 1 GB | 7.56 |
+-----------+----------------+
| 4 GB | 31.01 |
+-----------+----------------+
Contributions
*************
Contributions are welcome, please open a pull request at `<https://github.com/eriknyquist/boyermoore>`_ and ensure that:
#. All existing unit tests pass (run tests via ``python setup.py test``)
#. New unit tests are added to cover any modified/new functionality (run ``python code_coverage.py``
to ensure that coverage is above 98%)
You will need to install packages required for development, these are listed in ``dev_requirements.txt``:
::
pip install -r dev_requirements.txt
If you have any questions about / need help with contributions or unit tests, please
contact Erik at eknyquist@gmail.com.
Raw data
{
"_id": null,
"home_page": "http://github.com/eriknyquist/boyermoore",
"name": "boyermoore",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": "",
"keywords": "boyermoore,boyer-moore,search,filesearch,fastsearch,boyermorehorspool,boyer-moore-horspool",
"author": "Erik Nyquist",
"author_email": "eknyquist@gmail.com",
"download_url": "",
"platform": null,
"description": "\n.. contents:: **Table Of Contents**\n\nBoyer-Moore in pure python: search for unicode strings quickly in large files\n*****************************************************************************\n\n.. |tests_badge| image:: https://github.com/eriknyquist/boyermoore/actions/workflows/tests.yml/badge.svg\n.. |cov_badge| image:: https://github.com/eriknyquist/boyermoore/actions/workflows/coverage.yml/badge.svg\n.. |codeclimate_badge| image:: https://api.codeclimate.com/v1/badges/a5d499edc22f0a05c533/maintainability\n\n|tests_badge| |cov_badge| |codeclimate_badge|\n\n\nThis is an implementation of the Boyer-Moore substring search algorithm in pure python.\n\nIt is a shameless copy-paste of the python reference code provided `here <https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string-search_algorithm>`_,\nwith modifications to support the following additional features:\n\n* Searching in files without reading the whole file into memory, allowing handling of large files\n* Full unicode support\n\nSee the `API documentation <https://eriknyquist.github.io/boyermoore/>`_ for more details.\n\nInstalling\n----------\n\nInstall from ``pip``.\n\n::\n\n pip install boyermoore\n\nSearching for all occurences of a substring in a file\n-----------------------------------------------------\n\n::\n\n >>> from boyermoore import search_file\n >>>\n >>> offsets = search_file(\"pattern!\", \"file.txt\") # Find all occurrences of \"pattern!\" in file \"file.txt\"\n >>> offsets # Display found occurrences\n [12, 456, 10422] # Pattern occurs at byte offsets 12, 456, and 104222\n\nSearching for the first occurence of a substring in a file\n----------------------------------------------------------\n\n::\n\n >>> from boyermoore import search_file\n >>>\n >>> offsets = search_file(\"pattern!\", \"file.txt\", greedy=False) # Find the first occurrence of \"pattern!\" in file \"file.txt\"\n >>> offsets # Display found occurrences\n [12] # First occurrence of pattern is at byte offset 12\n\nPerformance / Speed test\n------------------------\n\nThe following section illustrates the average speed of the ``boyermoore.search_file``\nfunction when searching for a unicode string in files of sizes ranging from 32MB to 4GB.\n\nTest environment\n################\n\nThe test was executed using Python 3.7.6 on a Windows 10 system with an Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz\nand 32 GB of RAM.\n\nTest methodology\n################\n\nThe test searches for all occurrences of a fixed unicode string in a series of test files.\nThe unicode string is:\n\n::\n\n Hello \u0928\u092e\u0938\u094d\u0924\u0947 \u041f\u0440\u0438\u0432\u0435\u0442 \u3053\u3093\u306b\u3061\u306f\n\n(\"Hello\" in English, followed by the Hindi translation, followed by the Russian translation,\nfollowed by the Japanese translation)\n\nEach test file has 2 occurrences of the unicode string, one at the very beginning (byte offset of 0)\nand one at the very end (byte offset of [file_length - pattern_length]).\n\nTest results\n############\n\nThe following table shows the times taken to search for all occurences of the unicode\nstring \"Hello \u0928\u092e\u0938\u094d\u0924\u0947 \u041f\u0440\u0438\u0432\u0435\u0442 \u3053\u3093\u306b\u3061\u306f\" inside test files of various sizes.\n\n+-----------+----------------+\n| File size | Time (seconds) |\n+===========+================+\n| 32 MB | 0.31 |\n+-----------+----------------+\n| 64 MB | 0.55 |\n+-----------+----------------+\n| 128 MB | 1.07 |\n+-----------+----------------+\n| 256 MB | 1.96 |\n+-----------+----------------+\n| 512 MB | 3.87 |\n+-----------+----------------+\n| 1 GB | 7.56 |\n+-----------+----------------+\n| 4 GB | 31.01 |\n+-----------+----------------+\n\nContributions\n*************\n\nContributions are welcome, please open a pull request at `<https://github.com/eriknyquist/boyermoore>`_ and ensure that:\n\n#. All existing unit tests pass (run tests via ``python setup.py test``)\n#. New unit tests are added to cover any modified/new functionality (run ``python code_coverage.py``\n to ensure that coverage is above 98%)\n\nYou will need to install packages required for development, these are listed in ``dev_requirements.txt``:\n\n::\n\n pip install -r dev_requirements.txt\n\nIf you have any questions about / need help with contributions or unit tests, please\ncontact Erik at eknyquist@gmail.com.\n\n\n",
"bugtrack_url": null,
"license": "Apache 2.0",
"summary": "Boyer-moore in pure python, search for unicode strings in large files very quickly",
"version": "1.0.0",
"split_keywords": [
"boyermoore",
"boyer-moore",
"search",
"filesearch",
"fastsearch",
"boyermorehorspool",
"boyer-moore-horspool"
],
"urls": [
{
"comment_text": "",
"digests": {
"md5": "ead0a2b5b11abc52f461c8dbbbe558b2",
"sha256": "0dd4e7b29b8cabeb1114452a88241b2e5ad66bf914240151c132158dcfd87ddd"
},
"downloads": -1,
"filename": "boyermoore-1.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "ead0a2b5b11abc52f461c8dbbbe558b2",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 10313,
"upload_time": "2022-12-17T04:53:03",
"upload_time_iso_8601": "2022-12-17T04:53:03.453284Z",
"url": "https://files.pythonhosted.org/packages/ae/ea/efa5a23bdb04044b8931d9ac5b00d7cb5831cce0560c167b413ab14dc761/boyermoore-1.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2022-12-17 04:53:03",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "eriknyquist",
"github_project": "boyermoore",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "boyermoore"
}