ahocorapy


Nameahocorapy JSON
Version 1.6.2 PyPI version JSON
download
home_pagehttps://github.com/abusix/ahocorapy
Summaryahocorapy - Pure python ahocorasick implementation
upload_time2022-11-22 09:00:03
maintainer
docs_urlNone
authorabusix
requires_python>=2.7
licenseMIT
keywords keyword search purepython aho-corasick ahocorasick abusix
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [![Test](https://img.shields.io/github/workflow/status/abusix/ahocorapy/test/master)](https://github.com/abusix/ahocorapy/actions)
[![Test Coverage](https://img.shields.io/codecov/c/gh/abusix/ahocorapy/master)](https://codecov.io/gh/abusix/ahocorapy)
[![Downloads](https://pepy.tech/badge/ahocorapy)](https://pepy.tech/project/ahocorapy)
[![PyPi Version](https://img.shields.io/pypi/v/ahocorapy.svg)](https://pypi.python.org/pypi/ahocorapy)
[![PyPi License](https://img.shields.io/pypi/l/ahocorapy.svg)](https://pypi.python.org/pypi/ahocorapy)
[![PyPi Versions](https://img.shields.io/pypi/pyversions/ahocorapy.svg)](https://pypi.python.org/pypi/ahocorapy)
[![PyPi Wheel](https://img.shields.io/pypi/wheel/ahocorapy.svg)](https://pypi.python.org/pypi/ahocorapy)

# ahocorapy - Fast Many-Keyword Search in Pure Python

ahocorapy is a pure python implementation of the Aho-Corasick Algorithm.
Given a list of keywords one can check if at least one of the keywords exist in a given text in linear time.

## Comparison:

### Why another Aho-Corasick implementation?

We started working on this in the beginning of 2016. Our requirements included unicode support combined with python2.7. That
was impossible with C-extension based libraries (like [pyahocorasick](https://github.com/WojciechMula/pyahocorasick/)). Pure
python libraries were very slow or unusable due to memory explosion. Since then another pure python library was released
[py-aho-corasick](https://github.com/JanFan/py-aho-corasick). The repository also contains some discussion about different
implementations.
There is also [acora](https://github.com/scoder/acora), but it includes the note ('current construction algorithm is not
suitable for really large sets of keywords') which really was the case the last time I tested, because RAM ran out quickly.

### Differences

- Compared to [pyahocorasick](https://github.com/WojciechMula/pyahocorasick/) our library supports unicode in python 2.7 just like [py-aho-corasick](https://github.com/JanFan/py-aho-corasick).
  We don't use any C-Extension so the library is not platform dependant.

- On top of the standard Aho-Corasick longest suffix search, we also perform a shortcutting routine in the end, so
  that our lookup is fast while, the setup takes longer. During set up we go through the states and directly add transitions that are
  "offered" by the longest suffix or their longest suffixes. This leads to faster lookup times, because in the end we only have to
  follow simple transitions and don't have to perform any additional suffix lookup. It also leads to a bigger memory footprint,
  because the number of transitions is higher, because they are all included explicitely and not implicitely hidden by suffix pointers.

- We added a small tool that helps you visualize the resulting graph. This may help understanding the algorithm, if you'd like. See below.

- Fully pickleable (pythons built-in de-/serialization). ahocorapy uses a non-recursive custom implementation for de-/serialization so that even huge keyword trees can be pickled.

### Performance

I compared the two libraries mentioned above with ahocorapy. We used 50,000 keywords long list and an input text of 34,199 characters.
In the text only one keyword of the list is contained.
The setup process was run once per library and the search process was run 100 times. The following results are in seconds (not averaged for the lookup).

You can perform this test yourself using `python tests/ahocorapy_performance_test.py`. (Except for the pyahocorasick_py results. These were taken by importing the
pure python version of the code of [pyahocorasick](https://github.com/WojciechMula/pyahocorasick/). It's not available through pypi
as stated in the code.)

I also added measurements for the pure python libraries with run with pypy.

These are the results:

| Library (Variant)                                      | Setup (1x) | Search (100x) |
| ------------------------------------------------------ | ---------- | ------------- |
| ahocorapy\*                                            | 0.30s      | 0.29s         |
| ahocorapy (run with pypy)\*                            | 0.37s      | 0.10s         |
| pyahocorasick\*                                        | 0.04s      | 0.04s         |
| pyahocorasick (run with pypy)\*                        | 0.10s      | 0.05s         |
| pyahocorasick (pure python variant in github repo)\*\* | 0.50s      | 1.68s         |
| py_aho_corasick\*                                      | 0.72s      | 4,60s         |
| py_aho_corasick (run with pypy)\*                      | 0.83s      | 2.02s         |

As expected the C-Extension shatters the pure python implementations. Even though there is probably still room for optimization in
ahocorapy we are not going to get to the mark that pyahocorasick sets. ahocorapy's lookups are faster than py_aho_corasick.
When run with pypy ahocorapy is almost as fast as pyahocorasick, at least when it comes to
searching. The setup overhead is higher due to the suffix shortcutting mechanism used.

\* Specs
  
CPU: AMD Ryzen 2700X
Linux Kernel: 6.0.6
CPython: 3.11.0  
pypy: PyPy 7.3.9 (Python 3.9.12) with GCC 10.2.1 20210130
Date tested: 2022-11-22

\*\* Old measurement with different specs

## Basic Usage:

### Installation

```
pip install ahocorapy
```

### Creation of the Search Tree

```python
from ahocorapy.keywordtree import KeywordTree
kwtree = KeywordTree(case_insensitive=True)
kwtree.add('malaga')
kwtree.add('lacrosse')
kwtree.add('mallorca')
kwtree.add('mallorca bella')
kwtree.add('orca')
kwtree.finalize()
```

### Searching

```python
result = kwtree.search('My favorite islands are malaga and sylt.')
print(result)
```

Prints :

```python
('malaga', 24)
```

The search_all method returns a generator for all keywords found, or None if there is none.

```python
results = kwtree.search_all('malheur on mallorca bellacrosse')
for result in results:
    print(result)
```

Prints :

```python
('mallorca', 11)
('orca', 15)
('mallorca bella', 11)
('lacrosse', 23)
```

### Thread Safety

The construction of the tree is currently NOT thread safe. That means `add`ing shouldn't be called multiple times concurrently. Behavior is undefined.

After `finalize` is called you can use the `search` functionality on the same tree from multiple threads at the same time. So that part is thread safe.

## Drawing Graph

You can print the underlying graph with the Visualizer class.
This feature requires a working pygraphviz library installed.

```python
from ahocorapy_visualizer.visualizer import Visualizer
visualizer = Visualizer()
visualizer.draw('readme_example.png', kwtree)
```

The resulting .png of the graph looks like this:

![graph for kwtree](https://raw.githubusercontent.com/abusix/ahocorapy/master/img/readme_example.png "Keyword Tree")

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/abusix/ahocorapy",
    "name": "ahocorapy",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=2.7",
    "maintainer_email": "",
    "keywords": "keyword,search,purepython,aho-corasick,ahocorasick,abusix",
    "author": "abusix",
    "author_email": "fp@abusix.com",
    "download_url": "https://files.pythonhosted.org/packages/38/d7/81b1d1533896d72186add63e3dd2fb70142aede6cc8a3cc48bf8a6d51002/ahocorapy-1.6.2.tar.gz",
    "platform": null,
    "description": "[![Test](https://img.shields.io/github/workflow/status/abusix/ahocorapy/test/master)](https://github.com/abusix/ahocorapy/actions)\n[![Test Coverage](https://img.shields.io/codecov/c/gh/abusix/ahocorapy/master)](https://codecov.io/gh/abusix/ahocorapy)\n[![Downloads](https://pepy.tech/badge/ahocorapy)](https://pepy.tech/project/ahocorapy)\n[![PyPi Version](https://img.shields.io/pypi/v/ahocorapy.svg)](https://pypi.python.org/pypi/ahocorapy)\n[![PyPi License](https://img.shields.io/pypi/l/ahocorapy.svg)](https://pypi.python.org/pypi/ahocorapy)\n[![PyPi Versions](https://img.shields.io/pypi/pyversions/ahocorapy.svg)](https://pypi.python.org/pypi/ahocorapy)\n[![PyPi Wheel](https://img.shields.io/pypi/wheel/ahocorapy.svg)](https://pypi.python.org/pypi/ahocorapy)\n\n# ahocorapy - Fast Many-Keyword Search in Pure Python\n\nahocorapy is a pure python implementation of the Aho-Corasick Algorithm.\nGiven a list of keywords one can check if at least one of the keywords exist in a given text in linear time.\n\n## Comparison:\n\n### Why another Aho-Corasick implementation?\n\nWe started working on this in the beginning of 2016. Our requirements included unicode support combined with python2.7. That\nwas impossible with C-extension based libraries (like [pyahocorasick](https://github.com/WojciechMula/pyahocorasick/)). Pure\npython libraries were very slow or unusable due to memory explosion. Since then another pure python library was released\n[py-aho-corasick](https://github.com/JanFan/py-aho-corasick). The repository also contains some discussion about different\nimplementations.\nThere is also [acora](https://github.com/scoder/acora), but it includes the note ('current construction algorithm is not\nsuitable for really large sets of keywords') which really was the case the last time I tested, because RAM ran out quickly.\n\n### Differences\n\n- Compared to [pyahocorasick](https://github.com/WojciechMula/pyahocorasick/) our library supports unicode in python 2.7 just like [py-aho-corasick](https://github.com/JanFan/py-aho-corasick).\n  We don't use any C-Extension so the library is not platform dependant.\n\n- On top of the standard Aho-Corasick longest suffix search, we also perform a shortcutting routine in the end, so\n  that our lookup is fast while, the setup takes longer. During set up we go through the states and directly add transitions that are\n  \"offered\" by the longest suffix or their longest suffixes. This leads to faster lookup times, because in the end we only have to\n  follow simple transitions and don't have to perform any additional suffix lookup. It also leads to a bigger memory footprint,\n  because the number of transitions is higher, because they are all included explicitely and not implicitely hidden by suffix pointers.\n\n- We added a small tool that helps you visualize the resulting graph. This may help understanding the algorithm, if you'd like. See below.\n\n- Fully pickleable (pythons built-in de-/serialization). ahocorapy uses a non-recursive custom implementation for de-/serialization so that even huge keyword trees can be pickled.\n\n### Performance\n\nI compared the two libraries mentioned above with ahocorapy. We used 50,000 keywords long list and an input text of 34,199 characters.\nIn the text only one keyword of the list is contained.\nThe setup process was run once per library and the search process was run 100 times. The following results are in seconds (not averaged for the lookup).\n\nYou can perform this test yourself using `python tests/ahocorapy_performance_test.py`. (Except for the pyahocorasick_py results. These were taken by importing the\npure python version of the code of [pyahocorasick](https://github.com/WojciechMula/pyahocorasick/). It's not available through pypi\nas stated in the code.)\n\nI also added measurements for the pure python libraries with run with pypy.\n\nThese are the results:\n\n| Library (Variant)                                      | Setup (1x) | Search (100x) |\n| ------------------------------------------------------ | ---------- | ------------- |\n| ahocorapy\\*                                            | 0.30s      | 0.29s         |\n| ahocorapy (run with pypy)\\*                            | 0.37s      | 0.10s         |\n| pyahocorasick\\*                                        | 0.04s      | 0.04s         |\n| pyahocorasick (run with pypy)\\*                        | 0.10s      | 0.05s         |\n| pyahocorasick (pure python variant in github repo)\\*\\* | 0.50s      | 1.68s         |\n| py_aho_corasick\\*                                      | 0.72s      | 4,60s         |\n| py_aho_corasick (run with pypy)\\*                      | 0.83s      | 2.02s         |\n\nAs expected the C-Extension shatters the pure python implementations. Even though there is probably still room for optimization in\nahocorapy we are not going to get to the mark that pyahocorasick sets. ahocorapy's lookups are faster than py_aho_corasick.\nWhen run with pypy ahocorapy is almost as fast as pyahocorasick, at least when it comes to\nsearching. The setup overhead is higher due to the suffix shortcutting mechanism used.\n\n\\* Specs\n  \nCPU: AMD Ryzen 2700X\nLinux Kernel: 6.0.6\nCPython: 3.11.0  \npypy: PyPy 7.3.9 (Python 3.9.12) with GCC 10.2.1 20210130\nDate tested: 2022-11-22\n\n\\*\\* Old measurement with different specs\n\n## Basic Usage:\n\n### Installation\n\n```\npip install ahocorapy\n```\n\n### Creation of the Search Tree\n\n```python\nfrom ahocorapy.keywordtree import KeywordTree\nkwtree = KeywordTree(case_insensitive=True)\nkwtree.add('malaga')\nkwtree.add('lacrosse')\nkwtree.add('mallorca')\nkwtree.add('mallorca bella')\nkwtree.add('orca')\nkwtree.finalize()\n```\n\n### Searching\n\n```python\nresult = kwtree.search('My favorite islands are malaga and sylt.')\nprint(result)\n```\n\nPrints :\n\n```python\n('malaga', 24)\n```\n\nThe search_all method returns a generator for all keywords found, or None if there is none.\n\n```python\nresults = kwtree.search_all('malheur on mallorca bellacrosse')\nfor result in results:\n    print(result)\n```\n\nPrints :\n\n```python\n('mallorca', 11)\n('orca', 15)\n('mallorca bella', 11)\n('lacrosse', 23)\n```\n\n### Thread Safety\n\nThe construction of the tree is currently NOT thread safe. That means `add`ing shouldn't be called multiple times concurrently. Behavior is undefined.\n\nAfter `finalize` is called you can use the `search` functionality on the same tree from multiple threads at the same time. So that part is thread safe.\n\n## Drawing Graph\n\nYou can print the underlying graph with the Visualizer class.\nThis feature requires a working pygraphviz library installed.\n\n```python\nfrom ahocorapy_visualizer.visualizer import Visualizer\nvisualizer = Visualizer()\nvisualizer.draw('readme_example.png', kwtree)\n```\n\nThe resulting .png of the graph looks like this:\n\n![graph for kwtree](https://raw.githubusercontent.com/abusix/ahocorapy/master/img/readme_example.png \"Keyword Tree\")\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "ahocorapy - Pure python ahocorasick implementation",
    "version": "1.6.2",
    "project_urls": {
        "Company": "https://www.abusix.com/",
        "Homepage": "https://github.com/abusix/ahocorapy",
        "Source": "https://github.com/abusix/ahocorapy"
    },
    "split_keywords": [
        "keyword",
        "search",
        "purepython",
        "aho-corasick",
        "ahocorasick",
        "abusix"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e6faad9a6820a33c18e898b9f0a88aa38db34611f2ebbac46e7516d3a068bde1",
                "md5": "fdbd7f3e9ce7e03c07c92fe3187d96f4",
                "sha256": "4f8a7d8f8b074d72b0a0db1dff3ba54cb30ae72e2a62da6deaaea63c49ecf26c"
            },
            "downloads": -1,
            "filename": "ahocorapy-1.6.2-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "fdbd7f3e9ce7e03c07c92fe3187d96f4",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": ">=2.7",
            "size": 8295,
            "upload_time": "2022-11-22T09:00:01",
            "upload_time_iso_8601": "2022-11-22T09:00:01.731957Z",
            "url": "https://files.pythonhosted.org/packages/e6/fa/ad9a6820a33c18e898b9f0a88aa38db34611f2ebbac46e7516d3a068bde1/ahocorapy-1.6.2-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "38d781b1d1533896d72186add63e3dd2fb70142aede6cc8a3cc48bf8a6d51002",
                "md5": "608db159f7ecd5fcdea1b5f7f8417922",
                "sha256": "67a01cfdb91bb3ee81ec3a2eeacab42f0887b606463877bc08c636e873538940"
            },
            "downloads": -1,
            "filename": "ahocorapy-1.6.2.tar.gz",
            "has_sig": false,
            "md5_digest": "608db159f7ecd5fcdea1b5f7f8417922",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=2.7",
            "size": 10721,
            "upload_time": "2022-11-22T09:00:03",
            "upload_time_iso_8601": "2022-11-22T09:00:03.995149Z",
            "url": "https://files.pythonhosted.org/packages/38/d7/81b1d1533896d72186add63e3dd2fb70142aede6cc8a3cc48bf8a6d51002/ahocorapy-1.6.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2022-11-22 09:00:03",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "abusix",
    "github_project": "ahocorapy",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "tox": true,
    "lcname": "ahocorapy"
}
        
Elapsed time: 0.69986s