# trieregex
[![pypi Version](https://img.shields.io/pypi/v/trieregex.svg?logo=pypi&logoColor=white)](https://pypi.org/project/trieregex/)
[![python](https://img.shields.io/pypi/pyversions/trieregex.svg?logo=python&logoColor=white)](https://pypi.org/project/trieregex/)
[**trieregex**](https://github.com/ermanh/trieregex/) creates efficient [regular expressions](https://en.wikipedia.org/wiki/Regular_expression) (regexes) by storing a list of words in a [trie](https://en.wikipedia.org/wiki/Trie) structure, and translating the trie into a more compact pattern.
The speed performance of a trie-based regex (e.g. `r'(?:under(?:sta(?:nd|te))|take|go)?)'`), compared to a flat regex union (i.e., `r'(?:understand|understate|undertake|undergo)'`, becomes obvious when using extremely large word lists, and especially when more specific or complicated contexts are specified at the boundaries. The processing time of using this package itself is also minimized with [memoization](https://en.wikipedia.org/wiki/Memoization).
## Installation
```shell
pip install trieregex
```
## Usage
```py
import re
from trieregex import TrieRegEx as TRE
words = ['lemon', 'lime', 'pomelo', 'orange', 'citron']
more_words = ['grapefruit', 'grape', 'tangerine', 'tangelo']
# Initialize class instance
tre = TRE()
# Add word(s)
tre = TRE(*words) # word(s) can be added upon instance creation, or after
tre.add('kumquat') # add one word
tre.add(*more_words) # add a list of words
# Remove word(s)
tre.remove('citron') # remove one word
tre.remove(*words) # remove a list of words
# Check if a word exists in the trie
tre.has('citron') # Returns: False
tre.has('tangerine') # Returns: True
# Create regex pattern from the trie
tre.regex() # Returns: '(?:tange(?:rine|lo)|grape(?:fruit)?|kumquat)'
# Add boundary context and compile for matching
pattern = re.compile(f'\\b{tre.regex()}\\b') # OR rf'\b{tre.regex()}\b'
pattern # Returns: re.compile('\\b(?:tange(?:rine|lo)|grape(?:fruit)?|kumquat)\\b')
pattern.findall("A kumquat is tastier than a loquat") # Returns: ['kumquat']
# Inspect unique initial characters in the trie
tre.initials() # Returns: ['g', 'k', 't']
# Inspect unique final characters in the trie
tre.finals() # Returns: ['e', 'o', 't']
```
The last two methods are intended for the user to check what boundary contexts may be appropriate to set in the final regex. More discussed below.
## Boundaries
**trieregex** does not include any default boundaries (such as `r'\b'`) in the pattern returned from its `TrieRegEx.regex()` method, so that the user can determine what is appropriate for their particular use case.
Consider a fictitious brand name `!Citrus` with an exclamation mark at the beginning, using `r'\b'` to define its boundaries in an attempt to catch it:
```py
string = 'I love !Citrus products!'
re.findall(r'\b(!Citrus)\b', string) # Returns: []
```
The `re.findall()` call returns an empty list because the first `r'\b'` is not matched. While `r'\b'` stands for the boundary between a word character and a non-word character, the exclamation mark and its preceding space character are both non-word characters.
An appropriate regex for catching `'!Citrus'` can be written as follows, where the context before the exclamation mark is either start-of-string (`r'^'`) or a non-word character (`r'[^\w]'`):
```py
re.findall(r'(?:^|[^\w])(!Citrus)\b', string) # Returns: ['!Citrus']
```
This package was designed to allow any pattern in its trie, not just normal words bound by space and punctuation, so that the user can define their own regex context, and have the option to avoid data normalization when it is undesirable.
## Python version comptability
Due to [f-strings](https://www.python.org/dev/peps/pep-0498/) and type hints, this package is only comptible with Python versions >=3.6.
For Python versions >=2.7, <3.6, backports such as [`future-fstrings`](https://pypi.org/project/future-fstrings/) and [`typing`](https://pypi.org/project/typing/) are available.
Raw data
{
"_id": null,
"home_page": "https://github.com/ermanh/trieregex",
"name": "trieregex",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": "",
"keywords": "regular expressions,regex,pattern,trie",
"author": "Herman Leung",
"author_email": "leung.hm@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/06/48/d22ed66693fe643e7a376b2178bc91bd32771de3d6730ffe288da6b95964/trieregex-1.0.0.tar.gz",
"platform": "",
"description": "# trieregex\n\n[![pypi Version](https://img.shields.io/pypi/v/trieregex.svg?logo=pypi&logoColor=white)](https://pypi.org/project/trieregex/)\n[![python](https://img.shields.io/pypi/pyversions/trieregex.svg?logo=python&logoColor=white)](https://pypi.org/project/trieregex/)\n\n[**trieregex**](https://github.com/ermanh/trieregex/) creates efficient [regular expressions](https://en.wikipedia.org/wiki/Regular_expression) (regexes) by storing a list of words in a [trie](https://en.wikipedia.org/wiki/Trie) structure, and translating the trie into a more compact pattern.\n\nThe speed performance of a trie-based regex (e.g. `r'(?:under(?:sta(?:nd|te))|take|go)?)'`), compared to a flat regex union (i.e., `r'(?:understand|understate|undertake|undergo)'`, becomes obvious when using extremely large word lists, and especially when more specific or complicated contexts are specified at the boundaries. The processing time of using this package itself is also minimized with [memoization](https://en.wikipedia.org/wiki/Memoization).\n\n## Installation\n\n```shell\npip install trieregex\n```\n\n## Usage\n\n```py\nimport re\nfrom trieregex import TrieRegEx as TRE\n\nwords = ['lemon', 'lime', 'pomelo', 'orange', 'citron']\nmore_words = ['grapefruit', 'grape', 'tangerine', 'tangelo']\n\n# Initialize class instance\ntre = TRE()\n\n# Add word(s)\ntre = TRE(*words) # word(s) can be added upon instance creation, or after\ntre.add('kumquat') # add one word\ntre.add(*more_words) # add a list of words \n\n# Remove word(s)\ntre.remove('citron') # remove one word\ntre.remove(*words) # remove a list of words\n\n# Check if a word exists in the trie\ntre.has('citron') # Returns: False\ntre.has('tangerine') # Returns: True\n\n# Create regex pattern from the trie\ntre.regex() # Returns: '(?:tange(?:rine|lo)|grape(?:fruit)?|kumquat)'\n\n# Add boundary context and compile for matching\npattern = re.compile(f'\\\\b{tre.regex()}\\\\b') # OR rf'\\b{tre.regex()}\\b'\npattern # Returns: re.compile('\\\\b(?:tange(?:rine|lo)|grape(?:fruit)?|kumquat)\\\\b')\npattern.findall(\"A kumquat is tastier than a loquat\") # Returns: ['kumquat']\n\n# Inspect unique initial characters in the trie\ntre.initials() # Returns: ['g', 'k', 't']\n\n# Inspect unique final characters in the trie\ntre.finals() # Returns: ['e', 'o', 't']\n```\n\nThe last two methods are intended for the user to check what boundary contexts may be appropriate to set in the final regex. More discussed below.\n\n## Boundaries\n\n**trieregex** does not include any default boundaries (such as `r'\\b'`) in the pattern returned from its `TrieRegEx.regex()` method, so that the user can determine what is appropriate for their particular use case. \n\nConsider a fictitious brand name `!Citrus` with an exclamation mark at the beginning, using `r'\\b'` to define its boundaries in an attempt to catch it:\n\n```py\nstring = 'I love !Citrus products!'\nre.findall(r'\\b(!Citrus)\\b', string) # Returns: []\n```\n\nThe `re.findall()` call returns an empty list because the first `r'\\b'` is not matched. While `r'\\b'` stands for the boundary between a word character and a non-word character, the exclamation mark and its preceding space character are both non-word characters. \n\nAn appropriate regex for catching `'!Citrus'` can be written as follows, where the context before the exclamation mark is either start-of-string (`r'^'`) or a non-word character (`r'[^\\w]'`): \n\n```py\nre.findall(r'(?:^|[^\\w])(!Citrus)\\b', string) # Returns: ['!Citrus']\n```\n\nThis package was designed to allow any pattern in its trie, not just normal words bound by space and punctuation, so that the user can define their own regex context, and have the option to avoid data normalization when it is undesirable.\n\n## Python version comptability\n\nDue to [f-strings](https://www.python.org/dev/peps/pep-0498/) and type hints, this package is only comptible with Python versions >=3.6. \n\nFor Python versions >=2.7, <3.6, backports such as [`future-fstrings`](https://pypi.org/project/future-fstrings/) and [`typing`](https://pypi.org/project/typing/) are available.",
"bugtrack_url": null,
"license": "MIT",
"summary": "Build trie-based regular expressions from large word lists",
"version": "1.0.0",
"project_urls": {
"Homepage": "https://github.com/ermanh/trieregex"
},
"split_keywords": [
"regular expressions",
"regex",
"pattern",
"trie"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "0648d22ed66693fe643e7a376b2178bc91bd32771de3d6730ffe288da6b95964",
"md5": "494ba619f756fd275d1b31f0532fe5d6",
"sha256": "a34dd31d04aa169e1989971a315fcbd524126330c7f2f9f16991b0a8c9084eaf"
},
"downloads": -1,
"filename": "trieregex-1.0.0.tar.gz",
"has_sig": false,
"md5_digest": "494ba619f756fd275d1b31f0532fe5d6",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 11276,
"upload_time": "2020-06-23T11:42:10",
"upload_time_iso_8601": "2020-06-23T11:42:10.111758Z",
"url": "https://files.pythonhosted.org/packages/06/48/d22ed66693fe643e7a376b2178bc91bd32771de3d6730ffe288da6b95964/trieregex-1.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2020-06-23 11:42:10",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "ermanh",
"github_project": "trieregex",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "trieregex"
}