retrie


Nameretrie JSON
Version 0.3.0 PyPI version JSON
download
home_pagehttps://github.com/ddelange/retrie
SummaryEfficient Trie-based regex unions for blacklist/whitelist filtering and one-pass mapping-based string replacing
upload_time2024-02-22 07:32:43
maintainer
docs_urlNone
authorddelange
requires_python>=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.*
licenseMIT
keywords pure-python regex trie regex-trie blacklist whitelist re search replace
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # retrie

[![build](https://img.shields.io/github/actions/workflow/status/ddelange/retrie/main.yml?branch=master&logo=github&cacheSeconds=86400)](https://github.com/ddelange/retrie/actions?query=branch%3Amaster)
[![codecov](https://img.shields.io/codecov/c/github/ddelange/retrie/master?logo=codecov&logoColor=white)](https://codecov.io/gh/ddelange/retrie)
[![pypi Version](https://img.shields.io/pypi/v/retrie.svg?logo=pypi&logoColor=white)](https://pypi.org/project/retrie/)
[![python](https://img.shields.io/pypi/pyversions/retrie.svg?logo=python&logoColor=white)](https://pypi.org/project/retrie/)
[![downloads](https://static.pepy.tech/badge/retrie)](https://pypistats.org/packages/retrie)
[![black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/python/black)


[retrie](https://github.com/ddelange/retrie) offers fast methods to match and replace (sequences of) strings based on efficient Trie-based regex unions.

#### Trie

Instead of matching against a simple regex union, which becomes slow for large sets of words, a more efficient regex pattern can be compiled using a [Trie](https://en.wikipedia.org/wiki/Trie) structure:

```py
from retrie.trie import Trie


trie = Trie()

trie.add("abc", "foo", "abs")
assert trie.pattern() == "(?:ab[cs]|foo)"  # equivalent to but faster than "(?:abc|abs|foo)"

trie.add("absolute")
assert trie.pattern() == "(?:ab(?:c|s(?:olute)?)|foo)"

trie.add("abx")
assert trie.pattern() == "(?:ab(?:[cx]|s(?:olute)?)|foo)"

trie.add("abxy")
assert trie.pattern() == "(?:ab(?:c|s(?:olute)?|xy?)|foo)"
```

A Trie may be populated with zero or more strings at instantiation or via `.add`, from which method chaining is possible. Two Trie may be merged with the `+` and `+=` operators and will compare equal if their data dictionaries are equal.

```py
    trie = Trie()
    trie += Trie("abc")
    assert (
        trie + Trie().add("foo")
        == Trie("abc", "foo")
        == Trie(*["abc", "foo"])
        == Trie().add(*["abc", "foo"])
        == Trie().add("abc", "foo")
        == Trie().add("abc").add("foo")
    )
```


## Installation

This pure-Python, OS independent package is available on [PyPI](https://pypi.org/project/retrie):

```sh
$ pip install retrie
```


## Usage

[![readthedocs](https://readthedocs.org/projects/retrie/badge/?version=latest)](https://retrie.readthedocs.io)

For documentation, see [retrie.readthedocs.io](https://retrie.readthedocs.io/en/stable/_code_reference/retrie.html).

The following objects are all subclasses of `retrie.retrie.Retrie`, which handles filling the Trie and compiling the corresponding regex pattern.


#### Blacklist

The `Blacklist` object can be used to filter out bad occurences in a text or a sequence of strings:
```py
from retrie.retrie import Blacklist

# check out docstrings and methods
help(Blacklist)

blacklist = Blacklist(["abc", "foo", "abs"], match_substrings=False)
blacklist.compiled
# re.compile(r'(?<=\b)(?:ab[cs]|foo)(?=\b)', re.IGNORECASE|re.UNICODE)
assert not blacklist.is_blacklisted("a foobar")
assert tuple(blacklist.filter(("good", "abc", "foobar"))) == ("good", "foobar")
assert blacklist.cleanse_text(("good abc foobar")) == "good  foobar"

blacklist = Blacklist(["abc", "foo", "abs"], match_substrings=True)
blacklist.compiled
# re.compile(r'(?:ab[cs]|foo)', re.IGNORECASE|re.UNICODE)
assert blacklist.is_blacklisted("a foobar")
assert tuple(blacklist.filter(("good", "abc", "foobar"))) == ("good",)
assert blacklist.cleanse_text(("good abc foobar")) == "good  bar"
```


#### Whitelist

Similar methods are available for the `Whitelist` object:
```py
from retrie.retrie import Whitelist

# check out docstrings and methods
help(Whitelist)

whitelist = Whitelist(["abc", "foo", "abs"], match_substrings=False)
whitelist.compiled
# re.compile(r'(?<=\b)(?:ab[cs]|foo)(?=\b)', re.IGNORECASE|re.UNICODE)
assert not whitelist.is_whitelisted("a foobar")
assert tuple(whitelist.filter(("bad", "abc", "foobar"))) == ("abc",)
assert whitelist.cleanse_text(("bad abc foobar")) == "abc"

whitelist = Whitelist(["abc", "foo", "abs"], match_substrings=True)
whitelist.compiled
# re.compile(r'(?:ab[cs]|foo)', re.IGNORECASE|re.UNICODE)
assert whitelist.is_whitelisted("a foobar")
assert tuple(whitelist.filter(("bad", "abc", "foobar"))) == ("abc", "foobar")
assert whitelist.cleanse_text(("bad abc foobar")) == "abcfoo"
```


#### Replacer

The `Replacer` object does a fast single-pass search & replace for occurrences of `replacement_mapping.keys()` with corresponding values.
```py
from retrie.retrie import Replacer

# check out docstrings and methods
help(Replacer)

replacement_mapping = dict(zip(["abc", "foo", "abs"], ["new1", "new2", "new3"]))

replacer = Replacer(replacement_mapping, match_substrings=True)
replacer.compiled
# re.compile(r'(?:ab[cs]|foo)', re.IGNORECASE|re.UNICODE)
assert replacer.replace("ABS ...foo... foobar") == "new3 ...new2... new2bar"

replacer = Replacer(replacement_mapping, match_substrings=False)
replacer.compiled
# re.compile(r'\b(?:ab[cs]|foo)\b', re.IGNORECASE|re.UNICODE)
assert replacer.replace("ABS ...foo... foobar") == "new3 ...new2... foobar"

replacer = Replacer(replacement_mapping, match_substrings=False, re_flags=None)
replacer.compiled  # on py3, re.UNICODE is always enabled
# re.compile(r'\b(?:ab[cs]|foo)\b')
assert replacer.replace("ABS ...foo... foobar") == "ABS ...new2... foobar"

replacer = Replacer(replacement_mapping, match_substrings=False, word_boundary=" ")
replacer.compiled
# re.compile(r'(?<= )(?:ab[cs]|foo)(?= )', re.IGNORECASE|re.UNICODE)
assert replacer.replace(". ABS ...foo... foobar") == ". new3 ...foo... foobar"
```


## Development

[![gitmoji](https://img.shields.io/badge/gitmoji-%20%F0%9F%98%9C%20%F0%9F%98%8D-ffdd67)](https://github.com/carloscuesta/gitmoji-cli)
[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)

Run `make help` for options like installing for development, linting and testing.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/ddelange/retrie",
    "name": "retrie",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.*",
    "maintainer_email": "",
    "keywords": "pure-Python regex trie regex-trie blacklist whitelist re search replace",
    "author": "ddelange",
    "author_email": "ddelange@delange.dev",
    "download_url": "https://files.pythonhosted.org/packages/ad/16/6d11b6db5d7c173f891452b6b8eba5c5e7cf3be62184b83ee254d2996f3f/retrie-0.3.0.tar.gz",
    "platform": null,
    "description": "# retrie\n\n[![build](https://img.shields.io/github/actions/workflow/status/ddelange/retrie/main.yml?branch=master&logo=github&cacheSeconds=86400)](https://github.com/ddelange/retrie/actions?query=branch%3Amaster)\n[![codecov](https://img.shields.io/codecov/c/github/ddelange/retrie/master?logo=codecov&logoColor=white)](https://codecov.io/gh/ddelange/retrie)\n[![pypi Version](https://img.shields.io/pypi/v/retrie.svg?logo=pypi&logoColor=white)](https://pypi.org/project/retrie/)\n[![python](https://img.shields.io/pypi/pyversions/retrie.svg?logo=python&logoColor=white)](https://pypi.org/project/retrie/)\n[![downloads](https://static.pepy.tech/badge/retrie)](https://pypistats.org/packages/retrie)\n[![black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/python/black)\n\n\n[retrie](https://github.com/ddelange/retrie) offers fast methods to match and replace (sequences of) strings based on efficient Trie-based regex unions.\n\n#### Trie\n\nInstead of matching against a simple regex union, which becomes slow for large sets of words, a more efficient regex pattern can be compiled using a [Trie](https://en.wikipedia.org/wiki/Trie) structure:\n\n```py\nfrom retrie.trie import Trie\n\n\ntrie = Trie()\n\ntrie.add(\"abc\", \"foo\", \"abs\")\nassert trie.pattern() == \"(?:ab[cs]|foo)\"  # equivalent to but faster than \"(?:abc|abs|foo)\"\n\ntrie.add(\"absolute\")\nassert trie.pattern() == \"(?:ab(?:c|s(?:olute)?)|foo)\"\n\ntrie.add(\"abx\")\nassert trie.pattern() == \"(?:ab(?:[cx]|s(?:olute)?)|foo)\"\n\ntrie.add(\"abxy\")\nassert trie.pattern() == \"(?:ab(?:c|s(?:olute)?|xy?)|foo)\"\n```\n\nA Trie may be populated with zero or more strings at instantiation or via `.add`, from which method chaining is possible. Two Trie may be merged with the `+` and `+=` operators and will compare equal if their data dictionaries are equal.\n\n```py\n    trie = Trie()\n    trie += Trie(\"abc\")\n    assert (\n        trie + Trie().add(\"foo\")\n        == Trie(\"abc\", \"foo\")\n        == Trie(*[\"abc\", \"foo\"])\n        == Trie().add(*[\"abc\", \"foo\"])\n        == Trie().add(\"abc\", \"foo\")\n        == Trie().add(\"abc\").add(\"foo\")\n    )\n```\n\n\n## Installation\n\nThis pure-Python, OS independent package is available on [PyPI](https://pypi.org/project/retrie):\n\n```sh\n$ pip install retrie\n```\n\n\n## Usage\n\n[![readthedocs](https://readthedocs.org/projects/retrie/badge/?version=latest)](https://retrie.readthedocs.io)\n\nFor documentation, see [retrie.readthedocs.io](https://retrie.readthedocs.io/en/stable/_code_reference/retrie.html).\n\nThe following objects are all subclasses of `retrie.retrie.Retrie`, which handles filling the Trie and compiling the corresponding regex pattern.\n\n\n#### Blacklist\n\nThe `Blacklist` object can be used to filter out bad occurences in a text or a sequence of strings:\n```py\nfrom retrie.retrie import Blacklist\n\n# check out docstrings and methods\nhelp(Blacklist)\n\nblacklist = Blacklist([\"abc\", \"foo\", \"abs\"], match_substrings=False)\nblacklist.compiled\n# re.compile(r'(?<=\\b)(?:ab[cs]|foo)(?=\\b)', re.IGNORECASE|re.UNICODE)\nassert not blacklist.is_blacklisted(\"a foobar\")\nassert tuple(blacklist.filter((\"good\", \"abc\", \"foobar\"))) == (\"good\", \"foobar\")\nassert blacklist.cleanse_text((\"good abc foobar\")) == \"good  foobar\"\n\nblacklist = Blacklist([\"abc\", \"foo\", \"abs\"], match_substrings=True)\nblacklist.compiled\n# re.compile(r'(?:ab[cs]|foo)', re.IGNORECASE|re.UNICODE)\nassert blacklist.is_blacklisted(\"a foobar\")\nassert tuple(blacklist.filter((\"good\", \"abc\", \"foobar\"))) == (\"good\",)\nassert blacklist.cleanse_text((\"good abc foobar\")) == \"good  bar\"\n```\n\n\n#### Whitelist\n\nSimilar methods are available for the `Whitelist` object:\n```py\nfrom retrie.retrie import Whitelist\n\n# check out docstrings and methods\nhelp(Whitelist)\n\nwhitelist = Whitelist([\"abc\", \"foo\", \"abs\"], match_substrings=False)\nwhitelist.compiled\n# re.compile(r'(?<=\\b)(?:ab[cs]|foo)(?=\\b)', re.IGNORECASE|re.UNICODE)\nassert not whitelist.is_whitelisted(\"a foobar\")\nassert tuple(whitelist.filter((\"bad\", \"abc\", \"foobar\"))) == (\"abc\",)\nassert whitelist.cleanse_text((\"bad abc foobar\")) == \"abc\"\n\nwhitelist = Whitelist([\"abc\", \"foo\", \"abs\"], match_substrings=True)\nwhitelist.compiled\n# re.compile(r'(?:ab[cs]|foo)', re.IGNORECASE|re.UNICODE)\nassert whitelist.is_whitelisted(\"a foobar\")\nassert tuple(whitelist.filter((\"bad\", \"abc\", \"foobar\"))) == (\"abc\", \"foobar\")\nassert whitelist.cleanse_text((\"bad abc foobar\")) == \"abcfoo\"\n```\n\n\n#### Replacer\n\nThe `Replacer` object does a fast single-pass search & replace for occurrences of `replacement_mapping.keys()` with corresponding values.\n```py\nfrom retrie.retrie import Replacer\n\n# check out docstrings and methods\nhelp(Replacer)\n\nreplacement_mapping = dict(zip([\"abc\", \"foo\", \"abs\"], [\"new1\", \"new2\", \"new3\"]))\n\nreplacer = Replacer(replacement_mapping, match_substrings=True)\nreplacer.compiled\n# re.compile(r'(?:ab[cs]|foo)', re.IGNORECASE|re.UNICODE)\nassert replacer.replace(\"ABS ...foo... foobar\") == \"new3 ...new2... new2bar\"\n\nreplacer = Replacer(replacement_mapping, match_substrings=False)\nreplacer.compiled\n# re.compile(r'\\b(?:ab[cs]|foo)\\b', re.IGNORECASE|re.UNICODE)\nassert replacer.replace(\"ABS ...foo... foobar\") == \"new3 ...new2... foobar\"\n\nreplacer = Replacer(replacement_mapping, match_substrings=False, re_flags=None)\nreplacer.compiled  # on py3, re.UNICODE is always enabled\n# re.compile(r'\\b(?:ab[cs]|foo)\\b')\nassert replacer.replace(\"ABS ...foo... foobar\") == \"ABS ...new2... foobar\"\n\nreplacer = Replacer(replacement_mapping, match_substrings=False, word_boundary=\" \")\nreplacer.compiled\n# re.compile(r'(?<= )(?:ab[cs]|foo)(?= )', re.IGNORECASE|re.UNICODE)\nassert replacer.replace(\". ABS ...foo... foobar\") == \". new3 ...foo... foobar\"\n```\n\n\n## Development\n\n[![gitmoji](https://img.shields.io/badge/gitmoji-%20%F0%9F%98%9C%20%F0%9F%98%8D-ffdd67)](https://github.com/carloscuesta/gitmoji-cli)\n[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)\n\nRun `make help` for options like installing for development, linting and testing.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Efficient Trie-based regex unions for blacklist/whitelist filtering and one-pass mapping-based string replacing",
    "version": "0.3.0",
    "project_urls": {
        "Homepage": "https://github.com/ddelange/retrie"
    },
    "split_keywords": [
        "pure-python",
        "regex",
        "trie",
        "regex-trie",
        "blacklist",
        "whitelist",
        "re",
        "search",
        "replace"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "411cabb28397ae9886e07776fcd11f826fac3d03aefb477da51540d9df9fc889",
                "md5": "c7dd6d79da0d537b80457e35deec68a2",
                "sha256": "81f475145ab91831a49a8c89d886f33cdc4d30d14fe5baeb05c74fa6bce9256f"
            },
            "downloads": -1,
            "filename": "retrie-0.3.0-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "c7dd6d79da0d537b80457e35deec68a2",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.*",
            "size": 9169,
            "upload_time": "2024-02-22T07:32:42",
            "upload_time_iso_8601": "2024-02-22T07:32:42.498207Z",
            "url": "https://files.pythonhosted.org/packages/41/1c/abb28397ae9886e07776fcd11f826fac3d03aefb477da51540d9df9fc889/retrie-0.3.0-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ad166d11b6db5d7c173f891452b6b8eba5c5e7cf3be62184b83ee254d2996f3f",
                "md5": "559a8f63a8d18fd67713d3ea0b73851d",
                "sha256": "d103f20d57d782888478d31010cfd074edd3d318202f9683fe0b73c23657fac2"
            },
            "downloads": -1,
            "filename": "retrie-0.3.0.tar.gz",
            "has_sig": false,
            "md5_digest": "559a8f63a8d18fd67713d3ea0b73851d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.*",
            "size": 10116,
            "upload_time": "2024-02-22T07:32:43",
            "upload_time_iso_8601": "2024-02-22T07:32:43.973925Z",
            "url": "https://files.pythonhosted.org/packages/ad/16/6d11b6db5d7c173f891452b6b8eba5c5e7cf3be62184b83ee254d2996f3f/retrie-0.3.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-22 07:32:43",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ddelange",
    "github_project": "retrie",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "retrie"
}
        
Elapsed time: 0.63821s