Approximate (aka "fuzzy") matching of sequences of tokens, implemented
on top of the `libsdcxx` library.
The library offers more high-level (or user-friendly) interfaces for
`libsdcxx` use: . `approxism.Extractor` as the high-level,
dictionary-based information extractor and . `approxism.Matcher` as the
middle-ground, common approximate string matcher.
The project uses NLTK punkt sentence splitter.
In order to speed up the matching (or for other purposes), one may use
collection of stop words for almost 50 languages. As the intended use is
typically named-entity recognition, stop words are probably safe to
disregard. Note though, that stop words are **not** removed from the
text altogether; the matches are just prevented from beginning or ending
by a stop word. Matches may still contain them (and they do contribute
to the matching score).
Also, token lowercasing is provided (with option for keeping short
acronyms in uppercase (in order to prevent matches with common short
words). Injection of custom token transforms is supported.
See <http://github.com/vencik/libsdcxx>
# Build and installation
Python v3.7 or newer is supported. For the Python package build, you
shall need `pip` and Python `distutils` and the `wheel` package. If you
wish to run Python UTs (which is highly recommended), you shall also
need `pytest`.
E.g. on Debian-based (or similar, `apt` using) systems, the following
should get you the required prerequisites unless you wish to use
`pyenv`.
\# apt-get install git \# apt-get install python3-pip python3-distutils
\# unless you use pyenv $ pip install wheel pytest \# ditto, better do
that in pyenv sandbox\</programlisting\>
On Mac OS X, you’ll need Xcode tools and Homebrew. Then, install the
required prerequisites by
$ brew install git\</programlisting\>
If you do wish to use `pyenv` to create and manage project sandbox
(which is advisable), see short intro to that in the subsection below.
Anyhow, with all the prerequisites installed, clone the project:
$ git clone https://github.com/vencik/approxism.git\</programlisting\>
Build the project, run UTs and build packages:
$ cd approxism $ ./build.sh -ugp\</programlisting\>
Note that the `build.sh` script has options; run `$ ./build.sh -h` to
see them.
If you wish, use `pip` to install the Python package:
\# pip install approxism-\*.whl\</programlisting\>
Note that it’s recommended to use `pyenv`, especially for development
purposes.
## Managing project sandbox with `pyenv`
First install `pyenv`. You may use either your OS package repo (Homebrew
on Mac OS X) or web `pyenv` installer. Setup `pyenv` (set environment
variables) as instructed.
Then, create `approxism` project sandbox, thus:
$ pyenv install 3.9.6 \# your choice of the Python interpreter version,
\>= 3.7 $ pyenv virtualenv 3.9.6 approxism\</programlisting\>
Now, you can always (de)activate the virtual environment (switch to the
sandbox) by
\# pyenv activate approxism \# pyenv deactivate\</programlisting\>
In the sandbox, install Python packages using `pip` as usual.
$ pip install wheel pytest\</programlisting\>
# Usage
## High-level, NER-like extractor
A very simple use-case, assuming that you have collected interesting
terms you need to search for in your texts in a JSON file with the
following structure:
**`my_dictionary.json`.**
``` JSON
{
"multi-word term" : {
"whatever" : "other fields you wish",
"etc" : "really, whatever you like"
},
"another term..." : {
}
}
```
The `_matching_threshold` field is optional, see
`approxism.Extractor.Dictionary.Record`.
``` Python
from dataclasses import dataclass
import json
from approxism import Extractor
from approxism.transforms import Lowercase
@dataclass
class MyRecord(Extractor.Dictionary.Record):
whatever: str # or whatever type you wish, as long as it's JSON (de)serialisable
etc: str # ditto
with open("my_dictionary.json", "r", encoding="utf-8") as json_fd:
dictionary = {
term: MyRecord(**record)
for term, record in json.load(json_fd).items()
}
extractor = Extractor(
dictionary,
default_threshold=0.8,
language="english",
token_transform=[Lowercase(min_len=4, except_caps=True)],
)
for match in extractor.extract(my_text):
print(match)
```
## Approximate string matching (generic) layer
``` Python
from approxism import Matcher
matcher = Matcher("english") # that's the default
for match in matcher.text("Hello world!").match("worl", 0.8): # score >= 0.8 is required
print(match)
# Of course, one may like to store the (pre-processed) text and/or patterns:
txt1 = matcher.text("My text about Sørensen–Dice coefficient.") # text preprocessing
txt2 = matcher.text("And one about correlation coefficient.")
bgr1 = matcher.sequence_bigrams("Sørensen–Dice") # pattern preproc.
bgr2 = matcher.sequence_bigrams("coefficient")
for text in (txt1, txt2):
print(f"Searching in \"{text}\"")
for bgr in (bgr1, bgr2):
for match in text.match(bgr): # pattern matching
print(f"Found {bgr}: {match}")
# Long texts preprocessing produces relatively large data structures
# (space complexity is O(n^2) where n is number of tokens in a sentence).
# Matcher.text splits the text into sentences.
# However, if you already have the text split, or you prefer to process it
# sentence-by-sentence (which is recommended), you may use Matcher.sentences to split it
# and pre-process them using Matcher.sentence for matching:
for sentence in matcher.sentences(my_long_text): # sentence string generator
sentence = matcher.sentence(sentence) # sentence preprocessing
for bgr in (bgr1, bgr2):
for match in sentence.match(bgr): # pattern matching
print(f"Found {bgr}: {match}")
# Should you like to lowercase tokens, simply pass the matcher token transform(s)
from approxism.transforms import Lowercase
matcher = Matcher(
language="french",
strip_stopwords=False, # the default is True
token_transform=[Lowercase()], # lowercase tokens
)
# The lowercase transformer supports keeping short acronyms in uppercase:
Lowercase(min_len=4, except_caps=True) # this will lowercase tokens of at least 4 chars,
# but will also lowercase shorter ones UNLESS
# they are in all CAPS so e.g. "AMI" (AWS machine
# image) shall be kept as is and therefore won't
# get mistaken for a friend...
# You may add more transforms of yours; just implement Matcher.TokenTransform interface
# Lastly, when specifying language, note that not all languages may be available.
# List of available tokeniser langauges is obtained by calling
from approxism import Tokeniser
Tokeniser.available()
# Similarly, list of available stop words languages is obtained by calling
from approxism import Stopwords
Stopwords.available()
# The matcher allows you to specify how to proceed if your language is not available.
# By default, an exception is thrown.
# However, passing strict_language=False parameter suppresses it, using default language
# for tokenisation (and no stopwords, if they are not available).
Matcher(language="martian", strict_language=False)
# The above shan't throw; instead, Matcher.default_language shall be used
# for tokenisation (and no stopwords whall be used, unless somebody collects Martian
# stop words any time soon... ;-))
```
For more precise/interesting examples of use, check out the matcher unit
tests in `src/approxism/unit_test/test_matcher.py`.
# License
The software is available open-source under the terms of 3-clause BSD
license.
# Author
Václav Krpec \<<vencik@razdva.cz>\>
Raw data
{
"_id": null,
"home_page": "https://github.com/vencik/approxism",
"name": "approxism",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "",
"author": "V\u00e1clav Krpec",
"author_email": "vencik@razdva.cz",
"download_url": "https://files.pythonhosted.org/packages/58/7d/93135b21c5682cd840707301dcea9e11deb97e6f6269bf8938e5e1db5c3b/approxism-0.1.1.tar.gz",
"platform": null,
"description": "Approximate (aka \"fuzzy\") matching of sequences of tokens, implemented\non top of the `libsdcxx` library.\n\nThe library offers more high-level (or user-friendly) interfaces for\n`libsdcxx` use: . `approxism.Extractor` as the high-level,\ndictionary-based information extractor and . `approxism.Matcher` as the\nmiddle-ground, common approximate string matcher.\n\nThe project uses NLTK punkt sentence splitter.\n\nIn order to speed up the matching (or for other purposes), one may use\ncollection of stop words for almost 50 languages. As the intended use is\ntypically named-entity recognition, stop words are probably safe to\ndisregard. Note though, that stop words are **not** removed from the\ntext altogether; the matches are just prevented from beginning or ending\nby a stop word. Matches may still contain them (and they do contribute\nto the matching score).\n\nAlso, token lowercasing is provided (with option for keeping short\nacronyms in uppercase (in order to prevent matches with common short\nwords). Injection of custom token transforms is supported.\n\nSee <http://github.com/vencik/libsdcxx>\n\n# Build and installation\n\nPython v3.7 or newer is supported. For the Python package build, you\nshall need `pip` and Python `distutils` and the `wheel` package. If you\nwish to run Python UTs (which is highly recommended), you shall also\nneed `pytest`.\n\nE.g. on Debian-based (or similar, `apt` using) systems, the following\nshould get you the required prerequisites unless you wish to use\n`pyenv`.\n\n\\# apt-get install git \\# apt-get install python3-pip python3-distutils\n\\# unless you use pyenv $ pip install wheel pytest \\# ditto, better do\nthat in pyenv sandbox\\</programlisting\\>\n\nOn Mac OS X, you\u2019ll need Xcode tools and Homebrew. Then, install the\nrequired prerequisites by\n\n$ brew install git\\</programlisting\\>\n\nIf you do wish to use `pyenv` to create and manage project sandbox\n(which is advisable), see short intro to that in the subsection below.\n\nAnyhow, with all the prerequisites installed, clone the project:\n\n$ git clone https://github.com/vencik/approxism.git\\</programlisting\\>\n\nBuild the project, run UTs and build packages:\n\n$ cd approxism $ ./build.sh -ugp\\</programlisting\\>\n\nNote that the `build.sh` script has options; run `$ ./build.sh -h` to\nsee them.\n\nIf you wish, use `pip` to install the Python package:\n\n\\# pip install approxism-\\*.whl\\</programlisting\\>\n\nNote that it\u2019s recommended to use `pyenv`, especially for development\npurposes.\n\n## Managing project sandbox with `pyenv`\n\nFirst install `pyenv`. You may use either your OS package repo (Homebrew\non Mac OS X) or web `pyenv` installer. Setup `pyenv` (set environment\nvariables) as instructed.\n\nThen, create `approxism` project sandbox, thus:\n\n$ pyenv install 3.9.6 \\# your choice of the Python interpreter version,\n\\>= 3.7 $ pyenv virtualenv 3.9.6 approxism\\</programlisting\\>\n\nNow, you can always (de)activate the virtual environment (switch to the\nsandbox) by\n\n\\# pyenv activate approxism \\# pyenv deactivate\\</programlisting\\>\n\nIn the sandbox, install Python packages using `pip` as usual.\n\n$ pip install wheel pytest\\</programlisting\\>\n\n# Usage\n\n## High-level, NER-like extractor\n\nA very simple use-case, assuming that you have collected interesting\nterms you need to search for in your texts in a JSON file with the\nfollowing structure:\n\n**`my_dictionary.json`.**\n\n``` JSON\n{\n \"multi-word term\" : {\n \"whatever\" : \"other fields you wish\",\n \"etc\" : \"really, whatever you like\"\n },\n \"another term...\" : {\n }\n}\n```\n\nThe `_matching_threshold` field is optional, see\n`approxism.Extractor.Dictionary.Record`.\n\n``` Python\nfrom dataclasses import dataclass\nimport json\nfrom approxism import Extractor\nfrom approxism.transforms import Lowercase\n\n@dataclass\nclass MyRecord(Extractor.Dictionary.Record):\n whatever: str # or whatever type you wish, as long as it's JSON (de)serialisable\n etc: str # ditto\n\nwith open(\"my_dictionary.json\", \"r\", encoding=\"utf-8\") as json_fd:\n dictionary = {\n term: MyRecord(**record)\n for term, record in json.load(json_fd).items()\n }\n\nextractor = Extractor(\n dictionary,\n default_threshold=0.8,\n language=\"english\",\n token_transform=[Lowercase(min_len=4, except_caps=True)],\n)\n\nfor match in extractor.extract(my_text):\n print(match)\n```\n\n## Approximate string matching (generic) layer\n\n``` Python\nfrom approxism import Matcher\n\nmatcher = Matcher(\"english\") # that's the default\n\nfor match in matcher.text(\"Hello world!\").match(\"worl\", 0.8): # score >= 0.8 is required\n print(match)\n\n# Of course, one may like to store the (pre-processed) text and/or patterns:\n\ntxt1 = matcher.text(\"My text about S\u00f8rensen\u2013Dice coefficient.\") # text preprocessing\ntxt2 = matcher.text(\"And one about correlation coefficient.\")\nbgr1 = matcher.sequence_bigrams(\"S\u00f8rensen\u2013Dice\") # pattern preproc.\nbgr2 = matcher.sequence_bigrams(\"coefficient\")\n\nfor text in (txt1, txt2):\n print(f\"Searching in \\\"{text}\\\"\")\n for bgr in (bgr1, bgr2):\n for match in text.match(bgr): # pattern matching\n print(f\"Found {bgr}: {match}\")\n\n# Long texts preprocessing produces relatively large data structures\n# (space complexity is O(n^2) where n is number of tokens in a sentence).\n# Matcher.text splits the text into sentences.\n# However, if you already have the text split, or you prefer to process it\n# sentence-by-sentence (which is recommended), you may use Matcher.sentences to split it\n# and pre-process them using Matcher.sentence for matching:\n\nfor sentence in matcher.sentences(my_long_text): # sentence string generator\n sentence = matcher.sentence(sentence) # sentence preprocessing\n for bgr in (bgr1, bgr2):\n for match in sentence.match(bgr): # pattern matching\n print(f\"Found {bgr}: {match}\")\n\n# Should you like to lowercase tokens, simply pass the matcher token transform(s)\nfrom approxism.transforms import Lowercase\n\nmatcher = Matcher(\n language=\"french\",\n strip_stopwords=False, # the default is True\n token_transform=[Lowercase()], # lowercase tokens\n)\n\n# The lowercase transformer supports keeping short acronyms in uppercase:\n\nLowercase(min_len=4, except_caps=True) # this will lowercase tokens of at least 4 chars,\n # but will also lowercase shorter ones UNLESS\n # they are in all CAPS so e.g. \"AMI\" (AWS machine\n # image) shall be kept as is and therefore won't\n # get mistaken for a friend...\n\n# You may add more transforms of yours; just implement Matcher.TokenTransform interface\n\n# Lastly, when specifying language, note that not all languages may be available.\n# List of available tokeniser langauges is obtained by calling\nfrom approxism import Tokeniser\nTokeniser.available()\n\n# Similarly, list of available stop words languages is obtained by calling\nfrom approxism import Stopwords\nStopwords.available()\n\n# The matcher allows you to specify how to proceed if your language is not available.\n# By default, an exception is thrown.\n# However, passing strict_language=False parameter suppresses it, using default language\n# for tokenisation (and no stopwords, if they are not available).\nMatcher(language=\"martian\", strict_language=False)\n\n# The above shan't throw; instead, Matcher.default_language shall be used\n# for tokenisation (and no stopwords whall be used, unless somebody collects Martian\n# stop words any time soon... ;-))\n```\n\nFor more precise/interesting examples of use, check out the matcher unit\ntests in `src/approxism/unit_test/test_matcher.py`.\n\n# License\n\nThe software is available open-source under the terms of 3-clause BSD\nlicense.\n\n# Author\n\nV\u00e1clav Krpec \\<<vencik@razdva.cz>\\>",
"bugtrack_url": null,
"license": "BSD-3-Clause license",
"summary": "Approximate String Matching using libsdcxx",
"version": "0.1.1",
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "587d93135b21c5682cd840707301dcea9e11deb97e6f6269bf8938e5e1db5c3b",
"md5": "4d3ca60c7ba87396989e16cd083524be",
"sha256": "4042dfe62ae2111ab1ca884152c6f59bdc39cff4fb7bbacd2c176d9156f108d7"
},
"downloads": -1,
"filename": "approxism-0.1.1.tar.gz",
"has_sig": false,
"md5_digest": "4d3ca60c7ba87396989e16cd083524be",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 67090,
"upload_time": "2023-04-04T09:10:41",
"upload_time_iso_8601": "2023-04-04T09:10:41.831274Z",
"url": "https://files.pythonhosted.org/packages/58/7d/93135b21c5682cd840707301dcea9e11deb97e6f6269bf8938e5e1db5c3b/approxism-0.1.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-04-04 09:10:41",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "vencik",
"github_project": "approxism",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "approxism"
}