This library may be used to compare strings by a similarity measure,
defined as Sørensen–Dice coefficient calculated on multi-sets of the
strings' "bigrams" (all couples of adjacent characters).
Bigram (multi-)set is relatively easy to generate (*O(n)* time
complexity lower bound in terms of the string length).
Also, compared to just using a (multi-)set of characters, bigrams do
retain certain level of word structure. E.g. backwards spelled words are
not deemed similar (unlike when single character set is used).
Just note that single-character words are all perfectly similar in this
metric, as their bigram sets are empty---and therefore the same. It may
be a good idea to augment words like that with an additional character
(like a whitespace) to mitigate that problem.
Using implementation of bigram multisets above, a sequence matcher
implementation is also available. Given a sequence of text tokens, the
matcher then allows "fuzzy" matching of strings (using their bigrams
again) to sub-sequences of the text tokens.
See detailed documentation in `doc` directory. Rendered documentation:
- <https://htmlpreview.github.io/?https://github.com/vencik/libsdcxx/blob/master/doc/sequence_matcher.html>
# List of features
- Bigram multi-set objects implemented as sorted lists of character
tuples with count (*O(n log n)* creation time complexity in terms of
the string length)
- Union operation has *O(m+n)* time complexity (sum of multi-sets'
cardinalities at most)
- Intersection doesn’t produce objects; only its size is calculated in
*O(m+n)* time
- Template implementation, allowing for both ASCII/ANSI characters and
UNICODE characters
- Implementations using `std::multiset` and `std::unordered_multiset`
also available
- Performance tests show that, long story short, the "custom"
implementation is the best (notably faster unions, intersection size
computation in similar or better time)
- Sequence matcher (using the best performing bigrams) with several
optimisations
- Python v3 binding is provided (as `pysdcxx` module, packaged)
- Python `multiset` based implementation also compared---and is
expectedly much slower
# Build and installation
You shall need C++ compiler with support for at least C++17. Recent
enough gcc (v8.3 or newer) is OK (older versions might do as well,
though). You shall also need `cmake` and `make`.
Python v3.7 or newer is supported. For the Python package build, you
shall need `pip` and Python `distutils` and the `wheel` package. If you
wish to run Python UTs (which is highly recommended), you shall also
need `pytest`.
E.g. on Debian-based (or similar, `apt` using) systems, the following
should get you the required prerequisites unless you wish to use
`pyenv`.
\# apt-get install g++ cmake make git \# apt-get install python3-pip
python3-distutils \# unless you use pyenv $ pip install wheel pytest \#
ditto, better do that in pyenv sandbox\</programlisting\>
On Mac OS X, you’ll need Xcode tools and Homebrew. Then, install the
required prerequisites by
$ brew install coreutils cmake make git\</programlisting\>
If you do wish to use `pyenv` to create and manage project sandbox
(which is advisable), see short intro to that in the subsection below.
Anyhow, with all the prerequisites installed, clone the project:
$ git clone https://github.com/vencik/libsdcxx.git\</programlisting\>
Build the project, run UTs and build Python packages:
$ cd libsdcxx $ ./build.sh -ugp\</programlisting\>
Note that the `build.sh` script has options; run `$ ./build.sh -h` to
see them.
Most importantly, run with `-s` or `--enable-pt` options to perform
performance tests. Performance tests compare computation times of object
construction, union operations and intersection size computations of the
implementations. If you install `multiset` Python package (using `pip`),
the perf. tests shall also produce results for pure-Python
implementation using `multiset.Multiset` (not necessary).
<div class="note">
The perf. tests are written on the Python level, so the measured times
also contain some Python code runtime (and not trivial). I’d expect
results in native code to be notably better; but the test code is
identical, so the measurements should be meaningful for comparison.
</div>
If you wish, use `pip` to install the Python package:
\# pip install pysdcxx-\*.whl\</programlisting\>
Note that it’s recommended to use `pyenv`, especially for development
purposes.
## Managing project sandbox with `pyenv`
First install `pyenv`. You may use either your OS package repo (Homebrew
on Mac OS X) or web `pyenv` installer. Setup `pyenv` (set environment
variables) as instructed.
Then, create `pysdcxx` project sandbox, thus:
$ pyenv install 3.9.6 \# your choice of the Python interpreter version,
\>= 3.7 $ pyenv virtualenv 3.9.6 pysdcxx\</programlisting\>
Now, you can always (de)activate the virtual environment (switch to the
sandbox) by
\# pyenv activate pysdcxx \# pyenv deactivate\</programlisting\>
In the sandbox, install Python packages using `pip` as usual.
$ pip install wheel pytest\</programlisting\>
# Usage
## C++
### Using `bigrams`
``` C++
#include <libsdcxx/bigrams.hxx>
using bigrams = libsdcxx::bigrams; // wbigrams for UNICODE
const auto bgrms_empty = bigrams(); // empty bigrams set
const auto bgrms1 = bigrams("Hello world!"); // construct from string
size_t cnt = bgrms1.size(); // number of bigrams
std::cout << bgrms1; // serialisation
for (const auto & bigram_cnt: bgrms1) { // tuple of (bigram, count)
const auto & bigram = std::get(bigram_cnt);
std::cout << "Bigram : " << std::get(bigram) << std::get(bigram) << std::endl;
std::cout << "Count : " << std::get(bigram_cnt) << std::endl;
}
// (Const.) iterators are supported via cbegin, cend and begin, end method calls
const auto bgrms2 = bigrams("Hell or woes.");
size_t isect_size = bigrams::intersect_size(bgrms1, bgrms2); // intersection cardinality
double sdc = bigrams.sorensen_dice_coef(bgrms1, bgrms2); // similarity, in [0,1]
auto uni0n = bgrms1 + bgrms2; // 2 bigrams union
auto uni0n = bigrams::unite(bgrms1, bgrms2 /* , ... */); // variadic union
uni0n += bigrams("more stuff"); // objects are mutable
```
### Using `sequence_matcher`
``` C++
#include <libsdcxx/sequence_matcher.hxx>
#include <libsdcxx/bigrams.hxx>
using sequence_matcher = libsdcxx::sequence_matcher; // wsequence_matcher for UNICODE
using bigrams = libsdcxx::bigrams; // wbigrams for UNICODE
auto matcher = sequence_matcher();
matcher.reserve(10); // reserve space for bigrams matrix for text of 10 tokens
// (reservation is not strictly necessary, but advisable)
const auto bgrms_hello = bigrams("Hello");
const auto bgrms_world = bigrams("world");
matcher.emplace_back("Prologue"); // create token bigrams in-place
matcher.emplace_back(" .", true); // it's a good idea to pad single-char strings...
matcher.emplace_back(" "); // to 2 characters (so that they produce a bigram)
matcher.push_back(bigrams_hello); // push existing bigrams back
matcher.emplace_back(" ", true); // true here stands for "strip" token; matched...
matcher.push_back(bigrams_world); // sub-sequences are restricted not to begin/end...
matcher.emplace_back(" !", true); // with such tokens
matcher.emplace_back(" ", true);
matcher.emplace_back("Epilogue");
matcher.emplace_back(" .", true);
const auto bgrms_helo_wordl = bigrams::unite( // note thatbigrams of the whole
bigrams("Helo"), bigrams(" "), bigrams("wordl")); // sentence would differ
auto match = matcher.begin(bgrms_helo_wordl, 0.7); // match with threshold 0.7
for (; match != matcher.end(); ++match) {
std::cout << match << std::endl; // simple string form of match info, try it
std::cout
<< "Match bigrams: " << *match << std::endl // sub-sequence bigrams
<< "Match at index: " << match.begin() << std::endl // beginning token index
<< "Match end: " << match.end() << std::endl // index just past the end
<< "Match size: " << match.size() << std::endl // number of tokens
<< "Match score: " << match.sorensen_dice_coef() << std::endl;
}
// ... you may of course continue matching other sequences...
```
## Pyton v3
### Using `Bigrams`
``` Python
from pysdcxx import Bigrams # Python Bigrams are implemented by wbigrams, support UNICODE
bgrms_empty = Bigrams() # empty bigrams set
bgrms1 = Bigrams("Hello world!") # construct from string
cnt = len(bgrms1) # number of bigrams
print(str(bgrms1), f"{bgrms1}") # string serialisation
for bigram, cnt in bgrms1: # Bigrams are tuple[str, int] generators
assert len(bigram) == 2
bgrms2 = Bigrams("Hell or woes.")
isect_size = Bigrams.intersect_size(bgrms1, bgrms2) # intersection cardinality
sdc = Bigrams.sorensen_dice_coef(bgrms1, bgrms2) # simiarity, in [0,1]
union = bgrms1 + bgrms2 # 2 bigrams union
union += Bigrams("more stuff") # objects are mutable
```
### Using `SequenceMatcher`
``` Python
from pysdcxx import SentenceMatcher, Bigrams
matcher = SequenceMatcher() # empty matcher
matcher = SequenceMatcher(reserve=4) # empty matcher, reserved space for 4 tokens
# (not necessary, but speeds up token addition)
matcher.append("Hello") # append token (Bigrams are constructed)
matcher.append(Bigrams(" ")) # append token bigrams directly
matcher.append("world", strip=False) # "strip" token means that no match may begin...
matcher.append(" !", strip=True) # nor end by that token
# Alternatively, you may pass `Iterable` of tokens directly to the constructor.
# If the `Iterable` length can be taken, reservation of space is done; otherwise,
# you may still use the `reserve` constructor parameter if you know how many
# tokens there shall be... Again, if you don't, the constructor will handle it anyway
# (construction may just take a bit longer).
strip = True
matcher = SequenceMatcher([
"This", (" ",strip), "uses", (" ",strip),
"Sørensen", (" -",strip), "Dice", (" ",strip),
"coefficient", (" .",strip),
])
for match in matcher.match(["Sørenson", "and", "Dice"], 0.65): # matching
print(f"Match begin : {match.begin}") # 4 (index of the 1st match token)
print(f"Match end : {match.end}") # 7 (1 past the last match token)
print(f"Match score : {match.score}") # >0.65, <1.0 as it's not a perfect match
# You may continue matching other sequences
# Note that this is only a quick summary; see `SequenceMatcher` docstrings for more...
```
# License
The software is available open-source under the terms of 3-clause BSD
license.
# Author
Václav Krpec \<<vencik@razdva.cz>\>
Raw data
{
"_id": null,
"home_page": "https://github.com/vencik/libsdcxx",
"name": "pysdcxx",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "",
"author": "V\u00e1clav Krpec",
"author_email": "vencik@razdva.cz",
"download_url": "https://files.pythonhosted.org/packages/a3/c6/6d2e5428bcdab63df5f113844ed9ddf41d6fc1684d5020157b654134ed75/pysdcxx-0.1.4.tar.gz",
"platform": null,
"description": "This library may be used to compare strings by a similarity measure,\ndefined as S\u00f8rensen\u2013Dice coefficient calculated on multi-sets of the\nstrings' \"bigrams\" (all couples of adjacent characters).\n\nBigram (multi-)set is relatively easy to generate (*O(n)* time\ncomplexity lower bound in terms of the string length).\n\nAlso, compared to just using a (multi-)set of characters, bigrams do\nretain certain level of word structure. E.g. backwards spelled words are\nnot deemed similar (unlike when single character set is used).\n\nJust note that single-character words are all perfectly similar in this\nmetric, as their bigram sets are empty---and therefore the same. It may\nbe a good idea to augment words like that with an additional character\n(like a whitespace) to mitigate that problem.\n\nUsing implementation of bigram multisets above, a sequence matcher\nimplementation is also available. Given a sequence of text tokens, the\nmatcher then allows \"fuzzy\" matching of strings (using their bigrams\nagain) to sub-sequences of the text tokens.\n\nSee detailed documentation in `doc` directory. Rendered documentation:\n\n - <https://htmlpreview.github.io/?https://github.com/vencik/libsdcxx/blob/master/doc/sequence_matcher.html>\n\n# List of features\n\n - Bigram multi-set objects implemented as sorted lists of character\n tuples with count (*O(n log n)* creation time complexity in terms of\n the string length)\n\n - Union operation has *O(m+n)* time complexity (sum of multi-sets'\n cardinalities at most)\n\n - Intersection doesn\u2019t produce objects; only its size is calculated in\n *O(m+n)* time\n\n - Template implementation, allowing for both ASCII/ANSI characters and\n UNICODE characters\n\n - Implementations using `std::multiset` and `std::unordered_multiset`\n also available\n\n - Performance tests show that, long story short, the \"custom\"\n implementation is the best (notably faster unions, intersection size\n computation in similar or better time)\n\n - Sequence matcher (using the best performing bigrams) with several\n optimisations\n\n - Python v3 binding is provided (as `pysdcxx` module, packaged)\n\n - Python `multiset` based implementation also compared---and is\n expectedly much slower\n\n# Build and installation\n\nYou shall need C++ compiler with support for at least C++17. Recent\nenough gcc (v8.3 or newer) is OK (older versions might do as well,\nthough). You shall also need `cmake` and `make`.\n\nPython v3.7 or newer is supported. For the Python package build, you\nshall need `pip` and Python `distutils` and the `wheel` package. If you\nwish to run Python UTs (which is highly recommended), you shall also\nneed `pytest`.\n\nE.g. on Debian-based (or similar, `apt` using) systems, the following\nshould get you the required prerequisites unless you wish to use\n`pyenv`.\n\n\\# apt-get install g++ cmake make git \\# apt-get install python3-pip\npython3-distutils \\# unless you use pyenv $ pip install wheel pytest \\#\nditto, better do that in pyenv sandbox\\</programlisting\\>\n\nOn Mac OS X, you\u2019ll need Xcode tools and Homebrew. Then, install the\nrequired prerequisites by\n\n$ brew install coreutils cmake make git\\</programlisting\\>\n\nIf you do wish to use `pyenv` to create and manage project sandbox\n(which is advisable), see short intro to that in the subsection below.\n\nAnyhow, with all the prerequisites installed, clone the project:\n\n$ git clone https://github.com/vencik/libsdcxx.git\\</programlisting\\>\n\nBuild the project, run UTs and build Python packages:\n\n$ cd libsdcxx $ ./build.sh -ugp\\</programlisting\\>\n\nNote that the `build.sh` script has options; run `$ ./build.sh -h` to\nsee them.\n\nMost importantly, run with `-s` or `--enable-pt` options to perform\nperformance tests. Performance tests compare computation times of object\nconstruction, union operations and intersection size computations of the\nimplementations. If you install `multiset` Python package (using `pip`),\nthe perf. tests shall also produce results for pure-Python\nimplementation using `multiset.Multiset` (not necessary).\n\n<div class=\"note\">\n\nThe perf. tests are written on the Python level, so the measured times\nalso contain some Python code runtime (and not trivial). I\u2019d expect\nresults in native code to be notably better; but the test code is\nidentical, so the measurements should be meaningful for comparison.\n\n</div>\n\nIf you wish, use `pip` to install the Python package:\n\n\\# pip install pysdcxx-\\*.whl\\</programlisting\\>\n\nNote that it\u2019s recommended to use `pyenv`, especially for development\npurposes.\n\n## Managing project sandbox with `pyenv`\n\nFirst install `pyenv`. You may use either your OS package repo (Homebrew\non Mac OS X) or web `pyenv` installer. Setup `pyenv` (set environment\nvariables) as instructed.\n\nThen, create `pysdcxx` project sandbox, thus:\n\n$ pyenv install 3.9.6 \\# your choice of the Python interpreter version,\n\\>= 3.7 $ pyenv virtualenv 3.9.6 pysdcxx\\</programlisting\\>\n\nNow, you can always (de)activate the virtual environment (switch to the\nsandbox) by\n\n\\# pyenv activate pysdcxx \\# pyenv deactivate\\</programlisting\\>\n\nIn the sandbox, install Python packages using `pip` as usual.\n\n$ pip install wheel pytest\\</programlisting\\>\n\n# Usage\n\n## C++\n\n### Using `bigrams`\n\n``` C++\n#include <libsdcxx/bigrams.hxx>\n\nusing bigrams = libsdcxx::bigrams; // wbigrams for UNICODE\n\nconst auto bgrms_empty = bigrams(); // empty bigrams set\n\nconst auto bgrms1 = bigrams(\"Hello world!\"); // construct from string\nsize_t cnt = bgrms1.size(); // number of bigrams\n\nstd::cout << bgrms1; // serialisation\n\nfor (const auto & bigram_cnt: bgrms1) { // tuple of (bigram, count)\n const auto & bigram = std::get(bigram_cnt);\n\n std::cout << \"Bigram : \" << std::get(bigram) << std::get(bigram) << std::endl;\n std::cout << \"Count : \" << std::get(bigram_cnt) << std::endl;\n}\n\n// (Const.) iterators are supported via cbegin, cend and begin, end method calls\n\nconst auto bgrms2 = bigrams(\"Hell or woes.\");\n\nsize_t isect_size = bigrams::intersect_size(bgrms1, bgrms2); // intersection cardinality\ndouble sdc = bigrams.sorensen_dice_coef(bgrms1, bgrms2); // similarity, in [0,1]\n\nauto uni0n = bgrms1 + bgrms2; // 2 bigrams union\nauto uni0n = bigrams::unite(bgrms1, bgrms2 /* , ... */); // variadic union\n\nuni0n += bigrams(\"more stuff\"); // objects are mutable\n```\n\n### Using `sequence_matcher`\n\n``` C++\n#include <libsdcxx/sequence_matcher.hxx>\n#include <libsdcxx/bigrams.hxx>\n\nusing sequence_matcher = libsdcxx::sequence_matcher; // wsequence_matcher for UNICODE\nusing bigrams = libsdcxx::bigrams; // wbigrams for UNICODE\n\nauto matcher = sequence_matcher();\nmatcher.reserve(10); // reserve space for bigrams matrix for text of 10 tokens\n // (reservation is not strictly necessary, but advisable)\n\nconst auto bgrms_hello = bigrams(\"Hello\");\nconst auto bgrms_world = bigrams(\"world\");\n\nmatcher.emplace_back(\"Prologue\"); // create token bigrams in-place\nmatcher.emplace_back(\" .\", true); // it's a good idea to pad single-char strings...\nmatcher.emplace_back(\" \"); // to 2 characters (so that they produce a bigram)\nmatcher.push_back(bigrams_hello); // push existing bigrams back\nmatcher.emplace_back(\" \", true); // true here stands for \"strip\" token; matched...\nmatcher.push_back(bigrams_world); // sub-sequences are restricted not to begin/end...\nmatcher.emplace_back(\" !\", true); // with such tokens\nmatcher.emplace_back(\" \", true);\nmatcher.emplace_back(\"Epilogue\");\nmatcher.emplace_back(\" .\", true);\n\nconst auto bgrms_helo_wordl = bigrams::unite( // note thatbigrams of the whole\n bigrams(\"Helo\"), bigrams(\" \"), bigrams(\"wordl\")); // sentence would differ\n\nauto match = matcher.begin(bgrms_helo_wordl, 0.7); // match with threshold 0.7\nfor (; match != matcher.end(); ++match) {\n std::cout << match << std::endl; // simple string form of match info, try it\n\n std::cout\n << \"Match bigrams: \" << *match << std::endl // sub-sequence bigrams\n << \"Match at index: \" << match.begin() << std::endl // beginning token index\n << \"Match end: \" << match.end() << std::endl // index just past the end\n << \"Match size: \" << match.size() << std::endl // number of tokens\n << \"Match score: \" << match.sorensen_dice_coef() << std::endl;\n}\n\n// ... you may of course continue matching other sequences...\n```\n\n## Pyton v3\n\n### Using `Bigrams`\n\n``` Python\nfrom pysdcxx import Bigrams # Python Bigrams are implemented by wbigrams, support UNICODE\n\nbgrms_empty = Bigrams() # empty bigrams set\n\nbgrms1 = Bigrams(\"Hello world!\") # construct from string\ncnt = len(bgrms1) # number of bigrams\n\nprint(str(bgrms1), f\"{bgrms1}\") # string serialisation\n\nfor bigram, cnt in bgrms1: # Bigrams are tuple[str, int] generators\n assert len(bigram) == 2\n\nbgrms2 = Bigrams(\"Hell or woes.\")\n\nisect_size = Bigrams.intersect_size(bgrms1, bgrms2) # intersection cardinality\nsdc = Bigrams.sorensen_dice_coef(bgrms1, bgrms2) # simiarity, in [0,1]\n\nunion = bgrms1 + bgrms2 # 2 bigrams union\n\nunion += Bigrams(\"more stuff\") # objects are mutable\n```\n\n### Using `SequenceMatcher`\n\n``` Python\nfrom pysdcxx import SentenceMatcher, Bigrams\n\nmatcher = SequenceMatcher() # empty matcher\nmatcher = SequenceMatcher(reserve=4) # empty matcher, reserved space for 4 tokens\n # (not necessary, but speeds up token addition)\n\nmatcher.append(\"Hello\") # append token (Bigrams are constructed)\nmatcher.append(Bigrams(\" \")) # append token bigrams directly\nmatcher.append(\"world\", strip=False) # \"strip\" token means that no match may begin...\nmatcher.append(\" !\", strip=True) # nor end by that token\n\n# Alternatively, you may pass `Iterable` of tokens directly to the constructor.\n# If the `Iterable` length can be taken, reservation of space is done; otherwise,\n# you may still use the `reserve` constructor parameter if you know how many\n# tokens there shall be... Again, if you don't, the constructor will handle it anyway\n# (construction may just take a bit longer).\nstrip = True\nmatcher = SequenceMatcher([\n \"This\", (\" \",strip), \"uses\", (\" \",strip),\n \"S\u00f8rensen\", (\" -\",strip), \"Dice\", (\" \",strip),\n \"coefficient\", (\" .\",strip),\n])\n\nfor match in matcher.match([\"S\u00f8renson\", \"and\", \"Dice\"], 0.65): # matching\n print(f\"Match begin : {match.begin}\") # 4 (index of the 1st match token)\n print(f\"Match end : {match.end}\") # 7 (1 past the last match token)\n print(f\"Match score : {match.score}\") # >0.65, <1.0 as it's not a perfect match\n\n# You may continue matching other sequences\n# Note that this is only a quick summary; see `SequenceMatcher` docstrings for more...\n```\n\n# License\n\nThe software is available open-source under the terms of 3-clause BSD\nlicense.\n\n# Author\n\nV\u00e1clav Krpec \\<<vencik@razdva.cz>\\>",
"bugtrack_url": null,
"license": "BSD-3-Clause license",
"summary": "S\u00f8rensen\u2013Dice coefficient on string bigram multi-sets (in C++)",
"version": "0.1.4",
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "a3c66d2e5428bcdab63df5f113844ed9ddf41d6fc1684d5020157b654134ed75",
"md5": "8db2bc6376a998026b271d8d452cb3d1",
"sha256": "ba4c4e3218c5c4082718cfc3220c9c51be9154616a09ff2bf6fb0d3ad456eb69"
},
"downloads": -1,
"filename": "pysdcxx-0.1.4.tar.gz",
"has_sig": false,
"md5_digest": "8db2bc6376a998026b271d8d452cb3d1",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 30456,
"upload_time": "2023-04-25T11:46:11",
"upload_time_iso_8601": "2023-04-25T11:46:11.961591Z",
"url": "https://files.pythonhosted.org/packages/a3/c6/6d2e5428bcdab63df5f113844ed9ddf41d6fc1684d5020157b654134ed75/pysdcxx-0.1.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-04-25 11:46:11",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "vencik",
"github_project": "libsdcxx",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "pysdcxx"
}