# fuzzy-lightning
Fuzzy lightning is a fast and customizable package for finding the closest matches in a list of target strings (documents) using fuzzy string matching. It is particularly effective for short string matching against large document sets, and includes the fastest implementation of the Damerau-Levenshtein and longest common substring algorithms in its class.
## Introduction
Fuzzy lightning works by:
1. Converts strings to embedding vectors using an sklearn TF-IDF vectorizer on character n-grams.
2. Generates a shortlist of match candidates from the top N nearest neighbours (using cosine similarity).
3. This list of candidates is then pruned to select the best match using the longest common substring to length ratio.
### Quick Start
#### Installation
`pip install fuzzy-lightning`
Finding the closest matches in a list of documents for a list of input strings:
```
from fuzzy_lightning import FuzzyMatch
documents = ["SMARTEST ENERGY", "SMARTPIG"]
fuzzy_matcher = FuzzyMatch(documents=documents)
strings = ['SMARTPIGGIE', 'THE SMARTEST ENERGY']
matches = fuzzy_matcher.get_document_matches(strings=strings)
print(matches)
>>> [
DocumentMatch(match='SMARTPIG', confidence=1.0),
DocumentMatch(match='SMARTEST ENERGY', confidence=1.0)
]
```
The output is a list of DocumentMatch objects, each with a match attribute that contains the closest matching document and a confidence attribute that represents the confidence of the match (a value between 0 and 1):
If you want to find the closest match for a single string, you can use the get_lookup_match method:
```
match = fuzzy_matcher.get_document_match('SMARTPIGGIE')
print(match)
>>> DocumentMatch(match='SMARTPIG', confidence=1.0)
```
### Configuration
The FuzzyMatch class has a number of configurable parameters that you can set using the `FuzzyMatchConfig` class.
- **n_gram_range** (Tuple[int, int]): Range of lengths of n-grams to use with the TF-IDF vectorizer. For example,
n_gram_range = (2, 3) will use bi-grams and tri-grams.
- **min_document_freq** (int, optional): Minimum number of documents a term must appear in to be considered.
Defaults to 1.
- **tfidf_similarity_threshold** (float, optional): Minimum cosine similarity to a viable candidate for LCS.
Defaults to 0.1.
- **n_top_candidates** (int, optional): Maximum number of candidates to return that exceed the
similarity_threshold. Defaults to 40.
- **lcs_min_characters** (int): Minimum length of the string to qualify for matching to the target strings.
- **lcs_min_length_ratio** (float): Minimum ratio of string length to target string length for an string <> target
string match to qualify.
- **lcs_similarity_threshold** (float, optional): Minimum LCS match ratio to accept classification.
use_threads** (bool): Whether to use threads to parallelise the work to find the n_top_candidates for each
string.
- **n_threads** (int): Number of threads to use when finding the n_top_candidates. Increasing the number of threads
reduces the run time, but there becomes a trade off in production where there may be 'thread congestion'.
- **string_preprocessor** (Optional[Callable[[str], str]]): A callable that takes in a string and returns a processed
string. This can be used to perform any preprocessing steps on the input strings before they are compared.
For example, to change the range of n-grams used by the TF-IDF vectorizer, and to add some string preprocessing prior
to the fuzzy matching you can do the following:
```
from fuzzy_lightning import FuzzyMatch, FuzzyMatchConfig
def preprocessor(string: str) -> str:
return string.lower().replace(" ", "")
config = FuzzyMatchConfig(n_gram_range=(1, 2), string_preprocessor=preprocessor)
fuzzy_matcher = FuzzyMatch(documents=['abc', 'def'], config=config)
```
## Longest Common Substring
Finds the longest substring that is common to two strings. It is used to calculate the confidence of the fuzzy match.
```
from fuzzy_lightning import lcs
lcs.longest_common_substring_length('beersteinbeer', 'stein')
>>> 5
```
In terms of latency, a single longest common substring length call is < 0.5μs
```
from fuzzy_lightning import lcs
from timeit import timeit
n_iter = 1000000
timeit("lcs.longest_common_substring_length('beersteinbeer', 'stein')", globals=globals(), number=n_iter)/n_iter
>>> 3.182171669322997e-07
```
## Edit Distance Algorithms
### Damerau-Levenshtein
The Damerau-Levenshtein algorithm is a string edit distance algorithm calculates the minimum number of operations (insertions, deletions, substitutions, and transpositions) required to transform one string into another. Basically Levenshtein but also
allow for transpositions.
```
from fuzzy_lightning import edit_distance as ed
dist = ed.damerau_levenshtein('my nam is spetl wrrong', 'my name is spelt wrong')
print(dist)
>>> 3
```
## Appendix
### Why is this super fast?
1. C++
2. Dynamic Programming
3. Cache locality benefits of using a 1D array to mimic the behaviour of a 2D array
Raw data
{
"_id": null,
"home_page": "https://github.com/tomukmatthews/fuzzy-lightning",
"name": "fuzzy-lightning",
"maintainer": "Tom Matthews",
"docs_url": null,
"requires_python": ">=3.8.1,<4.0.0",
"maintainer_email": "tomukmatthews@gmail.com",
"keywords": "Fuzzy,String,Lookup,Search,Match,Similarity",
"author": "Tom Matthews",
"author_email": "tomukmatthews@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/d3/9e/f3f78ec51ac2ac963f72434c472dad562556a04cfb77c271ae5f5a0bc2a3/fuzzy-lightning-0.1.6.tar.gz",
"platform": null,
"description": "# fuzzy-lightning\n\nFuzzy lightning is a fast and customizable package for finding the closest matches in a list of target strings (documents) using fuzzy string matching. It is particularly effective for short string matching against large document sets, and includes the fastest implementation of the Damerau-Levenshtein and longest common substring algorithms in its class.\n\n## Introduction\nFuzzy lightning works by:\n1. Converts strings to embedding vectors using an sklearn TF-IDF vectorizer on character n-grams.\n2. Generates a shortlist of match candidates from the top N nearest neighbours (using cosine similarity).\n3. This list of candidates is then pruned to select the best match using the longest common substring to length ratio.\n\n### Quick Start\n\n#### Installation\n\n`pip install fuzzy-lightning`\n\nFinding the closest matches in a list of documents for a list of input strings:\n\n```\nfrom fuzzy_lightning import FuzzyMatch\n\ndocuments = [\"SMARTEST ENERGY\", \"SMARTPIG\"]\nfuzzy_matcher = FuzzyMatch(documents=documents)\nstrings = ['SMARTPIGGIE', 'THE SMARTEST ENERGY']\nmatches = fuzzy_matcher.get_document_matches(strings=strings)\nprint(matches)\n>>> [\n DocumentMatch(match='SMARTPIG', confidence=1.0),\n DocumentMatch(match='SMARTEST ENERGY', confidence=1.0)\n]\n```\n\nThe output is a list of DocumentMatch objects, each with a match attribute that contains the closest matching document and a confidence attribute that represents the confidence of the match (a value between 0 and 1):\n\nIf you want to find the closest match for a single string, you can use the get_lookup_match method:\n\n```\nmatch = fuzzy_matcher.get_document_match('SMARTPIGGIE')\nprint(match)\n>>> DocumentMatch(match='SMARTPIG', confidence=1.0)\n```\n\n### Configuration\n\nThe FuzzyMatch class has a number of configurable parameters that you can set using the `FuzzyMatchConfig` class. \n\n- **n_gram_range** (Tuple[int, int]): Range of lengths of n-grams to use with the TF-IDF vectorizer. For example,\n n_gram_range = (2, 3) will use bi-grams and tri-grams.\n- **min_document_freq** (int, optional): Minimum number of documents a term must appear in to be considered.\n Defaults to 1.\n- **tfidf_similarity_threshold** (float, optional): Minimum cosine similarity to a viable candidate for LCS.\n Defaults to 0.1.\n- **n_top_candidates** (int, optional): Maximum number of candidates to return that exceed the\n similarity_threshold. Defaults to 40.\n- **lcs_min_characters** (int): Minimum length of the string to qualify for matching to the target strings.\n- **lcs_min_length_ratio** (float): Minimum ratio of string length to target string length for an string <> target\n string match to qualify.\n- **lcs_similarity_threshold** (float, optional): Minimum LCS match ratio to accept classification.\nuse_threads** (bool): Whether to use threads to parallelise the work to find the n_top_candidates for each\n string.\n- **n_threads** (int): Number of threads to use when finding the n_top_candidates. Increasing the number of threads\n reduces the run time, but there becomes a trade off in production where there may be 'thread congestion'.\n- **string_preprocessor** (Optional[Callable[[str], str]]): A callable that takes in a string and returns a processed\n string. This can be used to perform any preprocessing steps on the input strings before they are compared.\n\nFor example, to change the range of n-grams used by the TF-IDF vectorizer, and to add some string preprocessing prior\nto the fuzzy matching you can do the following:\n\n```\nfrom fuzzy_lightning import FuzzyMatch, FuzzyMatchConfig\n\ndef preprocessor(string: str) -> str:\n return string.lower().replace(\" \", \"\")\n\nconfig = FuzzyMatchConfig(n_gram_range=(1, 2), string_preprocessor=preprocessor)\nfuzzy_matcher = FuzzyMatch(documents=['abc', 'def'], config=config)\n```\n\n## Longest Common Substring\n\nFinds the longest substring that is common to two strings. It is used to calculate the confidence of the fuzzy match.\n\n```\nfrom fuzzy_lightning import lcs\nlcs.longest_common_substring_length('beersteinbeer', 'stein')\n>>> 5\n```\n\nIn terms of latency, a single longest common substring length call is < 0.5\u03bcs\n\n```\nfrom fuzzy_lightning import lcs\nfrom timeit import timeit\n\nn_iter = 1000000\ntimeit(\"lcs.longest_common_substring_length('beersteinbeer', 'stein')\", globals=globals(), number=n_iter)/n_iter\n\n>>> 3.182171669322997e-07\n```\n\n\n\n## Edit Distance Algorithms\n\n### Damerau-Levenshtein\n\nThe Damerau-Levenshtein algorithm is a string edit distance algorithm calculates the minimum number of operations (insertions, deletions, substitutions, and transpositions) required to transform one string into another. Basically Levenshtein but also\nallow for transpositions.\n\n```\nfrom fuzzy_lightning import edit_distance as ed\ndist = ed.damerau_levenshtein('my nam is spetl wrrong', 'my name is spelt wrong')\nprint(dist)\n>>> 3\n```\n\n## Appendix\n\n### Why is this super fast?\n\n1. C++\n2. Dynamic Programming\n3. Cache locality benefits of using a 1D array to mimic the behaviour of a 2D array\n",
"bugtrack_url": null,
"license": "",
"summary": "Perform fast fuzzy string lookups.",
"version": "0.1.6",
"split_keywords": [
"fuzzy",
"string",
"lookup",
"search",
"match",
"similarity"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "d39ef3f78ec51ac2ac963f72434c472dad562556a04cfb77c271ae5f5a0bc2a3",
"md5": "241a89e2674ced3a857bd069dd2f66a5",
"sha256": "7179d0f98c93f6367caba48f80f7daf1df157a63ca1d3827ed2cf91bdab56866"
},
"downloads": -1,
"filename": "fuzzy-lightning-0.1.6.tar.gz",
"has_sig": false,
"md5_digest": "241a89e2674ced3a857bd069dd2f66a5",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8.1,<4.0.0",
"size": 12850,
"upload_time": "2023-01-21T00:53:53",
"upload_time_iso_8601": "2023-01-21T00:53:53.732305Z",
"url": "https://files.pythonhosted.org/packages/d3/9e/f3f78ec51ac2ac963f72434c472dad562556a04cfb77c271ae5f5a0bc2a3/fuzzy-lightning-0.1.6.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-01-21 00:53:53",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "tomukmatthews",
"github_project": "fuzzy-lightning",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "fuzzy-lightning"
}