fuzzy-lightning


Namefuzzy-lightning JSON
Version 0.1.6 PyPI version JSON
download
home_pagehttps://github.com/tomukmatthews/fuzzy-lightning
SummaryPerform fast fuzzy string lookups.
upload_time2023-01-21 00:53:53
maintainerTom Matthews
docs_urlNone
authorTom Matthews
requires_python>=3.8.1,<4.0.0
license
keywords fuzzy string lookup search match similarity
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # fuzzy-lightning

Fuzzy lightning is a fast and customizable package for finding the closest matches in a list of target strings (documents) using fuzzy string matching. It is particularly effective for short string matching against large document sets, and includes the fastest implementation of the Damerau-Levenshtein and longest common substring algorithms in its class.

## Introduction
Fuzzy lightning works by:
1. Converts strings to embedding vectors using an sklearn TF-IDF vectorizer on character n-grams.
2. Generates a shortlist of match candidates from the top N nearest neighbours (using cosine similarity).
3. This list of candidates is then pruned to select the best match using the longest common substring to length ratio.

### Quick Start

#### Installation

`pip install fuzzy-lightning`

Finding the closest matches in a list of documents for a list of input strings:

```
from fuzzy_lightning import FuzzyMatch

documents = ["SMARTEST ENERGY", "SMARTPIG"]
fuzzy_matcher = FuzzyMatch(documents=documents)
strings = ['SMARTPIGGIE', 'THE SMARTEST ENERGY']
matches = fuzzy_matcher.get_document_matches(strings=strings)
print(matches)
>>> [
    DocumentMatch(match='SMARTPIG', confidence=1.0),
    DocumentMatch(match='SMARTEST ENERGY', confidence=1.0)
]
```

The output is a list of DocumentMatch objects, each with a match attribute that contains the closest matching document and a confidence attribute that represents the confidence of the match (a value between 0 and 1):

If you want to find the closest match for a single string, you can use the get_lookup_match method:

```
match = fuzzy_matcher.get_document_match('SMARTPIGGIE')
print(match)
>>> DocumentMatch(match='SMARTPIG', confidence=1.0)
```

### Configuration

The FuzzyMatch class has a number of configurable parameters that you can set using the `FuzzyMatchConfig` class. 

- **n_gram_range** (Tuple[int, int]): Range of lengths of n-grams to use with the TF-IDF vectorizer. For example,
    n_gram_range = (2, 3) will use bi-grams and tri-grams.
- **min_document_freq** (int, optional): Minimum number of documents a term must appear in to be considered.
    Defaults to 1.
- **tfidf_similarity_threshold** (float, optional): Minimum cosine similarity to a viable candidate for LCS.
    Defaults to 0.1.
- **n_top_candidates** (int, optional): Maximum number of candidates to return that exceed the
    similarity_threshold. Defaults to 40.
- **lcs_min_characters** (int): Minimum length of the string to qualify for matching to the target strings.
- **lcs_min_length_ratio** (float): Minimum ratio of string length to target string length for an string <> target
    string match to qualify.
- **lcs_similarity_threshold** (float, optional): Minimum LCS match ratio to accept classification.
use_threads** (bool): Whether to use threads to parallelise the work to find the n_top_candidates for each
    string.
- **n_threads** (int): Number of threads to use when finding the n_top_candidates. Increasing the number of threads
    reduces the run time, but there becomes a trade off in production where there may be 'thread congestion'.
- **string_preprocessor** (Optional[Callable[[str], str]]): A callable that takes in a string and returns a processed
    string. This can be used to perform any preprocessing steps on the input strings before they are compared.

For example, to change the range of n-grams used by the TF-IDF vectorizer, and to add some string preprocessing prior
to the fuzzy matching you can do the following:

```
from fuzzy_lightning import FuzzyMatch, FuzzyMatchConfig

def preprocessor(string: str) -> str:
    return string.lower().replace(" ", "")

config = FuzzyMatchConfig(n_gram_range=(1, 2), string_preprocessor=preprocessor)
fuzzy_matcher = FuzzyMatch(documents=['abc', 'def'], config=config)
```

## Longest Common Substring

Finds the longest substring that is common to two strings. It is used to calculate the confidence of the fuzzy match.

```
from fuzzy_lightning import lcs
lcs.longest_common_substring_length('beersteinbeer', 'stein')
>>> 5
```

In terms of latency, a single longest common substring length call is < 0.5μs

```
from fuzzy_lightning import lcs
from timeit import timeit

n_iter = 1000000
timeit("lcs.longest_common_substring_length('beersteinbeer', 'stein')", globals=globals(), number=n_iter)/n_iter

>>> 3.182171669322997e-07
```



## Edit Distance Algorithms

### Damerau-Levenshtein

The Damerau-Levenshtein algorithm is a string edit distance algorithm calculates the minimum number of operations (insertions, deletions, substitutions, and transpositions) required to transform one string into another. Basically Levenshtein but also
allow for transpositions.

```
from fuzzy_lightning import edit_distance as ed
dist = ed.damerau_levenshtein('my nam is spetl wrrong', 'my name is spelt wrong')
print(dist)
>>> 3
```

## Appendix

### Why is this super fast?

1. C++
2. Dynamic Programming
3. Cache locality benefits of using a 1D array to mimic the behaviour of a 2D array

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/tomukmatthews/fuzzy-lightning",
    "name": "fuzzy-lightning",
    "maintainer": "Tom Matthews",
    "docs_url": null,
    "requires_python": ">=3.8.1,<4.0.0",
    "maintainer_email": "tomukmatthews@gmail.com",
    "keywords": "Fuzzy,String,Lookup,Search,Match,Similarity",
    "author": "Tom Matthews",
    "author_email": "tomukmatthews@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/d3/9e/f3f78ec51ac2ac963f72434c472dad562556a04cfb77c271ae5f5a0bc2a3/fuzzy-lightning-0.1.6.tar.gz",
    "platform": null,
    "description": "# fuzzy-lightning\n\nFuzzy lightning is a fast and customizable package for finding the closest matches in a list of target strings (documents) using fuzzy string matching. It is particularly effective for short string matching against large document sets, and includes the fastest implementation of the Damerau-Levenshtein and longest common substring algorithms in its class.\n\n## Introduction\nFuzzy lightning works by:\n1. Converts strings to embedding vectors using an sklearn TF-IDF vectorizer on character n-grams.\n2. Generates a shortlist of match candidates from the top N nearest neighbours (using cosine similarity).\n3. This list of candidates is then pruned to select the best match using the longest common substring to length ratio.\n\n### Quick Start\n\n#### Installation\n\n`pip install fuzzy-lightning`\n\nFinding the closest matches in a list of documents for a list of input strings:\n\n```\nfrom fuzzy_lightning import FuzzyMatch\n\ndocuments = [\"SMARTEST ENERGY\", \"SMARTPIG\"]\nfuzzy_matcher = FuzzyMatch(documents=documents)\nstrings = ['SMARTPIGGIE', 'THE SMARTEST ENERGY']\nmatches = fuzzy_matcher.get_document_matches(strings=strings)\nprint(matches)\n>>> [\n    DocumentMatch(match='SMARTPIG', confidence=1.0),\n    DocumentMatch(match='SMARTEST ENERGY', confidence=1.0)\n]\n```\n\nThe output is a list of DocumentMatch objects, each with a match attribute that contains the closest matching document and a confidence attribute that represents the confidence of the match (a value between 0 and 1):\n\nIf you want to find the closest match for a single string, you can use the get_lookup_match method:\n\n```\nmatch = fuzzy_matcher.get_document_match('SMARTPIGGIE')\nprint(match)\n>>> DocumentMatch(match='SMARTPIG', confidence=1.0)\n```\n\n### Configuration\n\nThe FuzzyMatch class has a number of configurable parameters that you can set using the `FuzzyMatchConfig` class. \n\n- **n_gram_range** (Tuple[int, int]): Range of lengths of n-grams to use with the TF-IDF vectorizer. For example,\n    n_gram_range = (2, 3) will use bi-grams and tri-grams.\n- **min_document_freq** (int, optional): Minimum number of documents a term must appear in to be considered.\n    Defaults to 1.\n- **tfidf_similarity_threshold** (float, optional): Minimum cosine similarity to a viable candidate for LCS.\n    Defaults to 0.1.\n- **n_top_candidates** (int, optional): Maximum number of candidates to return that exceed the\n    similarity_threshold. Defaults to 40.\n- **lcs_min_characters** (int): Minimum length of the string to qualify for matching to the target strings.\n- **lcs_min_length_ratio** (float): Minimum ratio of string length to target string length for an string <> target\n    string match to qualify.\n- **lcs_similarity_threshold** (float, optional): Minimum LCS match ratio to accept classification.\nuse_threads** (bool): Whether to use threads to parallelise the work to find the n_top_candidates for each\n    string.\n- **n_threads** (int): Number of threads to use when finding the n_top_candidates. Increasing the number of threads\n    reduces the run time, but there becomes a trade off in production where there may be 'thread congestion'.\n- **string_preprocessor** (Optional[Callable[[str], str]]): A callable that takes in a string and returns a processed\n    string. This can be used to perform any preprocessing steps on the input strings before they are compared.\n\nFor example, to change the range of n-grams used by the TF-IDF vectorizer, and to add some string preprocessing prior\nto the fuzzy matching you can do the following:\n\n```\nfrom fuzzy_lightning import FuzzyMatch, FuzzyMatchConfig\n\ndef preprocessor(string: str) -> str:\n    return string.lower().replace(\" \", \"\")\n\nconfig = FuzzyMatchConfig(n_gram_range=(1, 2), string_preprocessor=preprocessor)\nfuzzy_matcher = FuzzyMatch(documents=['abc', 'def'], config=config)\n```\n\n## Longest Common Substring\n\nFinds the longest substring that is common to two strings. It is used to calculate the confidence of the fuzzy match.\n\n```\nfrom fuzzy_lightning import lcs\nlcs.longest_common_substring_length('beersteinbeer', 'stein')\n>>> 5\n```\n\nIn terms of latency, a single longest common substring length call is < 0.5\u03bcs\n\n```\nfrom fuzzy_lightning import lcs\nfrom timeit import timeit\n\nn_iter = 1000000\ntimeit(\"lcs.longest_common_substring_length('beersteinbeer', 'stein')\", globals=globals(), number=n_iter)/n_iter\n\n>>> 3.182171669322997e-07\n```\n\n\n\n## Edit Distance Algorithms\n\n### Damerau-Levenshtein\n\nThe Damerau-Levenshtein algorithm is a string edit distance algorithm calculates the minimum number of operations (insertions, deletions, substitutions, and transpositions) required to transform one string into another. Basically Levenshtein but also\nallow for transpositions.\n\n```\nfrom fuzzy_lightning import edit_distance as ed\ndist = ed.damerau_levenshtein('my nam is spetl wrrong', 'my name is spelt wrong')\nprint(dist)\n>>> 3\n```\n\n## Appendix\n\n### Why is this super fast?\n\n1. C++\n2. Dynamic Programming\n3. Cache locality benefits of using a 1D array to mimic the behaviour of a 2D array\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Perform fast fuzzy string lookups.",
    "version": "0.1.6",
    "split_keywords": [
        "fuzzy",
        "string",
        "lookup",
        "search",
        "match",
        "similarity"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d39ef3f78ec51ac2ac963f72434c472dad562556a04cfb77c271ae5f5a0bc2a3",
                "md5": "241a89e2674ced3a857bd069dd2f66a5",
                "sha256": "7179d0f98c93f6367caba48f80f7daf1df157a63ca1d3827ed2cf91bdab56866"
            },
            "downloads": -1,
            "filename": "fuzzy-lightning-0.1.6.tar.gz",
            "has_sig": false,
            "md5_digest": "241a89e2674ced3a857bd069dd2f66a5",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8.1,<4.0.0",
            "size": 12850,
            "upload_time": "2023-01-21T00:53:53",
            "upload_time_iso_8601": "2023-01-21T00:53:53.732305Z",
            "url": "https://files.pythonhosted.org/packages/d3/9e/f3f78ec51ac2ac963f72434c472dad562556a04cfb77c271ae5f5a0bc2a3/fuzzy-lightning-0.1.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-01-21 00:53:53",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "tomukmatthews",
    "github_project": "fuzzy-lightning",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "fuzzy-lightning"
}
        
Elapsed time: 7.00771s