leven-search


Nameleven-search JSON
Version 0.1.0 PyPI version JSON
download
home_page
SummaryFast and flexible search in a dictionary using Levenshtein distance
upload_time2024-01-15 23:38:27
maintainer
docs_urlNone
author
requires_python>=3.8
licenseMIT
keywords levenshtein search fuzzy dictionary trie
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage
            # Efficient and Flexible Searching Within Levenshtein Distance

![Coverage](https://github.com/pkoz/leven-search/blob/main/test/coverage.svg)

### Introduction

Welcome to Leven-Search, a library designed for efficient and fast searching
of words within a specified Levenshtein distance.

This library is designed with Kaggle developers and researchers in mind
as well as all others who deal with natural language processing, text analysis,
and similar domains where the closeness of strings is a pivotal aspect.

### What is Levenshtein Distance?

[Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) measures the difference between two
sequences.
In the context of strings, it is the minimum number of single-character edits
(insertions, deletions, or substitutions) required to change one word into another.

For example, the Levenshtein distance between "table" and "marble" is 2:

1. `table` → `mable` (substitution of `t` for `m')
2. `mable` → `marble` (insertion of `r`)

### Design Goals

The library is designed with the following goals in mind:

- Efficient indexing of large datasets of words. Indexing about 40k words from the
  Brown corpus takes about 300ms on a modern laptop, and the index takes about 53MB of RAM.
- Flexibility in searching. The library allows searching for words within a specified
  Levenshtein distance also allows configuring specific edit costs for each operation.
  It allows to configure other distances, like a [keyboard distance](https://en.wiktionary.org/wiki/keyboard_distance).
- Extensibility. The library is designed to be easily extensible to support other edit distances,
  such as [Damerau-Levenshtein distance](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance)
  or [Jaro-Winkler distance](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance).

### Performance

Example performance of the library on a Brown corpus (only words larger than 2 characters) and a modern laptop:

| Distance | Time per 1000 searches<br/>_(in seconds)_ |
|----------|-------------------------------------------|
| 0        | 0.0146                                    | 
| 1        | 0.3933                                    | 
| 1 (*)    | 0.4154                                    | 
| 2        | 7.9556                                    | 

(*) with the per-letter cost granularity

### Installation

To install the library, simply run:

```bash
pip install leven-search
```

### Usage

First, import the library:

```python
import leven_search
```

Then, create a LevenSearch object:

```python
searcher = leven_search.LevenSearch()
```

Next, add words to the searcher:

```python
searcher.insert("hello")
searcher.insert("world")
```

Finally, search for words within a specified Levenshtein distance:

```python
searcher.find_dist("mello", 1)

# Result:
# 	hello: ResultItem(word='hello', dist=1, updates=[m -> h])
```

### Example

The following example shows how to use the library to search for words within a Brown corpus:

```python
import nltk
import leven_search

# Download the Brown corpus
nltk.download('brown')

# Create a LevenSearch object
searcher = leven_search.LevenSearch()

for w in nltk.corpus.brown.words():
    if len(w) > 2:
        searcher.insert(w)

# Search for words within a Levenshtein distance
searcher.find_dist('komputer', 1)

# Result:
# 	computer: ResultItem(word='computer', dist=1, updates=[k -> c])

# Search for words within a Levenshtein distance with custom costs
from cost import GranularEditCostConfig, EditCost

cost = GranularEditCostConfig(default_cost=5, edit_costs=[EditCost('k', 'c', 2)])

searcher.find_dist('komputer', 5, cost)
# Result:
#       computer: ResultItem(word='computer', dist=2, updates=[k -> c])

searcher.find_dist('yomputer', 5, cost)
# Result:
#       computer: ResultItem(word='computer', dist=5, updates=[y -> c])

searcher.find_dist('yomputer', 3, cost)
# Result:
#       None
```

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "leven-search",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "",
    "keywords": "levenshtein,search,fuzzy,dictionary,trie",
    "author": "",
    "author_email": "Piotr Koz\u0142owski <piotr.kozlowski@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/a0/9c/7c2c41f385929c52d9f29560854cd1fde020cb8bda057f1d4104f9edc004/leven-search-0.1.0.tar.gz",
    "platform": null,
    "description": "# Efficient and Flexible Searching Within Levenshtein Distance\n\n![Coverage](https://github.com/pkoz/leven-search/blob/main/test/coverage.svg)\n\n### Introduction\n\nWelcome to Leven-Search, a library designed for efficient and fast searching\nof words within a specified Levenshtein distance.\n\nThis library is designed with Kaggle developers and researchers in mind\nas well as all others who deal with natural language processing, text analysis,\nand similar domains where the closeness of strings is a pivotal aspect.\n\n### What is Levenshtein Distance?\n\n[Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) measures the difference between two\nsequences.\nIn the context of strings, it is the minimum number of single-character edits\n(insertions, deletions, or substitutions) required to change one word into another.\n\nFor example, the Levenshtein distance between \"table\" and \"marble\" is 2:\n\n1. `table` \u2192 `mable` (substitution of `t` for `m')\n2. `mable` \u2192 `marble` (insertion of `r`)\n\n### Design Goals\n\nThe library is designed with the following goals in mind:\n\n- Efficient indexing of large datasets of words. Indexing about 40k words from the\n  Brown corpus takes about 300ms on a modern laptop, and the index takes about 53MB of RAM.\n- Flexibility in searching. The library allows searching for words within a specified\n  Levenshtein distance also allows configuring specific edit costs for each operation.\n  It allows to configure other distances, like a [keyboard distance](https://en.wiktionary.org/wiki/keyboard_distance).\n- Extensibility. The library is designed to be easily extensible to support other edit distances,\n  such as [Damerau-Levenshtein distance](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance)\n  or [Jaro-Winkler distance](https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance).\n\n### Performance\n\nExample performance of the library on a Brown corpus (only words larger than 2 characters) and a modern laptop:\n\n| Distance | Time per 1000 searches<br/>_(in seconds)_ |\n|----------|-------------------------------------------|\n| 0        | 0.0146                                    | \n| 1        | 0.3933                                    | \n| 1 (*)    | 0.4154                                    | \n| 2        | 7.9556                                    | \n\n(*) with the per-letter cost granularity\n\n### Installation\n\nTo install the library, simply run:\n\n```bash\npip install leven-search\n```\n\n### Usage\n\nFirst, import the library:\n\n```python\nimport leven_search\n```\n\nThen, create a LevenSearch object:\n\n```python\nsearcher = leven_search.LevenSearch()\n```\n\nNext, add words to the searcher:\n\n```python\nsearcher.insert(\"hello\")\nsearcher.insert(\"world\")\n```\n\nFinally, search for words within a specified Levenshtein distance:\n\n```python\nsearcher.find_dist(\"mello\", 1)\n\n# Result:\n# \thello: ResultItem(word='hello', dist=1, updates=[m -> h])\n```\n\n### Example\n\nThe following example shows how to use the library to search for words within a Brown corpus:\n\n```python\nimport nltk\nimport leven_search\n\n# Download the Brown corpus\nnltk.download('brown')\n\n# Create a LevenSearch object\nsearcher = leven_search.LevenSearch()\n\nfor w in nltk.corpus.brown.words():\n    if len(w) > 2:\n        searcher.insert(w)\n\n# Search for words within a Levenshtein distance\nsearcher.find_dist('komputer', 1)\n\n# Result:\n# \tcomputer: ResultItem(word='computer', dist=1, updates=[k -> c])\n\n# Search for words within a Levenshtein distance with custom costs\nfrom cost import GranularEditCostConfig, EditCost\n\ncost = GranularEditCostConfig(default_cost=5, edit_costs=[EditCost('k', 'c', 2)])\n\nsearcher.find_dist('komputer', 5, cost)\n# Result:\n#       computer: ResultItem(word='computer', dist=2, updates=[k -> c])\n\nsearcher.find_dist('yomputer', 5, cost)\n# Result:\n#       computer: ResultItem(word='computer', dist=5, updates=[y -> c])\n\nsearcher.find_dist('yomputer', 3, cost)\n# Result:\n#       None\n```\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Fast and flexible search in a dictionary using Levenshtein distance",
    "version": "0.1.0",
    "project_urls": {
        "Homepage": "https://github.com/pkoz/leven-search",
        "Repository": "https://github.com/pkoz/leven-search"
    },
    "split_keywords": [
        "levenshtein",
        "search",
        "fuzzy",
        "dictionary",
        "trie"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "063655d1fdf21e818ea2d180b35544f5858fa045a76f20c71ae405fb2f891100",
                "md5": "b6320c510f6b7ba7716bfadab7e5b468",
                "sha256": "1e6359d972927807616398d07686a4afa1068b2a01589adedc6c92e2270b6c17"
            },
            "downloads": -1,
            "filename": "leven_search-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "b6320c510f6b7ba7716bfadab7e5b468",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 7686,
            "upload_time": "2024-01-15T23:38:24",
            "upload_time_iso_8601": "2024-01-15T23:38:24.187636Z",
            "url": "https://files.pythonhosted.org/packages/06/36/55d1fdf21e818ea2d180b35544f5858fa045a76f20c71ae405fb2f891100/leven_search-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a09c7c2c41f385929c52d9f29560854cd1fde020cb8bda057f1d4104f9edc004",
                "md5": "0cffa582e74eaac8463fc80a17f468ef",
                "sha256": "8836b04f77166c7a301dbd24387b8668d7ffc160667fb0f4b929ae74ce6e87b6"
            },
            "downloads": -1,
            "filename": "leven-search-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "0cffa582e74eaac8463fc80a17f468ef",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 6426,
            "upload_time": "2024-01-15T23:38:27",
            "upload_time_iso_8601": "2024-01-15T23:38:27.004504Z",
            "url": "https://files.pythonhosted.org/packages/a0/9c/7c2c41f385929c52d9f29560854cd1fde020cb8bda057f1d4104f9edc004/leven-search-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-01-15 23:38:27",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "pkoz",
    "github_project": "leven-search",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": true,
    "lcname": "leven-search"
}
        
Elapsed time: 0.17261s