jarowinkler


Namejarowinkler JSON
Version 2.0.1 PyPI version JSON
download
home_pagehttps://github.com/maxbachmann/JaroWinkler
Summarylibrary for fast approximate string matching using Jaro and Jaro-Winkler similarity
upload_time2023-11-03 10:32:11
maintainer
docs_urlNone
authorMax Bachmann
requires_python>=3.8
licenseMIT
keywords string comparison edit-distance
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            
<h1 align="center">
 JaroWinkler
</h1>
<p align="center">
  <a href="https://github.com/maxbachmann/JaroWinkler/actions">
    <img src="https://github.com/maxbachmann/JaroWinkler/workflows/Build/badge.svg"
         alt="Continous Integration">
  </a>
  <a href="https://pypi.org/project/jarowinkler/">
    <img src="https://img.shields.io/pypi/v/jarowinkler"
         alt="PyPI package version">
  </a>
  <a href="https://www.python.org">
    <img src="https://img.shields.io/pypi/pyversions/jarowinkler"
         alt="Python versions">
  </a><br/>
  <a href="https://github.com/maxbachmann/JaroWinkler/blob/main/LICENSE">
    <img src="https://img.shields.io/github/license/maxbachmann/JaroWinkler"
         alt="GitHub license">
  </a>
</p>

<h4 align="center">JaroWinkler is a library to calculate the Jaro and Jaro-Winkler similarity. It is easy to use, is far more performant than all alternatives and is designed to integrate seemingless with <a href="https://github.com/maxbachmann/RapidFuzz">RapidFuzz</a>.</h4>



## :zap: Quickstart
```python
>>> from jarowinkler import *

>>> jaro_similarity("Johnathan", "Jonathan")
0.8796296296296297

>>> jarowinkler_similarity("Johnathan", "Jonathan")
0.9037037037037037
```

## 🚀 Benchmarks
The implementation is based on a novel approach to calculate the Jaro-Winkler similarity using bitparallelism. This is significantly faster than the original approach used in other libraries. The following benchmark shows the performance difference to jellyfish and python-Levenshtein. 

<p align="center">
<img src="https://raw.githubusercontent.com/maxbachmann/JaroWinkler/main/bench/results/JaroWinkler.svg?sanitize=true" alt="Benchmark JaroWinkler">
</p>

## ⚙️ Installation

You can install this library from [PyPI](https://pypi.org/project/jarowinkler/) with pip:
```
pip install jarowinkler
```
JaroWinkler provides binary wheels for all common platforms.

### Source builds

For a source build (for example from a SDist packaged) you only require a C++14 compatible compiler. You can install directly from GitHub if you would like.
```
pip install git+https://github.com/maxbachmann/JaroWinkler.git@main
```

## 📖 Usage

Any algorithms in JaroWinkler can not only be used with strings, but with any arbitary sequences of hashable objects:
```python
from jarowinkler import jarowinkler_similarity


jarowinkler_similarity("this is an example".split(), ["this", "is", "a", "example"])
# 0.8666666666666667
```

So as long as two objects have the same hash they are treated as similar. You can provide a `__hash__` method for your own object instances.

```python
class MyObject:
    def __init__(self, hash):
        self.hash = hash

    def __hash__(self):
        return self.hash

jarowinkler_similarity([MyObject(1), MyObject(2)], [MyObject(1), MyObject(2), MyObject(3)])
# 0.9111111111111111
```

All algorithms provide a `score_cutoff` parameter. This parameter can be used to filter out bad matches. Internally this allows JaroWinkler to select faster implementations in some places:

```python
jaro_similarity("Johnathan", "Jonathan", score_cutoff=0.9)
# 0.0

jaro_similarity("Johnathan", "Jonathan", score_cutoff=0.85)
# 0.8796296296296297
```

JaroWinkler can be used with RapidFuzz, which provides multiple methods to compute string metrics on collections of inputs. JaroWinkler implements the RapidFuzz C-API which allows RapidFuzz to call the functions without any of the usual overhead of python, which makes this even faster.

```python
from rapidfuzz import process

process.cdist(["Johnathan", "Jonathan"], ["Johnathan", "Jonathan"], scorer=jarowinkler_similarity)
array([[1.       , 0.9037037],
       [0.9037037, 1.       ]], dtype=float32)
```

## 👍 Contributing

PRs are welcome!
- Found a bug? Report it in form of an [issue](https://github.com/maxbachmann/JaroWinkler/issues) or even better fix it!
- Can make something faster? Great! Just avoid external dependencies and remember that existing functionality should still work.
- Something else that do you think is good? Do it! Just make sure that CI passes and everything from the README is still applicable (interface, features, and so on).
- Have no time to code? Tell your friends and subscribers about JaroWinkler. More users, more contributions, more amazing features.

Thank you :heart:

## ⚠️ License
Copyright 2021 - present [maxbachmann](https://github.com/maxbachmann). `JaroWinkler` is free and open-source software licensed under the [MIT License](https://github.com/maxbachmann/JaroWinkler/blob/main/LICENSE).

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/maxbachmann/JaroWinkler",
    "name": "jarowinkler",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "",
    "keywords": "string,comparison,edit-distance",
    "author": "Max Bachmann",
    "author_email": "pypi@maxbachmann.de",
    "download_url": "https://files.pythonhosted.org/packages/e0/91/a3111ac8c11b52497840fdc0b0256aab9e9c014817adb79921f8f492695a/jarowinkler-2.0.1.tar.gz",
    "platform": null,
    "description": "\n<h1 align=\"center\">\n JaroWinkler\n</h1>\n<p align=\"center\">\n  <a href=\"https://github.com/maxbachmann/JaroWinkler/actions\">\n    <img src=\"https://github.com/maxbachmann/JaroWinkler/workflows/Build/badge.svg\"\n         alt=\"Continous Integration\">\n  </a>\n  <a href=\"https://pypi.org/project/jarowinkler/\">\n    <img src=\"https://img.shields.io/pypi/v/jarowinkler\"\n         alt=\"PyPI package version\">\n  </a>\n  <a href=\"https://www.python.org\">\n    <img src=\"https://img.shields.io/pypi/pyversions/jarowinkler\"\n         alt=\"Python versions\">\n  </a><br/>\n  <a href=\"https://github.com/maxbachmann/JaroWinkler/blob/main/LICENSE\">\n    <img src=\"https://img.shields.io/github/license/maxbachmann/JaroWinkler\"\n         alt=\"GitHub license\">\n  </a>\n</p>\n\n<h4 align=\"center\">JaroWinkler is a library to calculate the Jaro and Jaro-Winkler similarity. It is easy to use, is far more performant than all alternatives and is designed to integrate seemingless with <a href=\"https://github.com/maxbachmann/RapidFuzz\">RapidFuzz</a>.</h4>\n\n\n\n## :zap: Quickstart\n```python\n>>> from jarowinkler import *\n\n>>> jaro_similarity(\"Johnathan\", \"Jonathan\")\n0.8796296296296297\n\n>>> jarowinkler_similarity(\"Johnathan\", \"Jonathan\")\n0.9037037037037037\n```\n\n## \ud83d\ude80 Benchmarks\nThe implementation is based on a novel approach to calculate the Jaro-Winkler similarity using bitparallelism. This is significantly faster than the original approach used in other libraries. The following benchmark shows the performance difference to jellyfish and python-Levenshtein. \n\n<p align=\"center\">\n<img src=\"https://raw.githubusercontent.com/maxbachmann/JaroWinkler/main/bench/results/JaroWinkler.svg?sanitize=true\" alt=\"Benchmark JaroWinkler\">\n</p>\n\n## \u2699\ufe0f Installation\n\nYou can install this library from [PyPI](https://pypi.org/project/jarowinkler/) with pip:\n```\npip install jarowinkler\n```\nJaroWinkler provides binary wheels for all common platforms.\n\n### Source builds\n\nFor a source build (for example from a SDist packaged) you only require a C++14 compatible compiler. You can install directly from GitHub if you would like.\n```\npip install git+https://github.com/maxbachmann/JaroWinkler.git@main\n```\n\n## \ud83d\udcd6 Usage\n\nAny algorithms in JaroWinkler can not only be used with strings, but with any arbitary sequences of hashable objects:\n```python\nfrom jarowinkler import jarowinkler_similarity\n\n\njarowinkler_similarity(\"this is an example\".split(), [\"this\", \"is\", \"a\", \"example\"])\n# 0.8666666666666667\n```\n\nSo as long as two objects have the same hash they are treated as similar. You can provide a `__hash__` method for your own object instances.\n\n```python\nclass MyObject:\n    def __init__(self, hash):\n        self.hash = hash\n\n    def __hash__(self):\n        return self.hash\n\njarowinkler_similarity([MyObject(1), MyObject(2)], [MyObject(1), MyObject(2), MyObject(3)])\n# 0.9111111111111111\n```\n\nAll algorithms provide a `score_cutoff` parameter. This parameter can be used to filter out bad matches. Internally this allows JaroWinkler to select faster implementations in some places:\n\n```python\njaro_similarity(\"Johnathan\", \"Jonathan\", score_cutoff=0.9)\n# 0.0\n\njaro_similarity(\"Johnathan\", \"Jonathan\", score_cutoff=0.85)\n# 0.8796296296296297\n```\n\nJaroWinkler can be used with RapidFuzz, which provides multiple methods to compute string metrics on collections of inputs. JaroWinkler implements the RapidFuzz C-API which allows RapidFuzz to call the functions without any of the usual overhead of python, which makes this even faster.\n\n```python\nfrom rapidfuzz import process\n\nprocess.cdist([\"Johnathan\", \"Jonathan\"], [\"Johnathan\", \"Jonathan\"], scorer=jarowinkler_similarity)\narray([[1.       , 0.9037037],\n       [0.9037037, 1.       ]], dtype=float32)\n```\n\n## \ud83d\udc4d Contributing\n\nPRs are welcome!\n- Found a bug? Report it in form of an [issue](https://github.com/maxbachmann/JaroWinkler/issues) or even better fix it!\n- Can make something faster? Great! Just avoid external dependencies and remember that existing functionality should still work.\n- Something else that do you think is good? Do it! Just make sure that CI passes and everything from the README is still applicable (interface, features, and so on).\n- Have no time to code? Tell your friends and subscribers about JaroWinkler. More users, more contributions, more amazing features.\n\nThank you :heart:\n\n## \u26a0\ufe0f License\nCopyright 2021 - present [maxbachmann](https://github.com/maxbachmann). `JaroWinkler` is free and open-source software licensed under the [MIT License](https://github.com/maxbachmann/JaroWinkler/blob/main/LICENSE).\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "library for fast approximate string matching using Jaro and Jaro-Winkler similarity",
    "version": "2.0.1",
    "project_urls": {
        "Homepage": "https://github.com/maxbachmann/JaroWinkler"
    },
    "split_keywords": [
        "string",
        "comparison",
        "edit-distance"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e8efe6a3a716e5f5fbb32a55ab19384e62427907a37574dd75c4502b09146223",
                "md5": "44d1bd5da4af4299d4ee317ff01b10bb",
                "sha256": "2c04d8e761caa643eb9801440ccba12498b958f53146f236aa73a884e66ef23c"
            },
            "downloads": -1,
            "filename": "jarowinkler-2.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "44d1bd5da4af4299d4ee317ff01b10bb",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 5581,
            "upload_time": "2023-11-03T10:32:07",
            "upload_time_iso_8601": "2023-11-03T10:32:07.697073Z",
            "url": "https://files.pythonhosted.org/packages/e8/ef/e6a3a716e5f5fbb32a55ab19384e62427907a37574dd75c4502b09146223/jarowinkler-2.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e091a3111ac8c11b52497840fdc0b0256aab9e9c014817adb79921f8f492695a",
                "md5": "e785475492eedbe033156cfa351e5f26",
                "sha256": "7640c79f8d2d5e9eed6691cb49e3018a23b2319daad9a2178df253368b5432b7"
            },
            "downloads": -1,
            "filename": "jarowinkler-2.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "e785475492eedbe033156cfa351e5f26",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 6368,
            "upload_time": "2023-11-03T10:32:11",
            "upload_time_iso_8601": "2023-11-03T10:32:11.123457Z",
            "url": "https://files.pythonhosted.org/packages/e0/91/a3111ac8c11b52497840fdc0b0256aab9e9c014817adb79921f8f492695a/jarowinkler-2.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-11-03 10:32:11",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "maxbachmann",
    "github_project": "JaroWinkler",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "jarowinkler"
}
        
Elapsed time: 0.28043s