<h1 align="center">
JaroWinkler
</h1>
<p align="center">
<a href="https://github.com/maxbachmann/JaroWinkler/actions">
<img src="https://github.com/maxbachmann/JaroWinkler/workflows/Build/badge.svg"
alt="Continous Integration">
</a>
<a href="https://pypi.org/project/jarowinkler/">
<img src="https://img.shields.io/pypi/v/jarowinkler"
alt="PyPI package version">
</a>
<a href="https://www.python.org">
<img src="https://img.shields.io/pypi/pyversions/jarowinkler"
alt="Python versions">
</a><br/>
<a href="https://github.com/maxbachmann/JaroWinkler/blob/main/LICENSE">
<img src="https://img.shields.io/github/license/maxbachmann/JaroWinkler"
alt="GitHub license">
</a>
</p>
<h4 align="center">JaroWinkler is a library to calculate the Jaro and Jaro-Winkler similarity. It is easy to use, is far more performant than all alternatives and is designed to integrate seemingless with <a href="https://github.com/maxbachmann/RapidFuzz">RapidFuzz</a>.</h4>
## :zap: Quickstart
```python
>>> from jarowinkler import *
>>> jaro_similarity("Johnathan", "Jonathan")
0.8796296296296297
>>> jarowinkler_similarity("Johnathan", "Jonathan")
0.9037037037037037
```
## 🚀 Benchmarks
The implementation is based on a novel approach to calculate the Jaro-Winkler similarity using bitparallelism. This is significantly faster than the original approach used in other libraries. The following benchmark shows the performance difference to jellyfish and python-Levenshtein.
<p align="center">
<img src="https://raw.githubusercontent.com/maxbachmann/JaroWinkler/main/bench/results/JaroWinkler.svg?sanitize=true" alt="Benchmark JaroWinkler">
</p>
## ⚙️ Installation
You can install this library from [PyPI](https://pypi.org/project/jarowinkler/) with pip:
```
pip install jarowinkler
```
JaroWinkler provides binary wheels for all common platforms.
### Source builds
For a source build (for example from a SDist packaged) you only require a C++14 compatible compiler. You can install directly from GitHub if you would like.
```
pip install git+https://github.com/maxbachmann/JaroWinkler.git@main
```
## 📖 Usage
Any algorithms in JaroWinkler can not only be used with strings, but with any arbitary sequences of hashable objects:
```python
from jarowinkler import jarowinkler_similarity
jarowinkler_similarity("this is an example".split(), ["this", "is", "a", "example"])
# 0.8666666666666667
```
So as long as two objects have the same hash they are treated as similar. You can provide a `__hash__` method for your own object instances.
```python
class MyObject:
def __init__(self, hash):
self.hash = hash
def __hash__(self):
return self.hash
jarowinkler_similarity([MyObject(1), MyObject(2)], [MyObject(1), MyObject(2), MyObject(3)])
# 0.9111111111111111
```
All algorithms provide a `score_cutoff` parameter. This parameter can be used to filter out bad matches. Internally this allows JaroWinkler to select faster implementations in some places:
```python
jaro_similarity("Johnathan", "Jonathan", score_cutoff=0.9)
# 0.0
jaro_similarity("Johnathan", "Jonathan", score_cutoff=0.85)
# 0.8796296296296297
```
JaroWinkler can be used with RapidFuzz, which provides multiple methods to compute string metrics on collections of inputs. JaroWinkler implements the RapidFuzz C-API which allows RapidFuzz to call the functions without any of the usual overhead of python, which makes this even faster.
```python
from rapidfuzz import process
process.cdist(["Johnathan", "Jonathan"], ["Johnathan", "Jonathan"], scorer=jarowinkler_similarity)
array([[1. , 0.9037037],
[0.9037037, 1. ]], dtype=float32)
```
## 👍 Contributing
PRs are welcome!
- Found a bug? Report it in form of an [issue](https://github.com/maxbachmann/JaroWinkler/issues) or even better fix it!
- Can make something faster? Great! Just avoid external dependencies and remember that existing functionality should still work.
- Something else that do you think is good? Do it! Just make sure that CI passes and everything from the README is still applicable (interface, features, and so on).
- Have no time to code? Tell your friends and subscribers about JaroWinkler. More users, more contributions, more amazing features.
Thank you :heart:
## ⚠️ License
Copyright 2021 - present [maxbachmann](https://github.com/maxbachmann). `JaroWinkler` is free and open-source software licensed under the [MIT License](https://github.com/maxbachmann/JaroWinkler/blob/main/LICENSE).
Raw data
{
"_id": null,
"home_page": "https://github.com/maxbachmann/JaroWinkler",
"name": "jarowinkler",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "",
"keywords": "string,comparison,edit-distance",
"author": "Max Bachmann",
"author_email": "pypi@maxbachmann.de",
"download_url": "https://files.pythonhosted.org/packages/e0/91/a3111ac8c11b52497840fdc0b0256aab9e9c014817adb79921f8f492695a/jarowinkler-2.0.1.tar.gz",
"platform": null,
"description": "\n<h1 align=\"center\">\n JaroWinkler\n</h1>\n<p align=\"center\">\n <a href=\"https://github.com/maxbachmann/JaroWinkler/actions\">\n <img src=\"https://github.com/maxbachmann/JaroWinkler/workflows/Build/badge.svg\"\n alt=\"Continous Integration\">\n </a>\n <a href=\"https://pypi.org/project/jarowinkler/\">\n <img src=\"https://img.shields.io/pypi/v/jarowinkler\"\n alt=\"PyPI package version\">\n </a>\n <a href=\"https://www.python.org\">\n <img src=\"https://img.shields.io/pypi/pyversions/jarowinkler\"\n alt=\"Python versions\">\n </a><br/>\n <a href=\"https://github.com/maxbachmann/JaroWinkler/blob/main/LICENSE\">\n <img src=\"https://img.shields.io/github/license/maxbachmann/JaroWinkler\"\n alt=\"GitHub license\">\n </a>\n</p>\n\n<h4 align=\"center\">JaroWinkler is a library to calculate the Jaro and Jaro-Winkler similarity. It is easy to use, is far more performant than all alternatives and is designed to integrate seemingless with <a href=\"https://github.com/maxbachmann/RapidFuzz\">RapidFuzz</a>.</h4>\n\n\n\n## :zap: Quickstart\n```python\n>>> from jarowinkler import *\n\n>>> jaro_similarity(\"Johnathan\", \"Jonathan\")\n0.8796296296296297\n\n>>> jarowinkler_similarity(\"Johnathan\", \"Jonathan\")\n0.9037037037037037\n```\n\n## \ud83d\ude80 Benchmarks\nThe implementation is based on a novel approach to calculate the Jaro-Winkler similarity using bitparallelism. This is significantly faster than the original approach used in other libraries. The following benchmark shows the performance difference to jellyfish and python-Levenshtein. \n\n<p align=\"center\">\n<img src=\"https://raw.githubusercontent.com/maxbachmann/JaroWinkler/main/bench/results/JaroWinkler.svg?sanitize=true\" alt=\"Benchmark JaroWinkler\">\n</p>\n\n## \u2699\ufe0f Installation\n\nYou can install this library from [PyPI](https://pypi.org/project/jarowinkler/) with pip:\n```\npip install jarowinkler\n```\nJaroWinkler provides binary wheels for all common platforms.\n\n### Source builds\n\nFor a source build (for example from a SDist packaged) you only require a C++14 compatible compiler. You can install directly from GitHub if you would like.\n```\npip install git+https://github.com/maxbachmann/JaroWinkler.git@main\n```\n\n## \ud83d\udcd6 Usage\n\nAny algorithms in JaroWinkler can not only be used with strings, but with any arbitary sequences of hashable objects:\n```python\nfrom jarowinkler import jarowinkler_similarity\n\n\njarowinkler_similarity(\"this is an example\".split(), [\"this\", \"is\", \"a\", \"example\"])\n# 0.8666666666666667\n```\n\nSo as long as two objects have the same hash they are treated as similar. You can provide a `__hash__` method for your own object instances.\n\n```python\nclass MyObject:\n def __init__(self, hash):\n self.hash = hash\n\n def __hash__(self):\n return self.hash\n\njarowinkler_similarity([MyObject(1), MyObject(2)], [MyObject(1), MyObject(2), MyObject(3)])\n# 0.9111111111111111\n```\n\nAll algorithms provide a `score_cutoff` parameter. This parameter can be used to filter out bad matches. Internally this allows JaroWinkler to select faster implementations in some places:\n\n```python\njaro_similarity(\"Johnathan\", \"Jonathan\", score_cutoff=0.9)\n# 0.0\n\njaro_similarity(\"Johnathan\", \"Jonathan\", score_cutoff=0.85)\n# 0.8796296296296297\n```\n\nJaroWinkler can be used with RapidFuzz, which provides multiple methods to compute string metrics on collections of inputs. JaroWinkler implements the RapidFuzz C-API which allows RapidFuzz to call the functions without any of the usual overhead of python, which makes this even faster.\n\n```python\nfrom rapidfuzz import process\n\nprocess.cdist([\"Johnathan\", \"Jonathan\"], [\"Johnathan\", \"Jonathan\"], scorer=jarowinkler_similarity)\narray([[1. , 0.9037037],\n [0.9037037, 1. ]], dtype=float32)\n```\n\n## \ud83d\udc4d Contributing\n\nPRs are welcome!\n- Found a bug? Report it in form of an [issue](https://github.com/maxbachmann/JaroWinkler/issues) or even better fix it!\n- Can make something faster? Great! Just avoid external dependencies and remember that existing functionality should still work.\n- Something else that do you think is good? Do it! Just make sure that CI passes and everything from the README is still applicable (interface, features, and so on).\n- Have no time to code? Tell your friends and subscribers about JaroWinkler. More users, more contributions, more amazing features.\n\nThank you :heart:\n\n## \u26a0\ufe0f License\nCopyright 2021 - present [maxbachmann](https://github.com/maxbachmann). `JaroWinkler` is free and open-source software licensed under the [MIT License](https://github.com/maxbachmann/JaroWinkler/blob/main/LICENSE).\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "library for fast approximate string matching using Jaro and Jaro-Winkler similarity",
"version": "2.0.1",
"project_urls": {
"Homepage": "https://github.com/maxbachmann/JaroWinkler"
},
"split_keywords": [
"string",
"comparison",
"edit-distance"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "e8efe6a3a716e5f5fbb32a55ab19384e62427907a37574dd75c4502b09146223",
"md5": "44d1bd5da4af4299d4ee317ff01b10bb",
"sha256": "2c04d8e761caa643eb9801440ccba12498b958f53146f236aa73a884e66ef23c"
},
"downloads": -1,
"filename": "jarowinkler-2.0.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "44d1bd5da4af4299d4ee317ff01b10bb",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 5581,
"upload_time": "2023-11-03T10:32:07",
"upload_time_iso_8601": "2023-11-03T10:32:07.697073Z",
"url": "https://files.pythonhosted.org/packages/e8/ef/e6a3a716e5f5fbb32a55ab19384e62427907a37574dd75c4502b09146223/jarowinkler-2.0.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "e091a3111ac8c11b52497840fdc0b0256aab9e9c014817adb79921f8f492695a",
"md5": "e785475492eedbe033156cfa351e5f26",
"sha256": "7640c79f8d2d5e9eed6691cb49e3018a23b2319daad9a2178df253368b5432b7"
},
"downloads": -1,
"filename": "jarowinkler-2.0.1.tar.gz",
"has_sig": false,
"md5_digest": "e785475492eedbe033156cfa351e5f26",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 6368,
"upload_time": "2023-11-03T10:32:11",
"upload_time_iso_8601": "2023-11-03T10:32:11.123457Z",
"url": "https://files.pythonhosted.org/packages/e0/91/a3111ac8c11b52497840fdc0b0256aab9e9c014817adb79921f8f492695a/jarowinkler-2.0.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-11-03 10:32:11",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "maxbachmann",
"github_project": "JaroWinkler",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "jarowinkler"
}