# python-ngramratio
A method for similarity scoring of two strings.
The method, namely `nratio`, belongs to the class `SequenceMatcherExtended`, which is an extension of the `SequenceMatcher` class of the [difflib package](https://docs.python.org/3/library/difflib.html). In particular, `nratio` (method of `SequenceMatcherExtended`) is an augmenation of `ratio` (method of `SequenceMatcher`).
`ngramratio` is to be pronounced as "n gram ratio". The library uses n-grams to find a similarity score via a division (ratio) of the number of matched characters by the total number of characters. See below for more details.
## Motivation
To compute a similarity score based on matching n-grams (with n>=1 chosen by the user) rather than matching single characters (as in the case of the `ratio` method).
## Installation
To install the Python library run:
pip install ngramratio
The library will be installed as `ngramratio` to `bin` on
Linux (e.g. `/usr/bin`); or as `ngramratio.exe` to `Scripts` in your
Python installation on Windows (e.g.
`C:\Python27\Scripts\ngramratio.exe`).
You may consider installing the library only for the current user:
pip install ngramratio --user
In this case the library will be installed to
`~/.local/bin/ngramratio` on Linux and to
`%APPDATA%\Python\Scripts\ngramratio.exe` on Windows.
## Library usage
The module provides a method, `nratio`, which takes an integer number (the user's required minimum n-gram length, i.e. number of consecutive characters, to be matched) and outputs a similarity index (float number in [0,1]).
First step: initialize an object of class SequenceMatcherExtended specifying the two strings to be compared:
```
>>> import ngramratio from ngramratio
>>> SequenceMatcherExtended = ngrmaratio.SequenceMatcherExtended
>>> string_one = "ab cde"
>>> string_two = "bcde"
>>> s = SequenceMatcherExtended(None, string_one, string_two, None)
>>> # The "None" arguments prevents from any character being considered junk..
>>> # .. see the difflib documentation for more information on this.
```
Second step: apply the `ratio` and `nratio` methods and compare similarity scores:
```
>>> s.ratio()
>>> # Matches any character. Matches: "b" (length 1), "cde"(length 3). Score: (3+1)*2/10.
0.8
>>> s.nratio(1)
>>> # Matches substring of length 1 or more. It replicates `ratio()`'s functionality.
0.8
>>> s.nratio(2)
>>> # Matches substring of length 2 or more. Matches: "cde"(length 3). Score: 3*2/10.
0.6
>>> s.nratio(3)
>>> # Matches substring of length 3 or more. Matches: "cde"(length 3). Score: 3*2/10.
0.6
>>> s.nratio(4)
>>> # Matches substring of length 3 or more. Score 0/10.
0.0
```
The similarity score is computed as `the number of characters matched` (m) mutiplied by `two` (2) and divided by `the total numer of characters` (T) of the two strings, i.e. similarity score = 2m/T. Note that Python always returns a float upon computing a division.
## Testing in a virtual environment
This project uses [pytest](https://docs.pytest.org/) testing
framework with [tox](https://tox.readthedocs.io/) and [docker](https://docs.docker.com/language/) to automate testing in
different python environments. Tests are stored in the `test/`
folder.
To test a specific python version, for example version 3.6, edit the last few characters of the `startTest.sh` script to **py36** AND change the image to python 3.6 on line 4 of the `docker-compose.yaml` file.
To run tests, run `bash _scripts/startTest.sh`. This will start a docker container using the specified python image. After testing, or before testing a different python version, run `bash _scripts/teardown.sh` to remove the docker container.
The library has been tested successfully for python >= 3.6.
## Testing on your local machine with no v.e.
You can use `tox` directly in your local machine. Make sure to install `tox`, `pytest` before testing.
On Linux `tox` expects to find executables like `python3.6`, `python3.10` etc. On Windows it looks for
`C:\Python36\python.exe` and
`C:\Python310\python.exe` respectively.
To test a specific Python environment, use the `-e` option. For example, to
test against Python 3.7 run:
tox -e py37
in the root of the project source tree.
To fix code formatting (this will install `pre-commit` as a dependency), run:
tox -e lint
See the `tox.ini` file in the repository to learn more about the testing instructions being used.
## Contributions
Contributions should include tests and an explanation for the changes
they propose. Documentation (examples, docstrings, README.md) should be
updated accordingly.
Raw data
{
"_id": null,
"home_page": "https://github.com/gi-ba-bu/python-n-gram-ratio",
"name": "ngramratio",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": "",
"keywords": "",
"author": "Giacomo Baldo",
"author_email": "baldogiacomophd@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/51/41/3e9c81cbf6fbde6dad1ba3c266da35b54bdf4cccfbb897f873baa54791b8/ngramratio-0.0.5.tar.gz",
"platform": null,
"description": "# python-ngramratio\n\nA method for similarity scoring of two strings.\n\nThe method, namely `nratio`, belongs to the class `SequenceMatcherExtended`, which is an extension of the `SequenceMatcher` class of the [difflib package](https://docs.python.org/3/library/difflib.html). In particular, `nratio` (method of `SequenceMatcherExtended`) is an augmenation of `ratio` (method of `SequenceMatcher`).\n\n`ngramratio` is to be pronounced as \"n gram ratio\". The library uses n-grams to find a similarity score via a division (ratio) of the number of matched characters by the total number of characters. See below for more details.\n\n## Motivation\n\nTo compute a similarity score based on matching n-grams (with n>=1 chosen by the user) rather than matching single characters (as in the case of the `ratio` method).\n\n## Installation\n\nTo install the Python library run:\n\n pip install ngramratio\n\nThe library will be installed as `ngramratio` to `bin` on\nLinux (e.g. `/usr/bin`); or as `ngramratio.exe` to `Scripts` in your\nPython installation on Windows (e.g.\n`C:\\Python27\\Scripts\\ngramratio.exe`).\n\nYou may consider installing the library only for the current user:\n\n pip install ngramratio --user\n\nIn this case the library will be installed to\n`~/.local/bin/ngramratio` on Linux and to\n`%APPDATA%\\Python\\Scripts\\ngramratio.exe` on Windows.\n\n## Library usage\n\nThe module provides a method, `nratio`, which takes an integer number (the user's required minimum n-gram length, i.e. number of consecutive characters, to be matched) and outputs a similarity index (float number in [0,1]).\n\nFirst step: initialize an object of class SequenceMatcherExtended specifying the two strings to be compared:\n\n```\n >>> import ngramratio from ngramratio\n\n >>> SequenceMatcherExtended = ngrmaratio.SequenceMatcherExtended\n\n >>> string_one = \"ab cde\"\n >>> string_two = \"bcde\"\n\n >>> s = SequenceMatcherExtended(None, string_one, string_two, None)\n >>> # The \"None\" arguments prevents from any character being considered junk..\n >>> # .. see the difflib documentation for more information on this.\n```\n\nSecond step: apply the `ratio` and `nratio` methods and compare similarity scores:\n\n```\n >>> s.ratio()\n >>> # Matches any character. Matches: \"b\" (length 1), \"cde\"(length 3). Score: (3+1)*2/10.\n 0.8\n >>> s.nratio(1)\n >>> # Matches substring of length 1 or more. It replicates `ratio()`'s functionality.\n 0.8\n >>> s.nratio(2)\n >>> # Matches substring of length 2 or more. Matches: \"cde\"(length 3). Score: 3*2/10.\n 0.6\n >>> s.nratio(3)\n >>> # Matches substring of length 3 or more. Matches: \"cde\"(length 3). Score: 3*2/10.\n 0.6\n >>> s.nratio(4)\n >>> # Matches substring of length 3 or more. Score 0/10.\n 0.0\n```\n\nThe similarity score is computed as `the number of characters matched` (m) mutiplied by `two` (2) and divided by `the total numer of characters` (T) of the two strings, i.e. similarity score = 2m/T. Note that Python always returns a float upon computing a division.\n\n## Testing in a virtual environment\n\nThis project uses [pytest](https://docs.pytest.org/) testing\nframework with [tox](https://tox.readthedocs.io/) and [docker](https://docs.docker.com/language/) to automate testing in\ndifferent python environments. Tests are stored in the `test/`\nfolder.\n\nTo test a specific python version, for example version 3.6, edit the last few characters of the `startTest.sh` script to **py36** AND change the image to python 3.6 on line 4 of the `docker-compose.yaml` file.\n\nTo run tests, run `bash _scripts/startTest.sh`. This will start a docker container using the specified python image. After testing, or before testing a different python version, run `bash _scripts/teardown.sh` to remove the docker container.\n\nThe library has been tested successfully for python >= 3.6.\n\n## Testing on your local machine with no v.e.\n\nYou can use `tox` directly in your local machine. Make sure to install `tox`, `pytest` before testing.\n\nOn Linux `tox` expects to find executables like `python3.6`, `python3.10` etc. On Windows it looks for\n`C:\\Python36\\python.exe` and\n`C:\\Python310\\python.exe` respectively.\n\nTo test a specific Python environment, use the `-e` option. For example, to\ntest against Python 3.7 run:\n\n tox -e py37\n\nin the root of the project source tree.\n\nTo fix code formatting (this will install `pre-commit` as a dependency), run:\n\n tox -e lint\n\nSee the `tox.ini` file in the repository to learn more about the testing instructions being used.\n\n## Contributions\n\nContributions should include tests and an explanation for the changes\nthey propose. Documentation (examples, docstrings, README.md) should be\nupdated accordingly.\n",
"bugtrack_url": null,
"license": "",
"summary": "N-grams based similarity score",
"version": "0.0.5",
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"md5": "97f71bc42fde05b76a88621dc831ab23",
"sha256": "e446c22abb25d7a1dae13c0e124d5f649efd30c8870b0cd0cd96104e0729ebd1"
},
"downloads": -1,
"filename": "ngramratio-0.0.5-py3-none-any.whl",
"has_sig": false,
"md5_digest": "97f71bc42fde05b76a88621dc831ab23",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6",
"size": 4682,
"upload_time": "2022-05-14T19:15:57",
"upload_time_iso_8601": "2022-05-14T19:15:57.343370Z",
"url": "https://files.pythonhosted.org/packages/83/b9/a8340e830cb8ab6c441a1a53d381ff2a921eb4e46f37300432cb0e134fa3/ngramratio-0.0.5-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"md5": "2d3aabe194980e5cfcbd788ecea64738",
"sha256": "b955011987bc9d0cf59aec36e3d2cfd26d2dd941f9389ddc9acead356a9b9f8f"
},
"downloads": -1,
"filename": "ngramratio-0.0.5.tar.gz",
"has_sig": false,
"md5_digest": "2d3aabe194980e5cfcbd788ecea64738",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 4779,
"upload_time": "2022-05-14T19:15:59",
"upload_time_iso_8601": "2022-05-14T19:15:59.246776Z",
"url": "https://files.pythonhosted.org/packages/51/41/3e9c81cbf6fbde6dad1ba3c266da35b54bdf4cccfbb897f873baa54791b8/ngramratio-0.0.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2022-05-14 19:15:59",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "gi-ba-bu",
"github_project": "python-n-gram-ratio",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"tox": true,
"lcname": "ngramratio"
}