ngramratio


Namengramratio JSON
Version 0.0.5 PyPI version JSON
download
home_pagehttps://github.com/gi-ba-bu/python-n-gram-ratio
SummaryN-grams based similarity score
upload_time2022-05-14 19:15:59
maintainer
docs_urlNone
authorGiacomo Baldo
requires_python>=3.6
license
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # python-ngramratio

A method for similarity scoring of two strings.

The method, namely `nratio`, belongs to the class `SequenceMatcherExtended`, which is an extension of the `SequenceMatcher` class of the [difflib package](https://docs.python.org/3/library/difflib.html). In particular, `nratio` (method of `SequenceMatcherExtended`) is an augmenation of `ratio` (method of `SequenceMatcher`).

`ngramratio` is to be pronounced as "n gram ratio". The library uses n-grams to find a similarity score via a division (ratio) of the number of matched characters by the total number of characters. See below for more details.

## Motivation

To compute a similarity score based on matching n-grams (with n>=1 chosen by the user) rather than matching single characters (as in the case of the `ratio` method).

## Installation

To install the Python library run:

    pip install ngramratio

The library will be installed as `ngramratio` to `bin` on
Linux (e.g. `/usr/bin`); or as `ngramratio.exe` to `Scripts` in your
Python installation on Windows (e.g.
`C:\Python27\Scripts\ngramratio.exe`).

You may consider installing the library only for the current user:

    pip install ngramratio --user

In this case the library will be installed to
`~/.local/bin/ngramratio` on Linux and to
`%APPDATA%\Python\Scripts\ngramratio.exe` on Windows.

## Library usage

The module provides a method, `nratio`, which takes an integer number (the user's required minimum n-gram length, i.e. number of consecutive characters, to be matched) and outputs a similarity index (float number in [0,1]).

First step: initialize an object of class SequenceMatcherExtended specifying the two strings to be compared:

```
    >>> import ngramratio from ngramratio

    >>> SequenceMatcherExtended = ngrmaratio.SequenceMatcherExtended

    >>> string_one = "ab cde"
    >>> string_two = "bcde"

    >>> s = SequenceMatcherExtended(None, string_one, string_two, None)
    >>> # The "None" arguments prevents from any character being considered junk..
    >>> # .. see the difflib documentation for more information on this.
```

Second step: apply the `ratio` and `nratio` methods and compare similarity scores:

```
    >>> s.ratio()
    >>> # Matches any character. Matches: "b" (length 1), "cde"(length 3). Score: (3+1)*2/10.
    0.8
    >>> s.nratio(1)
    >>> # Matches substring of length 1 or more. It replicates `ratio()`'s functionality.
    0.8
    >>> s.nratio(2)
    >>> # Matches substring of length 2 or more. Matches: "cde"(length 3). Score: 3*2/10.
    0.6
    >>> s.nratio(3)
    >>> # Matches substring of length 3 or more. Matches: "cde"(length 3). Score: 3*2/10.
    0.6
    >>> s.nratio(4)
    >>> # Matches substring of length 3 or more. Score 0/10.
    0.0
```

The similarity score is computed as `the number of characters matched` (m) mutiplied by `two` (2) and divided by `the total numer of characters` (T) of the two strings, i.e. similarity score = 2m/T. Note that Python always returns a float upon computing a division.

## Testing in a virtual environment

This project uses [pytest](https://docs.pytest.org/) testing
framework with [tox](https://tox.readthedocs.io/) and [docker](https://docs.docker.com/language/) to automate testing in
different python environments. Tests are stored in the `test/`
folder.

To test a specific python version, for example version 3.6, edit the last few characters of the `startTest.sh` script to **py36** AND change the image to python 3.6 on line 4 of the `docker-compose.yaml` file.

To run tests, run `bash _scripts/startTest.sh`. This will start a docker container using the specified python image. After testing, or before testing a different python version, run `bash _scripts/teardown.sh` to remove the docker container.

The library has been tested successfully for python >= 3.6.

## Testing on your local machine with no v.e.

You can use `tox` directly in your local machine. Make sure to install `tox`, `pytest` before testing.

On Linux `tox` expects to find executables like `python3.6`, `python3.10` etc. On Windows it looks for
`C:\Python36\python.exe` and
`C:\Python310\python.exe` respectively.

To test a specific Python environment, use the `-e` option. For example, to
test against Python 3.7 run:

    tox -e py37

in the root of the project source tree.

To fix code formatting (this will install `pre-commit` as a dependency), run:

    tox -e lint

See the `tox.ini` file in the repository to learn more about the testing instructions being used.

## Contributions

Contributions should include tests and an explanation for the changes
they propose. Documentation (examples, docstrings, README.md) should be
updated accordingly.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/gi-ba-bu/python-n-gram-ratio",
    "name": "ngramratio",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "",
    "keywords": "",
    "author": "Giacomo Baldo",
    "author_email": "baldogiacomophd@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/51/41/3e9c81cbf6fbde6dad1ba3c266da35b54bdf4cccfbb897f873baa54791b8/ngramratio-0.0.5.tar.gz",
    "platform": null,
    "description": "# python-ngramratio\n\nA method for similarity scoring of two strings.\n\nThe method, namely `nratio`, belongs to the class `SequenceMatcherExtended`, which is an extension of the `SequenceMatcher` class of the [difflib package](https://docs.python.org/3/library/difflib.html). In particular, `nratio` (method of `SequenceMatcherExtended`) is an augmenation of `ratio` (method of `SequenceMatcher`).\n\n`ngramratio` is to be pronounced as \"n gram ratio\". The library uses n-grams to find a similarity score via a division (ratio) of the number of matched characters by the total number of characters. See below for more details.\n\n## Motivation\n\nTo compute a similarity score based on matching n-grams (with n>=1 chosen by the user) rather than matching single characters (as in the case of the `ratio` method).\n\n## Installation\n\nTo install the Python library run:\n\n    pip install ngramratio\n\nThe library will be installed as `ngramratio` to `bin` on\nLinux (e.g. `/usr/bin`); or as `ngramratio.exe` to `Scripts` in your\nPython installation on Windows (e.g.\n`C:\\Python27\\Scripts\\ngramratio.exe`).\n\nYou may consider installing the library only for the current user:\n\n    pip install ngramratio --user\n\nIn this case the library will be installed to\n`~/.local/bin/ngramratio` on Linux and to\n`%APPDATA%\\Python\\Scripts\\ngramratio.exe` on Windows.\n\n## Library usage\n\nThe module provides a method, `nratio`, which takes an integer number (the user's required minimum n-gram length, i.e. number of consecutive characters, to be matched) and outputs a similarity index (float number in [0,1]).\n\nFirst step: initialize an object of class SequenceMatcherExtended specifying the two strings to be compared:\n\n```\n    >>> import ngramratio from ngramratio\n\n    >>> SequenceMatcherExtended = ngrmaratio.SequenceMatcherExtended\n\n    >>> string_one = \"ab cde\"\n    >>> string_two = \"bcde\"\n\n    >>> s = SequenceMatcherExtended(None, string_one, string_two, None)\n    >>> # The \"None\" arguments prevents from any character being considered junk..\n    >>> # .. see the difflib documentation for more information on this.\n```\n\nSecond step: apply the `ratio` and `nratio` methods and compare similarity scores:\n\n```\n    >>> s.ratio()\n    >>> # Matches any character. Matches: \"b\" (length 1), \"cde\"(length 3). Score: (3+1)*2/10.\n    0.8\n    >>> s.nratio(1)\n    >>> # Matches substring of length 1 or more. It replicates `ratio()`'s functionality.\n    0.8\n    >>> s.nratio(2)\n    >>> # Matches substring of length 2 or more. Matches: \"cde\"(length 3). Score: 3*2/10.\n    0.6\n    >>> s.nratio(3)\n    >>> # Matches substring of length 3 or more. Matches: \"cde\"(length 3). Score: 3*2/10.\n    0.6\n    >>> s.nratio(4)\n    >>> # Matches substring of length 3 or more. Score 0/10.\n    0.0\n```\n\nThe similarity score is computed as `the number of characters matched` (m) mutiplied by `two` (2) and divided by `the total numer of characters` (T) of the two strings, i.e. similarity score = 2m/T. Note that Python always returns a float upon computing a division.\n\n## Testing in a virtual environment\n\nThis project uses [pytest](https://docs.pytest.org/) testing\nframework with [tox](https://tox.readthedocs.io/) and [docker](https://docs.docker.com/language/) to automate testing in\ndifferent python environments. Tests are stored in the `test/`\nfolder.\n\nTo test a specific python version, for example version 3.6, edit the last few characters of the `startTest.sh` script to **py36** AND change the image to python 3.6 on line 4 of the `docker-compose.yaml` file.\n\nTo run tests, run `bash _scripts/startTest.sh`. This will start a docker container using the specified python image. After testing, or before testing a different python version, run `bash _scripts/teardown.sh` to remove the docker container.\n\nThe library has been tested successfully for python >= 3.6.\n\n## Testing on your local machine with no v.e.\n\nYou can use `tox` directly in your local machine. Make sure to install `tox`, `pytest` before testing.\n\nOn Linux `tox` expects to find executables like `python3.6`, `python3.10` etc. On Windows it looks for\n`C:\\Python36\\python.exe` and\n`C:\\Python310\\python.exe` respectively.\n\nTo test a specific Python environment, use the `-e` option. For example, to\ntest against Python 3.7 run:\n\n    tox -e py37\n\nin the root of the project source tree.\n\nTo fix code formatting (this will install `pre-commit` as a dependency), run:\n\n    tox -e lint\n\nSee the `tox.ini` file in the repository to learn more about the testing instructions being used.\n\n## Contributions\n\nContributions should include tests and an explanation for the changes\nthey propose. Documentation (examples, docstrings, README.md) should be\nupdated accordingly.\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "N-grams based similarity score",
    "version": "0.0.5",
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "97f71bc42fde05b76a88621dc831ab23",
                "sha256": "e446c22abb25d7a1dae13c0e124d5f649efd30c8870b0cd0cd96104e0729ebd1"
            },
            "downloads": -1,
            "filename": "ngramratio-0.0.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "97f71bc42fde05b76a88621dc831ab23",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 4682,
            "upload_time": "2022-05-14T19:15:57",
            "upload_time_iso_8601": "2022-05-14T19:15:57.343370Z",
            "url": "https://files.pythonhosted.org/packages/83/b9/a8340e830cb8ab6c441a1a53d381ff2a921eb4e46f37300432cb0e134fa3/ngramratio-0.0.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "md5": "2d3aabe194980e5cfcbd788ecea64738",
                "sha256": "b955011987bc9d0cf59aec36e3d2cfd26d2dd941f9389ddc9acead356a9b9f8f"
            },
            "downloads": -1,
            "filename": "ngramratio-0.0.5.tar.gz",
            "has_sig": false,
            "md5_digest": "2d3aabe194980e5cfcbd788ecea64738",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 4779,
            "upload_time": "2022-05-14T19:15:59",
            "upload_time_iso_8601": "2022-05-14T19:15:59.246776Z",
            "url": "https://files.pythonhosted.org/packages/51/41/3e9c81cbf6fbde6dad1ba3c266da35b54bdf4cccfbb897f873baa54791b8/ngramratio-0.0.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2022-05-14 19:15:59",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "gi-ba-bu",
    "github_project": "python-n-gram-ratio",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "tox": true,
    "lcname": "ngramratio"
}
        
Elapsed time: 0.35059s