cer


Namecer JSON
Version 1.2.0 PyPI version JSON
download
home_pagehttps://github.com/BramVanroy/CharacTER
SummaryTranslation Edit Rate on the character level
upload_time2022-12-06 11:02:00
maintainer
docs_urlNone
authorBram Vanroy
requires_python>=3.7
licenseGPLv3
keywords machine-translation machine-translation-evaluation evaluation mt
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # CharacTER

CharacTER: Translation Edit Rate on Character Level

CharacTer (cer) is a novel character level metric inspired by the commonly applied translation edit rate (TER). It is defined
as the minimum number of character edits required to adjust a hypothesis, until it completely matches the reference,
normalized by the length of the hypothesis sentence. CharacTer calculates the character level edit distance while
performing the shift edit on word level. Unlike the strict matching criterion in TER, a hypothesis word is considered
to match a reference word and could be shifted, if the edit distance between them is below a threshold value. The
Levenshtein distance between the reference and the shifted hypothesis sequence is computed on the character level. In
addition, the lengths of hypothesis sequences instead of reference sequences are used for normalizing the edit
distance, which effectively counters the issue that shorter translations normally achieve lower TER.


## Modifications by Bram Vanroy

Bram Vanroy made some changes to this package that do not affect the result of the metric but that should
improve usability. Code has been re-written to avoid the need for custom C++ code (instead the [C implementation
of Levenshtein](https://github.com/maxbachmann/Levenshtein) alongside an LRU cache is used), to make functions more
accessible and readable, and typing info has been included. Packaging has also improved to make uploading to PyPi a
breeze. This means that the package can now be installed via pip:

```shell
pip install cer
```

The main functions are `calculate_cer` and `calculate_cer_corpus`, which both expect tokenized input. The first
argument contains the hypotheses and the second the references.

```python
from cer import calculate_cer

cer_score = calculate_cer(["i", "like", "your", "bag"], ["i", "like", "their", "bags"])
cer_score
0.3333333333333333
```

`calculate_cer_corpus` is similar but instead it expects a sequence of sequence of words, basically a corpus of
sentences of words. It will report some statistics of the sentence-level CER scores that were calculated.

```python
from cer import calculate_cer_corpus

hyps = ["this week the saudis denied information published in the new york times",
        "this is in fact an estimate"]
refs = ["saudi arabia denied this week information published in the american new york times",
        "this is actually an estimate"]

hyps = [sent.split() for sent in hyps]
refs = [sent.split() for sent in refs]

cer_corpus_score = calculate_cer_corpus(hyps, refs)
cer_corpus_score
{
    'count': 2,
    'mean': 0.3127282211789254,
    'median': 0.3127282211789254,
    'std': 0.07561653111280243,
    'min': 0.25925925925925924,
    'max': 0.36619718309859156
}
```

In addition to the Python interface, a command-line entry-point is also installed, which you can use as
`calculate-cer`. Its idea is to calculate aggregate scores on the corpus-level (similar to calculate_cer_corpus)
based on two input files. One with hypotheses and one with references (one on each line). Results are written to
stdout.

```shell
usage: calculate-cer [-h] [-r] fhyp fref

CharacTER: Character Level Translation Edit Rate

positional arguments:
  fhyp                Path to file containing hypothesis sentences. One per line.
  fref                Path to file containing reference sentences. One per line.

optional arguments:
  -h, --help          show this help message and exit
  -r, --per_sentence  Whether to output CER scores per ref/hyp pair in addition to corpus-level statistics
```

## License
[GPLv3](LICENSE)


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/BramVanroy/CharacTER",
    "name": "cer",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "",
    "keywords": "machine-translation machine-translation-evaluation evaluation mt",
    "author": "Bram Vanroy",
    "author_email": "bramvanroy@hotmail.com",
    "download_url": "https://files.pythonhosted.org/packages/bc/8a/9799debef342e3819a198f135ba2de5c278c13fefffbfb4ce7dd131b7920/cer-1.2.0.tar.gz",
    "platform": null,
    "description": "# CharacTER\n\nCharacTER: Translation Edit Rate on Character Level\n\nCharacTer (cer) is a novel character level metric inspired by the commonly applied translation edit rate (TER). It is defined\nas the minimum number of character edits required to adjust a hypothesis, until it completely matches the reference,\nnormalized by the length of the hypothesis sentence. CharacTer calculates the character level edit distance while\nperforming the shift edit on word level. Unlike the strict matching criterion in TER, a hypothesis word is considered\nto match a reference word and could be shifted, if the edit distance between them is below a threshold value. The\nLevenshtein distance between the reference and the shifted hypothesis sequence is computed on the character level. In\naddition, the lengths of hypothesis sequences instead of reference sequences are used for normalizing the edit\ndistance, which effectively counters the issue that shorter translations normally achieve lower TER.\n\n\n## Modifications by Bram Vanroy\n\nBram Vanroy made some changes to this package that do not affect the result of the metric but that should\nimprove usability. Code has been re-written to avoid the need for custom C++ code (instead the [C implementation\nof Levenshtein](https://github.com/maxbachmann/Levenshtein) alongside an LRU cache is used), to make functions more\naccessible and readable, and typing info has been included. Packaging has also improved to make uploading to PyPi a\nbreeze. This means that the package can now be installed via pip:\n\n```shell\npip install cer\n```\n\nThe main functions are `calculate_cer` and `calculate_cer_corpus`, which both expect tokenized input. The first\nargument contains the hypotheses and the second the references.\n\n```python\nfrom cer import calculate_cer\n\ncer_score = calculate_cer([\"i\", \"like\", \"your\", \"bag\"], [\"i\", \"like\", \"their\", \"bags\"])\ncer_score\n0.3333333333333333\n```\n\n`calculate_cer_corpus` is similar but instead it expects a sequence of sequence of words, basically a corpus of\nsentences of words. It will report some statistics of the sentence-level CER scores that were calculated.\n\n```python\nfrom cer import calculate_cer_corpus\n\nhyps = [\"this week the saudis denied information published in the new york times\",\n        \"this is in fact an estimate\"]\nrefs = [\"saudi arabia denied this week information published in the american new york times\",\n        \"this is actually an estimate\"]\n\nhyps = [sent.split() for sent in hyps]\nrefs = [sent.split() for sent in refs]\n\ncer_corpus_score = calculate_cer_corpus(hyps, refs)\ncer_corpus_score\n{\n    'count': 2,\n    'mean': 0.3127282211789254,\n    'median': 0.3127282211789254,\n    'std': 0.07561653111280243,\n    'min': 0.25925925925925924,\n    'max': 0.36619718309859156\n}\n```\n\nIn addition to the Python interface, a command-line entry-point is also installed, which you can use as\n`calculate-cer`. Its idea is to calculate aggregate scores on the corpus-level (similar to calculate_cer_corpus)\nbased on two input files. One with hypotheses and one with references (one on each line). Results are written to\nstdout.\n\n```shell\nusage: calculate-cer [-h] [-r] fhyp fref\n\nCharacTER: Character Level Translation Edit Rate\n\npositional arguments:\n  fhyp                Path to file containing hypothesis sentences. One per line.\n  fref                Path to file containing reference sentences. One per line.\n\noptional arguments:\n  -h, --help          show this help message and exit\n  -r, --per_sentence  Whether to output CER scores per ref/hyp pair in addition to corpus-level statistics\n```\n\n## License\n[GPLv3](LICENSE)\n\n",
    "bugtrack_url": null,
    "license": "GPLv3",
    "summary": "Translation Edit Rate on the character level",
    "version": "1.2.0",
    "split_keywords": [
        "machine-translation",
        "machine-translation-evaluation",
        "evaluation",
        "mt"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "8755f18d09a3dcd1d234bf81d0aa697e",
                "sha256": "a978a7a178bc369b0d32084ef5768b9636fd4e400fa7bbad1452f7f013a5107f"
            },
            "downloads": -1,
            "filename": "cer-1.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "8755f18d09a3dcd1d234bf81d0aa697e",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 17645,
            "upload_time": "2022-12-06T11:01:58",
            "upload_time_iso_8601": "2022-12-06T11:01:58.324038Z",
            "url": "https://files.pythonhosted.org/packages/06/72/2993280000b13dd8b3b6be348807f01fe84997d5d42333fe0516685507c6/cer-1.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "md5": "c4e64b1b384a7c314e002604fa451644",
                "sha256": "485cb7ea2e6cbafcaed2147905bd06c3a5453cba8e598fe5b5976c625ba5904e"
            },
            "downloads": -1,
            "filename": "cer-1.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "c4e64b1b384a7c314e002604fa451644",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 18193,
            "upload_time": "2022-12-06T11:02:00",
            "upload_time_iso_8601": "2022-12-06T11:02:00.498890Z",
            "url": "https://files.pythonhosted.org/packages/bc/8a/9799debef342e3819a198f135ba2de5c278c13fefffbfb4ce7dd131b7920/cer-1.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2022-12-06 11:02:00",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "BramVanroy",
    "github_project": "CharacTER",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "cer"
}
        
Elapsed time: 0.01557s