sacrebleu


Namesacrebleu JSON
Version 2.4.3 PyPI version JSON
download
home_pageNone
SummaryHassle-free computation of shareable, comparable, and reproducible BLEU, chrF, and TER scores
upload_time2024-08-17 14:36:09
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseApache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and (b) You must cause any modified files to carry prominent notices stating that You changed the files; and (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and (d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability. END OF TERMS AND CONDITIONS APPENDIX: How to apply the Apache License to your work. To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "[]" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives. Copyright [yyyy] [name of copyright owner] Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
keywords machine translation evaluation nlp natural language processing computational linguistics
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # sacreBLEU

[![PyPI version](https://img.shields.io/pypi/v/sacrebleu)](https://img.shields.io/pypi/v/sacrebleu)
[![Python version](https://img.shields.io/pypi/pyversions/sacrebleu)](https://img.shields.io/pypi/pyversions/sacrebleu)
[![GitHub issues](https://img.shields.io/github/issues/mjpost/sacreBLEU.svg)](https://github.com/mjpost/sacrebleu/issues)

SacreBLEU ([Post, 2018](http://aclweb.org/anthology/W18-6319)) provides hassle-free computation of shareable, comparable, and reproducible **BLEU** scores.
Inspired by Rico Sennrich's `multi-bleu-detok.perl`, it produces the official WMT scores but works with plain text.
It also knows all the standard test sets and handles downloading, processing, and tokenization for you.

The official version is hosted at <https://github.com/mjpost/sacrebleu>.

# Motivation

Comparing BLEU scores is harder than it should be. Every decoder has its own implementation, often borrowed from Moses, but maybe with subtle changes.
Moses itself has a number of implementations as standalone scripts, with little indication of how they differ (note: they mostly don't, but `multi-bleu.pl` expects tokenized input). Different flags passed to each of these scripts can produce wide swings in the final score. All of these may handle tokenization in different ways. On top of this, downloading and managing test sets is a moderate annoyance.

Sacre bleu! What a mess.

**SacreBLEU** aims to solve these problems by wrapping the original reference implementation ([Papineni et al., 2002](https://www.aclweb.org/anthology/P02-1040.pdf)) together with other useful features.
The defaults are set the way that BLEU should be computed, and furthermore, the script outputs a short version string that allows others to know exactly what you did.
As an added bonus, it automatically downloads and manages test sets for you, so that you can simply tell it to score against `wmt14`, without having to hunt down a path on your local file system.
It is all designed to take BLEU a little more seriously.
After all, even with all its problems, BLEU is the default and---admit it---well-loved metric of our entire research community.
Sacre BLEU.

# Features

- It automatically downloads common WMT test sets and processes them to plain text
- It produces a short version string that facilitates cross-paper comparisons
- It properly computes scores on detokenized outputs, using WMT ([Conference on Machine Translation](http://statmt.org/wmt17)) standard tokenization
- It produces the same values as the official script (`mteval-v13a.pl`) used by WMT
- It outputs the BLEU score without the comma, so you don't have to remove it with `sed` (Looking at you, `multi-bleu.perl`)
- It supports different tokenizers for BLEU including support for Japanese and Chinese
- It supports **chrF, chrF++** and **Translation error rate (TER)** metrics
- It performs paired bootstrap resampling and paired approximate randomization tests for statistical significance reporting

# Breaking Changes

## v2.0.0

As of v2.0.0, the default output format is changed to `json` for less painful parsing experience. This means that software that parse the output of sacreBLEU should be modified to either (i) parse the JSON using for example the `jq` utility or (ii) pass `-f text` to sacreBLEU to preserve the old textual output. The latter change can also be made **persistently** by exporting `SACREBLEU_FORMAT=text` in relevant shell configuration files.

Here's an example of parsing the `score` key of the JSON output using `jq`:

```
$ sacrebleu -i output.detok.txt -t wmt17 -l en-de | jq -r .score
20.8
```

# Installation

Install the official Python module from PyPI (**Python>=3.8 only**):

    pip install sacrebleu

In order to install Japanese tokenizer support through `mecab-python3`, you need to run the
following command instead, to perform a full installation with dependencies:

    pip install "sacrebleu[ja]"

In order to install Korean tokenizer support through `pymecab-ko`, you need to run the
following command instead, to perform a full installation with dependencies:

    pip install "sacrebleu[ko]"

# Command-line Usage

You can get a list of available test sets with `sacrebleu --list`. Please see [DATASETS.md](DATASETS.md)
for an up-to-date list of supported datasets. You can also list available test sets for a given language pair
with `sacrebleu --list -l en-fr`.

## Basics

### Downloading test sets

Downloading is triggered when you request a test set. If the dataset is not available, it is downloaded
and unpacked.

E.g., you can use the following commands to download the source, pass it through your translation system
in `translate.sh`, and then score it:

```
$ sacrebleu -t wmt17 -l en-de --echo src > wmt17.en-de.en
$ cat wmt17.en-de.en | translate.sh | sacrebleu -t wmt17 -l en-de
```

Some test sets also have the outputs of systems that were submitted to the task.
For example, the `wmt/systems` test set.

```bash
$ sacrebleu -t wmt21/systems -l zh-en --echo NiuTrans
```

This provides a convenient way to score:

```bash
$ sacrebleu -t wmt21/system -l zh-en --echo NiuTrans | sacrebleu -t wmt21/systems -l zh-en
``

You can see a list of the available outputs by passing an invalid value to `--echo`.

### JSON output

As of version `>=2.0.0`, sacreBLEU prints the computed scores in JSON format to make parsing less painful:

```
$ sacrebleu -i output.detok.txt -t wmt17 -l en-de
```

```json
{
 "name": "BLEU",
 "score": 20.8,
 "signature": "nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0",
 "verbose_score": "54.4/26.6/14.9/8.7 (BP = 1.000 ratio = 1.026 hyp_len = 62880 ref_len = 61287)",
 "nrefs": "1",
 "case": "mixed",
 "eff": "no",
 "tok": "13a",
 "smooth": "exp",
 "version": "2.0.0"
}
```

If you want to keep the old behavior, you can pass `-f text` or export `SACREBLEU_FORMAT=text`:

```
$ sacrebleu -i output.detok.txt -t wmt17 -l en-de -f text
BLEU|nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0 = 20.8 54.4/26.6/14.9/8.7 (BP = 1.000 ratio = 1.026 hyp_len = 62880 ref_len = 61287)
```

### Scoring

(All examples below assume old-style text output for a compact representation that save space)

Let's say that you just translated the `en-de` test set of WMT17 with your fancy MT system and the **detokenized** translations are in a file called `output.detok.txt`:

```
# Option 1: Redirect system output to STDIN
$ cat output.detok.txt | sacrebleu -t wmt17 -l en-de
BLEU|nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0 = 20.8 54.4/26.6/14.9/8.7 (BP = 1.000 ratio = 1.026 hyp_len = 62880 ref_len = 61287)

# Option 2: Use the --input/-i argument
$ sacrebleu -t wmt17 -l en-de -i output.detok.txt
BLEU|nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0 = 20.8 54.4/26.6/14.9/8.7 (BP = 1.000 ratio = 1.026 hyp_len = 62880 ref_len = 61287)
```

You can obtain a short version of the signature with `--short/-sh`:

```
$ sacrebleu -t wmt17 -l en-de -i output.detok.txt -sh
BLEU|#:1|c:mixed|e:no|tok:13a|s:exp|v:2.0.0 = 20.8 54.4/26.6/14.9/8.7 (BP = 1.000 ratio = 1.026 hyp_len = 62880 ref_len = 61287)
```

If you only want the score to be printed, you can use the `--score-only/-b` flag:

```
$ sacrebleu -t wmt17 -l en-de -i output.detok.txt -b
20.8
```

The precision of the scores can be configured via the `--width/-w` flag:

```
$ sacrebleu -t wmt17 -l en-de -i output.detok.txt -b -w 4
20.7965
```

### Using your own reference file

SacreBLEU knows about common test sets (as detailed in the `--list` example above), but you can also use it to score system outputs with arbitrary references. In this case, do not forget to provide **detokenized** reference and hypotheses files:

```
# Let's save the reference to a text file
$ sacrebleu -t wmt17 -l en-de --echo ref > ref.detok.txt

# Option 1: Pass the reference file as a positional argument to sacreBLEU
$ sacrebleu ref.detok.txt -i output.detok.txt -m bleu -b -w 4
20.7965

# Option 2: Redirect the system into STDIN (Compatible with multi-bleu.perl way of doing things)
$ cat output.detok.txt | sacrebleu ref.detok.txt -m bleu -b -w 4
20.7965
```

### Using multiple metrics

Let's first compute BLEU, chrF and TER with the default settings:

```
$ sacrebleu -t wmt17 -l en-de -i output.detok.txt -m bleu chrf ter
        BLEU|nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0 = 20.8 <stripped>
      chrF2|nrefs:1|case:mixed|eff:yes|nc:6|nw:0|space:no|version:2.0.0 = 52.0
TER|nrefs:1|case:lc|tok:tercom|norm:no|punct:yes|asian:no|version:2.0.0 = 69.0
```

Let's now enable `chrF++` which is a revised version of chrF that takes into account word n-grams.
Observe how the `nw:0` gets changed into `nw:2` in the signature:

```
$ sacrebleu -t wmt17 -l en-de -i output.detok.txt -m bleu chrf ter --chrf-word-order 2
        BLEU|nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0 = 20.8 <stripped>
    chrF2++|nrefs:1|case:mixed|eff:yes|nc:6|nw:2|space:no|version:2.0.0 = 49.0
TER|nrefs:1|case:lc|tok:tercom|norm:no|punct:yes|asian:no|version:2.0.0 = 69.0
```

Metric-specific arguments are detailed in the output of `--help`:

```
BLEU related arguments:
  --smooth-method {none,floor,add-k,exp}, -s {none,floor,add-k,exp}
                        Smoothing method: exponential decay, floor (increment zero counts), add-k (increment num/denom by k for n>1), or none. (Default: exp)
  --smooth-value BLEU_SMOOTH_VALUE, -sv BLEU_SMOOTH_VALUE
                        The smoothing value. Only valid for floor and add-k. (Defaults: floor: 0.1, add-k: 1)
  --tokenize {none,zh,13a,char,intl,ja-mecab,ko-mecab}, -tok {none,zh,13a,char,intl,ja-mecab,ko-mecab}
                        Tokenization method to use for BLEU. If not provided, defaults to `zh` for Chinese, `ja-mecab` for Japanese, `ko-mecab` for Korean and `13a` (mteval) otherwise.
  --lowercase, -lc      If True, enables case-insensitivity. (Default: False)
  --force               Insist that your tokenized input is actually detokenized.

chrF related arguments:
  --chrf-char-order CHRF_CHAR_ORDER, -cc CHRF_CHAR_ORDER
                        Character n-gram order. (Default: 6)
  --chrf-word-order CHRF_WORD_ORDER, -cw CHRF_WORD_ORDER
                        Word n-gram order (Default: 0). If equals to 2, the metric is referred to as chrF++.
  --chrf-beta CHRF_BETA
                        Determine the importance of recall w.r.t precision. (Default: 2)
  --chrf-whitespace     Include whitespaces when extracting character n-grams. (Default: False)
  --chrf-lowercase      Enable case-insensitivity. (Default: False)
  --chrf-eps-smoothing  Enables epsilon smoothing similar to chrF++.py, NLTK and Moses; instead of effective order smoothing. (Default: False)

TER related arguments (The defaults replicate TERCOM's behavior):
  --ter-case-sensitive  Enables case sensitivity (Default: False)
  --ter-asian-support   Enables special treatment of Asian characters (Default: False)
  --ter-no-punct        Removes punctuation. (Default: False)
  --ter-normalized      Applies basic normalization and tokenization. (Default: False)
```

### Version Signatures
As you may have noticed, sacreBLEU generates version strings such as `BLEU|nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0` for reproducibility reasons. It's strongly recommended to share these signatures in your papers!

### Outputting other metadata

Sacrebleu knows about metadata for some test sets, and you can output it like this:

```
$ sacrebleu -t wmt21 -l en-de --echo src docid ref | head 2
Couple MACED at California dog park for not wearing face masks while having lunch (VIDEO) - RT USA News	rt.com.131279	Paar in Hundepark in Kalifornien mit Pfefferspray besprüht, weil es beim Mittagessen keine Masken trug (VIDEO) - RT USA News
There's mask-shaming and then there's full on assault.	rt.com.131279	Masken-Shaming ist eine Sache, Körperverletzung eine andere.
```

If multiple fields are requested, they are output as tab-separated columns (a TSV).

To see the available fields, add `--echo asdf` (or some other garbage data):

```
$ sacrebleu -t wmt21 -l en-de --echo asdf
sacreBLEU: No such field asdf in test set wmt21 for language pair en-de.
sacreBLEU: available fields for wmt21/en-de: src, ref:A, ref, docid, origlang
```

## Translationese Support

If you are interested in the translationese effect, you can evaluate BLEU on a subset of sentences
with a given original language (identified based on the `origlang` tag in the raw SGM files).
E.g., to evaluate only against originally German sentences translated to English use:

    $ sacrebleu -t wmt13 -l de-en --origlang=de -i my-wmt13-output.txt

and to evaluate against the complement (in this case `origlang` en, fr, cs, ru, de) use:

    $ sacrebleu -t wmt13 -l de-en --origlang=non-de -i my-wmt13-output.txt

**Please note** that the evaluator will return a BLEU score only on the requested subset,
but it expects that you pass through the entire translated test set.

## Languages & Preprocessing

### BLEU

- You can compute case-insensitive BLEU by passing `--lowercase` to sacreBLEU
- The default tokenizer for BLEU is `13a` which mimics the `mteval-v13a` script from Moses.
- Other tokenizers are:
   - `none` which will not apply any kind of tokenization at all
   - `char` for language-agnostic character-level tokenization
   - `intl` applies international tokenization and mimics the `mteval-v14` script from Moses
   - `zh` separates out **Chinese** characters and tokenizes the non-Chinese parts using `13a` tokenizer
   - `ja-mecab` tokenizes **Japanese** inputs using the [MeCab](https://pypi.org/project/mecab-python3) morphological analyzer
   - `ko-mecab` tokenizes **Korean** inputs using the [MeCab-ko](https://pypi.org/project/mecab-ko) morphological analyzer
   - `flores101` and `flores200` uses the SentencePiece model built from the Flores-101 and [Flores-200](https://github.com/facebookresearch/flores/blob/main/flores200/README.md#languages-in-flores-200) dataset, respectively. Note: the canonical .spm file will be automatically fetched if not found locally.
- You can switch tokenizers using the `--tokenize` flag of sacreBLEU. Alternatively, if you provide language-pair strings
  using `--language-pair/-l`, `zh`, `ja-mecab` and `ko-mecab` tokenizers will be used if the target language is `zh` or `ja` or `ko`, respectively.
- **Note that** there's no automatic language detection from the hypotheses so you need to make sure that you are correctly
  selecting the tokenizer for **Japanese**, **Korean** and **Chinese**.


Default 13a tokenizer will produce poor results for Japanese:

```
$ sacrebleu kyoto-test.ref.ja -i kyoto-test.hyp.ja -b
2.1
```

Let's use the `ja-mecab` tokenizer:
```
$ sacrebleu kyoto-test.ref.ja -i kyoto-test.hyp.ja --tokenize ja-mecab -b
14.5
```

If you provide the language-pair, sacreBLEU will use ja-mecab automatically:

```
$ sacrebleu kyoto-test.ref.ja -i kyoto-test.hyp.ja -l en-ja -b
14.5
```

### chrF / chrF++

chrF applies minimum to none pre-processing as it deals with character n-grams:

- If you pass `--chrf-whitespace`, whitespace characters will be preserved when computing character n-grams.
- If you pass `--chrf-lowercase`, sacreBLEU will compute case-insensitive chrF.
- If you enable non-zero `--chrf-word-order` (pass `2` for `chrF++`), a very simple punctuation tokenization will be internally applied.


### TER

Translation Error Rate (TER) has its own special tokenizer that you can configure through the command line.
The defaults provided are **compatible with the upstream TER implementation (TERCOM)** but you can nevertheless modify the
behavior through the command-line:

- TER is by default case-insensitive. Pass `--ter-case-sensitive` to enable case-sensitivity.
- Pass `--ter-normalize` to apply a general Western tokenization
- Pass `--ter-asian-support` to enable the tokenization of Asian characters. If provided with `--ter-normalize`,
  both will be applied.
- Pass `--ter-no-punct` to strip punctuation.

## Multi-reference Evaluation

All three metrics support the use of multiple references during evaluation. Let's first pass all references as positional arguments:

```
$ sacrebleu ref1 ref2 -i system -m bleu chrf ter
        BLEU|nrefs:2|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0 = 61.8 <stripped>
      chrF2|nrefs:2|case:mixed|eff:yes|nc:6|nw:0|space:no|version:2.0.0 = 75.0
TER|nrefs:2|case:lc|tok:tercom|norm:no|punct:yes|asian:no|version:2.0.0 = 31.2
```

Alternatively (less recommended), we can concatenate references using tabs as delimiters as well. Don't forget to pass `--num-refs/-nr` in this case!

```
$ paste ref1 ref2 > refs.tsv

$ sacrebleu refs.tsv --num-refs 2 -i system -m bleu
BLEU|nrefs:2|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0 = 61.8 <stripped>
```

## Multi-system Evaluation
As of version `>=2.0.0`, SacreBLEU supports evaluation of an arbitrary number of systems for a particular
test set and language-pair. This has the advantage of seeing all results in a
nicely formatted table.

Let's pass all system output files that match the shell glob `newstest2017.online-*` to sacreBLEU for evaluation:

```
$ sacrebleu -t wmt17 -l en-de -i newstest2017.online-* -m bleu chrf
╒═══════════════════════════════╤════════╤═════════╕
│                        System │  BLEU  │  chrF2  │
╞═══════════════════════════════╪════════╪═════════╡
│ newstest2017.online-A.0.en-de │  20.8  │  52.0   │
├───────────────────────────────┼────────┼─────────┤
│ newstest2017.online-B.0.en-de │  26.7  │  56.3   │
├───────────────────────────────┼────────┼─────────┤
│ newstest2017.online-F.0.en-de │  15.5  │  49.3   │
├───────────────────────────────┼────────┼─────────┤
│ newstest2017.online-G.0.en-de │  18.2  │  51.6   │
╘═══════════════════════════════╧════════╧═════════╛

-----------------
Metric signatures
-----------------
 - BLEU       nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0
 - chrF2      nrefs:1|case:mixed|eff:yes|nc:6|nw:0|space:no|version:2.0.0
```

You can also change the output format to `latex`:

```
$ sacrebleu -t wmt17 -l en-de -i newstest2017.online-* -m bleu chrf -f latex
\begin{tabular}{rcc}
\toprule
                        System &  BLEU  &  chrF2  \\
\midrule
 newstest2017.online-A.0.en-de &  20.8  &  52.0   \\
 newstest2017.online-B.0.en-de &  26.7  &  56.3   \\
 newstest2017.online-F.0.en-de &  15.5  &  49.3   \\
 newstest2017.online-G.0.en-de &  18.2  &  51.6   \\
\bottomrule
\end{tabular}

...
```

## Confidence Intervals for Single System Evaluation

When enabled with the `--confidence` flag, SacreBLEU will print
(1) the actual system score, (2) the true mean estimated from bootstrap resampling and (3),
the 95% [confidence interval](https://en.wikipedia.org/wiki/Confidence_interval) around the mean.
By default, the number of bootstrap resamples is 1000 (`bs:1000` in the signature)
and can be changed with `--confidence-n`:

```
$ sacrebleu -t wmt17 -l en-de -i output.detok.txt -m bleu chrf --confidence -f text --short
   BLEU|#:1|bs:1000|rs:12345|c:mixed|e:no|tok:13a|s:exp|v:2.0.0 = 22.675 (μ = 22.669 ± 0.598) ...
chrF2|#:1|bs:1000|rs:12345|c:mixed|e:yes|nc:6|nw:0|s:no|v:2.0.0 = 51.953 (μ = 51.953 ± 0.462)
```

**NOTE:** Although provided as a functionality, having access to confidence intervals for just one system
may not reveal much information about the underlying model. It often makes more sense to perform
**paired statistical tests** across multiple systems.

**NOTE:** When resampling, the seed of the `numpy`'s random number generator (RNG)
is fixed to `12345`. If you want to relax this and set your own seed, you can
export the environment variable `SACREBLEU_SEED` to an integer. Alternatively, you can export
`SACREBLEU_SEED=None` to skip initializing the RNG's seed and allow for non-deterministic
behavior.

## Paired Significance Tests for Multi System Evaluation
Ideally, one would have access to many systems in cases such as (1) investigating
whether a newly added feature yields significantly different scores than the baseline or
(2) evaluating submissions for a particular shared task. SacreBLEU offers two different paired significance tests that are widely used in MT research.

### Paired bootstrap resampling (--paired-bs)

This is an efficient implementation of the paper [Statistical Significance Tests for Machine Translation Evaluation](https://www.aclweb.org/anthology/W04-3250.pdf) and is result-compliant with the [reference Moses implementation](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/analysis/bootstrap-hypothesis-difference-significance.pl). The number of bootstrap resamples can be changed with the `--paired-bs-n` flag and its default is 1000.

When launched, paired bootstrap resampling will perform:
 - Bootstrap resampling to estimate 95% CI for all systems and the baseline
 - A significance test between the **baseline** and each **system** to compute a [p-value](https://en.wikipedia.org/wiki/P-value).

### Paired approximate randomization (--paired-ar)

Paired approximate randomization (AR) is another type of paired significance test that is claimed to be more accurate than paired bootstrap resampling when it comes to Type-I errors ([Riezler and Maxwell III, 2005](https://www.aclweb.org/anthology/W05-0908.pdf)). Type-I errors indicate failures to reject the null hypothesis when it is true. In other words, AR should in theory be more robust to subtle changes across systems.

Our implementation is verified to be result-compliant with the [Multeval toolkit](https://github.com/jhclark/multeval) that also uses paired AR test for pairwise comparison. The number of approximate randomization trials is set to 10,000 by default. This can be changed with the `--paired-ar-n` flag.

### Running the tests

- The **first system** provided to `--input/-i` will be automatically taken as the **baseline system** against which you want to compare **other systems.**
- When `--input/-i` is used, the system output files will be automatically named according to the file paths. For the sake of simplicity, SacreBLEU will automatically discard the **baseline system** if it also appears amongst **other systems**. This is useful if you would like to run the tool by passing `-i systems/baseline.txt systems/*.txt`. Here, the `baseline.txt` file will not be also considered as a candidate system.
- Alternatively, you can also use a tab-separated input file redirected to SacreBLEU. In this case, the first column hypotheses will be taken as the **baseline system**. However, this method is **not recommended** as it won't allow naming your systems in a human-readable way. It will instead enumerate the systems from 1 to N following the column order in the tab-separated input.
- On Linux and Mac OS X, you can launch the tests on multiple CPU's by passing the flag `--paired-jobs N`. If `N == 0`, SacreBLEU will launch one worker for each pairwise comparison. If `N > 0`, `N` worker processes will be spawned. This feature will substantially speed up the runtime especially if you want the **TER** metric to be computed.

#### Example: Paired bootstrap resampling
In the example below, we select `newstest2017.LIUM-NMT.4900.en-de` as the baseline and compare it to 4 other WMT17 submissions using paired bootstrap resampling. According to the results, the null hypothesis (i.e. the two systems being essentially the same) could not be rejected (at the significance level of 0.05) for the following comparisons:

- 0.1 BLEU difference between the baseline and the online-B system (p = 0.3077)

```
$ sacrebleu -t wmt17 -l en-de -i newstest2017.LIUM-NMT.4900.en-de newstest2017.online-* -m bleu chrf --paired-bs
╒════════════════════════════════════════════╤═════════════════════╤══════════════════════╕
│                                     System │  BLEU (μ ± 95% CI)  │  chrF2 (μ ± 95% CI)  │
╞════════════════════════════════════════════╪═════════════════════╪══════════════════════╡
│ Baseline: newstest2017.LIUM-NMT.4900.en-de │  26.6 (26.6 ± 0.6)  │  55.9 (55.9 ± 0.5)   │
├────────────────────────────────────────────┼─────────────────────┼──────────────────────┤
│              newstest2017.online-A.0.en-de │  20.8 (20.8 ± 0.6)  │  52.0 (52.0 ± 0.4)   │
│                                            │    (p = 0.0010)*    │    (p = 0.0010)*     │
├────────────────────────────────────────────┼─────────────────────┼──────────────────────┤
│              newstest2017.online-B.0.en-de │  26.7 (26.6 ± 0.7)  │  56.3 (56.3 ± 0.5)   │
│                                            │    (p = 0.3077)     │    (p = 0.0240)*     │
├────────────────────────────────────────────┼─────────────────────┼──────────────────────┤
│              newstest2017.online-F.0.en-de │  15.5 (15.4 ± 0.5)  │  49.3 (49.3 ± 0.4)   │
│                                            │    (p = 0.0010)*    │    (p = 0.0010)*     │
├────────────────────────────────────────────┼─────────────────────┼──────────────────────┤
│              newstest2017.online-G.0.en-de │  18.2 (18.2 ± 0.5)  │  51.6 (51.6 ± 0.4)   │
│                                            │    (p = 0.0010)*    │    (p = 0.0010)*     │
╘════════════════════════════════════════════╧═════════════════════╧══════════════════════╛

------------------------------------------------------------
Paired bootstrap resampling test with 1000 resampling trials
------------------------------------------------------------
 - Each system is pairwise compared to Baseline: newstest2017.LIUM-NMT.4900.en-de.
   Actual system score / bootstrap estimated true mean / 95% CI are provided for each metric.

 - Null hypothesis: the system and the baseline translations are essentially
   generated by the same underlying process. For a given system and the baseline,
   the p-value is roughly the probability of the absolute score difference (delta)
   or higher occurring due to chance, under the assumption that the null hypothesis is correct.

 - Assuming a significance threshold of 0.05, the null hypothesis can be rejected
   for p-values < 0.05 (marked with "*"). This means that the delta is unlikely to be attributed
   to chance, hence the system is significantly "different" than the baseline.
   Otherwise, the p-values are highlighted in red.

 - NOTE: Significance does not tell whether a system is "better" than the baseline but rather
   emphasizes the "difference" of the systems in terms of the replicability of the delta.

-----------------
Metric signatures
-----------------
 - BLEU       nrefs:1|bs:1000|seed:12345|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0
 - chrF2      nrefs:1|bs:1000|seed:12345|case:mixed|eff:yes|nc:6|nw:0|space:no|version:2.0.0
```

#### Example: Paired approximate randomization

Let's now run the paired approximate randomization test for the same comparison. According to the results, the findings are compatible with the paired bootstrap resampling test. However, the p-value for the `baseline vs. online-B` comparison is much higher (`0.8066`) than the paired bootstrap resampling test.

(**Note that** the AR test does not provide confidence intervals around the true mean as it does not perform bootstrap resampling.)

```
$ sacrebleu -t wmt17 -l en-de -i newstest2017.LIUM-NMT.4900.en-de newstest2017.online-* -m bleu chrf --paired-ar
╒════════════════════════════════════════════╤═══════════════╤═══════════════╕
│                                     System │     BLEU      │     chrF2     │
╞════════════════════════════════════════════╪═══════════════╪═══════════════╡
│ Baseline: newstest2017.LIUM-NMT.4900.en-de │     26.6      │     55.9      │
├────────────────────────────────────────────┼───────────────┼───────────────┤
│              newstest2017.online-A.0.en-de │     20.8      │     52.0      │
│                                            │ (p = 0.0001)* │ (p = 0.0001)* │
├────────────────────────────────────────────┼───────────────┼───────────────┤
│              newstest2017.online-B.0.en-de │     26.7      │     56.3      │
│                                            │ (p = 0.8066)  │ (p = 0.0385)* │
├────────────────────────────────────────────┼───────────────┼───────────────┤
│              newstest2017.online-F.0.en-de │     15.5      │     49.3      │
│                                            │ (p = 0.0001)* │ (p = 0.0001)* │
├────────────────────────────────────────────┼───────────────┼───────────────┤
│              newstest2017.online-G.0.en-de │     18.2      │     51.6      │
│                                            │ (p = 0.0001)* │ (p = 0.0001)* │
╘════════════════════════════════════════════╧═══════════════╧═══════════════╛

-------------------------------------------------------
Paired approximate randomization test with 10000 trials
-------------------------------------------------------
 - Each system is pairwise compared to Baseline: newstest2017.LIUM-NMT.4900.en-de.
   Actual system score is provided for each metric.

 - Null hypothesis: the system and the baseline translations are essentially
   generated by the same underlying process. For a given system and the baseline,
   the p-value is roughly the probability of the absolute score difference (delta)
   or higher occurring due to chance, under the assumption that the null hypothesis is correct.

 - Assuming a significance threshold of 0.05, the null hypothesis can be rejected
   for p-values < 0.05 (marked with "*"). This means that the delta is unlikely to be attributed
   to chance, hence the system is significantly "different" than the baseline.
   Otherwise, the p-values are highlighted in red.

 - NOTE: Significance does not tell whether a system is "better" than the baseline but rather
   emphasizes the "difference" of the systems in terms of the replicability of the delta.

-----------------
Metric signatures
-----------------
 - BLEU       nrefs:1|ar:10000|seed:12345|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0
 - chrF2      nrefs:1|ar:10000|seed:12345|case:mixed|eff:yes|nc:6|nw:0|space:no|version:2.0.0
```

# Using SacreBLEU from Python

For evaluation, it may be useful to compute BLEU, chrF or TER from a Python script. The recommended
way of doing this is to use the object-oriented API, by creating an instance of the `metrics.BLEU` class
for example:

```python
In [1]: from sacrebleu.metrics import BLEU, CHRF, TER
   ...:
   ...: refs = [ # First set of references
   ...:          ['The dog bit the man.', 'It was not unexpected.', 'The man bit him first.'],
   ...:          # Second set of references
   ...:          ['The dog had bit the man.', 'No one was surprised.', 'The man had bitten the dog.'],
   ...:        ]
   ...: sys = ['The dog bit the man.', "It wasn't surprising.", 'The man had just bitten him.']

In [2]: bleu = BLEU()

In [3]: bleu.corpus_score(sys, refs)
Out[3]: BLEU = 48.53 82.4/50.0/45.5/37.5 (BP = 0.943 ratio = 0.944 hyp_len = 17 ref_len = 18)

In [4]: bleu.get_signature()
Out[4]: nrefs:2|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0

In [5]: chrf = CHRF()

In [6]: chrf.corpus_score(sys, refs)
Out[6]: chrF2 = 59.73
```

### Variable Number of References

Let's now remove the first reference sentence for the first system sentence `The dog bit the man.` by replacing it with either `None` or the empty string `''`.
This allows using a variable number of reference segments per hypothesis. Observe how the signature changes from `nrefs:2` to `nrefs:var`:

```python
In [1]: from sacrebleu.metrics import BLEU, CHRF, TER
   ...:
   ...: refs = [ # First set of references
                 # 1st sentence does not have a ref here
   ...:          ['', 'It was not unexpected.', 'The man bit him first.'],
   ...:          # Second set of references
   ...:          ['The dog had bit the man.', 'No one was surprised.', 'The man had bitten the dog.'],
   ...:        ]
   ...: sys = ['The dog bit the man.', "It wasn't surprising.", 'The man had just bitten him.']

In [2]: bleu = BLEU()

In [3]: bleu.corpus_score(sys, refs)
Out[3]: BLEU = 29.44 82.4/42.9/27.3/12.5 (BP = 0.889 ratio = 0.895 hyp_len = 17 ref_len = 19)

In [4]: bleu.get_signature()
Out[4]: nrefs:var|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0
```

## Compatibility API

You can also use the compatibility API that provides wrapper functions around the object-oriented API to
compute sentence-level and corpus-level BLEU, chrF and TER: (It should be noted that this API can be
removed in future releases)

```python
In [1]: import sacrebleu
   ...: 
   ...: refs = [ # First set of references
   ...:          ['The dog bit the man.', 'It was not unexpected.', 'The man bit him first.'],
   ...:          # Second set of references
   ...:          ['The dog had bit the man.', 'No one was surprised.', 'The man had bitten the dog.'],
   ...:        ]
   ...: sys = ['The dog bit the man.', "It wasn't surprising.", 'The man had just bitten him.']

In [2]: sacrebleu.corpus_bleu(sys, refs)
Out[2]: BLEU = 48.53 82.4/50.0/45.5/37.5 (BP = 0.943 ratio = 0.944 hyp_len = 17 ref_len = 18)
```

# License

SacreBLEU is licensed under the [Apache 2.0 License](LICENSE.txt).

# Credits

This was all [Rico Sennrich's idea](https://twitter.com/RicoSennrich/status/883246242763026433)
Originally written by Matt Post.
New features and ongoing support provided by Martin Popel (@martinpopel) and Ozan Caglayan (@ozancaglayan).

If you use SacreBLEU, please cite the following:

```
@inproceedings{post-2018-call,
  title = "A Call for Clarity in Reporting {BLEU} Scores",
  author = "Post, Matt",
  booktitle = "Proceedings of the Third Conference on Machine Translation: Research Papers",
  month = oct,
  year = "2018",
  address = "Belgium, Brussels",
  publisher = "Association for Computational Linguistics",
  url = "https://www.aclweb.org/anthology/W18-6319",
  pages = "186--191",
}
```

# Release Notes

Please see [CHANGELOG.md](CHANGELOG.md) for release notes.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "sacrebleu",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "Matt Post <post@cs.jhu.edu>",
    "keywords": "machine translation, evaluation, NLP, natural language processing, computational linguistics",
    "author": null,
    "author_email": "Matt Post <post@cs.jhu.edu>",
    "download_url": "https://files.pythonhosted.org/packages/17/71/9bec1dfed1ee74dc477666d236ac38976e36f847150f03b55b338874d26e/sacrebleu-2.4.3.tar.gz",
    "platform": null,
    "description": "# sacreBLEU\n\n[![PyPI version](https://img.shields.io/pypi/v/sacrebleu)](https://img.shields.io/pypi/v/sacrebleu)\n[![Python version](https://img.shields.io/pypi/pyversions/sacrebleu)](https://img.shields.io/pypi/pyversions/sacrebleu)\n[![GitHub issues](https://img.shields.io/github/issues/mjpost/sacreBLEU.svg)](https://github.com/mjpost/sacrebleu/issues)\n\nSacreBLEU ([Post, 2018](http://aclweb.org/anthology/W18-6319)) provides hassle-free computation of shareable, comparable, and reproducible **BLEU** scores.\nInspired by Rico Sennrich's `multi-bleu-detok.perl`, it produces the official WMT scores but works with plain text.\nIt also knows all the standard test sets and handles downloading, processing, and tokenization for you.\n\nThe official version is hosted at <https://github.com/mjpost/sacrebleu>.\n\n# Motivation\n\nComparing BLEU scores is harder than it should be. Every decoder has its own implementation, often borrowed from Moses, but maybe with subtle changes.\nMoses itself has a number of implementations as standalone scripts, with little indication of how they differ (note: they mostly don't, but `multi-bleu.pl` expects tokenized input). Different flags passed to each of these scripts can produce wide swings in the final score. All of these may handle tokenization in different ways. On top of this, downloading and managing test sets is a moderate annoyance.\n\nSacre bleu! What a mess.\n\n**SacreBLEU** aims to solve these problems by wrapping the original reference implementation ([Papineni et al., 2002](https://www.aclweb.org/anthology/P02-1040.pdf)) together with other useful features.\nThe defaults are set the way that BLEU should be computed, and furthermore, the script outputs a short version string that allows others to know exactly what you did.\nAs an added bonus, it automatically downloads and manages test sets for you, so that you can simply tell it to score against `wmt14`, without having to hunt down a path on your local file system.\nIt is all designed to take BLEU a little more seriously.\nAfter all, even with all its problems, BLEU is the default and---admit it---well-loved metric of our entire research community.\nSacre BLEU.\n\n# Features\n\n- It automatically downloads common WMT test sets and processes them to plain text\n- It produces a short version string that facilitates cross-paper comparisons\n- It properly computes scores on detokenized outputs, using WMT ([Conference on Machine Translation](http://statmt.org/wmt17)) standard tokenization\n- It produces the same values as the official script (`mteval-v13a.pl`) used by WMT\n- It outputs the BLEU score without the comma, so you don't have to remove it with `sed` (Looking at you, `multi-bleu.perl`)\n- It supports different tokenizers for BLEU including support for Japanese and Chinese\n- It supports **chrF, chrF++** and **Translation error rate (TER)** metrics\n- It performs paired bootstrap resampling and paired approximate randomization tests for statistical significance reporting\n\n# Breaking Changes\n\n## v2.0.0\n\nAs of v2.0.0, the default output format is changed to `json` for less painful parsing experience. This means that software that parse the output of sacreBLEU should be modified to either (i) parse the JSON using for example the `jq` utility or (ii) pass `-f text` to sacreBLEU to preserve the old textual output. The latter change can also be made **persistently** by exporting `SACREBLEU_FORMAT=text` in relevant shell configuration files.\n\nHere's an example of parsing the `score` key of the JSON output using `jq`:\n\n```\n$ sacrebleu -i output.detok.txt -t wmt17 -l en-de | jq -r .score\n20.8\n```\n\n# Installation\n\nInstall the official Python module from PyPI (**Python>=3.8 only**):\n\n    pip install sacrebleu\n\nIn order to install Japanese tokenizer support through `mecab-python3`, you need to run the\nfollowing command instead, to perform a full installation with dependencies:\n\n    pip install \"sacrebleu[ja]\"\n\nIn order to install Korean tokenizer support through `pymecab-ko`, you need to run the\nfollowing command instead, to perform a full installation with dependencies:\n\n    pip install \"sacrebleu[ko]\"\n\n# Command-line Usage\n\nYou can get a list of available test sets with `sacrebleu --list`. Please see [DATASETS.md](DATASETS.md)\nfor an up-to-date list of supported datasets. You can also list available test sets for a given language pair\nwith `sacrebleu --list -l en-fr`.\n\n## Basics\n\n### Downloading test sets\n\nDownloading is triggered when you request a test set. If the dataset is not available, it is downloaded\nand unpacked.\n\nE.g., you can use the following commands to download the source, pass it through your translation system\nin `translate.sh`, and then score it:\n\n```\n$ sacrebleu -t wmt17 -l en-de --echo src > wmt17.en-de.en\n$ cat wmt17.en-de.en | translate.sh | sacrebleu -t wmt17 -l en-de\n```\n\nSome test sets also have the outputs of systems that were submitted to the task.\nFor example, the `wmt/systems` test set.\n\n```bash\n$ sacrebleu -t wmt21/systems -l zh-en --echo NiuTrans\n```\n\nThis provides a convenient way to score:\n\n```bash\n$ sacrebleu -t wmt21/system -l zh-en --echo NiuTrans | sacrebleu -t wmt21/systems -l zh-en\n``\n\nYou can see a list of the available outputs by passing an invalid value to `--echo`.\n\n### JSON output\n\nAs of version `>=2.0.0`, sacreBLEU prints the computed scores in JSON format to make parsing less painful:\n\n```\n$ sacrebleu -i output.detok.txt -t wmt17 -l en-de\n```\n\n```json\n{\n \"name\": \"BLEU\",\n \"score\": 20.8,\n \"signature\": \"nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0\",\n \"verbose_score\": \"54.4/26.6/14.9/8.7 (BP = 1.000 ratio = 1.026 hyp_len = 62880 ref_len = 61287)\",\n \"nrefs\": \"1\",\n \"case\": \"mixed\",\n \"eff\": \"no\",\n \"tok\": \"13a\",\n \"smooth\": \"exp\",\n \"version\": \"2.0.0\"\n}\n```\n\nIf you want to keep the old behavior, you can pass `-f text` or export `SACREBLEU_FORMAT=text`:\n\n```\n$ sacrebleu -i output.detok.txt -t wmt17 -l en-de -f text\nBLEU|nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0 = 20.8 54.4/26.6/14.9/8.7 (BP = 1.000 ratio = 1.026 hyp_len = 62880 ref_len = 61287)\n```\n\n### Scoring\n\n(All examples below assume old-style text output for a compact representation that save space)\n\nLet's say that you just translated the `en-de` test set of WMT17 with your fancy MT system and the **detokenized** translations are in a file called `output.detok.txt`:\n\n```\n# Option 1: Redirect system output to STDIN\n$ cat output.detok.txt | sacrebleu -t wmt17 -l en-de\nBLEU|nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0 = 20.8 54.4/26.6/14.9/8.7 (BP = 1.000 ratio = 1.026 hyp_len = 62880 ref_len = 61287)\n\n# Option 2: Use the --input/-i argument\n$ sacrebleu -t wmt17 -l en-de -i output.detok.txt\nBLEU|nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0 = 20.8 54.4/26.6/14.9/8.7 (BP = 1.000 ratio = 1.026 hyp_len = 62880 ref_len = 61287)\n```\n\nYou can obtain a short version of the signature with `--short/-sh`:\n\n```\n$ sacrebleu -t wmt17 -l en-de -i output.detok.txt -sh\nBLEU|#:1|c:mixed|e:no|tok:13a|s:exp|v:2.0.0 = 20.8 54.4/26.6/14.9/8.7 (BP = 1.000 ratio = 1.026 hyp_len = 62880 ref_len = 61287)\n```\n\nIf you only want the score to be printed, you can use the `--score-only/-b` flag:\n\n```\n$ sacrebleu -t wmt17 -l en-de -i output.detok.txt -b\n20.8\n```\n\nThe precision of the scores can be configured via the `--width/-w` flag:\n\n```\n$ sacrebleu -t wmt17 -l en-de -i output.detok.txt -b -w 4\n20.7965\n```\n\n### Using your own reference file\n\nSacreBLEU knows about common test sets (as detailed in the `--list` example above), but you can also use it to score system outputs with arbitrary references. In this case, do not forget to provide **detokenized** reference and hypotheses files:\n\n```\n# Let's save the reference to a text file\n$ sacrebleu -t wmt17 -l en-de --echo ref > ref.detok.txt\n\n#\u00a0Option 1: Pass the reference file as a positional argument to sacreBLEU\n$ sacrebleu ref.detok.txt -i output.detok.txt -m bleu -b -w 4\n20.7965\n\n# Option 2: Redirect the system into STDIN (Compatible with multi-bleu.perl way of doing things)\n$ cat output.detok.txt | sacrebleu ref.detok.txt -m bleu -b -w 4\n20.7965\n```\n\n### Using multiple metrics\n\nLet's first compute BLEU, chrF and TER with the default settings:\n\n```\n$ sacrebleu -t wmt17 -l en-de -i output.detok.txt -m bleu chrf ter\n        BLEU|nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0 = 20.8 <stripped>\n      chrF2|nrefs:1|case:mixed|eff:yes|nc:6|nw:0|space:no|version:2.0.0 = 52.0\nTER|nrefs:1|case:lc|tok:tercom|norm:no|punct:yes|asian:no|version:2.0.0 = 69.0\n```\n\nLet's now enable `chrF++` which is a revised version of chrF that takes into account word n-grams.\nObserve how the `nw:0` gets changed into `nw:2` in the signature:\n\n```\n$ sacrebleu -t wmt17 -l en-de -i output.detok.txt -m bleu chrf ter --chrf-word-order 2\n        BLEU|nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0 = 20.8 <stripped>\n    chrF2++|nrefs:1|case:mixed|eff:yes|nc:6|nw:2|space:no|version:2.0.0 = 49.0\nTER|nrefs:1|case:lc|tok:tercom|norm:no|punct:yes|asian:no|version:2.0.0 = 69.0\n```\n\nMetric-specific arguments are detailed in the output of `--help`:\n\n```\nBLEU related arguments:\n  --smooth-method {none,floor,add-k,exp}, -s {none,floor,add-k,exp}\n                        Smoothing method: exponential decay, floor (increment zero counts), add-k (increment num/denom by k for n>1), or none. (Default: exp)\n  --smooth-value BLEU_SMOOTH_VALUE, -sv BLEU_SMOOTH_VALUE\n                        The smoothing value. Only valid for floor and add-k. (Defaults: floor: 0.1, add-k: 1)\n  --tokenize {none,zh,13a,char,intl,ja-mecab,ko-mecab}, -tok {none,zh,13a,char,intl,ja-mecab,ko-mecab}\n                        Tokenization method to use for BLEU. If not provided, defaults to `zh` for Chinese, `ja-mecab` for Japanese, `ko-mecab` for Korean and `13a` (mteval) otherwise.\n  --lowercase, -lc      If True, enables case-insensitivity. (Default: False)\n  --force               Insist that your tokenized input is actually detokenized.\n\nchrF related arguments:\n  --chrf-char-order CHRF_CHAR_ORDER, -cc CHRF_CHAR_ORDER\n                        Character n-gram order. (Default: 6)\n  --chrf-word-order CHRF_WORD_ORDER, -cw CHRF_WORD_ORDER\n                        Word n-gram order (Default: 0). If equals to 2, the metric is referred to as chrF++.\n  --chrf-beta CHRF_BETA\n                        Determine the importance of recall w.r.t precision. (Default: 2)\n  --chrf-whitespace     Include whitespaces when extracting character n-grams. (Default: False)\n  --chrf-lowercase      Enable case-insensitivity. (Default: False)\n  --chrf-eps-smoothing  Enables epsilon smoothing similar to chrF++.py, NLTK and Moses; instead of effective order smoothing. (Default: False)\n\nTER related arguments (The defaults replicate TERCOM's behavior):\n  --ter-case-sensitive  Enables case sensitivity (Default: False)\n  --ter-asian-support   Enables special treatment of Asian characters (Default: False)\n  --ter-no-punct        Removes punctuation. (Default: False)\n  --ter-normalized      Applies basic normalization and tokenization. (Default: False)\n```\n\n### Version Signatures\nAs you may have noticed, sacreBLEU generates version strings such as `BLEU|nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0` for reproducibility reasons. It's strongly recommended to share these signatures in your papers!\n\n### Outputting other metadata\n\nSacrebleu knows about metadata for some test sets, and you can output it like this:\n\n```\n$ sacrebleu -t wmt21 -l en-de --echo src docid ref | head 2\nCouple MACED at California dog park for not wearing face masks while having lunch (VIDEO) - RT USA News\trt.com.131279\tPaar in Hundepark in Kalifornien mit Pfefferspray bespr\u00fcht, weil es beim Mittagessen keine Masken trug (VIDEO) - RT USA News\nThere's mask-shaming and then there's full on assault.\trt.com.131279\tMasken-Shaming ist eine Sache, K\u00f6rperverletzung eine andere.\n```\n\nIf multiple fields are requested, they are output as tab-separated columns (a TSV).\n\nTo see the available fields, add `--echo asdf` (or some other garbage data):\n\n```\n$ sacrebleu -t wmt21 -l en-de --echo asdf\nsacreBLEU: No such field asdf in test set wmt21 for language pair en-de.\nsacreBLEU: available fields for wmt21/en-de: src, ref:A, ref, docid, origlang\n```\n\n## Translationese Support\n\nIf you are interested in the translationese effect, you can evaluate BLEU on a subset of sentences\nwith a given original language (identified based on the `origlang` tag in the raw SGM files).\nE.g., to evaluate only against originally German sentences translated to English use:\n\n    $ sacrebleu -t wmt13 -l de-en --origlang=de -i my-wmt13-output.txt\n\nand to evaluate against the complement (in this case `origlang` en, fr, cs, ru, de) use:\n\n    $ sacrebleu -t wmt13 -l de-en --origlang=non-de -i my-wmt13-output.txt\n\n**Please note** that the evaluator will return a BLEU score only on the requested subset,\nbut it expects that you pass through the entire translated test set.\n\n## Languages & Preprocessing\n\n### BLEU\n\n- You can compute case-insensitive BLEU by passing `--lowercase` to sacreBLEU\n- The default tokenizer for BLEU is `13a` which mimics the `mteval-v13a` script from Moses.\n- Other tokenizers are:\n   - `none` which will not apply any kind of tokenization at all\n   - `char` for language-agnostic character-level tokenization\n   - `intl` applies international tokenization and mimics the `mteval-v14` script from Moses\n   - `zh` separates out **Chinese** characters and tokenizes the non-Chinese parts using `13a` tokenizer\n   - `ja-mecab` tokenizes **Japanese** inputs using the [MeCab](https://pypi.org/project/mecab-python3) morphological analyzer\n   - `ko-mecab` tokenizes **Korean** inputs using the [MeCab-ko](https://pypi.org/project/mecab-ko) morphological analyzer\n   - `flores101` and `flores200` uses the SentencePiece model built from the Flores-101 and [Flores-200](https://github.com/facebookresearch/flores/blob/main/flores200/README.md#languages-in-flores-200) dataset, respectively. Note: the canonical .spm file will be automatically fetched if not found locally.\n- You can switch tokenizers using the `--tokenize` flag of sacreBLEU. Alternatively, if you provide language-pair strings\n  using `--language-pair/-l`, `zh`, `ja-mecab` and `ko-mecab` tokenizers will be used if the target language is `zh` or `ja` or `ko`, respectively.\n- **Note that** there's no automatic language detection from the hypotheses so you need to make sure that you are correctly\n  selecting the tokenizer for **Japanese**, **Korean** and **Chinese**.\n\n\nDefault 13a tokenizer will produce poor results for Japanese:\n\n```\n$ sacrebleu kyoto-test.ref.ja -i kyoto-test.hyp.ja -b\n2.1\n```\n\nLet's use the `ja-mecab` tokenizer:\n```\n$ sacrebleu kyoto-test.ref.ja -i kyoto-test.hyp.ja --tokenize ja-mecab -b\n14.5\n```\n\nIf you provide the language-pair, sacreBLEU will use ja-mecab automatically:\n\n```\n$ sacrebleu kyoto-test.ref.ja -i kyoto-test.hyp.ja -l en-ja -b\n14.5\n```\n\n### chrF / chrF++\n\nchrF applies minimum to none pre-processing as it deals with character n-grams:\n\n- If you pass `--chrf-whitespace`, whitespace characters will be preserved when computing character n-grams.\n- If you pass `--chrf-lowercase`, sacreBLEU will compute case-insensitive chrF.\n- If you enable non-zero `--chrf-word-order` (pass `2` for `chrF++`), a very simple punctuation tokenization will be internally applied.\n\n\n### TER\n\nTranslation Error Rate (TER) has its own special tokenizer that you can configure through the command line.\nThe defaults provided are **compatible with the upstream TER implementation (TERCOM)** but you can nevertheless modify the\nbehavior through the command-line:\n\n- TER is by default case-insensitive. Pass `--ter-case-sensitive` to enable case-sensitivity.\n- Pass `--ter-normalize` to apply a general Western tokenization\n- Pass `--ter-asian-support` to enable the tokenization of Asian characters. If provided with `--ter-normalize`,\n  both will be applied.\n- Pass `--ter-no-punct` to strip punctuation.\n\n## Multi-reference Evaluation\n\nAll three metrics support the use of multiple references during evaluation. Let's first pass all references as positional arguments:\n\n```\n$ sacrebleu ref1 ref2 -i system -m bleu chrf ter\n        BLEU|nrefs:2|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0 = 61.8 <stripped>\n      chrF2|nrefs:2|case:mixed|eff:yes|nc:6|nw:0|space:no|version:2.0.0 = 75.0\nTER|nrefs:2|case:lc|tok:tercom|norm:no|punct:yes|asian:no|version:2.0.0 = 31.2\n```\n\nAlternatively (less recommended), we can concatenate references using tabs as delimiters as well. Don't forget to pass `--num-refs/-nr` in this case!\n\n```\n$ paste ref1 ref2 > refs.tsv\n\n$ sacrebleu refs.tsv --num-refs 2 -i system -m bleu\nBLEU|nrefs:2|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0 = 61.8 <stripped>\n```\n\n## Multi-system Evaluation\nAs of version `>=2.0.0`, SacreBLEU supports evaluation of an arbitrary number of systems for a particular\ntest set and language-pair. This has the advantage of seeing all results in a\nnicely formatted table.\n\nLet's pass all system output files that match the shell glob `newstest2017.online-*` to sacreBLEU for evaluation:\n\n```\n$ sacrebleu -t wmt17 -l en-de -i newstest2017.online-* -m bleu chrf\n\u2552\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2564\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2564\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2555\n\u2502                        System \u2502  BLEU  \u2502  chrF2  \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 newstest2017.online-A.0.en-de \u2502  20.8  \u2502  52.0   \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502 newstest2017.online-B.0.en-de \u2502  26.7  \u2502  56.3   \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502 newstest2017.online-F.0.en-de \u2502  15.5  \u2502  49.3   \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502 newstest2017.online-G.0.en-de \u2502  18.2  \u2502  51.6   \u2502\n\u2558\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2567\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2567\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u255b\n\n-----------------\nMetric signatures\n-----------------\n - BLEU       nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0\n - chrF2      nrefs:1|case:mixed|eff:yes|nc:6|nw:0|space:no|version:2.0.0\n```\n\nYou can also change the output format to `latex`:\n\n```\n$ sacrebleu -t wmt17 -l en-de -i newstest2017.online-* -m bleu chrf -f latex\n\\begin{tabular}{rcc}\n\\toprule\n                        System &  BLEU  &  chrF2  \\\\\n\\midrule\n newstest2017.online-A.0.en-de &  20.8  &  52.0   \\\\\n newstest2017.online-B.0.en-de &  26.7  &  56.3   \\\\\n newstest2017.online-F.0.en-de &  15.5  &  49.3   \\\\\n newstest2017.online-G.0.en-de &  18.2  &  51.6   \\\\\n\\bottomrule\n\\end{tabular}\n\n...\n```\n\n## Confidence Intervals for Single System Evaluation\n\nWhen enabled with the `--confidence` flag, SacreBLEU will print\n(1) the actual system score, (2) the true mean estimated from bootstrap resampling and (3),\nthe 95% [confidence interval](https://en.wikipedia.org/wiki/Confidence_interval) around the mean.\nBy default, the number of bootstrap resamples is 1000 (`bs:1000` in the signature)\nand can be changed with `--confidence-n`:\n\n```\n$ sacrebleu -t wmt17 -l en-de -i output.detok.txt -m bleu chrf --confidence -f text --short\n   BLEU|#:1|bs:1000|rs:12345|c:mixed|e:no|tok:13a|s:exp|v:2.0.0 = 22.675 (\u03bc = 22.669 \u00b1 0.598) ...\nchrF2|#:1|bs:1000|rs:12345|c:mixed|e:yes|nc:6|nw:0|s:no|v:2.0.0 = 51.953 (\u03bc = 51.953 \u00b1 0.462)\n```\n\n**NOTE:** Although provided as a functionality, having access to confidence intervals for just one system\nmay not reveal much information about the underlying model. It often makes more sense to perform\n**paired statistical tests** across multiple systems.\n\n**NOTE:** When resampling, the seed of the `numpy`'s random number generator (RNG)\nis fixed to `12345`. If you want to relax this and set your own seed, you can\nexport the environment variable `SACREBLEU_SEED` to an integer. Alternatively, you can export\n`SACREBLEU_SEED=None` to skip initializing the RNG's seed and allow for non-deterministic\nbehavior.\n\n## Paired Significance Tests for Multi System Evaluation\nIdeally, one would have access to many systems in cases such as (1) investigating\nwhether a newly added feature yields significantly different scores than the baseline or\n(2) evaluating submissions for a particular shared task. SacreBLEU offers two different paired significance tests that are widely used in MT research.\n\n### Paired bootstrap resampling (--paired-bs)\n\nThis is an efficient implementation of the paper [Statistical Significance Tests for Machine Translation Evaluation](https://www.aclweb.org/anthology/W04-3250.pdf) and is result-compliant with the [reference Moses implementation](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/analysis/bootstrap-hypothesis-difference-significance.pl). The number of bootstrap resamples can be changed with the `--paired-bs-n` flag and its default is 1000.\n\nWhen launched, paired bootstrap resampling will perform:\n - Bootstrap resampling to estimate 95% CI for all systems and the baseline\n - A significance test between the **baseline** and each **system** to compute a [p-value](https://en.wikipedia.org/wiki/P-value).\n\n### Paired approximate randomization (--paired-ar)\n\nPaired approximate randomization (AR) is another type of paired significance test that is claimed to be more accurate than paired bootstrap resampling when it comes to Type-I errors ([Riezler and Maxwell III, 2005](https://www.aclweb.org/anthology/W05-0908.pdf)). Type-I errors indicate failures to reject the null hypothesis when it is true. In other words, AR should in theory be more robust to subtle changes across systems.\n\nOur implementation is verified to be result-compliant with the [Multeval toolkit](https://github.com/jhclark/multeval) that also uses paired AR test for pairwise comparison. The number of approximate randomization trials is set to 10,000 by default. This can be changed with the `--paired-ar-n` flag.\n\n### Running the tests\n\n- The **first system** provided to `--input/-i` will be automatically taken as the **baseline system** against which you want to compare **other systems.**\n- When `--input/-i` is used, the system output files will be automatically named according to the file paths. For the sake of simplicity, SacreBLEU will automatically discard the **baseline system** if it also appears amongst **other systems**. This is useful if you would like to run the tool by passing `-i systems/baseline.txt systems/*.txt`. Here, the `baseline.txt` file will not be also considered as a candidate system.\n- Alternatively, you can also use a tab-separated input file redirected to SacreBLEU. In this case, the first column hypotheses will be taken as the **baseline system**. However, this method is **not recommended** as it won't allow naming your systems in a human-readable way. It will instead enumerate the systems from 1 to N following the column order in the tab-separated input.\n- On Linux and Mac OS X, you can launch the tests on multiple CPU's by passing the flag `--paired-jobs N`. If `N == 0`, SacreBLEU will launch one worker for each pairwise comparison. If `N > 0`, `N` worker processes will be spawned. This feature will substantially speed up the runtime especially if you want the **TER** metric to be computed.\n\n#### Example: Paired bootstrap resampling\nIn the example below, we select `newstest2017.LIUM-NMT.4900.en-de` as the baseline and compare it to 4 other WMT17 submissions using paired bootstrap resampling. According to the results, the null hypothesis (i.e. the two systems being essentially the same) could not be rejected (at the significance level of 0.05) for the following comparisons:\n\n- 0.1 BLEU difference between the baseline and the online-B system (p = 0.3077)\n\n```\n$ sacrebleu -t wmt17 -l en-de -i newstest2017.LIUM-NMT.4900.en-de newstest2017.online-* -m bleu chrf --paired-bs\n\u2552\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2564\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2564\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2555\n\u2502                                     System \u2502  BLEU (\u03bc \u00b1 95% CI)  \u2502  chrF2 (\u03bc \u00b1 95% CI)  \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 Baseline: newstest2017.LIUM-NMT.4900.en-de \u2502  26.6 (26.6 \u00b1 0.6)  \u2502  55.9 (55.9 \u00b1 0.5)   \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502              newstest2017.online-A.0.en-de \u2502  20.8 (20.8 \u00b1 0.6)  \u2502  52.0 (52.0 \u00b1 0.4)   \u2502\n\u2502                                            \u2502    (p = 0.0010)*    \u2502    (p = 0.0010)*     \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502              newstest2017.online-B.0.en-de \u2502  26.7 (26.6 \u00b1 0.7)  \u2502  56.3 (56.3 \u00b1 0.5)   \u2502\n\u2502                                            \u2502    (p = 0.3077)     \u2502    (p = 0.0240)*     \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502              newstest2017.online-F.0.en-de \u2502  15.5 (15.4 \u00b1 0.5)  \u2502  49.3 (49.3 \u00b1 0.4)   \u2502\n\u2502                                            \u2502    (p = 0.0010)*    \u2502    (p = 0.0010)*     \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502              newstest2017.online-G.0.en-de \u2502  18.2 (18.2 \u00b1 0.5)  \u2502  51.6 (51.6 \u00b1 0.4)   \u2502\n\u2502                                            \u2502    (p = 0.0010)*    \u2502    (p = 0.0010)*     \u2502\n\u2558\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2567\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2567\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u255b\n\n------------------------------------------------------------\nPaired bootstrap resampling test with 1000 resampling trials\n------------------------------------------------------------\n - Each system is pairwise compared to Baseline: newstest2017.LIUM-NMT.4900.en-de.\n   Actual system score / bootstrap estimated true mean / 95% CI are provided for each metric.\n\n - Null hypothesis: the system and the baseline translations are essentially\n   generated by the same underlying process. For a given system and the baseline,\n   the p-value is roughly the probability of the absolute score difference (delta)\n   or higher occurring due to chance, under the assumption that the null hypothesis is correct.\n\n - Assuming a significance threshold of 0.05, the null hypothesis can be rejected\n   for p-values < 0.05 (marked with \"*\"). This means that the delta is unlikely to be attributed\n   to chance, hence the system is significantly \"different\" than the baseline.\n   Otherwise, the p-values are highlighted in red.\n\n - NOTE: Significance does not tell whether a system is \"better\" than the baseline but rather\n   emphasizes the \"difference\" of the systems in terms of the replicability of the delta.\n\n-----------------\nMetric signatures\n-----------------\n - BLEU       nrefs:1|bs:1000|seed:12345|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0\n - chrF2      nrefs:1|bs:1000|seed:12345|case:mixed|eff:yes|nc:6|nw:0|space:no|version:2.0.0\n```\n\n#### Example: Paired approximate randomization\n\nLet's now run the paired approximate randomization test for the same comparison. According to the results, the findings are compatible with the paired bootstrap resampling test. However, the p-value for the `baseline vs. online-B` comparison is much higher (`0.8066`) than the paired bootstrap resampling test.\n\n(**Note that** the AR test does not provide confidence intervals around the true mean as it does not perform bootstrap resampling.)\n\n```\n$ sacrebleu -t wmt17 -l en-de -i newstest2017.LIUM-NMT.4900.en-de newstest2017.online-* -m bleu chrf --paired-ar\n\u2552\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2564\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2564\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2555\n\u2502                                     System \u2502     BLEU      \u2502     chrF2     \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 Baseline: newstest2017.LIUM-NMT.4900.en-de \u2502     26.6      \u2502     55.9      \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502              newstest2017.online-A.0.en-de \u2502     20.8      \u2502     52.0      \u2502\n\u2502                                            \u2502 (p = 0.0001)* \u2502 (p = 0.0001)* \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502              newstest2017.online-B.0.en-de \u2502     26.7      \u2502     56.3      \u2502\n\u2502                                            \u2502 (p = 0.8066)  \u2502 (p = 0.0385)* \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502              newstest2017.online-F.0.en-de \u2502     15.5      \u2502     49.3      \u2502\n\u2502                                            \u2502 (p = 0.0001)* \u2502 (p = 0.0001)* \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502              newstest2017.online-G.0.en-de \u2502     18.2      \u2502     51.6      \u2502\n\u2502                                            \u2502 (p = 0.0001)* \u2502 (p = 0.0001)* \u2502\n\u2558\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2567\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2567\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u255b\n\n-------------------------------------------------------\nPaired approximate randomization test with 10000 trials\n-------------------------------------------------------\n - Each system is pairwise compared to Baseline: newstest2017.LIUM-NMT.4900.en-de.\n   Actual system score is provided for each metric.\n\n - Null hypothesis: the system and the baseline translations are essentially\n   generated by the same underlying process. For a given system and the baseline,\n   the p-value is roughly the probability of the absolute score difference (delta)\n   or higher occurring due to chance, under the assumption that the null hypothesis is correct.\n\n - Assuming a significance threshold of 0.05, the null hypothesis can be rejected\n   for p-values < 0.05 (marked with \"*\"). This means that the delta is unlikely to be attributed\n   to chance, hence the system is significantly \"different\" than the baseline.\n   Otherwise, the p-values are highlighted in red.\n\n - NOTE: Significance does not tell whether a system is \"better\" than the baseline but rather\n   emphasizes the \"difference\" of the systems in terms of the replicability of the delta.\n\n-----------------\nMetric signatures\n-----------------\n - BLEU       nrefs:1|ar:10000|seed:12345|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0\n - chrF2      nrefs:1|ar:10000|seed:12345|case:mixed|eff:yes|nc:6|nw:0|space:no|version:2.0.0\n```\n\n# Using SacreBLEU from Python\n\nFor evaluation, it may be useful to compute BLEU, chrF or TER from a Python script. The recommended\nway of doing this is to use the object-oriented API, by creating an instance of the `metrics.BLEU` class\nfor example:\n\n```python\nIn [1]: from sacrebleu.metrics import BLEU, CHRF, TER\n   ...:\n   ...: refs = [ # First set of references\n   ...:          ['The dog bit the man.', 'It was not unexpected.', 'The man bit him first.'],\n   ...:          # Second set of references\n   ...:          ['The dog had bit the man.', 'No one was surprised.', 'The man had bitten the dog.'],\n   ...:        ]\n   ...: sys = ['The dog bit the man.', \"It wasn't surprising.\", 'The man had just bitten him.']\n\nIn [2]: bleu = BLEU()\n\nIn [3]: bleu.corpus_score(sys, refs)\nOut[3]: BLEU = 48.53 82.4/50.0/45.5/37.5 (BP = 0.943 ratio = 0.944 hyp_len = 17 ref_len = 18)\n\nIn [4]: bleu.get_signature()\nOut[4]: nrefs:2|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0\n\nIn [5]: chrf = CHRF()\n\nIn [6]: chrf.corpus_score(sys, refs)\nOut[6]: chrF2 = 59.73\n```\n\n### Variable Number of References\n\nLet's now remove the first reference sentence for the first system sentence `The dog bit the man.` by replacing it with either `None` or the empty string `''`.\nThis allows using a variable number of reference segments per hypothesis. Observe how the signature changes from `nrefs:2` to `nrefs:var`:\n\n```python\nIn [1]: from sacrebleu.metrics import BLEU, CHRF, TER\n   ...:\n   ...: refs = [ # First set of references\n                 # 1st sentence does not have a ref here\n   ...:          ['', 'It was not unexpected.', 'The man bit him first.'],\n   ...:          # Second set of references\n   ...:          ['The dog had bit the man.', 'No one was surprised.', 'The man had bitten the dog.'],\n   ...:        ]\n   ...: sys = ['The dog bit the man.', \"It wasn't surprising.\", 'The man had just bitten him.']\n\nIn [2]: bleu = BLEU()\n\nIn [3]: bleu.corpus_score(sys, refs)\nOut[3]: BLEU = 29.44 82.4/42.9/27.3/12.5 (BP = 0.889 ratio = 0.895 hyp_len = 17 ref_len = 19)\n\nIn [4]: bleu.get_signature()\nOut[4]: nrefs:var|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0\n```\n\n## Compatibility API\n\nYou can also use the compatibility API that provides wrapper functions around the object-oriented API to\ncompute sentence-level and corpus-level BLEU, chrF and TER: (It should be noted that this API can be\nremoved in future releases)\n\n```python\nIn [1]: import sacrebleu\n   ...: \n   ...: refs = [ # First set of references\n   ...:          ['The dog bit the man.', 'It was not unexpected.', 'The man bit him first.'],\n   ...:          # Second set of references\n   ...:          ['The dog had bit the man.', 'No one was surprised.', 'The man had bitten the dog.'],\n   ...:        ]\n   ...: sys = ['The dog bit the man.', \"It wasn't surprising.\", 'The man had just bitten him.']\n\nIn [2]: sacrebleu.corpus_bleu(sys, refs)\nOut[2]: BLEU = 48.53 82.4/50.0/45.5/37.5 (BP = 0.943 ratio = 0.944 hyp_len = 17 ref_len = 18)\n```\n\n# License\n\nSacreBLEU is licensed under the [Apache 2.0 License](LICENSE.txt).\n\n# Credits\n\nThis was all [Rico Sennrich's idea](https://twitter.com/RicoSennrich/status/883246242763026433)\nOriginally written by Matt Post.\nNew features and ongoing support provided by Martin Popel (@martinpopel) and Ozan Caglayan (@ozancaglayan).\n\nIf you use SacreBLEU, please cite the following:\n\n```\n@inproceedings{post-2018-call,\n  title = \"A Call for Clarity in Reporting {BLEU} Scores\",\n  author = \"Post, Matt\",\n  booktitle = \"Proceedings of the Third Conference on Machine Translation: Research Papers\",\n  month = oct,\n  year = \"2018\",\n  address = \"Belgium, Brussels\",\n  publisher = \"Association for Computational Linguistics\",\n  url = \"https://www.aclweb.org/anthology/W18-6319\",\n  pages = \"186--191\",\n}\n```\n\n# Release Notes\n\nPlease see [CHANGELOG.md](CHANGELOG.md) for release notes.\n",
    "bugtrack_url": null,
    "license": "Apache License Version 2.0, January 2004 http://www.apache.org/licenses/  TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION  1. Definitions.  \"License\" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document.  \"Licensor\" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License.  \"Legal Entity\" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, \"control\" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity.  \"You\" (or \"Your\") shall mean an individual or Legal Entity exercising permissions granted by this License.  \"Source\" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files.  \"Object\" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types.  \"Work\" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below).  \"Derivative Works\" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof.  \"Contribution\" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, \"submitted\" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as \"Not a Contribution.\"  \"Contributor\" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work.  2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form.  3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed.  4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions:  (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and  (b) You must cause any modified files to carry prominent notices stating that You changed the files; and  (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and  (d) If the Work includes a \"NOTICE\" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License.  You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License.  5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions.  6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file.  7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License.  8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages.  9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability.  END OF TERMS AND CONDITIONS  APPENDIX: How to apply the Apache License to your work.  To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets \"[]\" replaced with your own identifying information. (Don't include the brackets!)  The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same \"printed page\" as the copyright notice for easier identification within third-party archives.  Copyright [yyyy] [name of copyright owner]  Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at  http://www.apache.org/licenses/LICENSE-2.0  Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. ",
    "summary": "Hassle-free computation of shareable, comparable, and reproducible BLEU, chrF, and TER scores",
    "version": "2.4.3",
    "project_urls": {
        "Repository": "https://github.com/mjpost/sacrebleu"
    },
    "split_keywords": [
        "machine translation",
        " evaluation",
        " nlp",
        " natural language processing",
        " computational linguistics"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "15d8e51d35bc863caa19ddeae48dfb890581a19326973ad1c9fa5dcfc63310f7",
                "md5": "5f7f1f635d551a919a3606dbc4eff347",
                "sha256": "a976fd6998d8ced267a722120ec7fc47083c8e9745d8808ccee6424464a0aa31"
            },
            "downloads": -1,
            "filename": "sacrebleu-2.4.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "5f7f1f635d551a919a3606dbc4eff347",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 103964,
            "upload_time": "2024-08-17T14:36:06",
            "upload_time_iso_8601": "2024-08-17T14:36:06.727414Z",
            "url": "https://files.pythonhosted.org/packages/15/d8/e51d35bc863caa19ddeae48dfb890581a19326973ad1c9fa5dcfc63310f7/sacrebleu-2.4.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "17719bec1dfed1ee74dc477666d236ac38976e36f847150f03b55b338874d26e",
                "md5": "4a9fd34600ccaa9b06e71b02ea1b4cae",
                "sha256": "e734b1e0baeaea6ade0fefc9d23bac3df50bf15775d8b78edc108db63654192a"
            },
            "downloads": -1,
            "filename": "sacrebleu-2.4.3.tar.gz",
            "has_sig": false,
            "md5_digest": "4a9fd34600ccaa9b06e71b02ea1b4cae",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 1896720,
            "upload_time": "2024-08-17T14:36:09",
            "upload_time_iso_8601": "2024-08-17T14:36:09.247814Z",
            "url": "https://files.pythonhosted.org/packages/17/71/9bec1dfed1ee74dc477666d236ac38976e36f847150f03b55b338874d26e/sacrebleu-2.4.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-08-17 14:36:09",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "mjpost",
    "github_project": "sacrebleu",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "tox": true,
    "lcname": "sacrebleu"
}
        
Elapsed time: 1.62399s