meeteval

Name	meeteval JSON
Version	0.4.0 JSON
	download
home_page	https://github.com/fgnt/meeteval/
Summary	None
upload_time	2024-10-22 09:34:53
maintainer	None
docs_url	None
author	Department of Communications Engineering, Paderborn University
requires_python	>=3.5
license	MIT
keywords	speech recognition word error rate evaluation meeting asr wer
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <h1 align="center">MeetEval</h1> 
<h3 align="center">A meeting transcription evaluation toolkit</h3>
<div align="center"><a href="#features">Features</a> | <a href="#installation">Installation</a> | <a href="#python-interface">Python Interface</a> | <a href="#command-line-interface">Command Line Interface</a> | <a href="#visualization">Visualization</a> | <a href="#cite">Cite</a></div>
<br>
<a href="https://github.com/fgnt/meeteval/actions"><img src="https://github.com/fgnt/meeteval/actions/workflows/pytest.yml/badge.svg"/></a>
<a href="https://pypi.org/project/meeteval/"><img src="https://img.shields.io/pypi/v/meeteval"/></a>

## Features
MeetEval supports the following metrics for meeting transcription evaluation:

- **Standard WER** for single utterances (Called SISO WER in MeetEval)<br>
  `meeteval-wer wer -r ref -h hyp`
- **Concatenated minimum-Permutation Word Error Rate (cpWER)**<br>
  `meeteval-wer cpwer -r ref.stm -h hyp.stm`
- **Optimal Reference Combination Word Error Rate (ORC WER)**<br>
  `meeteval-wer orcwer -r ref.stm -h hyp.stm`
- **Fast Greedy Approximation of Optimal Reference Combination Word Error Rate (greedy ORC WER)**<br>
  `meeteval-wer greedy_orcwer -r ref.stm -h hyp.stm`
- **Multi-speaker-input multi-stream-output Word Error Rate (MIMO WER)**<br>
  `meeteval-wer mimower -r ref.stm -h hyp.stm`
- **Time-Constrained minimum-Permutation Word Error Rate (tcpWER)**<br>
  `meeteval-wer tcpwer -r ref.stm -h hyp.stm --collar 5`
- **Time-Constrained Optimal Reference Combination Word Error Rate (tcORC WER)**<br>
  `meeteval-wer tcorcwer -r ref.stm -h hyp.stm --collar 5`
- **Fast Greedy Approximation of Time-Constrained Optimal Reference Combination Word Error Rate (greedy tcORC WER)**<br>
  `meeteval-wer greedy_tcorcwer -r ref.stm -h hyp.stm --collar 5`
- **Diarization-Invariant cpWER (DI-cpWER)**<br>
  `meeteval-wer greedy_dicpwer -r ref.stm -h hyp.stm`
- **Diarization Error Rate (DER)** by wrapping [mdeval](https://github.com/nryant/dscore/raw/master/scorelib/md-eval-22.pl)<br>
  `meeteval-der md_eval_22 -r ref.stm -h hyp.stm --collar .25`

Additionally, MeetEval contains a [visualization](#visualization) tool for cpWER and tcpWER alignments that helps to spot errors in system outputs.

## Installation

### From PyPI

```shell
pip install meeteval
```

### From source

```shell
git clone https://github.com/fgnt/meeteval
pip install -e ./meeteval
```

## Command-line interface

`MeetEval` supports the following file formats as input:
 - [Segmental Time Mark](https://github.com/usnistgov/SCTK/blob/master/doc/infmts.htm#L75) (`STM`)
 - [Time Marked Conversation](https://github.com/usnistgov/SCTK/blob/master/doc/infmts.htm#L286) (`CTM`)
 - [SEGment-wise Long-form Speech Transcription annotation](#segment-wise-long-form-speech-transcription-annotation-seglst) (`SegLST`), the file format used in the [CHiME challenges](https://www.chimechallenge.org)
 - [Rich Transcription Time Marked](https://github.com/nryant/dscore?tab=readme-ov-file#rttm) (`RTTM`) files (only for Diarization Error Rate)


> [!NOTE]
> `MeetEval` does not support alternate transcripts (e.g., `"i've { um / uh / @ } as far as i'm concerned"`).

The command-line interface is available as `meeteval-wer` or `python -m meeteval.wer` with the following signature:

```shell
python -m meeteval.wer [orcwer|mimower|cpwer|tcpwer|tcorcwer] -h example_files/hyp.stm -r example_files/ref.stm
# or
meeteval-wer [orcwer|mimower|cpwer|tcpwer|tcorcwer] -h example_files/hyp.stm -r example_files/ref.stm
```

You can add `--help` to any command to get more information about the available options.
The command name `orcwer`, `mimower`, `cpwer` and `tcpwer` selects the metric to use.
By default, the hypothesis files is used to create the template for the average
(e.g. `hypothesis.json`) and per_reco `hypothesis_per_reco.json` file.
They can be changed with `--average-out` and `--per-reco-out`.
`.json` and `.yaml` are the supported suffixes.

More examples can be found in [tests/test_cli.py](tests/test_cli.py).

### File Formats

#### SEGment-wise Long-form Speech Transcription annotation (SegLST)

The SegLST format was used in the [CHiME-7 challenge](https://www.chimechallenge.org/challenges/chime7/task1/index) and is the default format for `MeetEval`.

The SegLST format is stored in JSON format and contains a list of segments.
Each segment should have a minimum set of keys `"session_id"` and `"words"`.
Depending on the metric, additional keys may be required (`"speaker"`, `"start_time"`, `"end_time"`).

An example is shown below:
```python
[
    {
        "session_id": "recordingA", # Required
        "words": "The quick brown fox jumps over the lazy dog", # Required for WER metrics
        "speaker": "Alice", # Required for metrics that use speaker information (cpWER, ORC WER, MIMO WER)
        "start_time": 0,    # Required for time-constrained metrics (tcpWER, tcORC-WER, DER, ...)
        "end_time": 1,      # Required for time-constrained metrics (tcpWER, tcORC-WER, DER, ...)
        "audio_path": "path/to/recordingA.wav"  # Any additional keys can be included
    },
    ...
]
```
Another example can be found [here](example_files/hyp.seglst.json).

#### [Segmental Time Mark (STM)](https://github.com/usnistgov/SCTK/blob/master/doc/infmts.htm#L75)
Each line in an `STM` file represents one "utterance" and is defined as

```
STM :== <filename> <channel> <speaker_id> <begin_time> <end_time> <transcript>
```
where
- `filename`: name of the recording
- `channel`: ignored by MeetEval
- `speaker_id`: ID of the speaker or system output stream/channel (not microphone channel)
- `begin_time`: in seconds, used to find the order of the utterances
- `end_time`: in seconds
- `transcript`: space-separated list of words

for example:
```STM
recording1 1 Alice 0 0 Hello Bob.
recording1 1 Bob 1 0 Hello Alice.
recording1 1 Alice 2 0 How are you?
recording2 1 Alice 0 0 Hello Carol.
;; ...
```

An example `STM` file can be found in [here](example_files/ref.stm).

#### [Time Marked Conversation (CTM)](https://github.com/usnistgov/SCTK/blob/master/doc/infmts.htm#L286)
The CTM format is defined as

```
CTM :== <filename> <channel> <begin_time> <duration> <word> [<confidence>]
```

for the hypothesis (one file per speaker).
You have to supply one `CTM` file for each system output channel using multiple `-h` arguments since `CTM` files don't encode speaker or system output channel information (the `channel` field has a different meaning: left or right microphone).
For example:

```shell
meeteval-wer orcwer -h hyp1.ctm -h hyp2.ctm -r reference.stm
```

> [!NOTE]
> Note that the `LibriCSS` baseline recipe produces one `CTM` file which merges the speakers, so that it cannot be applied straight away. We recommend to use `STM` or `SegLST` files.


## Python interface

For all metrics a [Low-level](#low-level-interface) and [high-level](#high-level-interface) interface is available.

> [!TIP]
> You want to use the [high-level](#high-level-interface) for computing metrics over a full dataset. <br>
> You want to use the [low-level](#low-level-interface) interface for computing metrics for single examples or when your data is represented as Python structures, e.g., nested lists of strings.

### Low-level interface

All WERs have a low-level interface in the `meeteval.wer` module that allows computing the WER for single examples.
The functions take the reference and hypothesis as input and return an `ErrorRate` object.
The `ErrorRate` bundles statistics (errors, total number of words) and potential auxiliary information (e.g., assignment for ORC WER) together with the WER.

```python
import meeteval

# SISO WER
wer = meeteval.wer.wer.siso.siso_word_error_rate(
    reference='The quick brown fox jumps over the lazy dog',
    hypothesis='The kwick brown fox jump over lazy '
)
print(wer)
# ErrorRate(error_rate=0.4444444444444444, errors=4, length=9, insertions=0, deletions=2, substitutions=2)

# cpWER
wer = meeteval.wer.wer.cp.cp_word_error_rate(
    reference=['The quick brown fox', 'jumps over the lazy dog'],
    hypothesis=['The kwick brown fox', 'jump over lazy ']
)
print(wer)
# CPErrorRate(error_rate=0.4444444444444444, errors=4, length=9, insertions=0, deletions=2, substitutions=2, missed_speaker=0, falarm_speaker=0, scored_speaker=2, assignment=((0, 0), (1, 1)))

# ORC-WER
wer = meeteval.wer.wer.orc.orc_word_error_rate(
    reference=['The quick brown fox', 'jumps over the lazy dog'],
    hypothesis=['The kwick brown fox', 'jump over lazy ']
)
print(wer)
# OrcErrorRate(error_rate=0.4444444444444444, errors=4, length=9, insertions=0, deletions=2, substitutions=2, assignment=(0, 1))
```

The input format can be a (list of) strings or an object representing a file format from `meeteval.io`:

```python
import meeteval
wer = meeteval.wer.wer.cp.cp_word_error_rate(
    reference = meeteval.io.STM.parse('recordingA 1 Alice 0 1 The quick brown fox jumps over the lazy dog'),
    hypothesis = meeteval.io.STM.parse('recordingA 1 spk-1 0 1 The kwick brown fox jump over lazy ')
)
print(wer)
# CPErrorRate(error_rate=0.4444444444444444, errors=4, length=9, insertions=0, deletions=2, substitutions=2, reference_self_overlap=SelfOverlap(overlap_rate=Decimal('0'), overlap_time=0, total_time=Decimal('1')), hypothesis_self_overlap=SelfOverlap(overlap_rate=Decimal('0'), overlap_time=0, total_time=Decimal('1')), missed_speaker=0, falarm_speaker=0, scored_speaker=1, assignment=(('Alice', 'spk-1'), ))
```

All low-level interfaces come with a single-example function (as show above) and a batch function that computes the WER for multiple examples at once.
The batch function is postfixed with `_multifile` and is similar to the high-level interface without fancy input format handling.
To compute the average over multiple `ErrorRate`s, use `meeteval.wer.combine_error_rates`.
Note that the combined WER is _not_ the average over the error rates, but the error rate that results from combining the errors and lengths of all error rates.
`combine_error_rates` also discards any information that cannot be aggregated over multiple examples (such as the ORC WER assignment).

For example with the cpWER:
```python
import meeteval.wer.wer.siso

wers = meeteval.wer.wer.cp.cp_word_error_rate_multifile(
    reference={
        'recordingA': {'speakerA': 'First example', 'speakerB': 'First example second speaker'}, 
        'recordingB': {'speakerA': 'Second example'},
    },
    hypothesis={
        'recordingA': ['First example with errors', 'First example second speaker'],
        'recordingB': ['Second example', 'Overestimated speaker'],
    }
)
print(wers)
# {
#   'recordingA': CPErrorRate(error_rate=0.3333333333333333, errors=2, length=6, insertions=2, deletions=0, substitutions=0, missed_speaker=0, falarm_speaker=0, scored_speaker=2, assignment=(('speakerA', 0), ('speakerB', 1))), 
#   'recordingB': CPErrorRate(error_rate=1.0, errors=2, length=2, insertions=2, deletions=0, substitutions=0, missed_speaker=0, falarm_speaker=1, scored_speaker=1, assignment=(('speakerA', 0), (None, 1)))
# }

# Use combine_error_rates to compute an "overall" WER over multiple examples
avg = meeteval.wer.combine_error_rates(wers)
print(avg)
# CPErrorRate(error_rate=0.5, errors=4, length=8, insertions=4, deletions=0, substitutions=0, missed_speaker=0, falarm_speaker=1, scored_speaker=3)
```

### High-level interface

All WERs have a high-level Python interface available directly in the `meeteval.wer` module that mirrors the [Command-line interface](#command-line-interface) and accepts the formats from `meeteval.io` as input.
All of these functions require the input format to contain a session-ID and output a dict mapping from session-ID to the result of that session

```python
import meeteval

# File Paths
wers = meeteval.wer.tcpwer('example_files/ref.stm', 'example_files/hyp.stm', collar=5)

# Loaded files
wers = meeteval.wer.tcpwer(meeteval.io.load('example_files/ref.stm'), meeteval.io.load('example_files/hyp.stm'), collar=5)

# Objects
wers = meeteval.wer.tcpwer(
    reference=meeteval.io.STM.parse('''
        recordingA 1 Alice 0 1 The quick brown fox jumps over the lazy dog
        recordingB 1 Bob 0 1 The quick brown fox jumps over the lazy dog
    '''),
    hypothesis=meeteval.io.STM.parse('''
        recordingA 1 spk-1 0 1 The kwick brown fox jump over lazy
        recordingB 1 spk-1 0 1 The kwick brown fox jump over lazy
    '''),
    collar=5,
)
print(wers)
# {
#     'recordingA': CPErrorRate(error_rate=0.4444444444444444, errors=4, length=9, insertions=0, deletions=2, substitutions=2, reference_self_overlap=SelfOverlap(overlap_rate=Decimal('0'), overlap_time=0, total_time=Decimal('1')), hypothesis_self_overlap=SelfOverlap(overlap_rate=Decimal('0'), overlap_time=0, total_time=Decimal('1')), missed_speaker=0, falarm_speaker=0, scored_speaker=1, assignment=(('Alice', 'spk-1'),)), 
#     'recordingB': CPErrorRate(error_rate=0.4444444444444444, errors=4, length=9, insertions=0, deletions=2, substitutions=2, reference_self_overlap=SelfOverlap(overlap_rate=Decimal('0'), overlap_time=0, total_time=Decimal('1')), hypothesis_self_overlap=SelfOverlap(overlap_rate=Decimal('0'), overlap_time=0, total_time=Decimal('1')), missed_speaker=0, falarm_speaker=0, scored_speaker=1, assignment=(('Bob', 'spk-1'),))
# }

avg = meeteval.wer.combine_error_rates(wers)
print(avg)
# CPErrorRate(error_rate=0.4444444444444444, errors=8, length=18, insertions=0, deletions=4, substitutions=4, reference_self_overlap=SelfOverlap(overlap_rate=Decimal('0'), overlap_time=0, total_time=Decimal('2')), hypothesis_self_overlap=SelfOverlap(overlap_rate=Decimal('0'), overlap_time=0, total_time=Decimal('2')), missed_speaker=0, falarm_speaker=0, scored_speaker=2)
```

### Aligning sequences

Sequences can be aligned, similar to `kaldialign.align`, using the tcpWER matching:
```python
import meeteval
meeteval.wer.wer.time_constrained.align([{'words': 'a b', 'start_time': 0, 'end_time': 1}], [{'words': 'a c', 'start_time': 0, 'end_time': 1}, {'words': 'd', 'start_time': 2, 'end_time': 3}])
# [('a', 'a'), ('b', 'c'), ('*', 'd')]
```

## Visualization

> [!TIP]
> Try it in the browser! https://fgnt.github.io/meeteval_viz

```python
import meeteval
from meeteval.viz.visualize import AlignmentVisualization

folder = r'https://raw.githubusercontent.com/fgnt/meeteval/main/'
av = AlignmentVisualization(
    meeteval.io.load(folder + 'example_files/ref.stm').groupby('filename')['recordingA'],
    meeteval.io.load(folder + 'example_files/hyp.stm').groupby('filename')['recordingA']
)
# display(av)  # Jupyter
# av.dump('viz.html')  # Create standalone HTML file
```


## Cite

The toolkit and the tcpWER were presented at the CHiME-2023 workshop (Computational Hearing in Multisource Environments) with the paper 
["MeetEval: A Toolkit for Computation of Word Error Rates for Meeting Transcription Systems"](https://www.isca-archive.org/chime_2023/neumann23_chime.pdf).

[![ISCA DOI](https://img.shields.io/badge/ISCA/DOI-10.21437/CHiME.2023--6-blue.svg)](https://doi.org/10.21437/CHiME.2023-6)
[![arXiv](https://img.shields.io/badge/arXiv-2307.11394-b31b1b.svg)](https://arxiv.org/abs/2307.11394)


```bibtex
@InProceedings{MeetEval23,
  author    = {von Neumann, Thilo and Boeddeker, Christoph and Delcroix, Marc and Haeb-Umbach, Reinhold},
  title     = {{MeetEval}: A Toolkit for Computation of Word Error Rates for Meeting Transcription Systems},
  year      = {2023},
  booktitle = {Proc. 7th International Workshop on Speech Processing in Everyday Environments (CHiME 2023)},
  pages     = {27--32},
  doi       = {10.21437/CHiME.2023-6}
}
```

The MIMO WER and efficient implementation of ORC WER are presented in the paper 
["On Word Error Rate Definitions and their Efficient Computation for Multi-Speaker Speech Recognition Systems"](https://ieeexplore.ieee.org/iel7/10094559/10094560/10094784.pdf).

[![IEEE DOI](https://img.shields.io/badge/IEEE/DOI-10.1109/ICASSP49357.2023.10094784-blue.svg)](https://doi.org/10.1109/ICASSP49357.2023.10094784)
[![arXiv](https://img.shields.io/badge/arXiv-2211.16112-b31b1b.svg)](https://arxiv.org/abs/2211.16112)

```bibtex
@InProceedings{MIMO23,
  author       = {von Neumann, Thilo and Boeddeker, Christoph and Kinoshita, Keisuke and Delcroix, Marc and Haeb-Umbach, Reinhold},
  title        = {On Word Error Rate Definitions and their Efficient Computation for Multi-Speaker Speech Recognition Systems},
  booktitle    = {ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year         = {2023},
  doi          = {10.1109/ICASSP49357.2023.10094784}
}
```

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/fgnt/meeteval/",
    "name": "meeteval",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.5",
    "maintainer_email": null,
    "keywords": "speech recognition, word error rate, evaluation, meeting, ASR, WER",
    "author": "Department of Communications Engineering, Paderborn University",
    "author_email": "sek@nt.upb.de",
    "download_url": "https://files.pythonhosted.org/packages/ed/66/0eaeb786571b47df0ef8080e243c945e3ee15ca8102b6d874cb81a0136a7/meeteval-0.4.0.tar.gz",
    "platform": null,
    "description": "<h1 align=\"center\">MeetEval</h1> \n<h3 align=\"center\">A meeting transcription evaluation toolkit</h3>\n<div align=\"center\"><a href=\"#features\">Features</a> | <a href=\"#installation\">Installation</a> | <a href=\"#python-interface\">Python Interface</a> | <a href=\"#command-line-interface\">Command Line Interface</a> | <a href=\"#visualization\">Visualization</a> | <a href=\"#cite\">Cite</a></div>\n<br>\n<a href=\"https://github.com/fgnt/meeteval/actions\"><img src=\"https://github.com/fgnt/meeteval/actions/workflows/pytest.yml/badge.svg\"/></a>\n<a href=\"https://pypi.org/project/meeteval/\"><img src=\"https://img.shields.io/pypi/v/meeteval\"/></a>\n\n## Features\nMeetEval supports the following metrics for meeting transcription evaluation:\n\n- **Standard WER** for single utterances (Called SISO WER in MeetEval)<br>\n  `meeteval-wer wer -r ref -h hyp`\n- **Concatenated minimum-Permutation Word Error Rate (cpWER)**<br>\n  `meeteval-wer cpwer -r ref.stm -h hyp.stm`\n- **Optimal Reference Combination Word Error Rate (ORC WER)**<br>\n  `meeteval-wer orcwer -r ref.stm -h hyp.stm`\n- **Fast Greedy Approximation of Optimal Reference Combination Word Error Rate (greedy ORC WER)**<br>\n  `meeteval-wer greedy_orcwer -r ref.stm -h hyp.stm`\n- **Multi-speaker-input multi-stream-output Word Error Rate (MIMO WER)**<br>\n  `meeteval-wer mimower -r ref.stm -h hyp.stm`\n- **Time-Constrained minimum-Permutation Word Error Rate (tcpWER)**<br>\n  `meeteval-wer tcpwer -r ref.stm -h hyp.stm --collar 5`\n- **Time-Constrained Optimal Reference Combination Word Error Rate (tcORC WER)**<br>\n  `meeteval-wer tcorcwer -r ref.stm -h hyp.stm --collar 5`\n- **Fast Greedy Approximation of Time-Constrained Optimal Reference Combination Word Error Rate (greedy tcORC WER)**<br>\n  `meeteval-wer greedy_tcorcwer -r ref.stm -h hyp.stm --collar 5`\n- **Diarization-Invariant cpWER (DI-cpWER)**<br>\n  `meeteval-wer greedy_dicpwer -r ref.stm -h hyp.stm`\n- **Diarization Error Rate (DER)** by wrapping [mdeval](https://github.com/nryant/dscore/raw/master/scorelib/md-eval-22.pl)<br>\n  `meeteval-der md_eval_22 -r ref.stm -h hyp.stm --collar .25`\n\nAdditionally, MeetEval contains a [visualization](#visualization) tool for cpWER and tcpWER alignments that helps to spot errors in system outputs.\n\n## Installation\n\n### From PyPI\n\n```shell\npip install meeteval\n```\n\n### From source\n\n```shell\ngit clone https://github.com/fgnt/meeteval\npip install -e ./meeteval\n```\n\n## Command-line interface\n\n`MeetEval` supports the following file formats as input:\n - [Segmental Time Mark](https://github.com/usnistgov/SCTK/blob/master/doc/infmts.htm#L75) (`STM`)\n - [Time Marked Conversation](https://github.com/usnistgov/SCTK/blob/master/doc/infmts.htm#L286) (`CTM`)\n - [SEGment-wise Long-form Speech Transcription annotation](#segment-wise-long-form-speech-transcription-annotation-seglst) (`SegLST`), the file format used in the [CHiME challenges](https://www.chimechallenge.org)\n - [Rich Transcription Time Marked](https://github.com/nryant/dscore?tab=readme-ov-file#rttm) (`RTTM`) files (only for Diarization Error Rate)\n\n\n> [!NOTE]\n> `MeetEval` does not support alternate transcripts (e.g., `\"i've { um / uh / @ } as far as i'm concerned\"`).\n\nThe command-line interface is available as `meeteval-wer` or `python -m meeteval.wer` with the following signature:\n\n```shell\npython -m meeteval.wer [orcwer|mimower|cpwer|tcpwer|tcorcwer] -h example_files/hyp.stm -r example_files/ref.stm\n# or\nmeeteval-wer [orcwer|mimower|cpwer|tcpwer|tcorcwer] -h example_files/hyp.stm -r example_files/ref.stm\n```\n\nYou can add `--help` to any command to get more information about the available options.\nThe command name `orcwer`, `mimower`, `cpwer` and `tcpwer` selects the metric to use.\nBy default, the hypothesis files is used to create the template for the average\n(e.g. `hypothesis.json`) and per_reco `hypothesis_per_reco.json` file.\nThey can be changed with `--average-out` and `--per-reco-out`.\n`.json` and `.yaml` are the supported suffixes.\n\nMore examples can be found in [tests/test_cli.py](tests/test_cli.py).\n\n### File Formats\n\n#### SEGment-wise Long-form Speech Transcription annotation (SegLST)\n\nThe SegLST format was used in the [CHiME-7 challenge](https://www.chimechallenge.org/challenges/chime7/task1/index) and is the default format for `MeetEval`.\n\nThe SegLST format is stored in JSON format and contains a list of segments.\nEach segment should have a minimum set of keys `\"session_id\"` and `\"words\"`.\nDepending on the metric, additional keys may be required (`\"speaker\"`, `\"start_time\"`, `\"end_time\"`).\n\nAn example is shown below:\n```python\n[\n    {\n        \"session_id\": \"recordingA\", # Required\n        \"words\": \"The quick brown fox jumps over the lazy dog\", # Required for WER metrics\n        \"speaker\": \"Alice\", # Required for metrics that use speaker information (cpWER, ORC WER, MIMO WER)\n        \"start_time\": 0,    # Required for time-constrained metrics (tcpWER, tcORC-WER, DER, ...)\n        \"end_time\": 1,      # Required for time-constrained metrics (tcpWER, tcORC-WER, DER, ...)\n        \"audio_path\": \"path/to/recordingA.wav\"  # Any additional keys can be included\n    },\n    ...\n]\n```\nAnother example can be found [here](example_files/hyp.seglst.json).\n\n#### [Segmental Time Mark (STM)](https://github.com/usnistgov/SCTK/blob/master/doc/infmts.htm#L75)\nEach line in an `STM` file represents one \"utterance\" and is defined as\n\n```\nSTM :== <filename> <channel> <speaker_id> <begin_time> <end_time> <transcript>\n```\nwhere\n- `filename`: name of the recording\n- `channel`: ignored by MeetEval\n- `speaker_id`: ID of the speaker or system output stream/channel (not microphone channel)\n- `begin_time`: in seconds, used to find the order of the utterances\n- `end_time`: in seconds\n- `transcript`: space-separated list of words\n\nfor example:\n```STM\nrecording1 1 Alice 0 0 Hello Bob.\nrecording1 1 Bob 1 0 Hello Alice.\nrecording1 1 Alice 2 0 How are you?\nrecording2 1 Alice 0 0 Hello Carol.\n;; ...\n```\n\nAn example `STM` file can be found in [here](example_files/ref.stm).\n\n#### [Time Marked Conversation (CTM)](https://github.com/usnistgov/SCTK/blob/master/doc/infmts.htm#L286)\nThe CTM format is defined as\n\n```\nCTM :== <filename> <channel> <begin_time> <duration> <word> [<confidence>]\n```\n\nfor the hypothesis (one file per speaker).\nYou have to supply one `CTM` file for each system output channel using multiple `-h` arguments since `CTM` files don't encode speaker or system output channel information (the `channel` field has a different meaning: left or right microphone).\nFor example:\n\n```shell\nmeeteval-wer orcwer -h hyp1.ctm -h hyp2.ctm -r reference.stm\n```\n\n> [!NOTE]\n> Note that the `LibriCSS` baseline recipe produces one `CTM` file which merges the speakers, so that it cannot be applied straight away. We recommend to use `STM` or `SegLST` files.\n\n\n## Python interface\n\nFor all metrics a [Low-level](#low-level-interface) and [high-level](#high-level-interface) interface is available.\n\n> [!TIP]\n> You want to use the [high-level](#high-level-interface) for computing metrics over a full dataset. <br>\n> You want to use the [low-level](#low-level-interface) interface for computing metrics for single examples or when your data is represented as Python structures, e.g., nested lists of strings.\n\n### Low-level interface\n\nAll WERs have a low-level interface in the `meeteval.wer` module that allows computing the WER for single examples.\nThe functions take the reference and hypothesis as input and return an `ErrorRate` object.\nThe `ErrorRate` bundles statistics (errors, total number of words) and potential auxiliary information (e.g., assignment for ORC WER) together with the WER.\n\n```python\nimport meeteval\n\n# SISO WER\nwer = meeteval.wer.wer.siso.siso_word_error_rate(\n    reference='The quick brown fox jumps over the lazy dog',\n    hypothesis='The kwick brown fox jump over lazy '\n)\nprint(wer)\n# ErrorRate(error_rate=0.4444444444444444, errors=4, length=9, insertions=0, deletions=2, substitutions=2)\n\n# cpWER\nwer = meeteval.wer.wer.cp.cp_word_error_rate(\n    reference=['The quick brown fox', 'jumps over the lazy dog'],\n    hypothesis=['The kwick brown fox', 'jump over lazy ']\n)\nprint(wer)\n# CPErrorRate(error_rate=0.4444444444444444, errors=4, length=9, insertions=0, deletions=2, substitutions=2, missed_speaker=0, falarm_speaker=0, scored_speaker=2, assignment=((0, 0), (1, 1)))\n\n# ORC-WER\nwer = meeteval.wer.wer.orc.orc_word_error_rate(\n    reference=['The quick brown fox', 'jumps over the lazy dog'],\n    hypothesis=['The kwick brown fox', 'jump over lazy ']\n)\nprint(wer)\n# OrcErrorRate(error_rate=0.4444444444444444, errors=4, length=9, insertions=0, deletions=2, substitutions=2, assignment=(0, 1))\n```\n\nThe input format can be a (list of) strings or an object representing a file format from `meeteval.io`:\n\n```python\nimport meeteval\nwer = meeteval.wer.wer.cp.cp_word_error_rate(\n    reference = meeteval.io.STM.parse('recordingA 1 Alice 0 1 The quick brown fox jumps over the lazy dog'),\n    hypothesis = meeteval.io.STM.parse('recordingA 1 spk-1 0 1 The kwick brown fox jump over lazy ')\n)\nprint(wer)\n# CPErrorRate(error_rate=0.4444444444444444, errors=4, length=9, insertions=0, deletions=2, substitutions=2, reference_self_overlap=SelfOverlap(overlap_rate=Decimal('0'), overlap_time=0, total_time=Decimal('1')), hypothesis_self_overlap=SelfOverlap(overlap_rate=Decimal('0'), overlap_time=0, total_time=Decimal('1')), missed_speaker=0, falarm_speaker=0, scored_speaker=1, assignment=(('Alice', 'spk-1'), ))\n```\n\nAll low-level interfaces come with a single-example function (as show above) and a batch function that computes the WER for multiple examples at once.\nThe batch function is postfixed with `_multifile` and is similar to the high-level interface without fancy input format handling.\nTo compute the average over multiple `ErrorRate`s, use `meeteval.wer.combine_error_rates`.\nNote that the combined WER is _not_ the average over the error rates, but the error rate that results from combining the errors and lengths of all error rates.\n`combine_error_rates` also discards any information that cannot be aggregated over multiple examples (such as the ORC WER assignment).\n\nFor example with the cpWER:\n```python\nimport meeteval.wer.wer.siso\n\nwers = meeteval.wer.wer.cp.cp_word_error_rate_multifile(\n    reference={\n        'recordingA': {'speakerA': 'First example', 'speakerB': 'First example second speaker'}, \n        'recordingB': {'speakerA': 'Second example'},\n    },\n    hypothesis={\n        'recordingA': ['First example with errors', 'First example second speaker'],\n        'recordingB': ['Second example', 'Overestimated speaker'],\n    }\n)\nprint(wers)\n# {\n#   'recordingA': CPErrorRate(error_rate=0.3333333333333333, errors=2, length=6, insertions=2, deletions=0, substitutions=0, missed_speaker=0, falarm_speaker=0, scored_speaker=2, assignment=(('speakerA', 0), ('speakerB', 1))), \n#   'recordingB': CPErrorRate(error_rate=1.0, errors=2, length=2, insertions=2, deletions=0, substitutions=0, missed_speaker=0, falarm_speaker=1, scored_speaker=1, assignment=(('speakerA', 0), (None, 1)))\n# }\n\n# Use combine_error_rates to compute an \"overall\" WER over multiple examples\navg = meeteval.wer.combine_error_rates(wers)\nprint(avg)\n# CPErrorRate(error_rate=0.5, errors=4, length=8, insertions=4, deletions=0, substitutions=0, missed_speaker=0, falarm_speaker=1, scored_speaker=3)\n```\n\n### High-level interface\n\nAll WERs have a high-level Python interface available directly in the `meeteval.wer` module that mirrors the [Command-line interface](#command-line-interface) and accepts the formats from `meeteval.io` as input.\nAll of these functions require the input format to contain a session-ID and output a dict mapping from session-ID to the result of that session\n\n```python\nimport meeteval\n\n# File Paths\nwers = meeteval.wer.tcpwer('example_files/ref.stm', 'example_files/hyp.stm', collar=5)\n\n# Loaded files\nwers = meeteval.wer.tcpwer(meeteval.io.load('example_files/ref.stm'), meeteval.io.load('example_files/hyp.stm'), collar=5)\n\n# Objects\nwers = meeteval.wer.tcpwer(\n    reference=meeteval.io.STM.parse('''\n        recordingA 1 Alice 0 1 The quick brown fox jumps over the lazy dog\n        recordingB 1 Bob 0 1 The quick brown fox jumps over the lazy dog\n    '''),\n    hypothesis=meeteval.io.STM.parse('''\n        recordingA 1 spk-1 0 1 The kwick brown fox jump over lazy\n        recordingB 1 spk-1 0 1 The kwick brown fox jump over lazy\n    '''),\n    collar=5,\n)\nprint(wers)\n# {\n#     'recordingA': CPErrorRate(error_rate=0.4444444444444444, errors=4, length=9, insertions=0, deletions=2, substitutions=2, reference_self_overlap=SelfOverlap(overlap_rate=Decimal('0'), overlap_time=0, total_time=Decimal('1')), hypothesis_self_overlap=SelfOverlap(overlap_rate=Decimal('0'), overlap_time=0, total_time=Decimal('1')), missed_speaker=0, falarm_speaker=0, scored_speaker=1, assignment=(('Alice', 'spk-1'),)), \n#     'recordingB': CPErrorRate(error_rate=0.4444444444444444, errors=4, length=9, insertions=0, deletions=2, substitutions=2, reference_self_overlap=SelfOverlap(overlap_rate=Decimal('0'), overlap_time=0, total_time=Decimal('1')), hypothesis_self_overlap=SelfOverlap(overlap_rate=Decimal('0'), overlap_time=0, total_time=Decimal('1')), missed_speaker=0, falarm_speaker=0, scored_speaker=1, assignment=(('Bob', 'spk-1'),))\n# }\n\navg = meeteval.wer.combine_error_rates(wers)\nprint(avg)\n# CPErrorRate(error_rate=0.4444444444444444, errors=8, length=18, insertions=0, deletions=4, substitutions=4, reference_self_overlap=SelfOverlap(overlap_rate=Decimal('0'), overlap_time=0, total_time=Decimal('2')), hypothesis_self_overlap=SelfOverlap(overlap_rate=Decimal('0'), overlap_time=0, total_time=Decimal('2')), missed_speaker=0, falarm_speaker=0, scored_speaker=2)\n```\n\n### Aligning sequences\n\nSequences can be aligned, similar to `kaldialign.align`, using the tcpWER matching:\n```python\nimport meeteval\nmeeteval.wer.wer.time_constrained.align([{'words': 'a b', 'start_time': 0, 'end_time': 1}], [{'words': 'a c', 'start_time': 0, 'end_time': 1}, {'words': 'd', 'start_time': 2, 'end_time': 3}])\n# [('a', 'a'), ('b', 'c'), ('*', 'd')]\n```\n\n## Visualization\n\n> [!TIP]\n> Try it in the browser! https://fgnt.github.io/meeteval_viz\n\n```python\nimport meeteval\nfrom meeteval.viz.visualize import AlignmentVisualization\n\nfolder = r'https://raw.githubusercontent.com/fgnt/meeteval/main/'\nav = AlignmentVisualization(\n    meeteval.io.load(folder + 'example_files/ref.stm').groupby('filename')['recordingA'],\n    meeteval.io.load(folder + 'example_files/hyp.stm').groupby('filename')['recordingA']\n)\n# display(av)  # Jupyter\n# av.dump('viz.html')  # Create standalone HTML file\n```\n\n\n## Cite\n\nThe toolkit and the tcpWER were presented at the CHiME-2023 workshop (Computational Hearing in Multisource Environments) with the paper \n[\"MeetEval: A Toolkit for Computation of Word Error Rates for Meeting Transcription Systems\"](https://www.isca-archive.org/chime_2023/neumann23_chime.pdf).\n\n[![ISCA DOI](https://img.shields.io/badge/ISCA/DOI-10.21437/CHiME.2023--6-blue.svg)](https://doi.org/10.21437/CHiME.2023-6)\n[![arXiv](https://img.shields.io/badge/arXiv-2307.11394-b31b1b.svg)](https://arxiv.org/abs/2307.11394)\n\n\n```bibtex\n@InProceedings{MeetEval23,\n  author    = {von Neumann, Thilo and Boeddeker, Christoph and Delcroix, Marc and Haeb-Umbach, Reinhold},\n  title     = {{MeetEval}: A Toolkit for Computation of Word Error Rates for Meeting Transcription Systems},\n  year      = {2023},\n  booktitle = {Proc. 7th International Workshop on Speech Processing in Everyday Environments (CHiME 2023)},\n  pages     = {27--32},\n  doi       = {10.21437/CHiME.2023-6}\n}\n```\n\nThe MIMO WER and efficient implementation of ORC WER are presented in the paper \n[\"On Word Error Rate Definitions and their Efficient Computation for Multi-Speaker Speech Recognition Systems\"](https://ieeexplore.ieee.org/iel7/10094559/10094560/10094784.pdf).\n\n[![IEEE DOI](https://img.shields.io/badge/IEEE/DOI-10.1109/ICASSP49357.2023.10094784-blue.svg)](https://doi.org/10.1109/ICASSP49357.2023.10094784)\n[![arXiv](https://img.shields.io/badge/arXiv-2211.16112-b31b1b.svg)](https://arxiv.org/abs/2211.16112)\n\n```bibtex\n@InProceedings{MIMO23,\n  author       = {von Neumann, Thilo and Boeddeker, Christoph and Kinoshita, Keisuke and Delcroix, Marc and Haeb-Umbach, Reinhold},\n  title        = {On Word Error Rate Definitions and their Efficient Computation for Multi-Speaker Speech Recognition Systems},\n  booktitle    = {ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},\n  year         = {2023},\n  doi          = {10.1109/ICASSP49357.2023.10094784}\n}\n```\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": null,
    "version": "0.4.0",
    "project_urls": {
        "Homepage": "https://github.com/fgnt/meeteval/"
    },
    "split_keywords": [
        "speech recognition",
        " word error rate",
        " evaluation",
        " meeting",
        " asr",
        " wer"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ed660eaeb786571b47df0ef8080e243c945e3ee15ca8102b6d874cb81a0136a7",
                "md5": "33d8e4cdcfbc60e1fe3126463c7acbc0",
                "sha256": "fadbe2da0112c2d375041fda1fd37aab735f670f0430eb3f5318ab307989fb9f"
            },
            "downloads": -1,
            "filename": "meeteval-0.4.0.tar.gz",
            "has_sig": false,
            "md5_digest": "33d8e4cdcfbc60e1fe3126463c7acbc0",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.5",
            "size": 690896,
            "upload_time": "2024-10-22T09:34:53",
            "upload_time_iso_8601": "2024-10-22T09:34:53.180098Z",
            "url": "https://files.pythonhosted.org/packages/ed/66/0eaeb786571b47df0ef8080e243c945e3ee15ca8102b6d874cb81a0136a7/meeteval-0.4.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-22 09:34:53",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "fgnt",
    "github_project": "meeteval",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "meeteval"
}

Department of Communications Engineering, Paderborn University