sed-scores-eval

Name	sed-scores-eval JSON
Version	0.0.4 JSON
	download
home_page	https://github.com/fgnt/sed_scores_eval
Summary	(Threshold-Independent) Evaluation of Sound Event Detection Scores
upload_time	2024-05-23 19:52:56
maintainer	None
docs_url	None
author	Department of Communications Engineering, Paderborn University
requires_python	None
license	MIT
keywords	sound recognition evaluation from classification scores
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # sed_scores_eval

![GitHub Actions](https://github.com/fgnt/sed_scores_eval/actions/workflows/pytest.yml/badge.svg)

sed_scores_eval is a package for the efficient (threshold-independent)
evaluation of Sound Event Detection (SED) systems based on the SED system's
soft classification scores as described in
> **Threshold-Independent Evaluation of Sound Event Detection Scores**  
J. Ebbers, R. Serizel and R. Haeb-Umbach  
in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2022
https://arxiv.org/abs/2201.13148

With SED systems providing soft classification scores (usually frame-wise),
performance can be evaluated at different operating points (OPs) by varying the
decision/discrimination threshold used for binarization of the soft scores.
Other evaluation frameworks evaluate a list of detected sounds
(list of event labels with corresponding event onset and offset times) for each
decision threshold separately.
Therefore, they can not be used to accurately evaluate performance curves over
all thresholds (such as Precision-Recall curves and ROC curves) given that
there are many thousands (or even millions) of thresholds (as many as there are
frames in the dataset) that result in a different list of detections.
Performance curves can at most be approximated using a limited subset of
thresholds which, however, may result in inaccurate curves (see Figure below).
sed_scores_eval, in contrast, efficiently evaluates performance for all
decision thresholds jointly (also for sophisticated collar-based and
intersection-based evaluation criteria, see paper for details). It therefore
enables the efficient and accurate computation of performance curves such as
Precision-Recall Curves and ROC Curves.

![Fig: PSD ROC from example code](https://raw.githubusercontent.com/fgnt/sed_scores_eval/master/notebooks/psd_roc.png)

If you use this package please cite our paper.

## Supported Evaluation Criteria
### Intermediate Statistics:
* Segment-based [[1]](#1): Classifications and targets are defined and
  evaluated in fixed length segments.
* Collar-based (a.k.a. event-based) [[1]](#1): Compares if detected event
  (onset, offset, event_label) matches a ground truth event up to a certain
  collar on onset and offset.
* Intersection-based [[2]](#2): Evaluates the intersections of detected and
  ground truth events (Please also cite [[2]](#2) if you use intersection-based
  evaluation)
* Clip-based: Audio Tagging evaluation
  
### Evaluation Metrics / Curves:
* Precision-Recall (PR) Curve: Precisions for arbitrary decision thresholds
  plotted over Recalls
* F-Score Curve: F-Scores plotted over decision thresholds
* F-Score @ OP: F-Score for a specified decision threshold
* F-Score @ Best: F-Score for the optimal decision threshold (w.r.t. to the
  considered dataset)
* Average Precision: weighted mean of precisions for arbitrary decision thresholds.
  Weights are the increase in recall compared to the prior recall.
* Error-Rate Curve: Error-Rates plotted over decision thresholds
* Error-Rate @ OP: Error-Rate for a specified decision threshold
* Error-Rate @ Best: Error-Rate for the optimal decision threshold (w.r.t. to the
  considered dataset)
* ROC Curve: True-Positive rates (recalls) for arbitrary decision thresholds
  plotted over False-Positive rates
* Area under ROC curve
* PSD-ROC Curve: effective True Positive Rates (eTPRs) plotted over effective
  False Positive Rates (eFPRs) as described in [[2]](#2)*.
* PSD Score (PSDS): normalized Area under PSD-ROC Curve (until a certain
  maximum eFPR).
* Post-processing independent PSD-ROC Curve (pi-PSD-ROC): effective True Positive Rates (eTPRs) plotted over effective
  False Positive Rates (eFPRs) from different post-processings as described in [[3]](#3).
* Post-processing independent PSDS (piPSDS): normalized Area under pi-PSD-ROC Curve (until a certain
  maximum eFPR).


<a id="1">[1]</a> A. Mesaros, T. Heittola, and T. Virtanen,
"Metrics for polyphonic sound event detection", Applied Sciences,
2016,

<a id="2">[2]</a> C. Bilen, G. Ferroni, F. Tuveri, J. Azcarreta and S. Krstulovic,
"A Framework for the Robust Evaluation of Sound Event Detection",
in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP),
2020,
arXiv: https://arxiv.org/abs/1910.08440

<a id="3">[3]</a> J. Ebbers, R. Haeb-Umbach, and R. Serizel,
"Post-Processing Independent Evaluation of Sound Event Detection Systems",
submitted to Detection and Classification of Acoustic Scenes and Events (DCASE) Workshop,
2023,
arXiv: https://arxiv.org/abs/2306.15440

*Please also cite [[2]](#2) if you use PSD-ROC and/or PSDS
## IPython Notebooks
Have a look at the provided example [notebooks](./notebooks) for usage example
and for some comparisons/validations w.r.t. reference packages.

## Input Format
### System's Classification Scores
The system's classification scores need to be saved in a dedicated folder with
a tsv score file for each audio file from the evaluation set.
The score files have to be named according to the audio file names.
If, e.g., the audio file is "test1.wav" the score file's name needs to be
"test1.tsv".
For each score window (arbitrary and also varying window lengths are allowed
but windows need to be non-overlapping and gapless, i.e., the onset time of the
next window must be the offset time of the current window) the onset and offset
times of the window (in seconds) must be stated in the first and second column,
respectively, followed by classification scores for each event class in a
separate column as illustrated in the following example:

|onset|offset|class1  |class2  |class3  |...     |
|----:|-----:|-------:|-------:|-------:|-------:|
|0.0  |0.02  |0.010535|0.057549|0.063102|...     |
|0.02 |0.04  |0.001196|0.167730|0.098838|...     |
|...  |...   |...     |...     |...     |...     |
|4.76 |4.78  |0.015128|0.769687|0.087403|...     |
|4.78 |4.8   |0.002032|0.587578|0.120165|...     |
|...  |...   |...     |...     |...     |...     |
|9.98 |10.0  |0.031421|0.089716|0.929873|...     |

At inference time, when your system outputs a classification score array
`scores_arr`of shape TxK with T and K being the number of windows and event
classes, respectively, you can conveniently write the score file of above
format as follows:
```python
sed_scores_eval.io.write_sed_scores(
    scores_arr, '/path/to/score_dir/test1.tsv',
    timestamps=timestamps, event_classes=event_classes
)
```
where `timestamps` must be a 1d list or array of length T+1 providing the
window boundary times and `event_classes` must be a list of length K providing
the event class names corresponding to the columns in `scores_arr`.

In case the output scores of the whole dataset fit into memory, you can also
provide a dict of pandas.DataFrames of above format, where dict keys must be
the file ids (e.g. "test1").
Score dataframes can be obtained from score arrays analogously to above by
```python
scores["test1"] = sed_scores_eval.utils.create_score_dataframe(
    scores_arr, timestamps=timestamps, event_classes=event_classes
)
```

### Ground Truth
The ground truth events for the whole dataset must be provided either as a
file of the following format

|filename   |onset|offset|event\_label|
|----------:|----:|-----:|-----:|
|test1.wav |3.98 |4.86  |class2|
|test1.wav |9.05 |10.0  |class3|
|test2.wav |0.0  |4.07  |class1|
|test2.wav |0.0  |8.54  |class2|
|test2.wav |5.43 |7.21  |class1|
|...        |...  |...   |...   |

or as a dict
```python
{
  "test1": [(3.98, 4.86, "class2"), (9.05, 10.0, "class3")],
  "test2": [(0.0, 4.07, "class1"), (0.0, 8.54, "class2"), (5.43, 7.21, "class1")],
  ...
}
```
which can be obtained from the file by
```python
ground_truth_dict = sed_scores_eval.io.read_ground_truth_events(ground_truth_file)
```

### Audio durations
If required, you either have to provide the audios' durations (in seconds) as a
file of the following format:

|filename |duration|
|--------:|---:|
|test1.wav|10.0|
|test2.wav|9.7 |
|...      |... |

or as a dict
```python
{
  "test1": 10.0,
  "test2": 9.7,
  ...
}
```
which can be obtained from the file by
```python
durations_dict = sed_scores_eval.io.read_audio_durations(durations_file)
```

## Installation
Install package directly
```bash
$ pip install git+https://github.com/fgnt/sed_scores_eval.git
```
or clone and install (editable)
```bash
$ git clone https://github.com/fgnt/sed_scores_eval.git
$ cd sed_scores_eval
$ pip install --editable .
```

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/fgnt/sed_scores_eval",
    "name": "sed-scores-eval",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "sound recognition evaluation from classification scores",
    "author": "Department of Communications Engineering, Paderborn University",
    "author_email": "sek@nt.upb.de",
    "download_url": "https://files.pythonhosted.org/packages/b5/34/35f53fb064f7472377861e57862a586ef971c6159473e40da3a2e91a97bd/sed_scores_eval-0.0.4.tar.gz",
    "platform": null,
    "description": "# sed_scores_eval\n\n![GitHub Actions](https://github.com/fgnt/sed_scores_eval/actions/workflows/pytest.yml/badge.svg)\n\nsed_scores_eval is a package for the efficient (threshold-independent)\nevaluation of Sound Event Detection (SED) systems based on the SED system's\nsoft classification scores as described in\n> **Threshold-Independent Evaluation of Sound Event Detection Scores**  \nJ. Ebbers, R. Serizel and R. Haeb-Umbach  \nin Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2022\nhttps://arxiv.org/abs/2201.13148\n\nWith SED systems providing soft classification scores (usually frame-wise),\nperformance can be evaluated at different operating points (OPs) by varying the\ndecision/discrimination threshold used for binarization of the soft scores.\nOther evaluation frameworks evaluate a list of detected sounds\n(list of event labels with corresponding event onset and offset times) for each\ndecision threshold separately.\nTherefore, they can not be used to accurately evaluate performance curves over\nall thresholds (such as Precision-Recall curves and ROC curves) given that\nthere are many thousands (or even millions) of thresholds (as many as there are\nframes in the dataset) that result in a different list of detections.\nPerformance curves can at most be approximated using a limited subset of\nthresholds which, however, may result in inaccurate curves (see Figure below).\nsed_scores_eval, in contrast, efficiently evaluates performance for all\ndecision thresholds jointly (also for sophisticated collar-based and\nintersection-based evaluation criteria, see paper for details). It therefore\nenables the efficient and accurate computation of performance curves such as\nPrecision-Recall Curves and ROC Curves.\n\n![Fig: PSD ROC from example code](https://raw.githubusercontent.com/fgnt/sed_scores_eval/master/notebooks/psd_roc.png)\n\nIf you use this package please cite our paper.\n\n## Supported Evaluation Criteria\n### Intermediate Statistics:\n* Segment-based [[1]](#1): Classifications and targets are defined and\n  evaluated in fixed length segments.\n* Collar-based (a.k.a. event-based) [[1]](#1): Compares if detected event\n  (onset, offset, event_label) matches a ground truth event up to a certain\n  collar on onset and offset.\n* Intersection-based [[2]](#2): Evaluates the intersections of detected and\n  ground truth events (Please also cite [[2]](#2) if you use intersection-based\n  evaluation)\n* Clip-based: Audio Tagging evaluation\n  \n### Evaluation Metrics / Curves:\n* Precision-Recall (PR) Curve: Precisions for arbitrary decision thresholds\n  plotted over Recalls\n* F-Score Curve: F-Scores plotted over decision thresholds\n* F-Score @ OP: F-Score for a specified decision threshold\n* F-Score @ Best: F-Score for the optimal decision threshold (w.r.t. to the\n  considered dataset)\n* Average Precision: weighted mean of precisions for arbitrary decision thresholds.\n  Weights are the increase in recall compared to the prior recall.\n* Error-Rate Curve: Error-Rates plotted over decision thresholds\n* Error-Rate @ OP: Error-Rate for a specified decision threshold\n* Error-Rate @ Best: Error-Rate for the optimal decision threshold (w.r.t. to the\n  considered dataset)\n* ROC Curve: True-Positive rates (recalls) for arbitrary decision thresholds\n  plotted over False-Positive rates\n* Area under ROC curve\n* PSD-ROC Curve: effective True Positive Rates (eTPRs) plotted over effective\n  False Positive Rates (eFPRs) as described in [[2]](#2)*.\n* PSD Score (PSDS): normalized Area under PSD-ROC Curve (until a certain\n  maximum eFPR).\n* Post-processing independent PSD-ROC Curve (pi-PSD-ROC): effective True Positive Rates (eTPRs) plotted over effective\n  False Positive Rates (eFPRs) from different post-processings as described in [[3]](#3).\n* Post-processing independent PSDS (piPSDS): normalized Area under pi-PSD-ROC Curve (until a certain\n  maximum eFPR).\n\n\n<a id=\"1\">[1]</a> A. Mesaros, T. Heittola, and T. Virtanen,\n\"Metrics for polyphonic sound event detection\", Applied Sciences,\n2016,\n\n<a id=\"2\">[2]</a> C. Bilen, G. Ferroni, F. Tuveri, J. Azcarreta and S. Krstulovic,\n\"A Framework for the Robust Evaluation of Sound Event Detection\",\nin Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP),\n2020,\narXiv: https://arxiv.org/abs/1910.08440\n\n<a id=\"3\">[3]</a> J. Ebbers, R. Haeb-Umbach, and R. Serizel,\n\"Post-Processing Independent Evaluation of Sound Event Detection Systems\",\nsubmitted to Detection and Classification of Acoustic Scenes and Events (DCASE) Workshop,\n2023,\narXiv: https://arxiv.org/abs/2306.15440\n\n*Please also cite [[2]](#2) if you use PSD-ROC and/or PSDS\n## IPython Notebooks\nHave a look at the provided example [notebooks](./notebooks) for usage example\nand for some comparisons/validations w.r.t. reference packages.\n\n## Input Format\n### System's Classification Scores\nThe system's classification scores need to be saved in a dedicated folder with\na tsv score file for each audio file from the evaluation set.\nThe score files have to be named according to the audio file names.\nIf, e.g., the audio file is \"test1.wav\" the score file's name needs to be\n\"test1.tsv\".\nFor each score window (arbitrary and also varying window lengths are allowed\nbut windows need to be non-overlapping and gapless, i.e., the onset time of the\nnext window must be the offset time of the current window) the onset and offset\ntimes of the window (in seconds) must be stated in the first and second column,\nrespectively, followed by classification scores for each event class in a\nseparate column as illustrated in the following example:\n\n|onset|offset|class1  |class2  |class3  |...     |\n|----:|-----:|-------:|-------:|-------:|-------:|\n|0.0  |0.02  |0.010535|0.057549|0.063102|...     |\n|0.02 |0.04  |0.001196|0.167730|0.098838|...     |\n|...  |...   |...     |...     |...     |...     |\n|4.76 |4.78  |0.015128|0.769687|0.087403|...     |\n|4.78 |4.8   |0.002032|0.587578|0.120165|...     |\n|...  |...   |...     |...     |...     |...     |\n|9.98 |10.0  |0.031421|0.089716|0.929873|...     |\n\nAt inference time, when your system outputs a classification score array\n`scores_arr`of shape TxK with T and K being the number of windows and event\nclasses, respectively, you can conveniently write the score file of above\nformat as follows:\n```python\nsed_scores_eval.io.write_sed_scores(\n    scores_arr, '/path/to/score_dir/test1.tsv',\n    timestamps=timestamps, event_classes=event_classes\n)\n```\nwhere `timestamps` must be a 1d list or array of length T+1 providing the\nwindow boundary times and `event_classes` must be a list of length K providing\nthe event class names corresponding to the columns in `scores_arr`.\n\nIn case the output scores of the whole dataset fit into memory, you can also\nprovide a dict of pandas.DataFrames of above format, where dict keys must be\nthe file ids (e.g. \"test1\").\nScore dataframes can be obtained from score arrays analogously to above by\n```python\nscores[\"test1\"] = sed_scores_eval.utils.create_score_dataframe(\n    scores_arr, timestamps=timestamps, event_classes=event_classes\n)\n```\n\n### Ground Truth\nThe ground truth events for the whole dataset must be provided either as a\nfile of the following format\n\n|filename   |onset|offset|event\\_label|\n|----------:|----:|-----:|-----:|\n|test1.wav |3.98 |4.86  |class2|\n|test1.wav |9.05 |10.0  |class3|\n|test2.wav |0.0  |4.07  |class1|\n|test2.wav |0.0  |8.54  |class2|\n|test2.wav |5.43 |7.21  |class1|\n|...        |...  |...   |...   |\n\nor as a dict\n```python\n{\n  \"test1\": [(3.98, 4.86, \"class2\"), (9.05, 10.0, \"class3\")],\n  \"test2\": [(0.0, 4.07, \"class1\"), (0.0, 8.54, \"class2\"), (5.43, 7.21, \"class1\")],\n  ...\n}\n```\nwhich can be obtained from the file by\n```python\nground_truth_dict = sed_scores_eval.io.read_ground_truth_events(ground_truth_file)\n```\n\n### Audio durations\nIf required, you either have to provide the audios' durations (in seconds) as a\nfile of the following format:\n\n|filename |duration|\n|--------:|---:|\n|test1.wav|10.0|\n|test2.wav|9.7 |\n|...      |... |\n\nor as a dict\n```python\n{\n  \"test1\": 10.0,\n  \"test2\": 9.7,\n  ...\n}\n```\nwhich can be obtained from the file by\n```python\ndurations_dict = sed_scores_eval.io.read_audio_durations(durations_file)\n```\n\n## Installation\nInstall package directly\n```bash\n$ pip install git+https://github.com/fgnt/sed_scores_eval.git\n```\nor clone and install (editable)\n```bash\n$ git clone https://github.com/fgnt/sed_scores_eval.git\n$ cd sed_scores_eval\n$ pip install --editable .\n```\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "(Threshold-Independent) Evaluation of Sound Event Detection Scores",
    "version": "0.0.4",
    "project_urls": {
        "Homepage": "https://github.com/fgnt/sed_scores_eval"
    },
    "split_keywords": [
        "sound",
        "recognition",
        "evaluation",
        "from",
        "classification",
        "scores"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0a25f89d22fa574c4bc02c86f344555941339f69ff9477f00e72d8f73493db0c",
                "md5": "96b09ad3ba3a88046a79300c7cbb1d04",
                "sha256": "d58914720edf30db71e8c121c34d1a6ba95b39df8b5f4d30334f0fc19b8fb420"
            },
            "downloads": -1,
            "filename": "sed_scores_eval-0.0.4-cp311-cp311-macosx_11_0_arm64.whl",
            "has_sig": false,
            "md5_digest": "96b09ad3ba3a88046a79300c7cbb1d04",
            "packagetype": "bdist_wheel",
            "python_version": "cp311",
            "requires_python": null,
            "size": 251604,
            "upload_time": "2024-05-23T19:52:54",
            "upload_time_iso_8601": "2024-05-23T19:52:54.702985Z",
            "url": "https://files.pythonhosted.org/packages/0a/25/f89d22fa574c4bc02c86f344555941339f69ff9477f00e72d8f73493db0c/sed_scores_eval-0.0.4-cp311-cp311-macosx_11_0_arm64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "b53435f53fb064f7472377861e57862a586ef971c6159473e40da3a2e91a97bd",
                "md5": "e48e038c4359682137652cee4ccc7aea",
                "sha256": "76357fe14020d8d2d9b4a69931ec75c70e6b60dbcb1982916627a103287a3e86"
            },
            "downloads": -1,
            "filename": "sed_scores_eval-0.0.4.tar.gz",
            "has_sig": false,
            "md5_digest": "e48e038c4359682137652cee4ccc7aea",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 376758,
            "upload_time": "2024-05-23T19:52:56",
            "upload_time_iso_8601": "2024-05-23T19:52:56.169191Z",
            "url": "https://files.pythonhosted.org/packages/b5/34/35f53fb064f7472377861e57862a586ef971c6159473e40da3a2e91a97bd/sed_scores_eval-0.0.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-05-23 19:52:56",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "fgnt",
    "github_project": "sed_scores_eval",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "sed-scores-eval"
}

Department of Communications Engineering, Paderborn University