# logWMSE
This audio quality metric, logWMSE, tries to address a limitation of existing metrics:
The lack of support for digital silence. Existing audio quality metrics, like VISQOL,
CDPAM, SDR, SIR, SAR, ISR, STOI and SI-SDR are not well-behaved when the target is
digital silence. Some also struggle when the reference audio gets perfectly reconstructed.
# Installation
`pip install git+https://github.com/nomonosound/log-wmse-audio-quality`
# Usage example
```python
import numpy as np
from log_wmse_audio_quality import calculate_log_wmse
sample_rate = 44100
input_sound = np.random.uniform(low=-1.0, high=1.0, size=(sample_rate,)).astype(np.float32)
est_sound = input_sound * 0.1
target_sound = np.zeros((sample_rate,), dtype=np.float32)
log_wmse = calculate_log_wmse(input_sound, est_sound, target_sound, sample_rate)
print(log_wmse) # Expected output: ~18.42
```
# Motivation and more info
Here are some examples of use cases where the target (reference) is pure digital silence:
* **Music source separation:** Imagine separating music into stems like "vocal", "drums",
"bass", and "other". A song without bass would naturally have a silent target for the "bass" stem.
* **Speech denoising:** If you have a recording with only noise, and no speech, the target
should be digital silence.
* **Multichannel speaker separation** When evaluating speaker separation in a windowed
approach, periods when a speaker isn't speaking should be evaluated against silence.
Mean squared error (MSE) is well-defined for digital silence targets, but has several drawbacks:
* The values are commonly ridiculously small, like between 1e-8 and 1e-3, which makes number formatting and sight-reading hard
* It's not tailored specifically for audio
* Lack of scale-invariance
* Doesn't consider the frequency sensitivity of human hearing
* It's not invariant to tiny errors that don't matter (because humans can't hear those errors anyway)
* It's not logarithmic, like human hearing is
**logWMSE** attempts to solve all the problems mentioned above. It's essentially the **log**
of a frequency-**weighted MSE**, with a few bells and whistles.
The frequency weighting is like this:
![frequency_weighting.png](dev/frequency_weighting.png)
The idea of weighting it by frequency is to make it pay less attention to frequencies
that human hearing is less sensitive to. For example, an error at 3000 Hz sounds worse
than an error (with the same amplitude) at 50 Hz.
This audio quality metric was made with **high sample rates** in mind, like 36000, 44100
and 48000 Hz. However, in theory it should also work for low sample rates, like 16000 Hz.
The metric function performs an internal resampling to 44100 Hz to make the frequency
weighting filter consistent across multiple input sample rates.
Unlike many audio quality metrics, logWMSE accepts a *triple* of audio inputs:
* unprocessed audio (e.g. a raw, noisy recording)
* processed audio (e.g. a denoised recording)
* target audio (e.g. a clean reference without noise)
Relative audio quality metrics usually only input the two latter. However, logWMSE
additionally needs the unprocessed audio, because it needs to be able to measure how
well input audio was attenuated to the given target when the target is digital silence
(all zeros). And it needs to do this in a "scale-invariant" way. The scale invariance in
logWMSE is not exactly like SI-SDR, where the processed audio can have arbitrary scaling
compared to the target and still get the same score. logWMSE requires the gain of the
target to be consistent with the gain of the unprocessed sound. And the processed sound
needs to be scaled similarly to the target for a good metric score. The scale invariance
in logWMSE can be explained like this: if all three sounds are gained by an arbitrary
amount (the same gain for all three), the metric score will stay the same. Internally,
that property is implemented like this: the processed audio and the target audio are
both gained by the factor that would be required to bring the filtered unprocessed audio
to 0 dB RMS.
logWMSE is scaled to the same order of magnitude as common SDR values. For example,
logWMSE=3 means poor quality, while logWMSE=30 means very good quality. In other words,
higher is better.
`calculate_log_wmse` accepts 1D or 2D numpy arrays as input. In the latter case,
the shape is expected to be `(channels, samples)`. The dtype of the numpy arrays is
expected to be float32.
Please note the following limitations:
* The metric isn't invariant to arbitrary scaling, polarity inversion, or offsets in the estimated audio *relative to the target*.
* Although it incorporates frequency filtering inspired by human auditory sensitivity, it doesn't fully model human auditory perception. For instance, it doesn't consider auditory masking.
Raw data
{
"_id": null,
"home_page": "https://github.com/nomonosound/log-wmse-audio-quality",
"name": "log-wmse-audio-quality",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "",
"keywords": "",
"author": "Iver Jordal",
"author_email": "",
"download_url": "https://files.pythonhosted.org/packages/26/c3/9dbbcaaadf6dc7bf1be894ff57666b5f26ed413aaea75619ecedb2d0145e/log-wmse-audio-quality-0.1.0.tar.gz",
"platform": null,
"description": "# logWMSE\r\n\r\nThis audio quality metric, logWMSE, tries to address a limitation of existing metrics:\r\nThe lack of support for digital silence. Existing audio quality metrics, like VISQOL,\r\nCDPAM, SDR, SIR, SAR, ISR, STOI and SI-SDR are not well-behaved when the target is\r\ndigital silence. Some also struggle when the reference audio gets perfectly reconstructed.\r\n\r\n# Installation\r\n\r\n`pip install git+https://github.com/nomonosound/log-wmse-audio-quality`\r\n\r\n# Usage example\r\n\r\n```python\r\nimport numpy as np\r\nfrom log_wmse_audio_quality import calculate_log_wmse\r\n\r\nsample_rate = 44100\r\ninput_sound = np.random.uniform(low=-1.0, high=1.0, size=(sample_rate,)).astype(np.float32)\r\nest_sound = input_sound * 0.1\r\ntarget_sound = np.zeros((sample_rate,), dtype=np.float32)\r\n\r\nlog_wmse = calculate_log_wmse(input_sound, est_sound, target_sound, sample_rate)\r\nprint(log_wmse) # Expected output: ~18.42\r\n```\r\n\r\n# Motivation and more info\r\n\r\nHere are some examples of use cases where the target (reference) is pure digital silence:\r\n\r\n* **Music source separation:** Imagine separating music into stems like \"vocal\", \"drums\",\r\n \"bass\", and \"other\". A song without bass would naturally have a silent target for the \"bass\" stem.\r\n* **Speech denoising:** If you have a recording with only noise, and no speech, the target\r\n should be digital silence.\r\n* **Multichannel speaker separation** When evaluating speaker separation in a windowed\r\n approach, periods when a speaker isn't speaking should be evaluated against silence.\r\n\r\nMean squared error (MSE) is well-defined for digital silence targets, but has several drawbacks:\r\n\r\n* The values are commonly ridiculously small, like between 1e-8 and 1e-3, which makes number formatting and sight-reading hard\r\n* It's not tailored specifically for audio\r\n* Lack of scale-invariance\r\n* Doesn't consider the frequency sensitivity of human hearing\r\n* It's not invariant to tiny errors that don't matter (because humans can't hear those errors anyway)\r\n* It's not logarithmic, like human hearing is\r\n\r\n**logWMSE** attempts to solve all the problems mentioned above. It's essentially the **log**\r\nof a frequency-**weighted MSE**, with a few bells and whistles.\r\n\r\nThe frequency weighting is like this:\r\n![frequency_weighting.png](dev/frequency_weighting.png)\r\n\r\nThe idea of weighting it by frequency is to make it pay less attention to frequencies\r\nthat human hearing is less sensitive to. For example, an error at 3000 Hz sounds worse\r\nthan an error (with the same amplitude) at 50 Hz.\r\n\r\nThis audio quality metric was made with **high sample rates** in mind, like 36000, 44100\r\nand 48000 Hz. However, in theory it should also work for low sample rates, like 16000 Hz.\r\nThe metric function performs an internal resampling to 44100 Hz to make the frequency\r\nweighting filter consistent across multiple input sample rates.\r\n\r\nUnlike many audio quality metrics, logWMSE accepts a *triple* of audio inputs:\r\n\r\n* unprocessed audio (e.g. a raw, noisy recording)\r\n* processed audio (e.g. a denoised recording)\r\n* target audio (e.g. a clean reference without noise)\r\n\r\nRelative audio quality metrics usually only input the two latter. However, logWMSE\r\nadditionally needs the unprocessed audio, because it needs to be able to measure how\r\nwell input audio was attenuated to the given target when the target is digital silence\r\n(all zeros). And it needs to do this in a \"scale-invariant\" way. The scale invariance in\r\nlogWMSE is not exactly like SI-SDR, where the processed audio can have arbitrary scaling\r\ncompared to the target and still get the same score. logWMSE requires the gain of the\r\ntarget to be consistent with the gain of the unprocessed sound. And the processed sound\r\nneeds to be scaled similarly to the target for a good metric score. The scale invariance\r\nin logWMSE can be explained like this: if all three sounds are gained by an arbitrary\r\namount (the same gain for all three), the metric score will stay the same. Internally,\r\nthat property is implemented like this: the processed audio and the target audio are\r\nboth gained by the factor that would be required to bring the filtered unprocessed audio\r\nto 0 dB RMS.\r\n\r\nlogWMSE is scaled to the same order of magnitude as common SDR values. For example,\r\nlogWMSE=3 means poor quality, while logWMSE=30 means very good quality. In other words,\r\nhigher is better.\r\n\r\n`calculate_log_wmse` accepts 1D or 2D numpy arrays as input. In the latter case,\r\nthe shape is expected to be `(channels, samples)`. The dtype of the numpy arrays is\r\nexpected to be float32.\r\n\r\nPlease note the following limitations:\r\n\r\n* The metric isn't invariant to arbitrary scaling, polarity inversion, or offsets in the estimated audio *relative to the target*.\r\n* Although it incorporates frequency filtering inspired by human auditory sensitivity, it doesn't fully model human auditory perception. For instance, it doesn't consider auditory masking.\r\n",
"bugtrack_url": null,
"license": "Apache 2.0",
"summary": "logWMSE is an audio quality metric with support for digital silence target. Useful for evaluating audio source separation systems, even when there are many audio tracks or stems.",
"version": "0.1.0",
"project_urls": {
"Homepage": "https://github.com/nomonosound/log-wmse-audio-quality",
"Issue Tracker": "https://github.com/nomonosound/log-wmse-audio-quality/issues"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "d782d4cd7b679854acd3841ab06ff23efdbdff2b52c23a4d0441142b1678fc9e",
"md5": "1672d3d33c8ab4c4a09b2c100de311d6",
"sha256": "829a47e2b8be4327dded4402afd9595f6f0f87b7d04eb809885363fc1dd2a092"
},
"downloads": -1,
"filename": "log_wmse_audio_quality-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "1672d3d33c8ab4c4a09b2c100de311d6",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 35473,
"upload_time": "2023-09-15T13:40:54",
"upload_time_iso_8601": "2023-09-15T13:40:54.199233Z",
"url": "https://files.pythonhosted.org/packages/d7/82/d4cd7b679854acd3841ab06ff23efdbdff2b52c23a4d0441142b1678fc9e/log_wmse_audio_quality-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "26c39dbbcaaadf6dc7bf1be894ff57666b5f26ed413aaea75619ecedb2d0145e",
"md5": "e1b74c940c7b884e15e6f49792293efc",
"sha256": "c2773a8e27dd601a7c8a43114087c835a2faa08340a289275be9198006da882f"
},
"downloads": -1,
"filename": "log-wmse-audio-quality-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "e1b74c940c7b884e15e6f49792293efc",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 40666,
"upload_time": "2023-09-15T13:40:56",
"upload_time_iso_8601": "2023-09-15T13:40:56.181921Z",
"url": "https://files.pythonhosted.org/packages/26/c3/9dbbcaaadf6dc7bf1be894ff57666b5f26ed413aaea75619ecedb2d0145e/log-wmse-audio-quality-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-09-15 13:40:56",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "nomonosound",
"github_project": "log-wmse-audio-quality",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "log-wmse-audio-quality"
}