stable-codec


Namestable-codec JSON
Version 0.1.2 PyPI version JSON
download
home_pagehttps://github.com/Stability-AI/stable-codec/
SummaryStable Codec: A series of codec models for speech and audio
upload_time2025-01-14 13:38:46
maintainerNone
docs_urlNone
authorStability AI
requires_python>=3.9
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Stable Codec

This repository contains training and inference scripts for models in the Stable Codec series, starting with `stable-codec-speech-16k` - introduced in the paper titled Scaling Transformers for Low-bitrate High-Quality Speech Coding.

Paper: https://arxiv.org/abs/2411.19842

Sound demos: https://stability-ai.github.io/stable-codec-demo/

Model weights: https://huggingface.co/stabilityai/stable-codec-speech-16k

##

Note that whilst this code is MIT licensed, the model weights are covered by the [Stability AI Community License](https://huggingface.co/stabilityai/stable-codec-speech-16k/blob/main/LICENSE.md)

## Variants
The model is currently available in two variants:
- `stable-codec-speech-16k-base` is the weights corresponding to the results in our [publication](https://arxiv.org/abs/2411.19842), provided for reproducibility.
- `stable-codec-speech-16k` is an improved finetune, with boosted latent semantics. It should be used in 99% of use-cases.

### Additional Training

In addition to the training described in the paper, the weights for `stable-codec-speech-16k` have undergone 500k steps of finetuning with force-aligned data from LibriLight and the English portion Multilingual LibriSpeech. This was performed by using a CTC head to regress the force-aligned phoneme tags from pre-bottleneck latents. We found that this additional training significantly boosted the applicability of the codec tokens to downstream tasks like TTS, at a small cost to objective reconstruction metrics.

## Install

The model itself is defined in [stable-audio-tools](https://github.com/Stability-AI/stable-audio-tools) package.

To install `stable-codec`:

```bash
pip install stable-codec
pip install -U flash-attn --no-build-isolation
```

**IMPORTANT NOTE:** This model currently has a hard requirement for FlashAttention due to its use of sliding window attention. Inference without FlashAttention will likely be greatly degraded. This also means that the model currently does not support CPU inference. We will relax the dependency on FlashAttention in the future.

## Encoding and decoding

To encode audio or decode tokens, the `StableCodec` class provides a convenient wrapper for the model. It can be used with a local checkpoint and config as follows:

```python
import torch
import torchaudio
from stable_codec import StableCodec

model = StableCodec(
    model_config_path="<path-to-model-config>",
    ckpt_path="<path-to-checkpoint>", # optional, can be `None`,
    device = torch.device("cuda")
)

audiopath = "audio.wav"

latents, tokens = model.encode(audiopath)
decoded_audio = model.decode(tokens)

torchaudio.save("decoded.wav", decoded_audio, model.sample_rate)
```

To download the model weights automatically from HuggingFace, simply provide the model name:

```python
model = StableCodec(
    pretrained_model = 'stabilityai/stable-codec-speech-16k'
)
```
### Posthoc bottleneck configuration

Most usecases will benefit from replacing the training-time FSQ bottleneck with a post-hoc FSQ bottleneck, as described in the paper. This allows token dictionary size to be reduced to a reasonable level for modern language models. This is achieved by calling the `set_posthoc_bottleneck` function, and setting a flag to the encode/decode calls:

```python
model.set_posthoc_bottleneck("2x15625_700bps")
latents, tokens = model.encode(audiopath, posthoc_bottleneck = True)
decoded_audio = model.decode(tokens, posthoc_bottleneck = True)
```
`set_posthoc_bottleneck` can take a string as argument, which allows selection a number of recommended preset settings for the bottleneck:

| Bottleneck Preset | Number of Tokens per step | Dictionary Size | Bits Per Second (bps) |
|-------------------|------------------|-----------------|-----------------------|
| `1x46656_400bps`   | 1             | 46656             | 400                   |
| `2x15625_700bps`   | 2             | 15625             | 700                   |
| `4x729_1000bps`    | 4             | 729               | 1000                  |

Alternatively, the bottleneck stages can be specified directly. The format for specifying this can be seen in the definition of the `StableCodec` class in `model.py`.

### Normalization

The model is trained with utterances normalized to -20 +-5 LUFS. The `encode` function normalizes to -20 LUFS by default, but it can be disabled by setting `normalize = False` when calling the function. 

## Finetune

To finetune a model given its config and checkpoint, execute `train.py` file:

```bash
python train.py \
    --project "stable-codec" \
    --name "finetune" \
    --config-file "defaults.ini" \
    --save-dir "<ckpt-save-dir>" \
    --model-config "<path-to-config.json>" \
    --dataset-config "<dataset-config.json>" \
    --val-dataset-config "<dataset-config.json>" \
    --pretrained-ckpt-path "<pretrained-model-ckpt.ckpt>" \
    --ckpt-path "$CKPT_PATH" \
    --num-nodes $SLURM_JOB_NUM_NODES \
    --num-workers 16 --batch-size 10 --precision "16-mixed" \
    --checkpoint-every 10000 \
    --logger "wandb"
```

For dataset configuration, refer to `stable-audio-tools` [dataset docs](https://github.com/Stability-AI/stable-audio-tools/blob/main/docs/datasets.md).


### Using CTC loss

To use [CTC loss](https://pytorch.org/docs/stable/generated/torch.nn.CTCLoss.html)
during training you have to enable it in the training configuration file
and in the training dataset configuration.

1. Modifying training configuration:
    - Enable CTC projection head and set its hidden dimension:
      ```python
      config["model"]["use_proj_head"] = True
      config["model"]["proj_head_dim"] = 81
      ```
    - Enable CTC in the training part of the config:
      ```python
      config["training"]["use_ctc"] = True
      ```
    - And set its loss config:
      ```python
      config["training"]["loss_configs"]["ctc"] = {
        "blank_idx": 80,
        "decay": 1.0,
        "weights": {"ctc": 1.0}
      }
      ```
    - Optionally, you can enable computation of the Phone-Error-Rate (PER) during validation:
      ```python
      config["training"]["eval_loss_configs"]["per"] = {}
      ```

2. Configuring dataset (only WebDataset format is supported for CTC):
   - The dataset configuration should have one additional field set to it (see [dataset docs](https://github.com/Stability-AI/stable-audio-tools/blob/main/docs/datasets.md) for other options):
     ```python
     config["force_align_text"] = True
     ```
   - And the JSON metadata file for each sample should contain force aligned transcript under `force_aligned_text` entry in the format specified below (besides other metadata).
     Where `transcript` is a list of word-level alignments with `start` and `end` fields specifying range **in seconds** of each word.
     ```json
     "normalized_text":"and i feel"
     "force_aligned_text":{
      "transcript":[
         {
            "word":"and",
            "start":0.2202,
            "end":0.3403
         },
         {
            "word":"i",
            "start":0.4604,
            "end":0.4804
         },
         {
            "word":"feel",
            "start":0.5204,
            "end":0.7006
         }
       ]
     }
     ```
## Objective Metrics

| Model                     | SI-SDR | Mel Dis | STFT Dis | PESQ | STOI | 
|---------------------------|-------:|--------:|---------:|-----:|-----:|
| `stable-codec-speech-16k-base`         | 4.73   | 0.86    | 1.26     | 3.09 | 0.92 |
| `stable-codec-speech-16k` | 3.58   | 0.90    | 1.30     | 3.01 | 0.90 | 


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Stability-AI/stable-codec/",
    "name": "stable-codec",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": null,
    "author": "Stability AI",
    "author_email": "julian.parker@stability.ai",
    "download_url": "https://files.pythonhosted.org/packages/4e/73/b9fa2ce561dddb829dec3960aab3ac30ce47a0cd6cc51974c6da5d771f26/stable_codec-0.1.2.tar.gz",
    "platform": null,
    "description": "# Stable Codec\n\nThis repository contains training and inference scripts for models in the Stable Codec series, starting with `stable-codec-speech-16k` - introduced in the paper titled Scaling Transformers for Low-bitrate High-Quality Speech Coding.\n\nPaper: https://arxiv.org/abs/2411.19842\n\nSound demos: https://stability-ai.github.io/stable-codec-demo/\n\nModel weights: https://huggingface.co/stabilityai/stable-codec-speech-16k\n\n##\n\nNote that whilst this code is MIT licensed, the model weights are covered by the [Stability AI Community License](https://huggingface.co/stabilityai/stable-codec-speech-16k/blob/main/LICENSE.md)\n\n## Variants\nThe model is currently available in two variants:\n- `stable-codec-speech-16k-base` is the weights corresponding to the results in our [publication](https://arxiv.org/abs/2411.19842), provided for reproducibility.\n- `stable-codec-speech-16k` is an improved finetune, with boosted latent semantics. It should be used in 99% of use-cases.\n\n### Additional Training\n\nIn addition to the training described in the paper, the weights for `stable-codec-speech-16k` have undergone 500k steps of finetuning with force-aligned data from LibriLight and the English portion Multilingual LibriSpeech. This was performed by using a CTC head to regress the force-aligned phoneme tags from pre-bottleneck latents. We found that this additional training significantly boosted the applicability of the codec tokens to downstream tasks like TTS, at a small cost to objective reconstruction metrics.\n\n## Install\n\nThe model itself is defined in [stable-audio-tools](https://github.com/Stability-AI/stable-audio-tools) package.\n\nTo install `stable-codec`:\n\n```bash\npip install stable-codec\npip install -U flash-attn --no-build-isolation\n```\n\n**IMPORTANT NOTE:** This model currently has a hard requirement for FlashAttention due to its use of sliding window attention. Inference without FlashAttention will likely be greatly degraded. This also means that the model currently does not support CPU inference. We will relax the dependency on FlashAttention in the future.\n\n## Encoding and decoding\n\nTo encode audio or decode tokens, the `StableCodec` class provides a convenient wrapper for the model. It can be used with a local checkpoint and config as follows:\n\n```python\nimport torch\nimport torchaudio\nfrom stable_codec import StableCodec\n\nmodel = StableCodec(\n    model_config_path=\"<path-to-model-config>\",\n    ckpt_path=\"<path-to-checkpoint>\", # optional, can be `None`,\n    device = torch.device(\"cuda\")\n)\n\naudiopath = \"audio.wav\"\n\nlatents, tokens = model.encode(audiopath)\ndecoded_audio = model.decode(tokens)\n\ntorchaudio.save(\"decoded.wav\", decoded_audio, model.sample_rate)\n```\n\nTo download the model weights automatically from HuggingFace, simply provide the model name:\n\n```python\nmodel = StableCodec(\n    pretrained_model = 'stabilityai/stable-codec-speech-16k'\n)\n```\n### Posthoc bottleneck configuration\n\nMost usecases will benefit from replacing the training-time FSQ bottleneck with a post-hoc FSQ bottleneck, as described in the paper. This allows token dictionary size to be reduced to a reasonable level for modern language models. This is achieved by calling the `set_posthoc_bottleneck` function, and setting a flag to the encode/decode calls:\n\n```python\nmodel.set_posthoc_bottleneck(\"2x15625_700bps\")\nlatents, tokens = model.encode(audiopath, posthoc_bottleneck = True)\ndecoded_audio = model.decode(tokens, posthoc_bottleneck = True)\n```\n`set_posthoc_bottleneck` can take a string as argument, which allows selection a number of recommended preset settings for the bottleneck:\n\n| Bottleneck Preset | Number of Tokens per step | Dictionary Size | Bits Per Second (bps) |\n|-------------------|------------------|-----------------|-----------------------|\n| `1x46656_400bps`   | 1             | 46656             | 400                   |\n| `2x15625_700bps`   | 2             | 15625             | 700                   |\n| `4x729_1000bps`    | 4             | 729               | 1000                  |\n\nAlternatively, the bottleneck stages can be specified directly. The format for specifying this can be seen in the definition of the `StableCodec` class in `model.py`.\n\n### Normalization\n\nThe model is trained with utterances normalized to -20 +-5 LUFS. The `encode` function normalizes to -20 LUFS by default, but it can be disabled by setting `normalize = False` when calling the function. \n\n## Finetune\n\nTo finetune a model given its config and checkpoint, execute `train.py` file:\n\n```bash\npython train.py \\\n    --project \"stable-codec\" \\\n    --name \"finetune\" \\\n    --config-file \"defaults.ini\" \\\n    --save-dir \"<ckpt-save-dir>\" \\\n    --model-config \"<path-to-config.json>\" \\\n    --dataset-config \"<dataset-config.json>\" \\\n    --val-dataset-config \"<dataset-config.json>\" \\\n    --pretrained-ckpt-path \"<pretrained-model-ckpt.ckpt>\" \\\n    --ckpt-path \"$CKPT_PATH\" \\\n    --num-nodes $SLURM_JOB_NUM_NODES \\\n    --num-workers 16 --batch-size 10 --precision \"16-mixed\" \\\n    --checkpoint-every 10000 \\\n    --logger \"wandb\"\n```\n\nFor dataset configuration, refer to `stable-audio-tools` [dataset docs](https://github.com/Stability-AI/stable-audio-tools/blob/main/docs/datasets.md).\n\n\n### Using CTC loss\n\nTo use [CTC loss](https://pytorch.org/docs/stable/generated/torch.nn.CTCLoss.html)\nduring training you have to enable it in the training configuration file\nand in the training dataset configuration.\n\n1. Modifying training configuration:\n    - Enable CTC projection head and set its hidden dimension:\n      ```python\n      config[\"model\"][\"use_proj_head\"] = True\n      config[\"model\"][\"proj_head_dim\"] = 81\n      ```\n    - Enable CTC in the training part of the config:\n      ```python\n      config[\"training\"][\"use_ctc\"] = True\n      ```\n    - And set its loss config:\n      ```python\n      config[\"training\"][\"loss_configs\"][\"ctc\"] = {\n        \"blank_idx\": 80,\n        \"decay\": 1.0,\n        \"weights\": {\"ctc\": 1.0}\n      }\n      ```\n    - Optionally, you can enable computation of the Phone-Error-Rate (PER) during validation:\n      ```python\n      config[\"training\"][\"eval_loss_configs\"][\"per\"] = {}\n      ```\n\n2. Configuring dataset (only WebDataset format is supported for CTC):\n   - The dataset configuration should have one additional field set to it (see [dataset docs](https://github.com/Stability-AI/stable-audio-tools/blob/main/docs/datasets.md) for other options):\n     ```python\n     config[\"force_align_text\"] = True\n     ```\n   - And the JSON metadata file for each sample should contain force aligned transcript under `force_aligned_text` entry in the format specified below (besides other metadata).\n     Where `transcript` is a list of word-level alignments with `start` and `end` fields specifying range **in seconds** of each word.\n     ```json\n     \"normalized_text\":\"and i feel\"\n     \"force_aligned_text\":{\n      \"transcript\":[\n         {\n            \"word\":\"and\",\n            \"start\":0.2202,\n            \"end\":0.3403\n         },\n         {\n            \"word\":\"i\",\n            \"start\":0.4604,\n            \"end\":0.4804\n         },\n         {\n            \"word\":\"feel\",\n            \"start\":0.5204,\n            \"end\":0.7006\n         }\n       ]\n     }\n     ```\n## Objective Metrics\n\n| Model                     | SI-SDR | Mel Dis | STFT Dis | PESQ | STOI | \n|---------------------------|-------:|--------:|---------:|-----:|-----:|\n| `stable-codec-speech-16k-base`         | 4.73   | 0.86    | 1.26     | 3.09 | 0.92 |\n| `stable-codec-speech-16k` | 3.58   | 0.90    | 1.30     | 3.01 | 0.90 | \n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Stable Codec: A series of codec models for speech and audio",
    "version": "0.1.2",
    "project_urls": {
        "Homepage": "https://github.com/Stability-AI/stable-codec/"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "32bb75d0b2e7484177869f96e0a7a8eb9e5c5f717a069fd687b78091d54da9eb",
                "md5": "7628c34d0cbea50acaff2662a9373481",
                "sha256": "255239a2a7d5f96b081f9ca617aaefc1bd3d195632fb5f1611e8e76e6385bbb4"
            },
            "downloads": -1,
            "filename": "stable_codec-0.1.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "7628c34d0cbea50acaff2662a9373481",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 19527,
            "upload_time": "2025-01-14T13:38:44",
            "upload_time_iso_8601": "2025-01-14T13:38:44.011915Z",
            "url": "https://files.pythonhosted.org/packages/32/bb/75d0b2e7484177869f96e0a7a8eb9e5c5f717a069fd687b78091d54da9eb/stable_codec-0.1.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4e73b9fa2ce561dddb829dec3960aab3ac30ce47a0cd6cc51974c6da5d771f26",
                "md5": "b3f03e6e4686ed3f816322cfcbc33f98",
                "sha256": "6b9ab03c763dd78db831a391882a0739f2af7c86e502fc34ed0bf8d697e64190"
            },
            "downloads": -1,
            "filename": "stable_codec-0.1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "b3f03e6e4686ed3f816322cfcbc33f98",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 20250,
            "upload_time": "2025-01-14T13:38:46",
            "upload_time_iso_8601": "2025-01-14T13:38:46.389412Z",
            "url": "https://files.pythonhosted.org/packages/4e/73/b9fa2ce561dddb829dec3960aab3ac30ce47a0cd6cc51974c6da5d771f26/stable_codec-0.1.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-14 13:38:46",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Stability-AI",
    "github_project": "stable-codec",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "stable-codec"
}
        
Elapsed time: 1.11479s