kani-tts


Namekani-tts JSON
Version 0.0.4 PyPI version JSON
download
home_pageNone
SummaryText-to-speech using neural audio codec and causal language models
upload_time2025-11-03 02:17:29
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseApache-2.0
keywords tts text-to-speech audio nemo transformers
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Kani-TTS

A simple and efficient text-to-speech library using neural audio codecs and causal language models.

## Features

- Simple, intuitive API
- Built on Hugging Face Transformers and NVIDIA NeMo
- High-quality audio generation using neural codecs
- GPU acceleration support
- Multi-speaker model support with easy speaker selection

## Installation

### From PyPI (once published)

```bash
pip install kani-tts
pip install -U "transformers==4.57.1" # for LFM2 !!!
```


## Quick Start

```python
from kani_tts import KaniTTS

# Initialize model (replace with your model name)
model = KaniTTS('your-model-name-here')

# Generate audio from text
audio, text = model("Hello, world!")

# Save to file (requires soundfile)
model.save_audio(audio, "output.wav")
```

## Advanced Usage

### Working with Multi-Speaker Models

Some models support multiple speakers. You can check if your model supports speakers and select a specific voice:

```python
from kani_tts import KaniTTS

model = KaniTTS('your-multispeaker-model-name')

# Check if model supports multiple speakers
print(f"Model type: {model.status}")  # 'singlspeaker' or 'multispeaker'

# Display available speakers (pretty formatted)
model.show_speakers()

# Or access the speaker list directly
print(model.speaker_list)  # ['Speaker1', 'Speaker2', ...]

# Generate audio with a specific speaker
audio, text = model.generate("Hello, world!", speaker_id="Speaker1")
model.save_audio(audio, "speaker1_output.wav")

# Or using the shorthand call syntax
audio, text = model("Hello, world!", speaker_id="Speaker1")
```

### Custom Configuration

```python
from kani_tts import KaniTTS

model = KaniTTS(
    'your-model-name',
    temperature=0.7,           # Control randomness (default: 1.0)
    top_p=0.9,                 # Nucleus sampling (default: 0.95)
    max_new_tokens=2000,       # Max audio length (default: 1200)
    repetition_penalty=1.2,    # Prevent repetition (default: 1.1)
    suppress_logs=True,        # Suppress library logs (default: True)
    show_info=True,            # Show model info on init (default: True)
)

audio, text = model("Your text here")
```

When initialized, Kani-TTS displays a beautiful banner with model information:
```
╔════════════════════════════════════════════════════════════╗
║                                                            ║
║                   N I N E N I N E S I X  😼                ║
║                                                            ║
╚════════════════════════════════════════════════════════════╝

              /\_/\
             ( o.o )
              > ^ <

──────────────────────────────────────────────────────────────
  Model: your-model-name
  Device: GPU (CUDA)
  Mode: Multi-speaker (5 speakers)

  Configuration:
    • Sample Rate: 22050 Hz
    • Temperature: 1.0
    • Top-p: 0.95
    • Max Tokens: 1200
    • Repetition Penalty: 1.1
──────────────────────────────────────────────────────────────

  Ready to generate speech! 🎵
```

You can disable this banner by setting `show_info=False`, or show it again anytime with `model.show_model_info()`.

### Controlling Logging Output

By default, Kani-TTS suppresses all logging output from transformers, NeMo, and PyTorch to keep your console clean. Only your `print()` statements will be visible.

```python
from kani_tts import KaniTTS

# Default behavior - logs are suppressed
model = KaniTTS('your-model-name')

# To see all library logs (for debugging)
model = KaniTTS('your-model-name', suppress_logs=False)

# You can also manually suppress logs at any time
from kani_tts import suppress_all_logs
suppress_all_logs()
```

### Working with Audio Output

The generated audio is a NumPy array sampled at 22kHz:

```python
import numpy as np
import soundfile as sf

audio, text = model("Generate speech from this text")

# Audio is a numpy array
print(audio.shape)  # (num_samples,)
print(audio.dtype)  # float32/float64

# Save using soundfile
sf.write('output.wav', audio, 22050)

# Or use the built-in method
model.save_audio(audio, 'output.wav', sample_rate=22050)
```



### Playing Audio in Jupyter Notebooks

You can listen to generated audio directly in Jupyter notebooks or IPython:

```python
from kani_tts import KaniTTS
from IPython.display import Audio as aplay

model = KaniTTS('your-model-name')
audio, text = model("Hello, world!")

# Play audio in notebook
aplay(audio, rate=model.sample_rate)
```

## Architecture

Kani-TTS uses a two-stage architecture:

1. **Text → Audio Tokens**: A causal language model generates audio token sequences from text
2. **Audio Tokens → Waveform**: NVIDIA NeMo's NanoCodec decodes tokens into audio waveforms

The system uses special tokens to mark different segments:
- Text boundaries (start/end of text)
- Speech boundaries (start/end of speech)
- Speaker turns (human/AI)

Audio tokens are organized in 4-channel codebooks, with each channel representing different aspects of the audio signal.

## Requirements

- Python 3.10 or higher
- CUDA-capable GPU (recommended) or CPU
- PyTorch 2.0 or higher
- Transformers library
- NeMo Toolkit

## Model Compatibility

This library works with causal language models trained for TTS with the following characteristics:
- Extended vocabulary including audio tokens
- Special tokens for speech/text boundaries
- Compatible with NeMo nano codec (22kHz, 0.6kbps, 12.5fps)



## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## Citation

```
@inproceedings{emilialarge,
  author={He, Haorui and Shang, Zengqiang and Wang, Chaoren and Li, Xuyuan and Gu, Yicheng and Hua, Hua and Liu, Liwei and Yang, Chen and Li, Jiaqi and Shi, Peiyang and Wang, Yuancheng and Chen, Kai and Zhang, Pengyuan and Wu, Zhizheng},
  title={Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation},
  booktitle={arXiv:2501.15907},
  year={2025}
}
```
```
@article{emonet_voice_2025,
  author={Schuhmann, Christoph and Kaczmarczyk, Robert and Rabby, Gollam and Friedrich, Felix and Kraus, Maurice and Nadi, Kourosh and Nguyen, Huu and Kersting, Kristian and Auer, Sören},
  title={EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection},
  journal={arXiv preprint arXiv:2506.09827},
  year={2025}
}
```

## Acknowledgments

- Built on [Hugging Face Transformers](https://github.com/huggingface/transformers)
- Uses [NVIDIA NeMo](https://github.com/NVIDIA/NeMo) audio codec
- Powered by PyTorch

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "kani-tts",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "tts, text-to-speech, audio, nemo, transformers",
    "author": null,
    "author_email": "simonlob <simonlobgromov@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/98/52/313b14585c33d6ab9d75afdce57e18c1a6c2ee5a469553e4aabf872aab08/kani_tts-0.0.4.tar.gz",
    "platform": null,
    "description": "# Kani-TTS\n\nA simple and efficient text-to-speech library using neural audio codecs and causal language models.\n\n## Features\n\n- Simple, intuitive API\n- Built on Hugging Face Transformers and NVIDIA NeMo\n- High-quality audio generation using neural codecs\n- GPU acceleration support\n- Multi-speaker model support with easy speaker selection\n\n## Installation\n\n### From PyPI (once published)\n\n```bash\npip install kani-tts\npip install -U \"transformers==4.57.1\" # for LFM2 !!!\n```\n\n\n## Quick Start\n\n```python\nfrom kani_tts import KaniTTS\n\n# Initialize model (replace with your model name)\nmodel = KaniTTS('your-model-name-here')\n\n# Generate audio from text\naudio, text = model(\"Hello, world!\")\n\n# Save to file (requires soundfile)\nmodel.save_audio(audio, \"output.wav\")\n```\n\n## Advanced Usage\n\n### Working with Multi-Speaker Models\n\nSome models support multiple speakers. You can check if your model supports speakers and select a specific voice:\n\n```python\nfrom kani_tts import KaniTTS\n\nmodel = KaniTTS('your-multispeaker-model-name')\n\n# Check if model supports multiple speakers\nprint(f\"Model type: {model.status}\")  # 'singlspeaker' or 'multispeaker'\n\n# Display available speakers (pretty formatted)\nmodel.show_speakers()\n\n# Or access the speaker list directly\nprint(model.speaker_list)  # ['Speaker1', 'Speaker2', ...]\n\n# Generate audio with a specific speaker\naudio, text = model.generate(\"Hello, world!\", speaker_id=\"Speaker1\")\nmodel.save_audio(audio, \"speaker1_output.wav\")\n\n# Or using the shorthand call syntax\naudio, text = model(\"Hello, world!\", speaker_id=\"Speaker1\")\n```\n\n### Custom Configuration\n\n```python\nfrom kani_tts import KaniTTS\n\nmodel = KaniTTS(\n    'your-model-name',\n    temperature=0.7,           # Control randomness (default: 1.0)\n    top_p=0.9,                 # Nucleus sampling (default: 0.95)\n    max_new_tokens=2000,       # Max audio length (default: 1200)\n    repetition_penalty=1.2,    # Prevent repetition (default: 1.1)\n    suppress_logs=True,        # Suppress library logs (default: True)\n    show_info=True,            # Show model info on init (default: True)\n)\n\naudio, text = model(\"Your text here\")\n```\n\nWhen initialized, Kani-TTS displays a beautiful banner with model information:\n```\n\u2554\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2557\n\u2551                                                            \u2551\n\u2551                   N I N E N I N E S I X  \ud83d\ude3c                \u2551\n\u2551                                                            \u2551\n\u255a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u255d\n\n              /\\_/\\\n             ( o.o )\n              > ^ <\n\n\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\n  Model: your-model-name\n  Device: GPU (CUDA)\n  Mode: Multi-speaker (5 speakers)\n\n  Configuration:\n    \u2022 Sample Rate: 22050 Hz\n    \u2022 Temperature: 1.0\n    \u2022 Top-p: 0.95\n    \u2022 Max Tokens: 1200\n    \u2022 Repetition Penalty: 1.1\n\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\n\n  Ready to generate speech! \ud83c\udfb5\n```\n\nYou can disable this banner by setting `show_info=False`, or show it again anytime with `model.show_model_info()`.\n\n### Controlling Logging Output\n\nBy default, Kani-TTS suppresses all logging output from transformers, NeMo, and PyTorch to keep your console clean. Only your `print()` statements will be visible.\n\n```python\nfrom kani_tts import KaniTTS\n\n# Default behavior - logs are suppressed\nmodel = KaniTTS('your-model-name')\n\n# To see all library logs (for debugging)\nmodel = KaniTTS('your-model-name', suppress_logs=False)\n\n# You can also manually suppress logs at any time\nfrom kani_tts import suppress_all_logs\nsuppress_all_logs()\n```\n\n### Working with Audio Output\n\nThe generated audio is a NumPy array sampled at 22kHz:\n\n```python\nimport numpy as np\nimport soundfile as sf\n\naudio, text = model(\"Generate speech from this text\")\n\n# Audio is a numpy array\nprint(audio.shape)  # (num_samples,)\nprint(audio.dtype)  # float32/float64\n\n# Save using soundfile\nsf.write('output.wav', audio, 22050)\n\n# Or use the built-in method\nmodel.save_audio(audio, 'output.wav', sample_rate=22050)\n```\n\n\n\n### Playing Audio in Jupyter Notebooks\n\nYou can listen to generated audio directly in Jupyter notebooks or IPython:\n\n```python\nfrom kani_tts import KaniTTS\nfrom IPython.display import Audio as aplay\n\nmodel = KaniTTS('your-model-name')\naudio, text = model(\"Hello, world!\")\n\n# Play audio in notebook\naplay(audio, rate=model.sample_rate)\n```\n\n## Architecture\n\nKani-TTS uses a two-stage architecture:\n\n1. **Text \u2192 Audio Tokens**: A causal language model generates audio token sequences from text\n2. **Audio Tokens \u2192 Waveform**: NVIDIA NeMo's NanoCodec decodes tokens into audio waveforms\n\nThe system uses special tokens to mark different segments:\n- Text boundaries (start/end of text)\n- Speech boundaries (start/end of speech)\n- Speaker turns (human/AI)\n\nAudio tokens are organized in 4-channel codebooks, with each channel representing different aspects of the audio signal.\n\n## Requirements\n\n- Python 3.10 or higher\n- CUDA-capable GPU (recommended) or CPU\n- PyTorch 2.0 or higher\n- Transformers library\n- NeMo Toolkit\n\n## Model Compatibility\n\nThis library works with causal language models trained for TTS with the following characteristics:\n- Extended vocabulary including audio tokens\n- Special tokens for speech/text boundaries\n- Compatible with NeMo nano codec (22kHz, 0.6kbps, 12.5fps)\n\n\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## Citation\n\n```\n@inproceedings{emilialarge,\n  author={He, Haorui and Shang, Zengqiang and Wang, Chaoren and Li, Xuyuan and Gu, Yicheng and Hua, Hua and Liu, Liwei and Yang, Chen and Li, Jiaqi and Shi, Peiyang and Wang, Yuancheng and Chen, Kai and Zhang, Pengyuan and Wu, Zhizheng},\n  title={Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation},\n  booktitle={arXiv:2501.15907},\n  year={2025}\n}\n```\n```\n@article{emonet_voice_2025,\n  author={Schuhmann, Christoph and Kaczmarczyk, Robert and Rabby, Gollam and Friedrich, Felix and Kraus, Maurice and Nadi, Kourosh and Nguyen, Huu and Kersting, Kristian and Auer, S\u00f6ren},\n  title={EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection},\n  journal={arXiv preprint arXiv:2506.09827},\n  year={2025}\n}\n```\n\n## Acknowledgments\n\n- Built on [Hugging Face Transformers](https://github.com/huggingface/transformers)\n- Uses [NVIDIA NeMo](https://github.com/NVIDIA/NeMo) audio codec\n- Powered by PyTorch\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "Text-to-speech using neural audio codec and causal language models",
    "version": "0.0.4",
    "project_urls": {
        "Documentation": "https://github.com/nineninesix-ai/kani-tts/blob/main/README.md",
        "Homepage": "https://github.com/nineninesix-ai/kani-tts",
        "Repository": "https://github.com/nineninesix-ai/kani-tts"
    },
    "split_keywords": [
        "tts",
        " text-to-speech",
        " audio",
        " nemo",
        " transformers"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "80411debbdd5b449699fc1609c09edb422bfe951940f0a4c39387dcc63948e1e",
                "md5": "3d98015c2ac557e67e32bc14de08cdb2",
                "sha256": "d3851806b32ee4e1b4bb7baf3bacc66714a1308acf89a336476446963317d3f2"
            },
            "downloads": -1,
            "filename": "kani_tts-0.0.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "3d98015c2ac557e67e32bc14de08cdb2",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 9283,
            "upload_time": "2025-11-03T02:17:28",
            "upload_time_iso_8601": "2025-11-03T02:17:28.196867Z",
            "url": "https://files.pythonhosted.org/packages/80/41/1debbdd5b449699fc1609c09edb422bfe951940f0a4c39387dcc63948e1e/kani_tts-0.0.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "9852313b14585c33d6ab9d75afdce57e18c1a6c2ee5a469553e4aabf872aab08",
                "md5": "6d85e597fd24b6a08bf9b9a4162a5cbb",
                "sha256": "fe40f1e86362c5fb0e904d085fa85716b8ad71f613ed2f940da31bfbb5a0a50f"
            },
            "downloads": -1,
            "filename": "kani_tts-0.0.4.tar.gz",
            "has_sig": false,
            "md5_digest": "6d85e597fd24b6a08bf9b9a4162a5cbb",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 11277,
            "upload_time": "2025-11-03T02:17:29",
            "upload_time_iso_8601": "2025-11-03T02:17:29.446810Z",
            "url": "https://files.pythonhosted.org/packages/98/52/313b14585c33d6ab9d75afdce57e18c1a6c2ee5a469553e4aabf872aab08/kani_tts-0.0.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-11-03 02:17:29",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "nineninesix-ai",
    "github_project": "kani-tts",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "kani-tts"
}
        
Elapsed time: 1.71376s