# Official TensorTract2 Implementation
This repository contains the official implementation of the TensorTract2 model described in the paper: "[Precisely Controllable Neural Speech Synthesis](https://ieeexplore.ieee.org/abstract/document/10890772)", by Krug et al., published in Proc. ICASSP 2025.
TensorTract2 is a multi-task model that features:
- State-of-the-art acoustic-to-articulatory inversion
- A neural audio codec with a fully interpretable and disentangled speech latent representation that offers precise control over the speech production process.
- High-performance voice conversion
- Speech denoising, pitch-detection and more
## NOTE
This is a stand-alone implementation of the TensorTract2 model, so there is no articulatory synthesizer included. If you wish to do **articulatory synthesis** from the latent representation or if you wish to create **2D vocaltract visualizations and videos**, please use [TensorTractLab](https://github.com/tensortract-dev/tensortractlab), which integrates both TensorTract2 and VocalTractLab-Python.
## Installation
TensorTract2 can be installed via pip:
```bash
pip install tensortract2
```
# Usage
## Loading the Model
```python
from tensortract2 import TensorTract2
# Load the model
tt2 = TensorTract2()
```
Per default the model will download the weights from [google drive](https://drive.google.com/file/d/1UIPlLKXCuVHPj3_nmJflw48TzXI4BtUw/view?usp=sharing) on its first initialization and put it into the users cache directory (wavlm-large will be downloaded from Huggingface). If the model is used again, no download will be necessary anymore. If you wish to load the weights manually, you could do it like this:
```python
tt2 = TensorTract2(auto_load_weights = False)
tt2.load_weights( 'path/to/weights' )
```
Note that the model will always be initialiazed in `eval mode` automatically, so you don't need to set it manually.
## Acoustic-to-Articulatory Inversion
You can convert any speech audio files to articulatory parameters using the `speech_to_motor` method. The input `x` can be a string or a list of strings. The output is a list of `MotorSeries` objects, for more info on these objects, see package [target-approximation](https://github.com/paul-krug/target-approximation).
```python
# Load speech from an audio file and process it
motor_data = tt2.speech_to_motor(
x='path/to/audio.wav',
# Optional parameters:
msrs_type='tt2',
)
# motor_data is a list of MotorSeries objects
# Each MotorSeries object contains the articulatory parameters
# you can plot them like this:
motor_data[0].plot()
# get the articulatory parameters as numpy array:
array = motor_data[0].to_numpy()
```
The parameter `msrs_type` describes the type of returned motor-series data. `tt2`means 20 articulatory features at a sampling rate of 50 Hz (TensorTract2 standard), `vtl` means 30 articulatory features at a sampling rate of 441 Hz (VocalTractLab-Python standard). Use `vtl` type iif you want compatibility with the articulatory synthesizer VocalTractLab-Python.
## Articulatory Synthesis
This is a stand-alone implementation of the TensorTract2 model, so there is no articulatory synthesizer included.
However, you can generate articulatory data that is directly compatible for articulatory synthesis with [VocalTractLab-Python](https://github.com/paul-krug/VocalTractLab-Python).
```python
# Load speech from an audio file and process it
motor_data = tt2.speech_to_motor(
x="path/to/audio.wav",
msrs_type='vtl',
)
# continue to process motor_data with VocalTractLab-Python
```
## Neural Re-synthesis and Voice-Conversion
```python
wavs = tt2.speech_to_speech(
x="path/to/audio.wav",
# Optional parameters:
target="path/to/target.wav",
output="path/to/output.wav",
time_stretch=None, # time stretch factor
pitch_shift=None, # pitch shift in semitones
)
wavs # is a list of audio tensors (16kHz, mono)
```
The parameter `target` is optional. If you provide a target audio file, the model will perform voice conversion using the voice characteristic from the target speech file. If you don't provide a target, the model will perform neural re-synthesis. The output will be saved to the specified path.
The parameters `x`, `target` and `output` can be a string or a list of strings. If you provide a list of strings, the model will process each file and save the resulting audio to the paths provided in `output`.
## Fine-grained Speech Manipulation
At the moment you can only manipulate the articulatory parameters manually like this:
```python
motor_data = tt2.speech_to_motor(
x="path/to/audio.wav",
msrs_type='tt2',
)
m = motor_data[0] # get the first MotorSeries object
# Manipulate the articulatory parameters (for example TCX):
m[ 'TCX' ] *= 1.5 # increase TCX by 50%
# or directly access the numpy array:
m_np = m.to_numpy()
m_np[ :, v:w ] = .... # any manipulation
from target_approximation.tensortract import MotorSeries
m = MotorSeries( m_np, sr=50 ) # back to motor-series
# re-synthesize the audio
wavs = tt2.motor_to_speech(
msrs=m,
target='path/to/target.wav', # Get a voice for synthesis
# Optional parameters:
output: Optional[ Union[ str, List[str] ] ] = None,
time_stretch: Optional[ float ] = None,
pitch_shift: Optional[ float ] = None,
msrs_type: str = 'tt2',
)
```
## Speech Denoising
Speech denoising will happen automatically if the input audio is noisy. The model will automatically detect the noise and remove it.
# How to cite
If you use this code in your research, please cite the following paper:
```bibtex
@inproceedings{krug2025precisely,
title={Precisely Controllable Neural Speech Synthesis},
author={Krug, Paul Konstantin and Wagner, Christoph and Birkholz, Peter and Stich, Timo},
booktitle={ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={1--5},
year={2025},
organization={IEEE}
}
```
Raw data
{
"_id": null,
"home_page": "https://github.com/Altavo/tensortract2",
"name": "tensortract2",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": null,
"author": "Paul Krug",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/0a/20/022a0f0aa0588f9932216f59ed70741c3a5aa766594cc4605cb57aed538a/tensortract2-0.0.4.tar.gz",
"platform": null,
"description": "# Official TensorTract2 Implementation\n\nThis repository contains the official implementation of the TensorTract2 model described in the paper: \"[Precisely Controllable Neural Speech Synthesis](https://ieeexplore.ieee.org/abstract/document/10890772)\", by Krug et al., published in Proc. ICASSP 2025.\n\nTensorTract2 is a multi-task model that features:\n- State-of-the-art acoustic-to-articulatory inversion\n- A neural audio codec with a fully interpretable and disentangled speech latent representation that offers precise control over the speech production process.\n- High-performance voice conversion\n- Speech denoising, pitch-detection and more\n\n## NOTE\nThis is a stand-alone implementation of the TensorTract2 model, so there is no articulatory synthesizer included. If you wish to do **articulatory synthesis** from the latent representation or if you wish to create **2D vocaltract visualizations and videos**, please use [TensorTractLab](https://github.com/tensortract-dev/tensortractlab), which integrates both TensorTract2 and VocalTractLab-Python.\n\n## Installation\nTensorTract2 can be installed via pip:\n```bash\npip install tensortract2\n```\n\n# Usage\n\n## Loading the Model\n```python\nfrom tensortract2 import TensorTract2\n\n# Load the model\ntt2 = TensorTract2()\n```\nPer default the model will download the weights from [google drive](https://drive.google.com/file/d/1UIPlLKXCuVHPj3_nmJflw48TzXI4BtUw/view?usp=sharing) on its first initialization and put it into the users cache directory (wavlm-large will be downloaded from Huggingface). If the model is used again, no download will be necessary anymore. If you wish to load the weights manually, you could do it like this:\n```python\ntt2 = TensorTract2(auto_load_weights = False)\ntt2.load_weights( 'path/to/weights' )\n```\nNote that the model will always be initialiazed in `eval mode` automatically, so you don't need to set it manually.\n\n## Acoustic-to-Articulatory Inversion\nYou can convert any speech audio files to articulatory parameters using the `speech_to_motor` method. The input `x` can be a string or a list of strings. The output is a list of `MotorSeries` objects, for more info on these objects, see package [target-approximation](https://github.com/paul-krug/target-approximation).\n```python\n# Load speech from an audio file and process it\nmotor_data = tt2.speech_to_motor(\n x='path/to/audio.wav',\n # Optional parameters:\n msrs_type='tt2',\n )\n\n# motor_data is a list of MotorSeries objects\n# Each MotorSeries object contains the articulatory parameters\n# you can plot them like this:\nmotor_data[0].plot()\n\n# get the articulatory parameters as numpy array:\narray = motor_data[0].to_numpy()\n```\nThe parameter `msrs_type` describes the type of returned motor-series data. `tt2`means 20 articulatory features at a sampling rate of 50 Hz (TensorTract2 standard), `vtl` means 30 articulatory features at a sampling rate of 441 Hz (VocalTractLab-Python standard). Use `vtl` type iif you want compatibility with the articulatory synthesizer VocalTractLab-Python.\n\n## Articulatory Synthesis\nThis is a stand-alone implementation of the TensorTract2 model, so there is no articulatory synthesizer included.\nHowever, you can generate articulatory data that is directly compatible for articulatory synthesis with [VocalTractLab-Python](https://github.com/paul-krug/VocalTractLab-Python).\n```python\n# Load speech from an audio file and process it\nmotor_data = tt2.speech_to_motor(\n x=\"path/to/audio.wav\",\n msrs_type='vtl',\n )\n\n# continue to process motor_data with VocalTractLab-Python\n```\n\n## Neural Re-synthesis and Voice-Conversion\n```python\nwavs = tt2.speech_to_speech(\n x=\"path/to/audio.wav\",\n # Optional parameters:\n target=\"path/to/target.wav\",\n output=\"path/to/output.wav\",\n time_stretch=None, # time stretch factor\n pitch_shift=None, # pitch shift in semitones\n)\nwavs # is a list of audio tensors (16kHz, mono)\n```\nThe parameter `target` is optional. If you provide a target audio file, the model will perform voice conversion using the voice characteristic from the target speech file. If you don't provide a target, the model will perform neural re-synthesis. The output will be saved to the specified path.\nThe parameters `x`, `target` and `output` can be a string or a list of strings. If you provide a list of strings, the model will process each file and save the resulting audio to the paths provided in `output`.\n\n## Fine-grained Speech Manipulation\nAt the moment you can only manipulate the articulatory parameters manually like this:\n```python\nmotor_data = tt2.speech_to_motor(\n x=\"path/to/audio.wav\",\n msrs_type='tt2',\n )\n\nm = motor_data[0] # get the first MotorSeries object\n\n# Manipulate the articulatory parameters (for example TCX):\nm[ 'TCX' ] *= 1.5 # increase TCX by 50%\n\n# or directly access the numpy array:\nm_np = m.to_numpy()\nm_np[ :, v:w ] = .... # any manipulation\n\nfrom target_approximation.tensortract import MotorSeries\nm = MotorSeries( m_np, sr=50 ) # back to motor-series \n\n# re-synthesize the audio\nwavs = tt2.motor_to_speech(\n msrs=m,\n target='path/to/target.wav', # Get a voice for synthesis\n # Optional parameters:\n output: Optional[ Union[ str, List[str] ] ] = None,\n time_stretch: Optional[ float ] = None,\n pitch_shift: Optional[ float ] = None,\n msrs_type: str = 'tt2',\n )\n```\n\n\n## Speech Denoising\nSpeech denoising will happen automatically if the input audio is noisy. The model will automatically detect the noise and remove it.\n\n\n# How to cite\nIf you use this code in your research, please cite the following paper:\n```bibtex\n@inproceedings{krug2025precisely,\n title={Precisely Controllable Neural Speech Synthesis},\n author={Krug, Paul Konstantin and Wagner, Christoph and Birkholz, Peter and Stich, Timo},\n booktitle={ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},\n pages={1--5},\n year={2025},\n organization={IEEE}\n}\n```\n",
"bugtrack_url": null,
"license": "Commons Clause License Condition v1.0",
"summary": "A PyTorch implementation of TensorTract2",
"version": "0.0.4",
"project_urls": {
"Homepage": "https://github.com/Altavo/tensortract2"
},
"split_keywords": [],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "91920b740f839bc81c08ecd3f36970cb68c2f61b9f735bbf9a7e89ed44ec8f0a",
"md5": "48bfc477746843bcb442255594a870b2",
"sha256": "ac59b35f08e7cc3f99d3be698db1efe7141f77410ba7a55696e99d0b0588796f"
},
"downloads": -1,
"filename": "tensortract2-0.0.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "48bfc477746843bcb442255594a870b2",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 73247,
"upload_time": "2025-09-08T11:17:29",
"upload_time_iso_8601": "2025-09-08T11:17:29.183274Z",
"url": "https://files.pythonhosted.org/packages/91/92/0b740f839bc81c08ecd3f36970cb68c2f61b9f735bbf9a7e89ed44ec8f0a/tensortract2-0.0.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "0a20022a0f0aa0588f9932216f59ed70741c3a5aa766594cc4605cb57aed538a",
"md5": "1dbd0eb246b3f3aa9857155ed80a6d2a",
"sha256": "87e525cfda5d0b3af470c1bb51bc9880ee4d27307ecc23735aede93c9c571a5a"
},
"downloads": -1,
"filename": "tensortract2-0.0.4.tar.gz",
"has_sig": false,
"md5_digest": "1dbd0eb246b3f3aa9857155ed80a6d2a",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 52347,
"upload_time": "2025-09-08T11:17:30",
"upload_time_iso_8601": "2025-09-08T11:17:30.562324Z",
"url": "https://files.pythonhosted.org/packages/0a/20/022a0f0aa0588f9932216f59ed70741c3a5aa766594cc4605cb57aed538a/tensortract2-0.0.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-08 11:17:30",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Altavo",
"github_project": "tensortract2",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "tensortract2"
}