Name | speech-dataset-parser JSON |
Version |
0.0.4
JSON |
| download |
home_page | |
Summary | Library to parse speech datasets stored in a generic format based on TextGrids. A tool (CLI) for converting common datasets like LJ Speech into a generic format is included. |
upload_time | 2023-01-12 13:55:18 |
maintainer | |
docs_url | None |
author | |
requires_python | <4,>=3.7 |
license | MIT |
keywords |
text-to-speech
speech synthesis
corpus
utils
language
linguistics
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# speech-dataset-parser
[![PyPI](https://img.shields.io/pypi/v/speech-dataset-parser.svg)](https://pypi.python.org/pypi/speech-dataset-parser)
[![PyPI](https://img.shields.io/pypi/pyversions/speech-dataset-parser.svg)](https://pypi.python.org/pypi/speech-dataset-parser)
[![MIT](https://img.shields.io/github/license/stefantaubert/speech-dataset-parser.svg)](https://github.com/stefantaubert/speech-dataset-parser/blob/main/LICENSE)
[![PyPI](https://img.shields.io/pypi/wheel/speech-dataset-parser.svg)](https://pypi.python.org/pypi/speech-dataset-parser)
[![PyPI](https://img.shields.io/pypi/implementation/speech-dataset-parser.svg)](https://pypi.python.org/pypi/speech-dataset-parser)
[![PyPI](https://img.shields.io/github/commits-since/stefantaubert/speech-dataset-parser/latest/master.svg)](https://pypi.python.org/pypi/speech-dataset-parser)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.7529425.svg)](https://doi.org/10.5281/zenodo.7529425)
Library to parse speech datasets stored in a generic format based on TextGrids. A tool (CLI) for converting common datasets like LJ Speech into a generic format is included.
Speech datasets consists of pairs of .TextGrid and .wav files. The TextGrids need to contain a tier which has each symbol separated in an interval, e.g., `T|h|i|s| |i|s| |a| |t|e|x|t|.`
## Generic Format
The format is as follows: `{Dataset name}/{Speaker name};{Speaker gender};{Speaker language}[;{Speaker accent}]/[Subfolder(s)]/{Recordings as .wav- and .TextGrid-pairs}`
Example: `LJ Speech/Linda Johnson;2;eng;North American/wavs/...`
Speaker names can be any string (excluding `;` symbols).
Genders are defined via their [ISO/IEC 5218 Code](https://en.wikipedia.org/wiki/ISO/IEC_5218).
Languages are defined via their [ISO 639-2 Code](https://www.loc.gov/standards/iso639-2/php/code_list.php) (bibliographic).
Accents are optional and can be any string (excluding `;` symbols).
## Installation
```sh
pip install speech-dataset-parser --user
```
## Library Usage
```py
from speech_dataset_parser import parse_dataset
entries = list(parse_dataset({folder}, {grid-tier-name}))
```
The resulting `entries` list contains dataclass-instances with these properties:
- `symbols: Tuple[str, ...]`: contains the mark of each interval
- `intervals: Tuple[float, ...]`: contains the max-time of each interval
- `symbols_language: str`: contains the language
- `speaker_name: str`: contains the name of the speaker
- `speaker_accent: str`: contains the accent of the speaker
- `speaker_gender: int`: contains the gender of the speaker
- `audio_file_abs: Path`: contains the absolute path to the speech audio
- `min_time: float`: the min-time of the grid
- `max_time: float`: the max-time of the grid (equal to `intervals[-1]`)
## CLI Usage
```txt
usage: dataset-converter-cli [-h] [-v] {convert-ljs,convert-l2arctic,convert-thchs,convert-thchs-cslt,restore-structure} ...
This program converts common speech datasets into a generic representation.
positional arguments:
{convert-ljs,convert-l2arctic,convert-thchs,convert-thchs-cslt,restore-structure}
description
convert-ljs convert LJ Speech dataset to a generic dataset
convert-l2arctic convert L2-ARCTIC dataset to a generic dataset
convert-thchs convert THCHS-30 (OpenSLR Version) dataset to a generic dataset
convert-thchs-cslt convert THCHS-30 (CSLT Version) dataset to a generic dataset
restore-structure restore original dataset structure of generic datasets
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
```
## CLI Example
```sh
# Convert LJ Speech dataset with symbolic links to the audio files
dataset-converter-cli convert-ljs \
"/data/datasets/LJSpeech-1.1" \
"/tmp/ljs" \
--tier "Symbols" \
--symlink
```
## Dependencies
- `tqdm`
- `TextGrid>=1.5`
- `ordered_set>=4.1.0`
- `importlib_resources; python_version < '3.8'`
## Roadmap
- Supporting conversion of more datasets
- Adding more tests
## Contributing
If you notice an error, please don't hesitate to open an issue.
### Development setup
```sh
# update
sudo apt update
# install Python 3.7, 3.8, 3.9, 3.10 & 3.11 for ensuring that tests can be run
sudo apt install python3-pip \
python3.7 python3.7-dev python3.7-distutils python3.7-venv \
python3.8 python3.8-dev python3.8-distutils python3.8-venv \
python3.9 python3.9-dev python3.9-distutils python3.9-venv \
python3.10 python3.10-dev python3.10-distutils python3.10-venv \
python3.11 python3.11-dev python3.11-distutils python3.11-venv
# install pipenv for creation of virtual environments
python3.8 -m pip install pipenv --user
# check out repo
git clone https://github.com/stefantaubert/speech-dataset-parser.git
cd speech-dataset-parser
# create virtual environment
python3.8 -m pipenv install --dev
```
## Running the tests
```sh
# first install the tool like in "Development setup"
# then, navigate into the directory of the repo (if not already done)
cd speech-dataset-parser
# activate environment
python3.8 -m pipenv shell
# run tests
tox
```
Final lines of test result output:
```log
py37: commands succeeded
py38: commands succeeded
py39: commands succeeded
py310: commands succeeded
py311: commands succeeded
congratulations :)
```
## License
MIT License
## Acknowledgments
Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – Project-ID 416228727 – CRC 1410
## Citation
If you want to cite this repo, you can use this BibTeX-entry generated by GitHub (see *About => Cite this repository*).
## Changelog
- v0.0.4 (2023-01-12)
- Added:
- Added support to parse [OpenSLR THCHS-30 version](https://www.openslr.org/18/)
- Added returning of an exit code
- Changed:
- Changed default command to be parsing the OpenSLR version for THCHS-30 by renaming the previous command to `convert-thchs-cslt`
- v0.0.3 (2023-01-02)
- added option to restore original file structure
- added option to THCHS-30 to opt in for adding of punctuation
- change file naming format to numbers with preceding zeros
- v0.0.2 (2022-09-08)
- added support for L2Arctic
- added support for THCHS-30
- v0.0.1 (2022-06-03)
- Initial release
Raw data
{
"_id": null,
"home_page": "",
"name": "speech-dataset-parser",
"maintainer": "",
"docs_url": null,
"requires_python": "<4,>=3.7",
"maintainer_email": "Stefan Taubert <pypi@stefantaubert.com>",
"keywords": "Text-to-speech,Speech synthesis,Corpus,Utils,Language,Linguistics",
"author": "",
"author_email": "Stefan Taubert <pypi@stefantaubert.com>",
"download_url": "https://files.pythonhosted.org/packages/ce/bc/04cfbed9bece02be704716e4d581ffa82a01f577ce15301ea718dce5f243/speech-dataset-parser-0.0.4.tar.gz",
"platform": null,
"description": "# speech-dataset-parser\n\n[![PyPI](https://img.shields.io/pypi/v/speech-dataset-parser.svg)](https://pypi.python.org/pypi/speech-dataset-parser)\n[![PyPI](https://img.shields.io/pypi/pyversions/speech-dataset-parser.svg)](https://pypi.python.org/pypi/speech-dataset-parser)\n[![MIT](https://img.shields.io/github/license/stefantaubert/speech-dataset-parser.svg)](https://github.com/stefantaubert/speech-dataset-parser/blob/main/LICENSE)\n[![PyPI](https://img.shields.io/pypi/wheel/speech-dataset-parser.svg)](https://pypi.python.org/pypi/speech-dataset-parser)\n[![PyPI](https://img.shields.io/pypi/implementation/speech-dataset-parser.svg)](https://pypi.python.org/pypi/speech-dataset-parser)\n[![PyPI](https://img.shields.io/github/commits-since/stefantaubert/speech-dataset-parser/latest/master.svg)](https://pypi.python.org/pypi/speech-dataset-parser)\n[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.7529425.svg)](https://doi.org/10.5281/zenodo.7529425)\n\nLibrary to parse speech datasets stored in a generic format based on TextGrids. A tool (CLI) for converting common datasets like LJ Speech into a generic format is included.\nSpeech datasets consists of pairs of .TextGrid and .wav files. The TextGrids need to contain a tier which has each symbol separated in an interval, e.g., `T|h|i|s| |i|s| |a| |t|e|x|t|.`\n\n## Generic Format\n\nThe format is as follows: `{Dataset name}/{Speaker name};{Speaker gender};{Speaker language}[;{Speaker accent}]/[Subfolder(s)]/{Recordings as .wav- and .TextGrid-pairs}`\n\nExample: `LJ Speech/Linda Johnson;2;eng;North American/wavs/...`\n\nSpeaker names can be any string (excluding `;` symbols).\nGenders are defined via their [ISO/IEC 5218 Code](https://en.wikipedia.org/wiki/ISO/IEC_5218).\nLanguages are defined via their [ISO 639-2 Code](https://www.loc.gov/standards/iso639-2/php/code_list.php) (bibliographic).\nAccents are optional and can be any string (excluding `;` symbols).\n\n## Installation\n\n```sh\npip install speech-dataset-parser --user\n```\n\n## Library Usage\n\n```py\nfrom speech_dataset_parser import parse_dataset\n\nentries = list(parse_dataset({folder}, {grid-tier-name}))\n```\n\nThe resulting `entries` list contains dataclass-instances with these properties:\n\n- `symbols: Tuple[str, ...]`: contains the mark of each interval\n- `intervals: Tuple[float, ...]`: contains the max-time of each interval\n- `symbols_language: str`: contains the language\n- `speaker_name: str`: contains the name of the speaker\n- `speaker_accent: str`: contains the accent of the speaker\n- `speaker_gender: int`: contains the gender of the speaker\n- `audio_file_abs: Path`: contains the absolute path to the speech audio\n- `min_time: float`: the min-time of the grid\n- `max_time: float`: the max-time of the grid (equal to `intervals[-1]`)\n\n## CLI Usage\n\n```txt\nusage: dataset-converter-cli [-h] [-v] {convert-ljs,convert-l2arctic,convert-thchs,convert-thchs-cslt,restore-structure} ...\n\nThis program converts common speech datasets into a generic representation.\n\npositional arguments:\n {convert-ljs,convert-l2arctic,convert-thchs,convert-thchs-cslt,restore-structure}\n description\n convert-ljs convert LJ Speech dataset to a generic dataset\n convert-l2arctic convert L2-ARCTIC dataset to a generic dataset\n convert-thchs convert THCHS-30 (OpenSLR Version) dataset to a generic dataset\n convert-thchs-cslt convert THCHS-30 (CSLT Version) dataset to a generic dataset\n restore-structure restore original dataset structure of generic datasets\n\noptional arguments:\n -h, --help show this help message and exit\n -v, --version show program's version number and exit\n```\n\n## CLI Example\n\n```sh\n# Convert LJ Speech dataset with symbolic links to the audio files\ndataset-converter-cli convert-ljs \\\n \"/data/datasets/LJSpeech-1.1\" \\\n \"/tmp/ljs\" \\\n --tier \"Symbols\" \\\n --symlink\n```\n\n## Dependencies\n\n- `tqdm`\n- `TextGrid>=1.5`\n- `ordered_set>=4.1.0`\n- `importlib_resources; python_version < '3.8'`\n\n## Roadmap\n\n- Supporting conversion of more datasets\n- Adding more tests\n\n## Contributing\n\nIf you notice an error, please don't hesitate to open an issue.\n\n### Development setup\n\n```sh\n# update\nsudo apt update\n# install Python 3.7, 3.8, 3.9, 3.10 & 3.11 for ensuring that tests can be run\nsudo apt install python3-pip \\\n python3.7 python3.7-dev python3.7-distutils python3.7-venv \\\n python3.8 python3.8-dev python3.8-distutils python3.8-venv \\\n python3.9 python3.9-dev python3.9-distutils python3.9-venv \\\n python3.10 python3.10-dev python3.10-distutils python3.10-venv \\\n python3.11 python3.11-dev python3.11-distutils python3.11-venv\n# install pipenv for creation of virtual environments\npython3.8 -m pip install pipenv --user\n\n# check out repo\ngit clone https://github.com/stefantaubert/speech-dataset-parser.git\ncd speech-dataset-parser\n# create virtual environment\npython3.8 -m pipenv install --dev\n```\n\n## Running the tests\n\n```sh\n# first install the tool like in \"Development setup\"\n# then, navigate into the directory of the repo (if not already done)\ncd speech-dataset-parser\n# activate environment\npython3.8 -m pipenv shell\n# run tests\ntox\n```\n\nFinal lines of test result output:\n\n```log\npy37: commands succeeded\npy38: commands succeeded\npy39: commands succeeded\npy310: commands succeeded\npy311: commands succeeded\ncongratulations :)\n```\n\n## License\n\nMIT License\n\n## Acknowledgments\n\nFunded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) \u2013 Project-ID 416228727 \u2013 CRC 1410\n\n## Citation\n\nIf you want to cite this repo, you can use this BibTeX-entry generated by GitHub (see *About => Cite this repository*).\n\n## Changelog\n\n- v0.0.4 (2023-01-12)\n - Added:\n - Added support to parse [OpenSLR THCHS-30 version](https://www.openslr.org/18/)\n - Added returning of an exit code\n - Changed:\n - Changed default command to be parsing the OpenSLR version for THCHS-30 by renaming the previous command to `convert-thchs-cslt`\n- v0.0.3 (2023-01-02)\n - added option to restore original file structure\n - added option to THCHS-30 to opt in for adding of punctuation\n - change file naming format to numbers with preceding zeros\n- v0.0.2 (2022-09-08)\n - added support for L2Arctic\n - added support for THCHS-30\n- v0.0.1 (2022-06-03)\n - Initial release\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Library to parse speech datasets stored in a generic format based on TextGrids. A tool (CLI) for converting common datasets like LJ Speech into a generic format is included.",
"version": "0.0.4",
"split_keywords": [
"text-to-speech",
"speech synthesis",
"corpus",
"utils",
"language",
"linguistics"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "cc4eaa95b4ca99d9f0d518a70280129f007ae11ddfa044ae29cb7fb5b26f5292",
"md5": "09ff62fe38e758950ba8c68153a061c8",
"sha256": "52658e974f7981202f5f0b39d5d901cc2f9db4111a3826436aab1a4e00d60ebf"
},
"downloads": -1,
"filename": "speech_dataset_parser-0.0.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "09ff62fe38e758950ba8c68153a061c8",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4,>=3.7",
"size": 26486,
"upload_time": "2023-01-12T13:55:17",
"upload_time_iso_8601": "2023-01-12T13:55:17.391881Z",
"url": "https://files.pythonhosted.org/packages/cc/4e/aa95b4ca99d9f0d518a70280129f007ae11ddfa044ae29cb7fb5b26f5292/speech_dataset_parser-0.0.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "cebc04cfbed9bece02be704716e4d581ffa82a01f577ce15301ea718dce5f243",
"md5": "abce15d5635a54332211682ac4709f39",
"sha256": "b7f2dce5f0534f5f5e66266c03646f2f6a5c2db76cf637740824635387364b09"
},
"downloads": -1,
"filename": "speech-dataset-parser-0.0.4.tar.gz",
"has_sig": false,
"md5_digest": "abce15d5635a54332211682ac4709f39",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4,>=3.7",
"size": 19326,
"upload_time": "2023-01-12T13:55:18",
"upload_time_iso_8601": "2023-01-12T13:55:18.787456Z",
"url": "https://files.pythonhosted.org/packages/ce/bc/04cfbed9bece02be704716e4d581ffa82a01f577ce15301ea718dce5f243/speech-dataset-parser-0.0.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-01-12 13:55:18",
"github": false,
"gitlab": false,
"bitbucket": false,
"lcname": "speech-dataset-parser"
}