**Malaya-Speech** is a Speech-Toolkit library for bahasa Malaysia, powered by Tensorflow and PyTorch.
Documentation
--------------
Stable released documentation is available at https://malaya-speech.readthedocs.io/
Installing from the PyPI
----------------------------------
::
$ pip install malaya-speech
It will automatically install all dependencies except for Tensorflow and PyTorch. So you can choose your own Tensorflow CPU / GPU version and PyTorch CPU / GPU version.
Only **Python >= 3.6.0**, **Tensorflow >= 1.15.0**, and **PyTorch >= 1.10** are supported.
Development Release
---------------------------------
Install from `master` branch,
::
$ pip install git+https://github.com/huseinzol05/malaya-speech.git
We recommend to use **virtualenv** for development.
Documentation at https://malaya-speech.readthedocs.io/en/latest/
Features
--------
- **Age Detection**, detect age in speech using Finetuned Speaker Vector.
- **Speaker Diarization**, diarizing speakers using Pretrained Speaker Vector.
- **Emotion Detection**, detect emotions in speech using Finetuned Speaker Vector.
- **Force Alignment**, generate a time-aligned transcription of an audio file using RNNT and CTC.
- **Gender Detection**, detect genders in speech using Finetuned Speaker Vector.
- **Language Detection**, detect hyperlocal languages in speech using Finetuned Speaker Vector.
- **Language Model**, using KenLM, Masked language model using BERT and RoBERTa, and GPT2 to do ASR decoder scoring.
- **Multispeaker Separation**, Multispeaker separation using FastSep on 8k Wav.
- **Noise Reduction**, reduce multilevel noises using STFT UNET.
- **Speaker Change**, detect changing speakers using Finetuned Speaker Vector.
- **Speaker overlap**, detect overlap speakers using Finetuned Speaker Vector.
- **Speaker Vector**, calculate similarity between speakers using Pretrained Speaker Vector.
- **Speech Enhancement**, enhance voice activities using Waveform UNET.
- **SpeechSplit Conversion**, detailed speaking style conversion by disentangling speech into content, timbre, rhythm and pitch using PyWorld and PySPTK.
- **Speech-to-Text**, End-to-End Speech to Text for Malay, Mixed (Malay, Singlish and Mandarin) and Singlish using RNNT, Wav2Vec2, HuBERT and BEST-RQ CTC.
- **Super Resolution**, Super Resolution 4x for Waveform using ResNet UNET and Neural Vocoder.
- **Text-to-Speech**, Text to Speech for Malay and Singlish using Tacotron2, FastSpeech2, FastPitch, GlowTTS, LightSpeech and VITS.
- **Vocoder**, convert Mel to Waveform using MelGAN, Multiband MelGAN and Universal MelGAN Vocoder.
- **Voice Activity Detection**, detect voice activities using Finetuned Speaker Vector.
- **Voice Conversion**, Many-to-One, One-to-Many, Many-to-Many, and Zero-shot Voice Conversion.
- **Hybrid 8-bit Quantization**, provide hybrid 8-bit quantization for all models to reduce inference time up to 2x and model size up to 4x.
Pretrained Models
------------------
Malaya-Speech also released pretrained models, simply check at `malaya-speech/pretrained-model <https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model>`_
- **Wave UNET**, Multi-Scale Neural Network for End-to-End Audio Source Separation, https://arxiv.org/abs/1806.03185
- **Wave ResNet UNET**, added ResNet style into Wave UNET, no paper produced.
- **Wave ResNext UNET**, added ResNext style into Wave UNET, no paper produced.
- **Deep Speaker**, An End-to-End Neural Speaker Embedding System, https://arxiv.org/pdf/1705.02304.pdf
- **SpeakerNet**, 1D Depth-wise Separable Convolutional Network for Text-Independent Speaker Recognition and Verification, https://arxiv.org/abs/2010.12653
- **VGGVox**, a large-scale speaker identification dataset, https://arxiv.org/pdf/1706.08612.pdf
- **GhostVLAD**, Utterance-level Aggregation For Speaker Recognition In The Wild, https://arxiv.org/abs/1902.10107
- **Conformer**, Convolution-augmented Transformer for Speech Recognition, https://arxiv.org/abs/2005.08100
- **ALConformer**, A lite Conformer, no paper produced.
- **Jasper**, An End-to-End Convolutional Neural Acoustic Model, https://arxiv.org/abs/1904.03288
- **Tacotron2**, Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions, https://arxiv.org/abs/1712.05884
- **FastSpeech2**, Fast and High-Quality End-to-End Text to Speech, https://arxiv.org/abs/2006.04558
- **MelGAN**, Generative Adversarial Networks for Conditional Waveform Synthesis, https://arxiv.org/abs/1910.06711
- **Multi-band MelGAN**, Faster Waveform Generation for High-Quality Text-to-Speech, https://arxiv.org/abs/2005.05106
- **SRGAN**, Modified version of SRGAN to do 1D Convolution, Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network, https://arxiv.org/abs/1609.04802
- **Speech Enhancement UNET**, https://github.com/haoxiangsnr/Wave-U-Net-for-Speech-Enhancement
- **Speech Enhancement ResNet UNET**, Added ResNet style into Speech Enhancement UNET, no paper produced.
- **Speech Enhancement ResNext UNET**, Added ResNext style into Speech Enhancement UNET, no paper produced.
- **Universal MelGAN**, Universal MelGAN: A Robust Neural Vocoder for High-Fidelity Waveform Generation in Multiple Domains, https://arxiv.org/abs/2011.09631
- **FastVC**, Faster and Accurate Voice Conversion using Transformer, no paper produced.
- **FastSep**, Faster and Accurate Speech Separation using Transformer, no paper produced.
- **wav2vec 2.0**, A Framework for Self-Supervised Learning of Speech Representations, https://arxiv.org/abs/2006.11477
- **FastSpeechSplit**, Unsupervised Speech Decomposition Via Triple Information Bottleneck using Transformer, no paper produced.
- **Sepformer**, Attention is All You Need in Speech Separation, https://arxiv.org/abs/2010.13154
- **FastSpeechSplit**, Faster and Accurate Speech Split Conversion using Transformer, no paper produced.
- **HuBERT**, Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units, https://arxiv.org/pdf/2106.07447v1.pdf
- **FastPitch**, Parallel Text-to-speech with Pitch Prediction, https://arxiv.org/abs/2006.06873
- **GlowTTS**, A Generative Flow for Text-to-Speech via Monotonic Alignment Search, https://arxiv.org/abs/2005.11129
- **BEST-RQ**, Self-supervised learning with random-projection quantizer for speech recognition, https://arxiv.org/pdf/2202.01855.pdf
- **LightSpeech**, Lightweight and Fast Text to Speech with Neural Architecture Search, https://arxiv.org/abs/2102.04040
- **VITS**, Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech, https://arxiv.org/abs/2106.06103
- **Squeezeformer**, An Efficient Transformer for Automatic Speech Recognition, https://arxiv.org/abs/2206.00888
References
-----------
If you use our software for research, please cite:
::
@misc{Malaya, Speech-Toolkit library for bahasa Malaysia, powered by Deep Learning Tensorflow,
author = {Husein, Zolkepli},
title = {Malaya-Speech},
year = {2020},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huseinzol05/malaya-speech}}
}
Acknowledgement
----------------
Thanks to `KeyReply <https://www.keyreply.com/>`_ for private V100s cloud and `Mesolitica <https://mesolitica.com/>`_ for private RTXs cloud to train Malaya-Speech models.
Raw data
{
"_id": null,
"home_page": "https://github.com/huseinzol05/malaya-speech",
"name": "malaya-speech",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.6.*",
"maintainer_email": "",
"keywords": "nlp,bm",
"author": "huseinzol05",
"author_email": "husein.zol05@gmail.com",
"download_url": "https://github.com/huseinzol05/malaya-speech/archive/master.zip",
"platform": null,
"description": "**Malaya-Speech** is a Speech-Toolkit library for bahasa Malaysia, powered by Tensorflow and PyTorch.\n\nDocumentation\n--------------\n\nStable released documentation is available at https://malaya-speech.readthedocs.io/\n\nInstalling from the PyPI\n----------------------------------\n\n::\n\n $ pip install malaya-speech\n\nIt will automatically install all dependencies except for Tensorflow and PyTorch. So you can choose your own Tensorflow CPU / GPU version and PyTorch CPU / GPU version.\n\nOnly **Python >= 3.6.0**, **Tensorflow >= 1.15.0**, and **PyTorch >= 1.10** are supported.\n\nDevelopment Release\n---------------------------------\n\nInstall from `master` branch,\n::\n\n $ pip install git+https://github.com/huseinzol05/malaya-speech.git\n\n\nWe recommend to use **virtualenv** for development. \n\nDocumentation at https://malaya-speech.readthedocs.io/en/latest/\n\nFeatures\n--------\n\n- **Age Detection**, detect age in speech using Finetuned Speaker Vector.\n- **Speaker Diarization**, diarizing speakers using Pretrained Speaker Vector.\n- **Emotion Detection**, detect emotions in speech using Finetuned Speaker Vector.\n- **Force Alignment**, generate a time-aligned transcription of an audio file using RNNT and CTC.\n- **Gender Detection**, detect genders in speech using Finetuned Speaker Vector.\n- **Language Detection**, detect hyperlocal languages in speech using Finetuned Speaker Vector.\n- **Language Model**, using KenLM, Masked language model using BERT and RoBERTa, and GPT2 to do ASR decoder scoring.\n- **Multispeaker Separation**, Multispeaker separation using FastSep on 8k Wav.\n- **Noise Reduction**, reduce multilevel noises using STFT UNET.\n- **Speaker Change**, detect changing speakers using Finetuned Speaker Vector.\n- **Speaker overlap**, detect overlap speakers using Finetuned Speaker Vector.\n- **Speaker Vector**, calculate similarity between speakers using Pretrained Speaker Vector.\n- **Speech Enhancement**, enhance voice activities using Waveform UNET.\n- **SpeechSplit Conversion**, detailed speaking style conversion by disentangling speech into content, timbre, rhythm and pitch using PyWorld and PySPTK.\n- **Speech-to-Text**, End-to-End Speech to Text for Malay, Mixed (Malay, Singlish and Mandarin) and Singlish using RNNT, Wav2Vec2, HuBERT and BEST-RQ CTC.\n- **Super Resolution**, Super Resolution 4x for Waveform using ResNet UNET and Neural Vocoder.\n- **Text-to-Speech**, Text to Speech for Malay and Singlish using Tacotron2, FastSpeech2, FastPitch, GlowTTS, LightSpeech and VITS.\n- **Vocoder**, convert Mel to Waveform using MelGAN, Multiband MelGAN and Universal MelGAN Vocoder.\n- **Voice Activity Detection**, detect voice activities using Finetuned Speaker Vector.\n- **Voice Conversion**, Many-to-One, One-to-Many, Many-to-Many, and Zero-shot Voice Conversion.\n- **Hybrid 8-bit Quantization**, provide hybrid 8-bit quantization for all models to reduce inference time up to 2x and model size up to 4x.\n\nPretrained Models\n------------------\n\nMalaya-Speech also released pretrained models, simply check at `malaya-speech/pretrained-model <https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model>`_\n\n- **Wave UNET**, Multi-Scale Neural Network for End-to-End Audio Source Separation, https://arxiv.org/abs/1806.03185\n- **Wave ResNet UNET**, added ResNet style into Wave UNET, no paper produced.\n- **Wave ResNext UNET**, added ResNext style into Wave UNET, no paper produced.\n- **Deep Speaker**, An End-to-End Neural Speaker Embedding System, https://arxiv.org/pdf/1705.02304.pdf\n- **SpeakerNet**, 1D Depth-wise Separable Convolutional Network for Text-Independent Speaker Recognition and Verification, https://arxiv.org/abs/2010.12653\n- **VGGVox**, a large-scale speaker identification dataset, https://arxiv.org/pdf/1706.08612.pdf\n- **GhostVLAD**, Utterance-level Aggregation For Speaker Recognition In The Wild, https://arxiv.org/abs/1902.10107\n- **Conformer**, Convolution-augmented Transformer for Speech Recognition, https://arxiv.org/abs/2005.08100\n- **ALConformer**, A lite Conformer, no paper produced.\n- **Jasper**, An End-to-End Convolutional Neural Acoustic Model, https://arxiv.org/abs/1904.03288\n- **Tacotron2**, Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions, https://arxiv.org/abs/1712.05884\n- **FastSpeech2**, Fast and High-Quality End-to-End Text to Speech, https://arxiv.org/abs/2006.04558\n- **MelGAN**, Generative Adversarial Networks for Conditional Waveform Synthesis, https://arxiv.org/abs/1910.06711\n- **Multi-band MelGAN**, Faster Waveform Generation for High-Quality Text-to-Speech, https://arxiv.org/abs/2005.05106\n- **SRGAN**, Modified version of SRGAN to do 1D Convolution, Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network, https://arxiv.org/abs/1609.04802\n- **Speech Enhancement UNET**, https://github.com/haoxiangsnr/Wave-U-Net-for-Speech-Enhancement\n- **Speech Enhancement ResNet UNET**, Added ResNet style into Speech Enhancement UNET, no paper produced.\n- **Speech Enhancement ResNext UNET**, Added ResNext style into Speech Enhancement UNET, no paper produced.\n- **Universal MelGAN**, Universal MelGAN: A Robust Neural Vocoder for High-Fidelity Waveform Generation in Multiple Domains, https://arxiv.org/abs/2011.09631\n- **FastVC**, Faster and Accurate Voice Conversion using Transformer, no paper produced.\n- **FastSep**, Faster and Accurate Speech Separation using Transformer, no paper produced.\n- **wav2vec 2.0**, A Framework for Self-Supervised Learning of Speech Representations, https://arxiv.org/abs/2006.11477\n- **FastSpeechSplit**, Unsupervised Speech Decomposition Via Triple Information Bottleneck using Transformer, no paper produced.\n- **Sepformer**, Attention is All You Need in Speech Separation, https://arxiv.org/abs/2010.13154\n- **FastSpeechSplit**, Faster and Accurate Speech Split Conversion using Transformer, no paper produced.\n- **HuBERT**, Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units, https://arxiv.org/pdf/2106.07447v1.pdf\n- **FastPitch**, Parallel Text-to-speech with Pitch Prediction, https://arxiv.org/abs/2006.06873\n- **GlowTTS**, A Generative Flow for Text-to-Speech via Monotonic Alignment Search, https://arxiv.org/abs/2005.11129\n- **BEST-RQ**, Self-supervised learning with random-projection quantizer for speech recognition, https://arxiv.org/pdf/2202.01855.pdf\n- **LightSpeech**, Lightweight and Fast Text to Speech with Neural Architecture Search, https://arxiv.org/abs/2102.04040\n- **VITS**, Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech, https://arxiv.org/abs/2106.06103\n- **Squeezeformer**, An Efficient Transformer for Automatic Speech Recognition, https://arxiv.org/abs/2206.00888\n\nReferences\n-----------\n\nIf you use our software for research, please cite:\n\n::\n\n @misc{Malaya, Speech-Toolkit library for bahasa Malaysia, powered by Deep Learning Tensorflow,\n author = {Husein, Zolkepli},\n title = {Malaya-Speech},\n year = {2020},\n publisher = {GitHub},\n journal = {GitHub repository},\n howpublished = {\\url{https://github.com/huseinzol05/malaya-speech}}\n }\n\nAcknowledgement\n----------------\n\nThanks to `KeyReply <https://www.keyreply.com/>`_ for private V100s cloud and `Mesolitica <https://mesolitica.com/>`_ for private RTXs cloud to train Malaya-Speech models.\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Speech-Toolkit for bahasa Malaysia, powered by Tensorflow and PyTorch.",
"version": "1.3.0.2",
"project_urls": {
"Download": "https://github.com/huseinzol05/malaya-speech/archive/master.zip",
"Homepage": "https://github.com/huseinzol05/malaya-speech"
},
"split_keywords": [
"nlp",
"bm"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "c7430a50cc833b9e76764543ff9eaf5a0fe791ab86fc86b5391fb11b6b8abce9",
"md5": "6e0a20f472a0b5bad551c69ef0d6ec29",
"sha256": "c66fd5b21a1ab94a3f73d8b920691511957e99e9f3191c9cfaa5511bdc00ffa4"
},
"downloads": -1,
"filename": "malaya_speech-1.3.0.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "6e0a20f472a0b5bad551c69ef0d6ec29",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6.*",
"size": 1144048,
"upload_time": "2022-09-22T04:20:04",
"upload_time_iso_8601": "2022-09-22T04:20:04.472922Z",
"url": "https://files.pythonhosted.org/packages/c7/43/0a50cc833b9e76764543ff9eaf5a0fe791ab86fc86b5391fb11b6b8abce9/malaya_speech-1.3.0.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2022-09-22 04:20:04",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "huseinzol05",
"github_project": "malaya-speech",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "malaya-speech"
}