# Cued Speech Processing Tools
Python package for decoding and generating cued speech videos with MediaPipe and deep learning.
## Features
- **Decoder**: Convert cued speech videos to text with subtitles using neural networks and language models
- **Generator**: Create cued speech videos from text with automatic hand gesture overlay
- **Automatic Data Management**: Downloads required models and data automatically
## Installation
### Prerequisites
- Python 3.11.*
- Pixi (for Montreal Forced Aligner)
### Setup Steps
1. **Install Pixi**
```bash
# macOS/Linux
curl -fsSL https://pixi.sh/install.sh | bash
# Windows PowerShell
irm https://pixi.sh/install.ps1 | iex
```
2. **Create Pixi environment**
```bash
mkdir cued-speech-env && cd cued-speech-env
pixi init
pixi add "python==3.11"
pixi add montreal-forced-aligner=3.3.4
```
3. **Install package**
```bash
pixi run python -m pip install cued-speech
```
4. **Download data and setup MFA models**
```bash
pixi shell
cued-speech download-data
pixi run mfa models save acoustic download/french_mfa.zip --overwrite
pixi run mfa models save dictionary download/french_mfa.dict --overwrite
```
5. **Verify installation**
```bash
cued-speech --help
```
## Quick Start
### Decode Video (Cued Speech → Text)
```bash
# Basic usage with default parameters, we use the provided test video
cued-speech decode
# Custom video
cued-speech decode --video_path /path/to/video.mp4
```
### Generate Video (Text → Cued Speech)
```bash
# Text extracted automatically from video audio
cued-speech generate input_video.mp4
# Skip Whisper
cued-speech generate video.mp4 --skip-whisper --text "Votre texte ici"
```
## Command Line Options
### Decoder
**Core Options:**
- `--video_path PATH` - Input video (default: `download/test_decode.mp4`)
- `--output_path PATH` - Output video (default: `output/decoder/decoded_video.mp4`)
- `--right_speaker [True|False]` - Speaker handedness (default: `True`)
- `--auto_download [True|False]` - Auto-download data (default: `True`)
**Model Paths (optional):**
- `--model_path PATH` - Neural network model
- `--vocab_path PATH` - Phoneme vocabulary
- `--face_tflite PATH` - Face landmark model
- `--hand_tflite PATH` - Hand landmark model
- `--pose_tflite PATH` - Pose landmark model
### Generator
**Options:**
- `VIDEO_PATH` (required) - Input video file
- `--text TEXT` - Manual text input (optional, otherwise extracted from audio)
- `--output_path PATH` - Output video (default: `output/generator/generated_cued_speech.mp4`)
- `--language LANG` - Language (default: `french`)
- `--skip-whisper` - Skip Whisper transcription (requires `--text`)
- `--easing TYPE` - Animation easing: `linear`, `ease_in_out_cubic`, `ease_out_elastic`, `ease_in_out_back`
- `--morphing/--no-morphing` - Hand shape morphing (default: enabled)
- `--transparency/--no-transparency` - Transparency effects (default: enabled)
- `--curving/--no-curving` - Curved trajectories (default: enabled)
## Python API
### Decoder
```python
from cued_speech import decode_video
decode_video(
video_path="input.mp4",
right_speaker=True,
output_path="output/decoder/"
)
```
### Generator
```python
from cued_speech import generate_cue
import whisper
# Automatic text extraction
model = whisper.load_model("medium", download_root="download")
result_path = generate_cue(
text=None, # Extracted from video
video_path="video.mp4",
output_path="output/generator/",
config={
"model": model, # Optional preloaded Whisper model
"language": "french",
"easing_function": "ease_in_out_cubic",
"enable_morphing": True,
"enable_transparency": True,
"enable_curving": True,
}
)
# With manual text
result_path = generate_cue(
text="Bonjour tout le monde",
video_path="video.mp4",
output_path="output/generator/",
config={"skip_whisper": True}
)
```
## Data Management
```bash
# Download all required data
cued-speech download-data
# List available data
cued-speech list-data
# Clean up data
cued-speech cleanup-data --confirm
```
### Downloaded Files
Data is stored in `./download/`:
**Decoder:**
- `cuedspeech-model.pt` - Neural network model
- `phonelist.csv`, `lexicon.txt` - Vocabularies
- `kenlm_fr.bin`, `kenlm_ipa.binary` - Language models
- `homophones_dico.jsonl` - Homophone dictionary
- `face_landmarker.task` - Face landmarks (478 points, 3.6 MB)
- `hand_landmarker.task` - Hand landmarks (21 points/hand, 7.5 MB)
- `pose_landmarker_full.task` - Pose landmarks (33 points, 9.0 MB)
**Generator:**
- `rotated_images/` - Hand shape images
- `french_mfa.dict`, `french_mfa.zip` - MFA models
**Test Files:**
- `test_decode.mp4`, `test_generate.mp4`
## Architecture
### Decoder
- **MediaPipe Tasks API**: Latest float16 models for landmark detection
- **Neural Network**: Three-stream fusion encoder (hand shape, position, lips)
- **CTC Decoder**: Phoneme recognition with beam search
- **Language Model**: KenLM for French sentence correction
### Generator
- **Whisper**: Speech-to-text transcription
- **MFA**: Montreal Forced Alignment for phoneme timing
- **Dynamic Scaling**: Hand size automatically adapts to face width
- **Hand Rendering**: MediaPipe-based hand landmark detection for accurate positioning
## Notes
- Models designed for 30 FPS videos
- Hand size automatically scales based on detected face width
- TFLite models use MediaPipe Tasks API (`.task` files) or fallback to TFLite Interpreter (`.tflite`)
- Automatic fallback to MediaPipe Holistic if TFLite models fail
## License
MIT License - see LICENSE file
## Support
Contact: boubasow.pro@gmail.com
Raw data
{
"_id": null,
"home_page": null,
"name": "cued-speech",
"maintainer": null,
"docs_url": null,
"requires_python": "<3.12,>=3.11",
"maintainer_email": "Boubacar Sow <boubasow.pro@gmail.com>",
"keywords": "accessibility, cued-speech, mediapipe, speech-recognition, video-processing, whisper",
"author": null,
"author_email": "Boubacar Sow <boubasow.pro@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/8a/29/a2792ee1f6d1e505d1f1b921d927775a31913efc84e8ec6f836e43e1492c/cued_speech-0.4.0.tar.gz",
"platform": null,
"description": "# Cued Speech Processing Tools\n\nPython package for decoding and generating cued speech videos with MediaPipe and deep learning.\n\n## Features\n\n- **Decoder**: Convert cued speech videos to text with subtitles using neural networks and language models\n- **Generator**: Create cued speech videos from text with automatic hand gesture overlay\n- **Automatic Data Management**: Downloads required models and data automatically\n\n## Installation\n\n### Prerequisites\n- Python 3.11.*\n- Pixi (for Montreal Forced Aligner)\n\n### Setup Steps\n\n1. **Install Pixi**\n```bash\n# macOS/Linux\ncurl -fsSL https://pixi.sh/install.sh | bash\n\n# Windows PowerShell\nirm https://pixi.sh/install.ps1 | iex\n```\n\n2. **Create Pixi environment**\n```bash\nmkdir cued-speech-env && cd cued-speech-env\npixi init\npixi add \"python==3.11\"\npixi add montreal-forced-aligner=3.3.4\n```\n\n3. **Install package**\n```bash\npixi run python -m pip install cued-speech\n```\n\n4. **Download data and setup MFA models**\n```bash\npixi shell\ncued-speech download-data\npixi run mfa models save acoustic download/french_mfa.zip --overwrite\npixi run mfa models save dictionary download/french_mfa.dict --overwrite\n```\n\n5. **Verify installation**\n```bash\ncued-speech --help\n```\n\n## Quick Start\n\n### Decode Video (Cued Speech \u2192 Text)\n```bash\n# Basic usage with default parameters, we use the provided test video\ncued-speech decode\n\n# Custom video\ncued-speech decode --video_path /path/to/video.mp4\n```\n\n### Generate Video (Text \u2192 Cued Speech)\n```bash\n# Text extracted automatically from video audio\ncued-speech generate input_video.mp4\n\n# Skip Whisper \ncued-speech generate video.mp4 --skip-whisper --text \"Votre texte ici\"\n```\n\n## Command Line Options\n\n### Decoder\n**Core Options:**\n- `--video_path PATH` - Input video (default: `download/test_decode.mp4`)\n- `--output_path PATH` - Output video (default: `output/decoder/decoded_video.mp4`)\n- `--right_speaker [True|False]` - Speaker handedness (default: `True`)\n- `--auto_download [True|False]` - Auto-download data (default: `True`)\n\n**Model Paths (optional):**\n- `--model_path PATH` - Neural network model\n- `--vocab_path PATH` - Phoneme vocabulary\n- `--face_tflite PATH` - Face landmark model\n- `--hand_tflite PATH` - Hand landmark model\n- `--pose_tflite PATH` - Pose landmark model\n\n### Generator\n**Options:**\n- `VIDEO_PATH` (required) - Input video file\n- `--text TEXT` - Manual text input (optional, otherwise extracted from audio)\n- `--output_path PATH` - Output video (default: `output/generator/generated_cued_speech.mp4`)\n- `--language LANG` - Language (default: `french`)\n- `--skip-whisper` - Skip Whisper transcription (requires `--text`)\n- `--easing TYPE` - Animation easing: `linear`, `ease_in_out_cubic`, `ease_out_elastic`, `ease_in_out_back`\n- `--morphing/--no-morphing` - Hand shape morphing (default: enabled)\n- `--transparency/--no-transparency` - Transparency effects (default: enabled)\n- `--curving/--no-curving` - Curved trajectories (default: enabled)\n\n## Python API\n\n### Decoder\n```python\nfrom cued_speech import decode_video\n\ndecode_video(\n video_path=\"input.mp4\",\n right_speaker=True,\n output_path=\"output/decoder/\"\n)\n```\n\n### Generator\n```python\nfrom cued_speech import generate_cue\nimport whisper\n\n# Automatic text extraction\nmodel = whisper.load_model(\"medium\", download_root=\"download\")\nresult_path = generate_cue(\n text=None, # Extracted from video\n video_path=\"video.mp4\",\n output_path=\"output/generator/\",\n config={\n \"model\": model, # Optional preloaded Whisper model\n \"language\": \"french\",\n \"easing_function\": \"ease_in_out_cubic\",\n \"enable_morphing\": True,\n \"enable_transparency\": True,\n \"enable_curving\": True,\n }\n)\n\n# With manual text\nresult_path = generate_cue(\n text=\"Bonjour tout le monde\",\n video_path=\"video.mp4\",\n output_path=\"output/generator/\",\n config={\"skip_whisper\": True}\n)\n```\n\n## Data Management\n\n```bash\n# Download all required data\ncued-speech download-data\n\n# List available data\ncued-speech list-data\n\n# Clean up data\ncued-speech cleanup-data --confirm\n```\n\n### Downloaded Files\n\nData is stored in `./download/`:\n\n**Decoder:**\n- `cuedspeech-model.pt` - Neural network model\n- `phonelist.csv`, `lexicon.txt` - Vocabularies\n- `kenlm_fr.bin`, `kenlm_ipa.binary` - Language models\n- `homophones_dico.jsonl` - Homophone dictionary\n- `face_landmarker.task` - Face landmarks (478 points, 3.6 MB)\n- `hand_landmarker.task` - Hand landmarks (21 points/hand, 7.5 MB)\n- `pose_landmarker_full.task` - Pose landmarks (33 points, 9.0 MB)\n\n**Generator:**\n- `rotated_images/` - Hand shape images\n- `french_mfa.dict`, `french_mfa.zip` - MFA models\n\n**Test Files:**\n- `test_decode.mp4`, `test_generate.mp4`\n\n## Architecture\n\n### Decoder\n- **MediaPipe Tasks API**: Latest float16 models for landmark detection\n- **Neural Network**: Three-stream fusion encoder (hand shape, position, lips)\n- **CTC Decoder**: Phoneme recognition with beam search\n- **Language Model**: KenLM for French sentence correction\n\n### Generator\n- **Whisper**: Speech-to-text transcription\n- **MFA**: Montreal Forced Alignment for phoneme timing\n- **Dynamic Scaling**: Hand size automatically adapts to face width\n- **Hand Rendering**: MediaPipe-based hand landmark detection for accurate positioning\n\n## Notes\n\n- Models designed for 30 FPS videos\n- Hand size automatically scales based on detected face width\n- TFLite models use MediaPipe Tasks API (`.task` files) or fallback to TFLite Interpreter (`.tflite`)\n- Automatic fallback to MediaPipe Holistic if TFLite models fail\n\n## License\n\nMIT License - see LICENSE file\n\n## Support\n\nContact: boubasow.pro@gmail.com\n",
"bugtrack_url": null,
"license": null,
"summary": "Cued Speech Processing Tools - Decode and Generate cued speech videos",
"version": "0.4.0",
"project_urls": {
"Bug Tracker": "https://github.com/boubacar-sow/cued-speech/issues",
"Documentation": "https://github.com/boubacar-sow/cued-speechblob/main/README.md",
"Homepage": "https://github.com/boubacar-sow/cued-speech",
"Repository": "https://github.com/boubacar-sow/cued-speech"
},
"split_keywords": [
"accessibility",
" cued-speech",
" mediapipe",
" speech-recognition",
" video-processing",
" whisper"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "24bf0e421bc8a96d91bdcf5fa056de8967825a331234c81a7fcb5ad863099916",
"md5": "cc533a56b3d161df048c9ad3032544ce",
"sha256": "9a7459c599798c8d840c4dc2df7f74043522bea9f859d7c4e098f4f2c8d0f05c"
},
"downloads": -1,
"filename": "cued_speech-0.4.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "cc533a56b3d161df048c9ad3032544ce",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<3.12,>=3.11",
"size": 150704,
"upload_time": "2025-10-22T11:37:30",
"upload_time_iso_8601": "2025-10-22T11:37:30.268575Z",
"url": "https://files.pythonhosted.org/packages/24/bf/0e421bc8a96d91bdcf5fa056de8967825a331234c81a7fcb5ad863099916/cued_speech-0.4.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "8a29a2792ee1f6d1e505d1f1b921d927775a31913efc84e8ec6f836e43e1492c",
"md5": "1e33e302dc6953ab17c863505c338e08",
"sha256": "bde65f3c5c952d412b32341cdbf28e471f60685c141d02836683f20336bc4410"
},
"downloads": -1,
"filename": "cued_speech-0.4.0.tar.gz",
"has_sig": false,
"md5_digest": "1e33e302dc6953ab17c863505c338e08",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<3.12,>=3.11",
"size": 145093,
"upload_time": "2025-10-22T11:37:32",
"upload_time_iso_8601": "2025-10-22T11:37:32.397767Z",
"url": "https://files.pythonhosted.org/packages/8a/29/a2792ee1f6d1e505d1f1b921d927775a31913efc84e8ec6f836e43e1492c/cued_speech-0.4.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-22 11:37:32",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "boubacar-sow",
"github_project": "cued-speech",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "cued-speech"
}