ichigo


Nameichigo JSON
Version 0.0.10 PyPI version JSON
download
home_pageNone
SummaryIchigo is an open, ongoing research experiment to extend a text-based LLM to have native listening ability. Think of it as an open data, open weight, on device Siri.
upload_time2025-02-17 02:26:25
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseApache-2.0
keywords asr llm tts ichigo
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <div align="center">

# :strawberry: Ichigo: A simple speech package optimised for local inference
<a href='https://homebrew.ltd/blog/llama3-just-got-ears'><img src='https://img.shields.io/badge/Project-Blog-Green'></a>
<a href='https://ichigo.homebrew.ltd/'><img src='https://img.shields.io/badge/Project-Demo-violet'></a>
<a href='https://arxiv.org/pdf/2410.15316'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a>
<a href='https://huggingface.co/homebrewltd'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue'></a>
<a href='https://huggingface.co/homebrewltd'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Data-green'></a>
<a href='https://colab.research.google.com/drive/18IiwN0AzBZaox5o0iidXqWD1xKq11XbZ?usp=sharing'><img src='https://colab.research.google.com/assets/colab-badge.svg'></a>

[**About**](#About) | [**Installation**](#Installation) | [**Ichigo-ASR**](#Ichigo-ASR) | [**Ichigo-LLM**](#Ichigo-LLM)

<img src="assets/ichigo.jpeg" width="400"/>
</div>

## About

This package does 3 things: 
1) Automatic Speech Recognition: [**Ichigo-ASR**](#ichigo-asr)
2) Text to Speech: Coming Soon
3) Speech Language Model: [**Ichigo-LLM**](#ichigo-llm) (experimental)

It contains only inference code, and caters to most local inference use cases around these three tasks.

## Installation

To get started, simply install the package.

```bash
pip install ichigo
```

## Ichigo-ASR

Ichigo-ASR is a compact (22M parameters), open-source speech tokenizer for the `Whisper-medium model`, designed to enhance performance on multilingual with minimal impact on its original English capabilities. Unlike models that output continuous embeddings, Ịchigo-ASR compresses speech into discrete tokens, making it more compatible with large language models (LLMs) for immediate speech understanding. This speech tokenizer has been trained on over ~400 hours of English data and ~1000 hours of Vietnamese data.

### Batch Processing

The ichigo package can handle batch processing of audio files using a single line of code, with additional parameters for available for more control.

1. For single files

```python
# Quick one-liner for transcription
from ichigo.asr import transcribe
results = transcribe("path/to/your/file")
# Expected output: "{filename: transcription}"
```
A transcription.txt will also stored in the same folder as "path/to/your/file"

2. For multiple files (folder)

```python
# Quick one-liner for transcription
from ichigo.asr import transcribe
results = transcribe("path/to/your/folder")
# Expected output: "{filename1: transcription1, filename2: transcription2, ... filenameN: transcriptionN,}"
```
A subfolder will be created in `path/to/your/folder` and transcriptions will be stored as `filenameN.txt` in the subfolder.

### API

For integration with frontend, a python fastAPI is also available. This api also does batch processing. Streaming is currently not supported.

1. Start the server

```bash
# Uvicorn
cd api && uvicorn asr:app --host 0.0.0.0 --port 8000

# or Docker 
docker compose up -d
```

2. curl
```bash
# S2T
curl "http://localhost:8000/v1/audio/transcriptions" \
  -H "accept: application/json" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@sample.wav" -F "model=ichigo"

# S2R
curl "http://localhost:8000/s2r" \
  -H "accept: application/json" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@sample.wav"

# R2T
curl "http://localhost:8000/r2t" -X POST \
  -H "accept: application/json" \
  -H "Content-Type: application/json" \
  --data '{"tokens":"<|sound_start|><|sound_1012|><|sound_1508|><|sound_1508|><|sound_0636|><|sound_1090|><|sound_0567|><|sound_0901|><|sound_0901|><|sound_1192|><|sound_1820|><|sound_0547|><|sound_1999|><|sound_0157|><|sound_0157|><|sound_1454|><|sound_1223|><|sound_1223|><|sound_1223|><|sound_1223|><|sound_1808|><|sound_1808|><|sound_1573|><|sound_0065|><|sound_1508|><|sound_1508|><|sound_1268|><|sound_0568|><|sound_1745|><|sound_1508|><|sound_0084|><|sound_1768|><|sound_0192|><|sound_1048|><|sound_0826|><|sound_0192|><|sound_0517|><|sound_0192|><|sound_0826|><|sound_0971|><|sound_1845|><|sound_1694|><|sound_1048|><|sound_0192|><|sound_1048|><|sound_1268|><|sound_end|>"}'
```

You can also access the API documentation at `http://localhost:8000/docs`

## Ichigo-LLM

:strawberry: Ichigo-LLM is an open, ongoing research experiment to extend a text-based LLM to have native "listening" ability. Think of it as an open data, open weight, on device Siri.

It uses an [early fusion](https://medium.com/@raj.pulapakura/multimodal-models-and-fusion-a-complete-guide-225ca91f6861#:~:text=3.3.,-Early%20Fusion&text=Early%20fusion%20refers%20to%20combining,fused%20representation%20through%20the%20model.) technique inspired by [Meta's Chameleon paper](https://arxiv.org/abs/2405.09818).

We ~~build~~ train in public:
- [Ichigo v0.3 Checkpoint Writeup](https://homebrew.ltd/blog/llama-learns-to-talk)
- [Ichigo v0.2 Checkpoint Writeup](https://homebrew.ltd/blog/llama3-just-got-ears)
- [Ichigo v0.1 Checkpoint Writeup](https://homebrew.ltd/blog/can-llama-3-listen)

## Ichigo-TTS

Coming Soon
  
## Join Us

:strawberry: Ichigo-LLM and 🍰 Ichigo-ASR is an open research project. We're looking for collaborators, and will likely move towards crowdsourcing speech datasets in the future.

## References

```bibtex
@article{dao2024ichigo,
  title={Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant},
  author={Dao, Alan and Vu, Dinh Bach and Ha, Huy Hoang},
  journal={arXiv preprint arXiv:2410.15316},
  year={2024}
}

@misc{chameleonteam2024chameleonmixedmodalearlyfusionfoundation,
      title={Chameleon: Mixed-Modal Early-Fusion Foundation Models}, 
      author={Chameleon Team},
      year={2024},
      eprint={2405.09818},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      journal={arXiv preprint}
}

@misc{WhisperSpeech,
      title={WhisperSpeech: An Open Source Text-to-Speech System Built by Inverting Whisper}, 
      author={Collabora and LAION},
      year={2024},
      url={https://github.com/collabora/WhisperSpeech},
      note={GitHub repository}
}
```

## Acknowledgement

- [torchtune](https://github.com/pytorch/torchtune): The codebase we built upon
- [WhisperSpeech](https://github.com/collabora/WhisperSpeech): Text-to-speech model for synthetic audio generation
- [llama3](https://huggingface.co/collections/meta-llama/meta-llama-3-66214712577ca38149ebb2b6): the Family of Models that we based on that has the amazing language capabilities

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "ichigo",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "asr, llm, tts, ichigo",
    "author": null,
    "author_email": "Charles <charles@jan.ai>",
    "download_url": "https://files.pythonhosted.org/packages/fb/30/f993bff1ebf1e25c8d55a22b230d3897c7cfaf5dbd33ed6da379d65da568/ichigo-0.0.10.tar.gz",
    "platform": null,
    "description": "<div align=\"center\">\n\n# :strawberry: Ichigo: A simple speech package optimised for local inference\n<a href='https://homebrew.ltd/blog/llama3-just-got-ears'><img src='https://img.shields.io/badge/Project-Blog-Green'></a>\n<a href='https://ichigo.homebrew.ltd/'><img src='https://img.shields.io/badge/Project-Demo-violet'></a>\n<a href='https://arxiv.org/pdf/2410.15316'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a>\n<a href='https://huggingface.co/homebrewltd'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue'></a>\n<a href='https://huggingface.co/homebrewltd'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Data-green'></a>\n<a href='https://colab.research.google.com/drive/18IiwN0AzBZaox5o0iidXqWD1xKq11XbZ?usp=sharing'><img src='https://colab.research.google.com/assets/colab-badge.svg'></a>\n\n[**About**](#About) | [**Installation**](#Installation) | [**Ichigo-ASR**](#Ichigo-ASR) | [**Ichigo-LLM**](#Ichigo-LLM)\n\n<img src=\"assets/ichigo.jpeg\" width=\"400\"/>\n</div>\n\n## About\n\nThis package does 3 things: \n1) Automatic Speech Recognition: [**Ichigo-ASR**](#ichigo-asr)\n2) Text to Speech: Coming Soon\n3) Speech Language Model: [**Ichigo-LLM**](#ichigo-llm) (experimental)\n\nIt contains only inference code, and caters to most local inference use cases around these three tasks.\n\n## Installation\n\nTo get started, simply install the package.\n\n```bash\npip install ichigo\n```\n\n## Ichigo-ASR\n\nIchigo-ASR is a compact (22M parameters), open-source speech tokenizer for the `Whisper-medium model`, designed to enhance performance on multilingual with minimal impact on its original English capabilities. Unlike models that output continuous embeddings, \u1ecachigo-ASR compresses speech into discrete tokens, making it more compatible with large language models (LLMs) for immediate speech understanding. This speech tokenizer has been trained on over ~400 hours of English data and ~1000 hours of Vietnamese data.\n\n### Batch Processing\n\nThe ichigo package can handle batch processing of audio files using a single line of code, with additional parameters for available for more control.\n\n1. For single files\n\n```python\n# Quick one-liner for transcription\nfrom ichigo.asr import transcribe\nresults = transcribe(\"path/to/your/file\")\n# Expected output: \"{filename: transcription}\"\n```\nA transcription.txt will also stored in the same folder as \"path/to/your/file\"\n\n2. For multiple files (folder)\n\n```python\n# Quick one-liner for transcription\nfrom ichigo.asr import transcribe\nresults = transcribe(\"path/to/your/folder\")\n# Expected output: \"{filename1: transcription1, filename2: transcription2, ... filenameN: transcriptionN,}\"\n```\nA subfolder will be created in `path/to/your/folder` and transcriptions will be stored as `filenameN.txt` in the subfolder.\n\n### API\n\nFor integration with frontend, a python fastAPI is also available. This api also does batch processing. Streaming is currently not supported.\n\n1. Start the server\n\n```bash\n# Uvicorn\ncd api && uvicorn asr:app --host 0.0.0.0 --port 8000\n\n# or Docker \ndocker compose up -d\n```\n\n2. curl\n```bash\n# S2T\ncurl \"http://localhost:8000/v1/audio/transcriptions\" \\\n  -H \"accept: application/json\" \\\n  -H \"Content-Type: multipart/form-data\" \\\n  -F \"file=@sample.wav\" -F \"model=ichigo\"\n\n# S2R\ncurl \"http://localhost:8000/s2r\" \\\n  -H \"accept: application/json\" \\\n  -H \"Content-Type: multipart/form-data\" \\\n  -F \"file=@sample.wav\"\n\n# R2T\ncurl \"http://localhost:8000/r2t\" -X POST \\\n  -H \"accept: application/json\" \\\n  -H \"Content-Type: application/json\" \\\n  --data '{\"tokens\":\"<|sound_start|><|sound_1012|><|sound_1508|><|sound_1508|><|sound_0636|><|sound_1090|><|sound_0567|><|sound_0901|><|sound_0901|><|sound_1192|><|sound_1820|><|sound_0547|><|sound_1999|><|sound_0157|><|sound_0157|><|sound_1454|><|sound_1223|><|sound_1223|><|sound_1223|><|sound_1223|><|sound_1808|><|sound_1808|><|sound_1573|><|sound_0065|><|sound_1508|><|sound_1508|><|sound_1268|><|sound_0568|><|sound_1745|><|sound_1508|><|sound_0084|><|sound_1768|><|sound_0192|><|sound_1048|><|sound_0826|><|sound_0192|><|sound_0517|><|sound_0192|><|sound_0826|><|sound_0971|><|sound_1845|><|sound_1694|><|sound_1048|><|sound_0192|><|sound_1048|><|sound_1268|><|sound_end|>\"}'\n```\n\nYou can also access the API documentation at `http://localhost:8000/docs`\n\n## Ichigo-LLM\n\n:strawberry: Ichigo-LLM is an open, ongoing research experiment to extend a text-based LLM to have native \"listening\" ability. Think of it as an open data, open weight, on device Siri.\n\nIt uses an [early fusion](https://medium.com/@raj.pulapakura/multimodal-models-and-fusion-a-complete-guide-225ca91f6861#:~:text=3.3.,-Early%20Fusion&text=Early%20fusion%20refers%20to%20combining,fused%20representation%20through%20the%20model.) technique inspired by [Meta's Chameleon paper](https://arxiv.org/abs/2405.09818).\n\nWe ~~build~~ train in public:\n- [Ichigo v0.3 Checkpoint Writeup](https://homebrew.ltd/blog/llama-learns-to-talk)\n- [Ichigo v0.2 Checkpoint Writeup](https://homebrew.ltd/blog/llama3-just-got-ears)\n- [Ichigo v0.1 Checkpoint Writeup](https://homebrew.ltd/blog/can-llama-3-listen)\n\n## Ichigo-TTS\n\nComing Soon\n  \n## Join Us\n\n:strawberry: Ichigo-LLM and \ud83c\udf70 Ichigo-ASR is an open research project. We're looking for collaborators, and will likely move towards crowdsourcing speech datasets in the future.\n\n## References\n\n```bibtex\n@article{dao2024ichigo,\n  title={Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant},\n  author={Dao, Alan and Vu, Dinh Bach and Ha, Huy Hoang},\n  journal={arXiv preprint arXiv:2410.15316},\n  year={2024}\n}\n\n@misc{chameleonteam2024chameleonmixedmodalearlyfusionfoundation,\n      title={Chameleon: Mixed-Modal Early-Fusion Foundation Models}, \n      author={Chameleon Team},\n      year={2024},\n      eprint={2405.09818},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL},\n      journal={arXiv preprint}\n}\n\n@misc{WhisperSpeech,\n      title={WhisperSpeech: An Open Source Text-to-Speech System Built by Inverting Whisper}, \n      author={Collabora and LAION},\n      year={2024},\n      url={https://github.com/collabora/WhisperSpeech},\n      note={GitHub repository}\n}\n```\n\n## Acknowledgement\n\n- [torchtune](https://github.com/pytorch/torchtune): The codebase we built upon\n- [WhisperSpeech](https://github.com/collabora/WhisperSpeech): Text-to-speech model for synthetic audio generation\n- [llama3](https://huggingface.co/collections/meta-llama/meta-llama-3-66214712577ca38149ebb2b6): the Family of Models that we based on that has the amazing language capabilities\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "Ichigo is an open, ongoing research experiment to extend a text-based LLM to have native listening ability. Think of it as an open data, open weight, on device Siri.",
    "version": "0.0.10",
    "project_urls": null,
    "split_keywords": [
        "asr",
        " llm",
        " tts",
        " ichigo"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "729101698d14664666fcba52ff32d527017ad235760945af4330aad15d2b9392",
                "md5": "cd434541b5e3dd865586b61861e98e00",
                "sha256": "d7e4cc0a1c830f1298cf00664dca56ab96d5368eae7767ed63cfd3d438389eaf"
            },
            "downloads": -1,
            "filename": "ichigo-0.0.10-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "cd434541b5e3dd865586b61861e98e00",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 15202,
            "upload_time": "2025-02-17T02:26:23",
            "upload_time_iso_8601": "2025-02-17T02:26:23.188059Z",
            "url": "https://files.pythonhosted.org/packages/72/91/01698d14664666fcba52ff32d527017ad235760945af4330aad15d2b9392/ichigo-0.0.10-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "fb30f993bff1ebf1e25c8d55a22b230d3897c7cfaf5dbd33ed6da379d65da568",
                "md5": "b8540619faf9b26fe0173ed567bbb281",
                "sha256": "bf0fb97703cf856095af611771a12ca35d66fa8844ede59c6d3ab3738cbadb06"
            },
            "downloads": -1,
            "filename": "ichigo-0.0.10.tar.gz",
            "has_sig": false,
            "md5_digest": "b8540619faf9b26fe0173ed567bbb281",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 15539,
            "upload_time": "2025-02-17T02:26:25",
            "upload_time_iso_8601": "2025-02-17T02:26:25.513787Z",
            "url": "https://files.pythonhosted.org/packages/fb/30/f993bff1ebf1e25c8d55a22b230d3897c7cfaf5dbd33ed6da379d65da568/ichigo-0.0.10.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-17 02:26:25",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "ichigo"
}
        
Elapsed time: 1.69324s