# LLMVision
Visualize how LLMs tokenize text.
```python
from llmvision import tokenize_and_visualize, GPT4Tokenizer
text = "Hello world! ππ"
print(tokenize_and_visualize(text, GPT4Tokenizer()))
# Output: Helloβ worldβ!β<bytes:20f09f>β<bytes:91>β<bytes:8b>β<bytes:f09f>β<bytes:8c>β<bytes:8d>
```
## Features
- Multiple tokenizers: GPT-2, GPT-4, byte-level, character-level
- Visual token boundaries
- Unicode/emoji handling
- Actual tokenization used by OpenAI models
## Installation
```bash
pip install llmvision
```
## Usage
```bash
llmvision "Hello world!"
llmvision "Hello world!" --tokenizer gpt4
llmvision "Hello world!" --indices
```
```python
from llmvision import tokenize_and_visualize, GPT4Tokenizer
# Default tokenizer
print(tokenize_and_visualize("Hello world!"))
# Specific tokenizer
print(tokenize_and_visualize("Hello world!", GPT4Tokenizer()))
```
## Examples
```python
from llmvision import GPT4Tokenizer
tokenizer = GPT4Tokenizer()
tokens = tokenizer.tokenize("Hello world!")
print(tokens) # ['Hello', ' world', '!']
```
## Tokenizers
- `SimpleTokenizer` - word/punctuation/space
- `WordTokenizer` - whitespace-based
- `CharTokenizer` - character-level
- `GraphemeTokenizer` - Unicode grapheme clusters
- `ByteLevelTokenizer` - UTF-8 bytes
- `GPT2Tokenizer` - GPT-2 (via tiktoken)
- `GPT4Tokenizer` - GPT-4 (via tiktoken)
- `SubwordTokenizer` - basic subword splitting
## Token Costs
```python
tokenizer = GPT4Tokenizer()
examples = [
"Hello world!", # 3 tokens
"Hello δΈη!", # 5 tokens
"Hello ππ!", # 8 tokens
"π¨βπ©βπ§βπ¦", # 18 tokens
]
for text in examples:
print(f"{text:15} β {len(tokenizer.tokenize(text))} tokens")
```
## License
MIT
Raw data
{
"_id": null,
"home_page": null,
"name": "llmvision",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "gpt, llm, nlp, tokenization, tokenizer, visualization",
"author": null,
"author_email": "Alok Singh <alokbeniwal@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/0b/4b/f829e91e1b8e0a5fd4fd416b69df6b0dbe710da62e964f87838621eee731/llmvision-0.1.1.tar.gz",
"platform": null,
"description": "# LLMVision\n\nVisualize how LLMs tokenize text.\n\n```python\nfrom llmvision import tokenize_and_visualize, GPT4Tokenizer\n\ntext = \"Hello world! \ud83d\udc4b\ud83c\udf0d\"\nprint(tokenize_and_visualize(text, GPT4Tokenizer()))\n# Output: Hello\u2502 world\u2502!\u2502<bytes:20f09f>\u2502<bytes:91>\u2502<bytes:8b>\u2502<bytes:f09f>\u2502<bytes:8c>\u2502<bytes:8d>\n```\n\n## Features\n\n- Multiple tokenizers: GPT-2, GPT-4, byte-level, character-level\n- Visual token boundaries \n- Unicode/emoji handling\n- Actual tokenization used by OpenAI models\n\n## Installation\n\n```bash\npip install llmvision\n```\n\n## Usage\n\n```bash\nllmvision \"Hello world!\"\nllmvision \"Hello world!\" --tokenizer gpt4\nllmvision \"Hello world!\" --indices\n```\n\n```python\nfrom llmvision import tokenize_and_visualize, GPT4Tokenizer\n\n# Default tokenizer\nprint(tokenize_and_visualize(\"Hello world!\"))\n\n# Specific tokenizer\nprint(tokenize_and_visualize(\"Hello world!\", GPT4Tokenizer()))\n```\n\n## Examples\n\n```python\nfrom llmvision import GPT4Tokenizer\n\ntokenizer = GPT4Tokenizer()\ntokens = tokenizer.tokenize(\"Hello world!\")\nprint(tokens) # ['Hello', ' world', '!']\n```\n\n## Tokenizers\n\n- `SimpleTokenizer` - word/punctuation/space\n- `WordTokenizer` - whitespace-based\n- `CharTokenizer` - character-level\n- `GraphemeTokenizer` - Unicode grapheme clusters\n- `ByteLevelTokenizer` - UTF-8 bytes\n- `GPT2Tokenizer` - GPT-2 (via tiktoken)\n- `GPT4Tokenizer` - GPT-4 (via tiktoken)\n- `SubwordTokenizer` - basic subword splitting\n\n## Token Costs\n\n```python\ntokenizer = GPT4Tokenizer()\nexamples = [\n \"Hello world!\", # 3 tokens\n \"Hello \u4e16\u754c!\", # 5 tokens \n \"Hello \ud83d\udc4b\ud83c\udf0d!\", # 8 tokens\n \"\ud83d\udc68\u200d\ud83d\udc69\u200d\ud83d\udc67\u200d\ud83d\udc66\", # 18 tokens\n]\nfor text in examples:\n print(f\"{text:15} \u2192 {len(tokenizer.tokenize(text))} tokens\")\n```\n\n## License\n\nMIT",
"bugtrack_url": null,
"license": "MIT",
"summary": "Visualize how LLMs tokenize text",
"version": "0.1.1",
"project_urls": {
"Documentation": "https://github.com/alokbeniwal/llmvision#readme",
"Homepage": "https://github.com/alokbeniwal/llmvision",
"Issues": "https://github.com/alokbeniwal/llmvision/issues",
"Repository": "https://github.com/alokbeniwal/llmvision"
},
"split_keywords": [
"gpt",
" llm",
" nlp",
" tokenization",
" tokenizer",
" visualization"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "987565cc6dbb92346b6c400d19d0873253e0fd3b0e80d32dd658443f5045285c",
"md5": "d6f3b553362e0e5be91db143a779bf8c",
"sha256": "230d261fa1402e9c63be1cd01d16377c26c59fa1e32366ed1f95a6b921317a2a"
},
"downloads": -1,
"filename": "llmvision-0.1.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "d6f3b553362e0e5be91db143a779bf8c",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 10432,
"upload_time": "2025-07-26T07:34:53",
"upload_time_iso_8601": "2025-07-26T07:34:53.526659Z",
"url": "https://files.pythonhosted.org/packages/98/75/65cc6dbb92346b6c400d19d0873253e0fd3b0e80d32dd658443f5045285c/llmvision-0.1.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "0b4bf829e91e1b8e0a5fd4fd416b69df6b0dbe710da62e964f87838621eee731",
"md5": "aee95464fc7c0d360b885b857336f462",
"sha256": "8eede57c89d8960e2a16cfb65d65a0841a697076f64ad2bbc6288f0a762a990b"
},
"downloads": -1,
"filename": "llmvision-0.1.1.tar.gz",
"has_sig": false,
"md5_digest": "aee95464fc7c0d360b885b857336f462",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 3597810,
"upload_time": "2025-07-26T07:34:54",
"upload_time_iso_8601": "2025-07-26T07:34:54.863189Z",
"url": "https://files.pythonhosted.org/packages/0b/4b/f829e91e1b8e0a5fd4fd416b69df6b0dbe710da62e964f87838621eee731/llmvision-0.1.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-26 07:34:54",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "alokbeniwal",
"github_project": "llmvision#readme",
"github_not_found": true,
"lcname": "llmvision"
}