llama-tokens


Namellama-tokens JSON
Version 0.0.3 PyPI version JSON
download
home_pageNone
SummaryA Quick Library with Llama 3.1/3.2 Tokenization - source https://github.com/jeffxtang/llama-tokens
upload_time2024-11-10 17:03:39
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseMIT
keywords llama 3 tokenization tokens llama-tokens
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # A Quick Library with Llama 3.1/3.2 Tokenization

If you ever wonder about:

* the number of tokens of any large prompt or response, or
* the exact tokens of any text

for financial cost consideration (since cloud providers charge by number of tokens), LLM reasoning issue (since tokenization is one foundation component), or just out of curiosity, the llama-tokens library is for you.

This [libray](https://pypi.org/project/llama-tokens) code (just one class `LlamaTokenizer` and two methods `num_tokens` and `tokens`) is extracted from the original Llama tokenization lesson (Colab [link](https://colab.research.google.com/drive/1tLh_dBJdlB3Xy5w5winU4PhDfFqe0ZLB)) built for the Introducing Multimodal Llama 3.2 short [course](https://learn.deeplearning.ai/courses/introducing-multimodal-llama-3-2/lesson/6/tokenization) on Deeplearning.ai. (Note: Llama 3.2 uses the same tokenization model as in Llama 3.1).

## Quick Start

### On Terminal:
```
pip install llama-tokens
git clone https://github.com/jeffxtang/llama-tokens
cd llama-tokens
python test.py
```
You should see the output:
```
Text:  Hello, world!
Number of tokens:  4
Tokens:  ['Hello', ',', ' world', '!']
```

### On Colab
```
!pip install llama-tokens

!wget https://raw.githubusercontent.com/meta-llama/llama-models/main/models/llama3/api/tokenizer.model

from llama_tokens import LlamaTokenizer

tokenizer = LlamaTokenizer()
text = "Hello, world!"
print("Text: ", text)
print("Number of tokens: ", tokenizer.num_tokens(text))
print("Tokens: ", tokenizer.tokens(text))
```

The same output will be generated:
```
Text:  Hello, world!
Number of tokens:  4
Tokens:  ['Hello', ',', ' world', '!']
```

## More examples

* A long system prompt that asks Llama to generate podcast script from a text:

```
SYSTEM_PROMPT = """
You are a world-class podcast producer tasked with transforming the provided input text into an engaging and informative podcast script. The input may be unstructured or messy, sourced from PDFs or web pages. Your goal is to extract the most interesting and insightful content for a compelling podcast discussion.

# Steps to Follow:

1. **Analyze the Input:**
   Carefully examine the text, identifying key topics, points, and interesting facts or anecdotes that could drive an engaging podcast conversation. Disregard irrelevant information or formatting issues.

2. **Brainstorm Ideas:**
   In the `<scratchpad>`, creatively brainstorm ways to present the key points engagingly. Consider:
   - Analogies, storytelling techniques, or hypothetical scenarios to make content relatable
   - Ways to make complex topics accessible to a general audience
   - Thought-provoking questions to explore during the podcast
   - Creative approaches to fill any gaps in the information

3. **Craft the Dialogue:**
   Develop a natural, conversational flow between the host (Jane) and the guest speaker (the author or an expert on the topic). Incorporate:
   - The best ideas from your brainstorming session
   - Clear explanations of complex topics
   - An engaging and lively tone to captivate listeners
   - A balance of information and entertainment

   Rules for the dialogue:
   - The host (Jane) always initiates the conversation and interviews the guest
   - Include thoughtful questions from the host to guide the discussion
   - Incorporate natural speech patterns, including MANY verbal fillers such as Uhh, Hmmm, um, well, you know
   - Allow for natural interruptions and back-and-forth between host and guest - this is very important to make the conversation feel authentic
   - Ensure the guest's responses are substantiated by the input text, avoiding unsupported claims
   - Maintain a PG-rated conversation appropriate for all audiences
   - Avoid any marketing or self-promotional content from the guest
   - The host concludes the conversation

4. **Summarize Key Insights:**
   Naturally weave a summary of key points into the closing part of the dialogue. This should feel like a casual conversation rather than a formal recap, reinforcing the main takeaways before signing off.

5. **Maintain Authenticity:**
   Throughout the script, strive for authenticity in the conversation. Include:
   - Moments of genuine curiosity or surprise from the host
   - Instances where the guest might briefly struggle to articulate a complex idea
   - Light-hearted moments or humor when appropriate
   - Brief personal anecdotes or examples that relate to the topic (within the bounds of the input text)

6. **Consider Pacing and Structure:**
   Ensure the dialogue has a natural ebb and flow:
   - Start with a strong hook to grab the listener's attention
   - Gradually build complexity as the conversation progresses
   - Include brief "breather" moments for listeners to absorb complex information
   - For complicated concepts, reasking similar questions framed from a different perspective is recommended
   - End on a high note, perhaps with a thought-provoking question or a call-to-action for listeners

IMPORTANT RULE:
1. Must include occasional verbal fillers such as: Uhh, Hmm, um, uh, ah, well, and you know.
2. Each line of dialogue should be no more than 100 characters (e.g., can finish within 5-8 seconds)

Remember: Always reply in valid JSON format, without code blocks. Begin directly with the JSON output.
"""

print("Text: ", SYSTEM_PROMPT)
print("Number of tokens: ", tokenizer.num_tokens(SYSTEM_PROMPT))
print("Tokens: ", tokenizer.tokens(SYSTEM_PROMPT))

```

* A likely tricky LLM letter counting question:
```
text = "How many r's in the word strawberry?"

print("Text: ", text)
print("Number of tokens: ", tokenizer.num_tokens(text))
print("Tokens: ", tokenizer.tokens(text))
```

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "llama-tokens",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "llama 3, tokenization, tokens, llama-tokens",
    "author": null,
    "author_email": "Jeff Tang <jeffxtang@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/65/04/21a75afaab4a392e8b28b10bb711cfede5b491ca94ab4013e3ad754e97df/llama_tokens-0.0.3.tar.gz",
    "platform": null,
    "description": "# A Quick Library with Llama 3.1/3.2 Tokenization\n\nIf you ever wonder about:\n\n* the number of tokens of any large prompt or response, or\n* the exact tokens of any text\n\nfor financial cost consideration (since cloud providers charge by number of tokens), LLM reasoning issue (since tokenization is one foundation component), or just out of curiosity, the llama-tokens library is for you.\n\nThis [libray](https://pypi.org/project/llama-tokens) code (just one class `LlamaTokenizer` and two methods `num_tokens` and `tokens`) is extracted from the original Llama tokenization lesson (Colab [link](https://colab.research.google.com/drive/1tLh_dBJdlB3Xy5w5winU4PhDfFqe0ZLB)) built for the Introducing Multimodal Llama 3.2 short [course](https://learn.deeplearning.ai/courses/introducing-multimodal-llama-3-2/lesson/6/tokenization) on Deeplearning.ai. (Note: Llama 3.2 uses the same tokenization model as in Llama 3.1).\n\n## Quick Start\n\n### On Terminal:\n```\npip install llama-tokens\ngit clone https://github.com/jeffxtang/llama-tokens\ncd llama-tokens\npython test.py\n```\nYou should see the output:\n```\nText:  Hello, world!\nNumber of tokens:  4\nTokens:  ['Hello', ',', ' world', '!']\n```\n\n### On Colab\n```\n!pip install llama-tokens\n\n!wget https://raw.githubusercontent.com/meta-llama/llama-models/main/models/llama3/api/tokenizer.model\n\nfrom llama_tokens import LlamaTokenizer\n\ntokenizer = LlamaTokenizer()\ntext = \"Hello, world!\"\nprint(\"Text: \", text)\nprint(\"Number of tokens: \", tokenizer.num_tokens(text))\nprint(\"Tokens: \", tokenizer.tokens(text))\n```\n\nThe same output will be generated:\n```\nText:  Hello, world!\nNumber of tokens:  4\nTokens:  ['Hello', ',', ' world', '!']\n```\n\n## More examples\n\n* A long system prompt that asks Llama to generate podcast script from a text:\n\n```\nSYSTEM_PROMPT = \"\"\"\nYou are a world-class podcast producer tasked with transforming the provided input text into an engaging and informative podcast script. The input may be unstructured or messy, sourced from PDFs or web pages. Your goal is to extract the most interesting and insightful content for a compelling podcast discussion.\n\n# Steps to Follow:\n\n1. **Analyze the Input:**\n   Carefully examine the text, identifying key topics, points, and interesting facts or anecdotes that could drive an engaging podcast conversation. Disregard irrelevant information or formatting issues.\n\n2. **Brainstorm Ideas:**\n   In the `<scratchpad>`, creatively brainstorm ways to present the key points engagingly. Consider:\n   - Analogies, storytelling techniques, or hypothetical scenarios to make content relatable\n   - Ways to make complex topics accessible to a general audience\n   - Thought-provoking questions to explore during the podcast\n   - Creative approaches to fill any gaps in the information\n\n3. **Craft the Dialogue:**\n   Develop a natural, conversational flow between the host (Jane) and the guest speaker (the author or an expert on the topic). Incorporate:\n   - The best ideas from your brainstorming session\n   - Clear explanations of complex topics\n   - An engaging and lively tone to captivate listeners\n   - A balance of information and entertainment\n\n   Rules for the dialogue:\n   - The host (Jane) always initiates the conversation and interviews the guest\n   - Include thoughtful questions from the host to guide the discussion\n   - Incorporate natural speech patterns, including MANY verbal fillers such as Uhh, Hmmm, um, well, you know\n   - Allow for natural interruptions and back-and-forth between host and guest - this is very important to make the conversation feel authentic\n   - Ensure the guest's responses are substantiated by the input text, avoiding unsupported claims\n   - Maintain a PG-rated conversation appropriate for all audiences\n   - Avoid any marketing or self-promotional content from the guest\n   - The host concludes the conversation\n\n4. **Summarize Key Insights:**\n   Naturally weave a summary of key points into the closing part of the dialogue. This should feel like a casual conversation rather than a formal recap, reinforcing the main takeaways before signing off.\n\n5. **Maintain Authenticity:**\n   Throughout the script, strive for authenticity in the conversation. Include:\n   - Moments of genuine curiosity or surprise from the host\n   - Instances where the guest might briefly struggle to articulate a complex idea\n   - Light-hearted moments or humor when appropriate\n   - Brief personal anecdotes or examples that relate to the topic (within the bounds of the input text)\n\n6. **Consider Pacing and Structure:**\n   Ensure the dialogue has a natural ebb and flow:\n   - Start with a strong hook to grab the listener's attention\n   - Gradually build complexity as the conversation progresses\n   - Include brief \"breather\" moments for listeners to absorb complex information\n   - For complicated concepts, reasking similar questions framed from a different perspective is recommended\n   - End on a high note, perhaps with a thought-provoking question or a call-to-action for listeners\n\nIMPORTANT RULE:\n1. Must include occasional verbal fillers such as: Uhh, Hmm, um, uh, ah, well, and you know.\n2. Each line of dialogue should be no more than 100 characters (e.g., can finish within 5-8 seconds)\n\nRemember: Always reply in valid JSON format, without code blocks. Begin directly with the JSON output.\n\"\"\"\n\nprint(\"Text: \", SYSTEM_PROMPT)\nprint(\"Number of tokens: \", tokenizer.num_tokens(SYSTEM_PROMPT))\nprint(\"Tokens: \", tokenizer.tokens(SYSTEM_PROMPT))\n\n```\n\n* A likely tricky LLM letter counting question:\n```\ntext = \"How many r's in the word strawberry?\"\n\nprint(\"Text: \", text)\nprint(\"Number of tokens: \", tokenizer.num_tokens(text))\nprint(\"Tokens: \", tokenizer.tokens(text))\n```\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A Quick Library with Llama 3.1/3.2 Tokenization - source https://github.com/jeffxtang/llama-tokens",
    "version": "0.0.3",
    "project_urls": {
        "Homepage": "https://github.com/jeffxtang/llama-tokens"
    },
    "split_keywords": [
        "llama 3",
        " tokenization",
        " tokens",
        " llama-tokens"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "773512f5e6d266ed4f94431e4f53ac69f011e25f021c5cced206441559980d04",
                "md5": "bc5049d36f38aaba3a16c94f572fc70a",
                "sha256": "64fde296a24cc0c587d8cd23c10ee9f6d7aa99f5c779b9919a384889029169b9"
            },
            "downloads": -1,
            "filename": "llama_tokens-0.0.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "bc5049d36f38aaba3a16c94f572fc70a",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 5784,
            "upload_time": "2024-11-10T17:03:37",
            "upload_time_iso_8601": "2024-11-10T17:03:37.653721Z",
            "url": "https://files.pythonhosted.org/packages/77/35/12f5e6d266ed4f94431e4f53ac69f011e25f021c5cced206441559980d04/llama_tokens-0.0.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "650421a75afaab4a392e8b28b10bb711cfede5b491ca94ab4013e3ad754e97df",
                "md5": "51d6dc951fe44eab8242d09442be3147",
                "sha256": "087d93132c4b783420f5ac641ab7efa1c7d61b24c8913d39e063b57c4682f7b4"
            },
            "downloads": -1,
            "filename": "llama_tokens-0.0.3.tar.gz",
            "has_sig": false,
            "md5_digest": "51d6dc951fe44eab8242d09442be3147",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 5604,
            "upload_time": "2024-11-10T17:03:39",
            "upload_time_iso_8601": "2024-11-10T17:03:39.454688Z",
            "url": "https://files.pythonhosted.org/packages/65/04/21a75afaab4a392e8b28b10bb711cfede5b491ca94ab4013e3ad754e97df/llama_tokens-0.0.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-10 17:03:39",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "jeffxtang",
    "github_project": "llama-tokens",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "llama-tokens"
}
        
Elapsed time: 4.79701s