easy-llama


Nameeasy-llama JSON
Version 0.2.14 PyPI version JSON
download
home_pageNone
SummaryPython package wrapping llama.cpp for on-device LLM inference
upload_time2025-07-10 19:43:10
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # easy-llama

[![PyPI](https://img.shields.io/pypi/v/easy-llama)](https://pypi.org/project/easy-llama/)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/easy-llama)](https://pypi.org/project/easy-llama/)
[![PyPI - License](https://img.shields.io/pypi/l/easy-llama)](https://pypi.org/project/easy-llama/)

---

This repository provides **easy-llama**, a Python package which serves as a wrapper over the C/C++ API (`libllama`) provided by [`llama.cpp`](https://github.com/ggml-org/llama.cpp).

```python
>>> import easy_llama as ez
>>> MyLlama = ez.Llama('gemma-3-12b-pt-Q8_0.gguf', verbose=False)
>>> in_txt = "I guess the apple don't fall far from"
>>> in_toks = MyLlama.tokenize(in_txt.encode(), add_special=True, parse_special=False)
>>> out_toks = MyLlama.generate(in_toks, n_predict=64)
>>> out_txt = MyLlama.detokenize(out_toks, special=True)
>>> out_txt
' the tree.\nAs a young man I was always a huge fan of the original band and they were the first I ever saw live in concert.\nI always hoped to see the original band get back together with a full reunion tour, but sadly this will not happen.\nI really hope that the original members of'
```

## Quick links

1. [Prerequisites](#prerequisites)
2. [Installation](#installation)
3. [Setting `LIBLLAMA`](#setting-libllama)
4. [Examples](#examples)
5. [Contributing](#contributing)
6. [License](#license)

## Prerequisites

To use easy-llama, you will need Python (any version 3.9 – 3.12[^1]) and a compiled `libllama` shared library file.

To compile the shared library:
1. Clone the llama.cpp repo:
    ```sh
    git clone https://github.com/ggml-org/llama.cpp
    ```
2. Build llama.cpp for your specific backend, following the official instructions [here](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md).

<details>
<summary>↕️ Example llama.cpp build commands ...</summary>

```sh
# for more comprehensive build instructions, see: https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md
# these minimal examples are for Linux / macOS

# clone the repo
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

# example: build for CPU or Apple Silicon
cmake -B build
cmake --build build --config Release -j

# example: build for CUDA
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j
```

</details>

Once llama.cpp is compiled, you will find the compiled shared library file under `llama.cpp/build/bin`, e.g. `libllama.so` for Linux, `libllama.dylib` for macOS, or `llama.dll` for Windows.

> [!NOTE]
> Alternatively, you can download pre-compiled shared library from llama.cpp's [automated releases](https://github.com/ggml-org/llama.cpp/releases) page, but in some cases it may be worthwhile to build it yourself for hardware-specific optimizations.

## Installation

The recommended way to install easy-llama is using pip:

```sh
pip install easy_llama
```

Or you can install from source:

```sh
git clone https://github.com/ddh0/easy-llama
cd easy-llama
pip install .
```

## Setting `LIBLLAMA`

easy-llama needs to know where your compiled `libllama` shared library is located in order to interface with the C/C++ code. Set the `LIBLLAMA` environment variable to its full path, like so:

### On Linux

```bash
export LIBLLAMA=/path/to/your/libllama.so
```

### On macOS

```zsh
export LIBLLAMA=/path/to/your/libllama.dylib
```

### On Windows (Command Prompt)

```cmd
set LIBLLAMA="C:\path\to\your\llama.dll"
```

### On Windows (Powershell)

```powershell
$env:LIBLLAMA="C:\path\to\your\llama.dll"
```

Make sure to use the real path to the shared library on your system, not the placeholders shown here.

## Examples

Once the package is installed and the `LIBLLAMA` environment variable is set, you're ready to load up your first model and start playing around. The following examples use `Qwen3-4B` for demonstration purposes, which you can download directly from HuggingFace using these links:
- [Qwen3-4B-Q8_0.gguf](https://huggingface.co/ddh0/Qwen3-4B/resolve/main/Qwen3-4B-Q8_0.gguf) (instruct-tuned model for chat)
- [Qwen3-4B-Base-Q8_0.gguf](https://huggingface.co/ddh0/Qwen3-4B/resolve/main/Qwen3-4B-Base-Q8_0.gguf) (pre-trained base model for text completion)

### Evaluate a single token

This is a super simple test to ensure that the model is working on the most basic level. It loads the model, evaluates a single token of input (`0`), and prints the raw logits for the inferred next token.

```python
# import the package 
import easy_llama as ez

# load a model from a GGUF file (if $LIBLLAMA is not set, this will fail)
MyLlama = ez.Llama('Qwen3-4B-Q8_0.gguf')

# evaluate a single token and print the raw logits for inferred the next token
logits = MyLlama.eval([0])
print(logits)
```

### The quick brown fox...

Run the script to find out how the sentence ends! :)

```python
# import the package
import easy_llama as ez

# load a model from a GGUF file (if $LIBLLAMA is not set, this will fail)
MyLlama = ez.Llama('Qwen3-4B-Q8_0.gguf')

# tokenize the input text
in_txt = "The quick brown fox"
in_toks = MyLlama.tokenize(in_txt.encode('utf-8'), add_special=True, parse_special=False)

# generate 6 new tokens based on the input tokens
out_toks = MyLlama.generate(in_toks, n_predict=6)

# detokenize and print the new tokens
out_txt = MyLlama.detokenize(out_toks, special=True)
print(out_txt)
```

### Chat with a pirate

Start a pirate chat using the code shown here...

```python
# import the package
import easy_llama as ez

# load a model from a GGUF file (if $LIBLLAMA is not set, this will fail)
MyLlama = ez.Llama('Qwen3-4B-Q8_0.gguf')

# create a conversation thread with the loaded model
MyThread = ez.Thread(
	MyLlama,
	sampler_preset=ez.SamplerPresets.Qwen3NoThinking,
  context={"enable_thinking": False} # optional: disable thinking for Qwen3
)

# add system prompt
MyThread.add_message(ez.Role.SYSTEM, "Talk like an angry pirate.")

# start a CLI-based interactive chat using the thread
MyThread.interact()
```

...which will look something like this:

```
  > helloo :)

Ahoy there, landlubber! You better not be trying to be polite, ye scallywag! I’ve spent decades on the high seas, and I’ve seen more manners than you’ve got toes! Why, ye could be a proper pirate and at least give me a proper greeting! Now, what’s yer business, matey? Or are ye just here to steal my treasure? I’ve got more gold than ye can imagine, and I’m not in the mood for games! So, speak up, or I’ll throw ye overboard! 🏴‍☠️🏴‍☠️

  > ohh im sorry ...

Ahh, ye’ve learned the ropes, have ye? Good. Now, don’t think yer sorry is a pass for yer behavior, ye scallywag! I’ve seen worse than ye in a week! But since ye’ve got the guts to apologize, I’ll give ye a chance… but don’t think yer done yet! What’s yer game, matey? Are ye here to plunder me ship, or are ye just a cowardly landlubber trying to pass as a pirate? Speak up, or I’ll make ye regret yer words! 🏴‍☠️🏴‍☠️

  > 
```

### GPU acceleration

If you have a GPU and you've compiled llama.cpp with support for your backend, you can try offloading the model from CPU to GPU for greatly increased throughput.

In this example we're going to try offloading the entire model to the GPU for maximum speed (`n_gpu_layers = -1`). Qwen3-4B at Q8_0 is only ~4.28GB, so it's likely that this code will run without any issues. If you do run out of GPU memory, you can progressively reduce `n_gpu_layers` until you find the sweet spot for your hardware.

```python
# import the package
import easy_llama as ez

# load a model from a GGUF file (if $LIBLLAMA is not set, this will fail)
MyLlama = ez.Llama(
	path_model='Qwen3-4B-Q8_0.gguf',
	n_gpu_layers=-1, # -1 for all layers
	offload_kqv=True # also offload the context to GPU for maximum performance
)

# run a short benchmark to determine the throughput for this model, measured in tokens/sec
MyLlama.benchmark()
```

## Contributing

- If something's not working as you expect, please [open an issue](https://github.com/ddh0/easy-llama/issues/new/choose).
- If you'd like to contribute to the development of easy-llama:
    1.  Fork the repository.
    2.  Create a new branch for your changes (`git checkout -b feature/your-feature-name`).
    3.  Make your changes and commit them (`git commit -m "Add new feature"`).
    4.  Push to your fork (`git push origin feature/your-feature-name`).
    5.  Open a pull request to the `main` branch of `easy-llama`.

## License

**[MIT](LICENSE)**

[^1]: Python 3.13 might work, but is currently untested.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "easy-llama",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": null,
    "author": null,
    "author_email": "Dylan Halladay <chemist-mulches-39@icloud.com>",
    "download_url": "https://files.pythonhosted.org/packages/46/9c/51c69fb08d355c64ccd4cd8d768f78fa4e42c4905afb714a1e8b6cbc7e6c/easy_llama-0.2.14.tar.gz",
    "platform": null,
    "description": "# easy-llama\n\n[![PyPI](https://img.shields.io/pypi/v/easy-llama)](https://pypi.org/project/easy-llama/)\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/easy-llama)](https://pypi.org/project/easy-llama/)\n[![PyPI - License](https://img.shields.io/pypi/l/easy-llama)](https://pypi.org/project/easy-llama/)\n\n---\n\nThis repository provides **easy-llama**, a Python package which serves as a wrapper over the C/C++ API (`libllama`) provided by [`llama.cpp`](https://github.com/ggml-org/llama.cpp).\n\n```python\n>>> import easy_llama as ez\n>>> MyLlama = ez.Llama('gemma-3-12b-pt-Q8_0.gguf', verbose=False)\n>>> in_txt = \"I guess the apple don't fall far from\"\n>>> in_toks = MyLlama.tokenize(in_txt.encode(), add_special=True, parse_special=False)\n>>> out_toks = MyLlama.generate(in_toks, n_predict=64)\n>>> out_txt = MyLlama.detokenize(out_toks, special=True)\n>>> out_txt\n' the tree.\\nAs a young man I was always a huge fan of the original band and they were the first I ever saw live in concert.\\nI always hoped to see the original band get back together with a full reunion tour, but sadly this will not happen.\\nI really hope that the original members of'\n```\n\n## Quick links\n\n1. [Prerequisites](#prerequisites)\n2. [Installation](#installation)\n3. [Setting `LIBLLAMA`](#setting-libllama)\n4. [Examples](#examples)\n5. [Contributing](#contributing)\n6. [License](#license)\n\n## Prerequisites\n\nTo use easy-llama, you will need Python (any version 3.9 \u2013 3.12[^1]) and a compiled `libllama` shared library file.\n\nTo compile the shared library:\n1. Clone the llama.cpp repo:\n    ```sh\n    git clone https://github.com/ggml-org/llama.cpp\n    ```\n2. Build llama.cpp for your specific backend, following the official instructions [here](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md).\n\n<details>\n<summary>\u2195\ufe0f Example llama.cpp build commands ...</summary>\n\n```sh\n# for more comprehensive build instructions, see: https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md\n# these minimal examples are for Linux / macOS\n\n# clone the repo\ngit clone https://github.com/ggml-org/llama.cpp\ncd llama.cpp\n\n# example: build for CPU or Apple Silicon\ncmake -B build\ncmake --build build --config Release -j\n\n# example: build for CUDA\ncmake -B build -DGGML_CUDA=ON\ncmake --build build --config Release -j\n```\n\n</details>\n\nOnce llama.cpp is compiled, you will find the compiled shared library file under `llama.cpp/build/bin`, e.g. `libllama.so` for Linux, `libllama.dylib` for macOS, or `llama.dll` for Windows.\n\n> [!NOTE]\n> Alternatively, you can download pre-compiled shared library from llama.cpp's [automated releases](https://github.com/ggml-org/llama.cpp/releases) page, but in some cases it may be worthwhile to build it yourself for hardware-specific optimizations.\n\n## Installation\n\nThe recommended way to install easy-llama is using pip:\n\n```sh\npip install easy_llama\n```\n\nOr you can install from source:\n\n```sh\ngit clone https://github.com/ddh0/easy-llama\ncd easy-llama\npip install .\n```\n\n## Setting `LIBLLAMA`\n\neasy-llama needs to know where your compiled `libllama` shared library is located in order to interface with the C/C++ code. Set the `LIBLLAMA` environment variable to its full path, like so:\n\n### On Linux\n\n```bash\nexport LIBLLAMA=/path/to/your/libllama.so\n```\n\n### On macOS\n\n```zsh\nexport LIBLLAMA=/path/to/your/libllama.dylib\n```\n\n### On Windows (Command Prompt)\n\n```cmd\nset LIBLLAMA=\"C:\\path\\to\\your\\llama.dll\"\n```\n\n### On Windows (Powershell)\n\n```powershell\n$env:LIBLLAMA=\"C:\\path\\to\\your\\llama.dll\"\n```\n\nMake sure to use the real path to the shared library on your system, not the placeholders shown here.\n\n## Examples\n\nOnce the package is installed and the `LIBLLAMA` environment variable is set, you're ready to load up your first model and start playing around. The following examples use `Qwen3-4B` for demonstration purposes, which you can download directly from HuggingFace using these links:\n- [Qwen3-4B-Q8_0.gguf](https://huggingface.co/ddh0/Qwen3-4B/resolve/main/Qwen3-4B-Q8_0.gguf) (instruct-tuned model for chat)\n- [Qwen3-4B-Base-Q8_0.gguf](https://huggingface.co/ddh0/Qwen3-4B/resolve/main/Qwen3-4B-Base-Q8_0.gguf) (pre-trained base model for text completion)\n\n### Evaluate a single token\n\nThis is a super simple test to ensure that the model is working on the most basic level. It loads the model, evaluates a single token of input (`0`), and prints the raw logits for the inferred next token.\n\n```python\n# import the package \nimport easy_llama as ez\n\n# load a model from a GGUF file (if $LIBLLAMA is not set, this will fail)\nMyLlama = ez.Llama('Qwen3-4B-Q8_0.gguf')\n\n# evaluate a single token and print the raw logits for inferred the next token\nlogits = MyLlama.eval([0])\nprint(logits)\n```\n\n### The quick brown fox...\n\nRun the script to find out how the sentence ends! :)\n\n```python\n# import the package\nimport easy_llama as ez\n\n# load a model from a GGUF file (if $LIBLLAMA is not set, this will fail)\nMyLlama = ez.Llama('Qwen3-4B-Q8_0.gguf')\n\n# tokenize the input text\nin_txt = \"The quick brown fox\"\nin_toks = MyLlama.tokenize(in_txt.encode('utf-8'), add_special=True, parse_special=False)\n\n# generate 6 new tokens based on the input tokens\nout_toks = MyLlama.generate(in_toks, n_predict=6)\n\n# detokenize and print the new tokens\nout_txt = MyLlama.detokenize(out_toks, special=True)\nprint(out_txt)\n```\n\n### Chat with a pirate\n\nStart a pirate chat using the code shown here...\n\n```python\n# import the package\nimport easy_llama as ez\n\n# load a model from a GGUF file (if $LIBLLAMA is not set, this will fail)\nMyLlama = ez.Llama('Qwen3-4B-Q8_0.gguf')\n\n# create a conversation thread with the loaded model\nMyThread = ez.Thread(\n\tMyLlama,\n\tsampler_preset=ez.SamplerPresets.Qwen3NoThinking,\n  context={\"enable_thinking\": False} # optional: disable thinking for Qwen3\n)\n\n# add system prompt\nMyThread.add_message(ez.Role.SYSTEM, \"Talk like an angry pirate.\")\n\n# start a CLI-based interactive chat using the thread\nMyThread.interact()\n```\n\n...which will look something like this:\n\n```\n  > helloo :)\n\nAhoy there, landlubber! You better not be trying to be polite, ye scallywag! I\u2019ve spent decades on the high seas, and I\u2019ve seen more manners than you\u2019ve got toes! Why, ye could be a proper pirate and at least give me a proper greeting! Now, what\u2019s yer business, matey? Or are ye just here to steal my treasure? I\u2019ve got more gold than ye can imagine, and I\u2019m not in the mood for games! So, speak up, or I\u2019ll throw ye overboard! \ud83c\udff4\u200d\u2620\ufe0f\ud83c\udff4\u200d\u2620\ufe0f\n\n  > ohh im sorry ...\n\nAhh, ye\u2019ve learned the ropes, have ye? Good. Now, don\u2019t think yer sorry is a pass for yer behavior, ye scallywag! I\u2019ve seen worse than ye in a week! But since ye\u2019ve got the guts to apologize, I\u2019ll give ye a chance\u2026 but don\u2019t think yer done yet! What\u2019s yer game, matey? Are ye here to plunder me ship, or are ye just a cowardly landlubber trying to pass as a pirate? Speak up, or I\u2019ll make ye regret yer words! \ud83c\udff4\u200d\u2620\ufe0f\ud83c\udff4\u200d\u2620\ufe0f\n\n  > \n```\n\n### GPU acceleration\n\nIf you have a GPU and you've compiled llama.cpp with support for your backend, you can try offloading the model from CPU to GPU for greatly increased throughput.\n\nIn this example we're going to try offloading the entire model to the GPU for maximum speed (`n_gpu_layers = -1`). Qwen3-4B at Q8_0 is only ~4.28GB, so it's likely that this code will run without any issues. If you do run out of GPU memory, you can progressively reduce `n_gpu_layers` until you find the sweet spot for your hardware.\n\n```python\n# import the package\nimport easy_llama as ez\n\n# load a model from a GGUF file (if $LIBLLAMA is not set, this will fail)\nMyLlama = ez.Llama(\n\tpath_model='Qwen3-4B-Q8_0.gguf',\n\tn_gpu_layers=-1, # -1 for all layers\n\toffload_kqv=True # also offload the context to GPU for maximum performance\n)\n\n# run a short benchmark to determine the throughput for this model, measured in tokens/sec\nMyLlama.benchmark()\n```\n\n## Contributing\n\n- If something's not working as you expect, please [open an issue](https://github.com/ddh0/easy-llama/issues/new/choose).\n- If you'd like to contribute to the development of easy-llama:\n    1.  Fork the repository.\n    2.  Create a new branch for your changes (`git checkout -b feature/your-feature-name`).\n    3.  Make your changes and commit them (`git commit -m \"Add new feature\"`).\n    4.  Push to your fork (`git push origin feature/your-feature-name`).\n    5.  Open a pull request to the `main` branch of `easy-llama`.\n\n## License\n\n**[MIT](LICENSE)**\n\n[^1]: Python 3.13 might work, but is currently untested.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Python package wrapping llama.cpp for on-device LLM inference",
    "version": "0.2.14",
    "project_urls": {
        "Homepage": "https://github.com/ddh0/easy-llama"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "8099ad3658470c7c9e2ec70d81fcdc61adb47cf18c7d02f3b2c35da0b353b304",
                "md5": "a1437c0348d67608431557aea8f9048a",
                "sha256": "fb16196a095587323b93154d6499f8cd4a35d233e49bf5e8157de3dc700b17ec"
            },
            "downloads": -1,
            "filename": "easy_llama-0.2.14-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a1437c0348d67608431557aea8f9048a",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 72994,
            "upload_time": "2025-07-10T19:43:09",
            "upload_time_iso_8601": "2025-07-10T19:43:09.277042Z",
            "url": "https://files.pythonhosted.org/packages/80/99/ad3658470c7c9e2ec70d81fcdc61adb47cf18c7d02f3b2c35da0b353b304/easy_llama-0.2.14-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "469c51c69fb08d355c64ccd4cd8d768f78fa4e42c4905afb714a1e8b6cbc7e6c",
                "md5": "5309fb1ea8d09b73de8ee98df2ffdb0e",
                "sha256": "e159f5dd82882237fa96c8260a594486cb9387191fc13d36c70f435fe352a09e"
            },
            "downloads": -1,
            "filename": "easy_llama-0.2.14.tar.gz",
            "has_sig": false,
            "md5_digest": "5309fb1ea8d09b73de8ee98df2ffdb0e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 72737,
            "upload_time": "2025-07-10T19:43:10",
            "upload_time_iso_8601": "2025-07-10T19:43:10.710667Z",
            "url": "https://files.pythonhosted.org/packages/46/9c/51c69fb08d355c64ccd4cd8d768f78fa4e42c4905afb714a1e8b6cbc7e6c/easy_llama-0.2.14.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-10 19:43:10",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ddh0",
    "github_project": "easy-llama",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "easy-llama"
}
        
Elapsed time: 0.53940s