mlx-textgen


Namemlx-textgen JSON
Version 0.1.0 PyPI version JSON
download
home_pageNone
SummaryA python package for serving LLM on OpenAI-compatible API endpoints with prompt caching using MLX.
upload_time2024-10-20 22:35:46
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseMIT
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # MLX-Textgen
[![PyPI](https://img.shields.io/pypi/v/mlx-textgen)](https://pypi.org/project/mlx-textgen/)
[![PyPI - License](https://img.shields.io/pypi/l/mlx-textgen)](https://pypi.org/project/mlx-textgen/)
[![GitHub Repo stars](https://img.shields.io/github/stars/nath1295/mlx-textgen)](https://pypi.org/project/mlx-textgen/)

## A python package for serving LLM on OpenAI-compatible API endpoints with prompt caching using MLX  

MLX-Textgen is a light-weight LLM serving engine that utilize MLX and a smart KV cache management system to make your LLM generation more seamless on your Apple silicon machine. It features:
- Multiple KV-cache slots to reduce the needs of prompt processing
- Multiple models serving with Fastapi
- Common OpenAI API endpoints: `/v1/models`, `/v1/completions`, `/v1/chat/completions`

## Updates
**2024-10-20** - Batch inference and function calling supported. Breaking changes in the save format for the KV caches. Run `mlx_textgen.clear_cache` after updating to avoid issues.
**2024-10-07** - Guided decoding is supported with [Outlines](https://github.com/dottxt-ai/outlines) backend.

## Installing MLX-Textgen
MLX-textgen can be easily installed with `pip`:
```
pip install mlx-textgen
```

## Features
### 1. Multiple KV cache slots support
All the KV cache are stored on disk. Therefore, unlike other LLM serving engine, a newly created KV cache will not overwrite the existing KV cache. This works better for agenic workflows where different types of prompts are being used frequently without losing previous cache for a long prompt.

### 2. Guided decoding with Regex, Json schema, and Grammar
You can pass your guided decoding argument `guided_json`, `guided_choice`, `guided_regex`, or `guided_grammar` as extra arguments and create structured text generation in a similar fashion to [vllm](https://github.com/vllm-project/vllm).

### 3. Batch inference support
Batch inference is supported for multiple prompts or multiple generations for a single prompt. Just pass a list of prompts to the `prompt` argument to the `/v1/completions` endpoint or `n=2` (or more than 2) to the `/v1/chat/completions` or `v1/completions` endpoints for batch inferencing.

### 4. Function calling support
Function calling with the `/v1/chat/completions` is supported. Simply use the `tools` and `tool_choice` arguments to supply lists of tools. Note that instead of schema suggested by OpenAI, Pydantic models json schemas should be provided as the list of json schemas for `tools` as it is using Outlines to generate the arguments for 100% compliance. There are three modes of using function calling:
1. `tool_choice="auto"`: The model will decide if tool calling is needed based on the conversation. If a tool is needed, it will pick the appropriate tool and generate the arguments. Otherwise, it will only response with normal text.
2. `tool_choice="required"`: One of the given tools must be selected by the model. The model will pick the appropriate tool and generate the arguments.
3. `tool_choice={"type": "function", "function": {"name": "<selected tool name>"}}`: The model will generate the arguments of the selected tools.  

If function calling is triggered, the call arguments will be contained in the `tool_calls` attribute in the `choices` element in the response. The `finish_reason` will be `tool_call`.  

If `tool_choice="none"` is passed, the list of tools provided will be ignored and the model will only generate normal text.

### 5. Multiple LLMs serving
Only one model is loaded on ram at a time, but the engine leverage MLX fast module loading time to spin up another model when it is requested. This allows serving multiple models with one endpoint.

### 6. Automatic model quantisation
When configuring your model, you can specify the quantisation to increase your inference speed and lower memory usage. The original model is converted to MLX quantised model format when initialising the serving engine.

```python
from pydantic import BaseModel
from openai import Client

client = Client(api_key='Your API Key', base_url='http://localhost:5001/v1/')

class Customer(BaseModel):
    first_name: str
    last_name: str
    age: int

prompt = """Extract the customer information from the following text in json format:
"...The customer David Stone join our membership in 20023, his current age is thirty five years old...."
"""
for i in client.chat.completions.create(
    model='my_llama_model',
    messages=[dict(role='user', content=prompt)],
    max_tokens=200,
    stream=True,
    extra_body=dict(
        guided_json=Customer.model_json_schema()
    )
):
    print(i.choices[0].delta.content, end='')

# Output: {"first_name": "David", "last_name": "Stone", "age": 35}
```

## Usage
### 1. Serving a single model
You can quickly set up a OpenAI API server with a single command.

```bash
mlx_textgen.server --model NousResearch/Hermes-3-Llama-3.1-8B --qunatize q8 --port 5001
```

### 2. Serving a multiple models server
Create a config file template and add as many model as you like.
```bash
mlx_textgen.create_config --num-models 2
```

It will generate a file called `model_config.yaml`. Edit this file for the models you want to serve.
```yaml
- model_id_or_path: NousResearch/Hermes-3-Llama-3.1-8B
  tokenizer_id_or_path: null
  adapter_path: null
  quant: q8
  revision: null
  model_name: null
  model_config: null
  tokenizer_config: null
- model_id_or_path: mlx-community/Llama-3.2-3B-Instruct-4bit
  tokenizer_id_or_path: null
  adapter_path: null
  quant: q4
  revision: null
  model_name: llama-3.2-3b-instruct
  model_config: null
  tokenizer_config: null
```

Then start the engine:
```bash
mlx_textgen.server --config-file ./model_config.yaml --port 5001
```

### 3. More engine arguments
You can check the details of other engine arguments by running:
```bash
mlx_textgen.server --help
```

You can specify the number of cache slots for each model, minimum number of tokens to create a cache file, and API keys etc.

## License
This project is licensed under the terms of the MIT license.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "mlx-textgen",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": null,
    "author": null,
    "author_email": "Nathan Tam <nathan1295@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/8c/ba/0f36466a8570076435b12aaf1615c96a0100e98bfe26da31f394b8019dbd/mlx_textgen-0.1.0.tar.gz",
    "platform": null,
    "description": "# MLX-Textgen\n[![PyPI](https://img.shields.io/pypi/v/mlx-textgen)](https://pypi.org/project/mlx-textgen/)\n[![PyPI - License](https://img.shields.io/pypi/l/mlx-textgen)](https://pypi.org/project/mlx-textgen/)\n[![GitHub Repo stars](https://img.shields.io/github/stars/nath1295/mlx-textgen)](https://pypi.org/project/mlx-textgen/)\n\n## A python package for serving LLM on OpenAI-compatible API endpoints with prompt caching using MLX  \n\nMLX-Textgen is a light-weight LLM serving engine that utilize MLX and a smart KV cache management system to make your LLM generation more seamless on your Apple silicon machine. It features:\n- Multiple KV-cache slots to reduce the needs of prompt processing\n- Multiple models serving with Fastapi\n- Common OpenAI API endpoints: `/v1/models`, `/v1/completions`, `/v1/chat/completions`\n\n## Updates\n**2024-10-20** - Batch inference and function calling supported. Breaking changes in the save format for the KV caches. Run `mlx_textgen.clear_cache` after updating to avoid issues.\n**2024-10-07** - Guided decoding is supported with [Outlines](https://github.com/dottxt-ai/outlines) backend.\n\n## Installing MLX-Textgen\nMLX-textgen can be easily installed with `pip`:\n```\npip install mlx-textgen\n```\n\n## Features\n### 1. Multiple KV cache slots support\nAll the KV cache are stored on disk. Therefore, unlike other LLM serving engine, a newly created KV cache will not overwrite the existing KV cache. This works better for agenic workflows where different types of prompts are being used frequently without losing previous cache for a long prompt.\n\n### 2. Guided decoding with Regex, Json schema, and Grammar\nYou can pass your guided decoding argument `guided_json`, `guided_choice`, `guided_regex`, or `guided_grammar` as extra arguments and create structured text generation in a similar fashion to [vllm](https://github.com/vllm-project/vllm).\n\n### 3. Batch inference support\nBatch inference is supported for multiple prompts or multiple generations for a single prompt. Just pass a list of prompts to the `prompt` argument to the `/v1/completions` endpoint or `n=2` (or more than 2) to the `/v1/chat/completions` or `v1/completions` endpoints for batch inferencing.\n\n### 4. Function calling support\nFunction calling with the `/v1/chat/completions` is supported. Simply use the `tools` and `tool_choice` arguments to supply lists of tools. Note that instead of schema suggested by OpenAI, Pydantic models json schemas should be provided as the list of json schemas for `tools` as it is using Outlines to generate the arguments for 100% compliance. There are three modes of using function calling:\n1. `tool_choice=\"auto\"`: The model will decide if tool calling is needed based on the conversation. If a tool is needed, it will pick the appropriate tool and generate the arguments. Otherwise, it will only response with normal text.\n2. `tool_choice=\"required\"`: One of the given tools must be selected by the model. The model will pick the appropriate tool and generate the arguments.\n3. `tool_choice={\"type\": \"function\", \"function\": {\"name\": \"<selected tool name>\"}}`: The model will generate the arguments of the selected tools.  \n\nIf function calling is triggered, the call arguments will be contained in the `tool_calls` attribute in the `choices` element in the response. The `finish_reason` will be `tool_call`.  \n\nIf `tool_choice=\"none\"` is passed, the list of tools provided will be ignored and the model will only generate normal text.\n\n### 5. Multiple LLMs serving\nOnly one model is loaded on ram at a time, but the engine leverage MLX fast module loading time to spin up another model when it is requested. This allows serving multiple models with one endpoint.\n\n### 6. Automatic model quantisation\nWhen configuring your model, you can specify the quantisation to increase your inference speed and lower memory usage. The original model is converted to MLX quantised model format when initialising the serving engine.\n\n```python\nfrom pydantic import BaseModel\nfrom openai import Client\n\nclient = Client(api_key='Your API Key', base_url='http://localhost:5001/v1/')\n\nclass Customer(BaseModel):\n    first_name: str\n    last_name: str\n    age: int\n\nprompt = \"\"\"Extract the customer information from the following text in json format:\n\"...The customer David Stone join our membership in 20023, his current age is thirty five years old....\"\n\"\"\"\nfor i in client.chat.completions.create(\n    model='my_llama_model',\n    messages=[dict(role='user', content=prompt)],\n    max_tokens=200,\n    stream=True,\n    extra_body=dict(\n        guided_json=Customer.model_json_schema()\n    )\n):\n    print(i.choices[0].delta.content, end='')\n\n# Output: {\"first_name\": \"David\", \"last_name\": \"Stone\", \"age\": 35}\n```\n\n## Usage\n### 1. Serving a single model\nYou can quickly set up a OpenAI API server with a single command.\n\n```bash\nmlx_textgen.server --model NousResearch/Hermes-3-Llama-3.1-8B --qunatize q8 --port 5001\n```\n\n### 2. Serving a multiple models server\nCreate a config file template and add as many model as you like.\n```bash\nmlx_textgen.create_config --num-models 2\n```\n\nIt will generate a file called `model_config.yaml`. Edit this file for the models you want to serve.\n```yaml\n- model_id_or_path: NousResearch/Hermes-3-Llama-3.1-8B\n  tokenizer_id_or_path: null\n  adapter_path: null\n  quant: q8\n  revision: null\n  model_name: null\n  model_config: null\n  tokenizer_config: null\n- model_id_or_path: mlx-community/Llama-3.2-3B-Instruct-4bit\n  tokenizer_id_or_path: null\n  adapter_path: null\n  quant: q4\n  revision: null\n  model_name: llama-3.2-3b-instruct\n  model_config: null\n  tokenizer_config: null\n```\n\nThen start the engine:\n```bash\nmlx_textgen.server --config-file ./model_config.yaml --port 5001\n```\n\n### 3. More engine arguments\nYou can check the details of other engine arguments by running:\n```bash\nmlx_textgen.server --help\n```\n\nYou can specify the number of cache slots for each model, minimum number of tokens to create a cache file, and API keys etc.\n\n## License\nThis project is licensed under the terms of the MIT license.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A python package for serving LLM on OpenAI-compatible API endpoints with prompt caching using MLX.",
    "version": "0.1.0",
    "project_urls": {
        "Homepage": "https://github.com/nath1295/MLX-Textgen"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8cba0f36466a8570076435b12aaf1615c96a0100e98bfe26da31f394b8019dbd",
                "md5": "e95e71ce9e5bf78107947a840cf8ca63",
                "sha256": "3c72ef58e3548263cd1741c0ac9a6a5b6e7eee853bad4340ced5c1afc29d0ba2"
            },
            "downloads": -1,
            "filename": "mlx_textgen-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "e95e71ce9e5bf78107947a840cf8ca63",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 29454,
            "upload_time": "2024-10-20T22:35:46",
            "upload_time_iso_8601": "2024-10-20T22:35:46.617136Z",
            "url": "https://files.pythonhosted.org/packages/8c/ba/0f36466a8570076435b12aaf1615c96a0100e98bfe26da31f394b8019dbd/mlx_textgen-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-20 22:35:46",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "nath1295",
    "github_project": "MLX-Textgen",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "mlx-textgen"
}
        
Elapsed time: 0.36998s