mlx-lm


Namemlx-lm JSON
Version 0.19.3 PyPI version JSON
download
home_pagehttps://github.com/ml-explore/mlx-examples
SummaryLLMs on Apple silicon with MLX and the Hugging Face Hub
upload_time2024-11-04 15:23:46
maintainerNone
docs_urlNone
authorMLX Contributors
requires_python>=3.8
licenseMIT
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ## Generate Text with LLMs and MLX

The easiest way to get started is to install the `mlx-lm` package:

**With `pip`**:

```sh
pip install mlx-lm
```

**With `conda`**:

```sh
conda install -c conda-forge mlx-lm
```

The `mlx-lm` package also has:

- [LoRA, QLoRA, and full fine-tuning](https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/LORA.md)
- [Merging models](https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/MERGE.md)
- [HTTP model serving](https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/SERVER.md)

### Quick Start

To generate text with an LLM use:

```bash
mlx_lm.generate --prompt "Hi!"
```

To chat with an LLM use:

```bash
mlx_lm.chat
```

This will give you a chat REPL that you can use to interact with the LLM. The
chat context is preserved during the lifetime of the REPL.

Commands in `mlx-lm` typically take command line options which let you specify
the model, sampling parameters, and more. Use `-h` to see a list of available
options for a command, e.g.:

```bash
mlx_lm.generate -h
```

### Python API

You can use `mlx-lm` as a module:

```python
from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit")

prompt = "Write a story about Einstein"

messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

response = generate(model, tokenizer, prompt=prompt, verbose=True)
```

To see a description of all the arguments you can do:

```
>>> help(generate)
```

Check out the [generation
example](https://github.com/ml-explore/mlx-examples/tree/main/llms/mlx_lm/examples/generate_response.py)
to see how to use the API in more detail.

The `mlx-lm` package also comes with functionality to quantize and optionally
upload models to the Hugging Face Hub.

You can convert models in the Python API with:

```python
from mlx_lm import convert

repo = "mistralai/Mistral-7B-Instruct-v0.3"
upload_repo = "mlx-community/My-Mistral-7B-Instruct-v0.3-4bit"

convert(repo, quantize=True, upload_repo=upload_repo)
```

This will generate a 4-bit quantized Mistral 7B and upload it to the repo
`mlx-community/My-Mistral-7B-Instruct-v0.3-4bit`. It will also save the
converted model in the path `mlx_model` by default.

To see a description of all the arguments you can do:

```
>>> help(convert)
```

#### Streaming

For streaming generation, use the `stream_generate` function. This returns a
generator object which streams the output text. For example,

```python
from mlx_lm import load, stream_generate

repo = "mlx-community/Mistral-7B-Instruct-v0.3-4bit"
model, tokenizer = load(repo)

prompt = "Write a story about Einstein"

messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

for t in stream_generate(model, tokenizer, prompt, max_tokens=512):
    print(t, end="", flush=True)
print()
```

### Command Line

You can also use `mlx-lm` from the command line with:

```
mlx_lm.generate --model mistralai/Mistral-7B-Instruct-v0.3 --prompt "hello"
```

This will download a Mistral 7B model from the Hugging Face Hub and generate
text using the given prompt.

For a full list of options run:

```
mlx_lm.generate --help
```

To quantize a model from the command line run:

```
mlx_lm.convert --hf-path mistralai/Mistral-7B-Instruct-v0.3 -q
```

For more options run:

```
mlx_lm.convert --help
```

You can upload new models to Hugging Face by specifying `--upload-repo` to
`convert`. For example, to upload a quantized Mistral-7B model to the
[MLX Hugging Face community](https://huggingface.co/mlx-community) you can do:

```
mlx_lm.convert \
    --hf-path mistralai/Mistral-7B-Instruct-v0.3 \
    -q \
    --upload-repo mlx-community/my-4bit-mistral
```

### Long Prompts and Generations 

`mlx-lm` has some tools to scale efficiently to long prompts and generations:

- A rotating fixed-size key-value cache.
- Prompt caching

To use the rotating key-value cache pass the argument `--max-kv-size n` where
`n` can be any integer. Smaller values like `512` will use very little RAM but
result in worse quality. Larger values like `4096` or higher will use more RAM
but have better quality.

Caching prompts can substantially speedup reusing the same long context with
different queries. To cache a prompt use `mlx_lm.cache_prompt`. For example:

```bash
cat prompt.txt | mlx_lm.cache_prompt \
  --model mistralai/Mistral-7B-Instruct-v0.3 \
  --prompt - \
  --prompt-cache-file mistral_prompt.safetensors
``` 

Then use the cached prompt with `mlx_lm.generate`:

```
mlx_lm.generate \
    --prompt-cache-file mistral_prompt.safetensors \
    --prompt "\nSummarize the above text."
```

The cached prompt is treated as a prefix to the supplied prompt. Also notice
when using a cached prompt, the model to use is read from the cache and need
not be supplied explicitly.

Prompt caching can also be used in the Python API in order to to avoid
recomputing the prompt. This is useful in multi-turn dialogues or across
requests that use the same context. See the
[example](https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/examples/chat.py)
for more usage details.

### Supported Models

`mlx-lm` supports thousands of Hugging Face format LLMs. If the model you want to
run is not supported, file an
[issue](https://github.com/ml-explore/mlx-examples/issues/new) or better yet,
submit a pull request.

Here are a few examples of Hugging Face models that work with this example:

- [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
- [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf)
- [deepseek-ai/deepseek-coder-6.7b-instruct](https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-instruct)
- [01-ai/Yi-6B-Chat](https://huggingface.co/01-ai/Yi-6B-Chat)
- [microsoft/phi-2](https://huggingface.co/microsoft/phi-2)
- [mistralai/Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1)
- [Qwen/Qwen-7B](https://huggingface.co/Qwen/Qwen-7B)
- [pfnet/plamo-13b](https://huggingface.co/pfnet/plamo-13b)
- [pfnet/plamo-13b-instruct](https://huggingface.co/pfnet/plamo-13b-instruct)
- [stabilityai/stablelm-2-zephyr-1_6b](https://huggingface.co/stabilityai/stablelm-2-zephyr-1_6b)
- [internlm/internlm2-7b](https://huggingface.co/internlm/internlm2-7b)

Most
[Mistral](https://huggingface.co/models?library=transformers,safetensors&other=mistral&sort=trending),
[Llama](https://huggingface.co/models?library=transformers,safetensors&other=llama&sort=trending),
[Phi-2](https://huggingface.co/models?library=transformers,safetensors&other=phi&sort=trending),
and
[Mixtral](https://huggingface.co/models?library=transformers,safetensors&other=mixtral&sort=trending)
style models should work out of the box.

For some models (such as `Qwen` and `plamo`) the tokenizer requires you to
enable the `trust_remote_code` option. You can do this by passing
`--trust-remote-code` in the command line. If you don't specify the flag
explicitly, you will be prompted to trust remote code in the terminal when
running the model. 

For `Qwen` models you must also specify the `eos_token`. You can do this by
passing `--eos-token "<|endoftext|>"` in the command
line. 

These options can also be set in the Python API. For example:

```python
model, tokenizer = load(
    "qwen/Qwen-7B",
    tokenizer_config={"eos_token": "<|endoftext|>", "trust_remote_code": True},
)
```

### Large Models

> [!NOTE]
    This requires macOS 15.0 or higher to work.

Models which are large relative to the total RAM available on the machine can
be slow. `mlx-lm` will attempt to make them faster by wiring the memory
occupied by the model and cache. This requires macOS 15 or higher to
work.

If you see the following warning message:

> [WARNING] Generating with a model that requires ...

then the model will likely be slow on the given machine. If the model fits in
RAM then it can often be sped up by increasing the system wired memory limit.
To increase the limit, set the following `sysctl`:

```bash
sudo sysctl iogpu.wired_limit_mb=N
```

The value `N` should be larger than the size of the model in megabytes but
smaller than the memory size of the machine.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/ml-explore/mlx-examples",
    "name": "mlx-lm",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": null,
    "author": "MLX Contributors",
    "author_email": "mlx@group.apple.com",
    "download_url": "https://files.pythonhosted.org/packages/9b/97/6ff7a44273c425e220e70a29919716ad54747f396c030ea83c596ef7c000/mlx_lm-0.19.3.tar.gz",
    "platform": null,
    "description": "## Generate Text with LLMs and MLX\n\nThe easiest way to get started is to install the `mlx-lm` package:\n\n**With `pip`**:\n\n```sh\npip install mlx-lm\n```\n\n**With `conda`**:\n\n```sh\nconda install -c conda-forge mlx-lm\n```\n\nThe `mlx-lm` package also has:\n\n- [LoRA, QLoRA, and full fine-tuning](https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/LORA.md)\n- [Merging models](https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/MERGE.md)\n- [HTTP model serving](https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/SERVER.md)\n\n### Quick Start\n\nTo generate text with an LLM use:\n\n```bash\nmlx_lm.generate --prompt \"Hi!\"\n```\n\nTo chat with an LLM use:\n\n```bash\nmlx_lm.chat\n```\n\nThis will give you a chat REPL that you can use to interact with the LLM. The\nchat context is preserved during the lifetime of the REPL.\n\nCommands in `mlx-lm` typically take command line options which let you specify\nthe model, sampling parameters, and more. Use `-h` to see a list of available\noptions for a command, e.g.:\n\n```bash\nmlx_lm.generate -h\n```\n\n### Python API\n\nYou can use `mlx-lm` as a module:\n\n```python\nfrom mlx_lm import load, generate\n\nmodel, tokenizer = load(\"mlx-community/Mistral-7B-Instruct-v0.3-4bit\")\n\nprompt = \"Write a story about Einstein\"\n\nmessages = [{\"role\": \"user\", \"content\": prompt}]\nprompt = tokenizer.apply_chat_template(\n    messages, tokenize=False, add_generation_prompt=True\n)\n\nresponse = generate(model, tokenizer, prompt=prompt, verbose=True)\n```\n\nTo see a description of all the arguments you can do:\n\n```\n>>> help(generate)\n```\n\nCheck out the [generation\nexample](https://github.com/ml-explore/mlx-examples/tree/main/llms/mlx_lm/examples/generate_response.py)\nto see how to use the API in more detail.\n\nThe `mlx-lm` package also comes with functionality to quantize and optionally\nupload models to the Hugging Face Hub.\n\nYou can convert models in the Python API with:\n\n```python\nfrom mlx_lm import convert\n\nrepo = \"mistralai/Mistral-7B-Instruct-v0.3\"\nupload_repo = \"mlx-community/My-Mistral-7B-Instruct-v0.3-4bit\"\n\nconvert(repo, quantize=True, upload_repo=upload_repo)\n```\n\nThis will generate a 4-bit quantized Mistral 7B and upload it to the repo\n`mlx-community/My-Mistral-7B-Instruct-v0.3-4bit`. It will also save the\nconverted model in the path `mlx_model` by default.\n\nTo see a description of all the arguments you can do:\n\n```\n>>> help(convert)\n```\n\n#### Streaming\n\nFor streaming generation, use the `stream_generate` function. This returns a\ngenerator object which streams the output text. For example,\n\n```python\nfrom mlx_lm import load, stream_generate\n\nrepo = \"mlx-community/Mistral-7B-Instruct-v0.3-4bit\"\nmodel, tokenizer = load(repo)\n\nprompt = \"Write a story about Einstein\"\n\nmessages = [{\"role\": \"user\", \"content\": prompt}]\nprompt = tokenizer.apply_chat_template(\n    messages, tokenize=False, add_generation_prompt=True\n)\n\nfor t in stream_generate(model, tokenizer, prompt, max_tokens=512):\n    print(t, end=\"\", flush=True)\nprint()\n```\n\n### Command Line\n\nYou can also use `mlx-lm` from the command line with:\n\n```\nmlx_lm.generate --model mistralai/Mistral-7B-Instruct-v0.3 --prompt \"hello\"\n```\n\nThis will download a Mistral 7B model from the Hugging Face Hub and generate\ntext using the given prompt.\n\nFor a full list of options run:\n\n```\nmlx_lm.generate --help\n```\n\nTo quantize a model from the command line run:\n\n```\nmlx_lm.convert --hf-path mistralai/Mistral-7B-Instruct-v0.3 -q\n```\n\nFor more options run:\n\n```\nmlx_lm.convert --help\n```\n\nYou can upload new models to Hugging Face by specifying `--upload-repo` to\n`convert`. For example, to upload a quantized Mistral-7B model to the\n[MLX Hugging Face community](https://huggingface.co/mlx-community) you can do:\n\n```\nmlx_lm.convert \\\n    --hf-path mistralai/Mistral-7B-Instruct-v0.3 \\\n    -q \\\n    --upload-repo mlx-community/my-4bit-mistral\n```\n\n### Long Prompts and Generations \n\n`mlx-lm` has some tools to scale efficiently to long prompts and generations:\n\n- A rotating fixed-size key-value cache.\n- Prompt caching\n\nTo use the rotating key-value cache pass the argument `--max-kv-size n` where\n`n` can be any integer. Smaller values like `512` will use very little RAM but\nresult in worse quality. Larger values like `4096` or higher will use more RAM\nbut have better quality.\n\nCaching prompts can substantially speedup reusing the same long context with\ndifferent queries. To cache a prompt use `mlx_lm.cache_prompt`. For example:\n\n```bash\ncat prompt.txt | mlx_lm.cache_prompt \\\n  --model mistralai/Mistral-7B-Instruct-v0.3 \\\n  --prompt - \\\n  --prompt-cache-file mistral_prompt.safetensors\n``` \n\nThen use the cached prompt with `mlx_lm.generate`:\n\n```\nmlx_lm.generate \\\n    --prompt-cache-file mistral_prompt.safetensors \\\n    --prompt \"\\nSummarize the above text.\"\n```\n\nThe cached prompt is treated as a prefix to the supplied prompt. Also notice\nwhen using a cached prompt, the model to use is read from the cache and need\nnot be supplied explicitly.\n\nPrompt caching can also be used in the Python API in order to to avoid\nrecomputing the prompt. This is useful in multi-turn dialogues or across\nrequests that use the same context. See the\n[example](https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/examples/chat.py)\nfor more usage details.\n\n### Supported Models\n\n`mlx-lm` supports thousands of Hugging Face format LLMs. If the model you want to\nrun is not supported, file an\n[issue](https://github.com/ml-explore/mlx-examples/issues/new) or better yet,\nsubmit a pull request.\n\nHere are a few examples of Hugging Face models that work with this example:\n\n- [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)\n- [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf)\n- [deepseek-ai/deepseek-coder-6.7b-instruct](https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-instruct)\n- [01-ai/Yi-6B-Chat](https://huggingface.co/01-ai/Yi-6B-Chat)\n- [microsoft/phi-2](https://huggingface.co/microsoft/phi-2)\n- [mistralai/Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1)\n- [Qwen/Qwen-7B](https://huggingface.co/Qwen/Qwen-7B)\n- [pfnet/plamo-13b](https://huggingface.co/pfnet/plamo-13b)\n- [pfnet/plamo-13b-instruct](https://huggingface.co/pfnet/plamo-13b-instruct)\n- [stabilityai/stablelm-2-zephyr-1_6b](https://huggingface.co/stabilityai/stablelm-2-zephyr-1_6b)\n- [internlm/internlm2-7b](https://huggingface.co/internlm/internlm2-7b)\n\nMost\n[Mistral](https://huggingface.co/models?library=transformers,safetensors&other=mistral&sort=trending),\n[Llama](https://huggingface.co/models?library=transformers,safetensors&other=llama&sort=trending),\n[Phi-2](https://huggingface.co/models?library=transformers,safetensors&other=phi&sort=trending),\nand\n[Mixtral](https://huggingface.co/models?library=transformers,safetensors&other=mixtral&sort=trending)\nstyle models should work out of the box.\n\nFor some models (such as `Qwen` and `plamo`) the tokenizer requires you to\nenable the `trust_remote_code` option. You can do this by passing\n`--trust-remote-code` in the command line. If you don't specify the flag\nexplicitly, you will be prompted to trust remote code in the terminal when\nrunning the model. \n\nFor `Qwen` models you must also specify the `eos_token`. You can do this by\npassing `--eos-token \"<|endoftext|>\"` in the command\nline. \n\nThese options can also be set in the Python API. For example:\n\n```python\nmodel, tokenizer = load(\n    \"qwen/Qwen-7B\",\n    tokenizer_config={\"eos_token\": \"<|endoftext|>\", \"trust_remote_code\": True},\n)\n```\n\n### Large Models\n\n> [!NOTE]\n    This requires macOS 15.0 or higher to work.\n\nModels which are large relative to the total RAM available on the machine can\nbe slow. `mlx-lm` will attempt to make them faster by wiring the memory\noccupied by the model and cache. This requires macOS 15 or higher to\nwork.\n\nIf you see the following warning message:\n\n> [WARNING] Generating with a model that requires ...\n\nthen the model will likely be slow on the given machine. If the model fits in\nRAM then it can often be sped up by increasing the system wired memory limit.\nTo increase the limit, set the following `sysctl`:\n\n```bash\nsudo sysctl iogpu.wired_limit_mb=N\n```\n\nThe value `N` should be larger than the size of the model in megabytes but\nsmaller than the memory size of the machine.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "LLMs on Apple silicon with MLX and the Hugging Face Hub",
    "version": "0.19.3",
    "project_urls": {
        "Homepage": "https://github.com/ml-explore/mlx-examples"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8ccc47b0cd966844d898ba3a7317124de356a99302df84dd884cbfbfec5ee457",
                "md5": "1b6b80061e9c746aae0b40cb8ccb16fa",
                "sha256": "d59b92676b89c8c7dbffdce2c4214d67557a031989034752ae37e9f047f4c1fa"
            },
            "downloads": -1,
            "filename": "mlx_lm-0.19.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "1b6b80061e9c746aae0b40cb8ccb16fa",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 118421,
            "upload_time": "2024-11-04T15:23:44",
            "upload_time_iso_8601": "2024-11-04T15:23:44.907598Z",
            "url": "https://files.pythonhosted.org/packages/8c/cc/47b0cd966844d898ba3a7317124de356a99302df84dd884cbfbfec5ee457/mlx_lm-0.19.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9b976ff7a44273c425e220e70a29919716ad54747f396c030ea83c596ef7c000",
                "md5": "d66887828f3ec3e2c50c637b3255a9b6",
                "sha256": "1d2b2f26f6262f4b8eb816d2c39b9ecbe6d0334bde0e6b5db141e101a8e94a34"
            },
            "downloads": -1,
            "filename": "mlx_lm-0.19.3.tar.gz",
            "has_sig": false,
            "md5_digest": "d66887828f3ec3e2c50c637b3255a9b6",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 91649,
            "upload_time": "2024-11-04T15:23:46",
            "upload_time_iso_8601": "2024-11-04T15:23:46.712914Z",
            "url": "https://files.pythonhosted.org/packages/9b/97/6ff7a44273c425e220e70a29919716ad54747f396c030ea83c596ef7c000/mlx_lm-0.19.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-04 15:23:46",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ml-explore",
    "github_project": "mlx-examples",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "circle": true,
    "lcname": "mlx-lm"
}
        
Elapsed time: 0.41541s