mlx-lm


Namemlx-lm JSON
Version 0.18.1 PyPI version JSON
download
home_pagehttps://github.com/ml-explore/mlx-examples
SummaryLLMs on Apple silicon with MLX and the Hugging Face Hub
upload_time2024-08-30 12:57:12
maintainerNone
docs_urlNone
authorMLX Contributors
requires_python>=3.8
licenseMIT
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ## Generate Text with LLMs and MLX

The easiest way to get started is to install the `mlx-lm` package:

**With `pip`**:

```sh
pip install mlx-lm
```

**With `conda`**:

```sh
conda install -c conda-forge mlx-lm
```

The `mlx-lm` package also has:

- [LoRA and QLoRA fine-tuning](https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/LORA.md)
- [Merging models](https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/MERGE.md)
- [HTTP model serving](https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/SERVER.md)

### Python API

You can use `mlx-lm` as a module:

```python
from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit")

response = generate(model, tokenizer, prompt="hello", verbose=True)
```

To see a description of all the arguments you can do:

```
>>> help(generate)
```

Check out the [generation
example](https://github.com/ml-explore/mlx-examples/tree/main/llms/mlx_lm/examples/generate_response.py)
to see how to use the API in more detail.

The `mlx-lm` package also comes with functionality to quantize and optionally
upload models to the Hugging Face Hub.

You can convert models in the Python API with:

```python
from mlx_lm import convert

repo = "mistralai/Mistral-7B-Instruct-v0.3"
upload_repo = "mlx-community/My-Mistral-7B-Instruct-v0.3-4bit"

convert(repo, quantize=True, upload_repo=upload_repo)
```

This will generate a 4-bit quantized Mistral 7B and upload it to the repo
`mlx-community/My-Mistral-7B-Instruct-v0.3-4bit`. It will also save the
converted model in the path `mlx_model` by default.

To see a description of all the arguments you can do:

```
>>> help(convert)
```

#### Streaming

For streaming generation, use the `stream_generate` function. This returns a
generator object which streams the output text. For example,

```python
from mlx_lm import load, stream_generate

repo = "mlx-community/Mistral-7B-Instruct-v0.3-4bit"
model, tokenizer = load(repo)

prompt = "Write a story about Einstein"

for t in stream_generate(model, tokenizer, prompt, max_tokens=512):
    print(t, end="", flush=True)
print()
```

### Command Line

You can also use `mlx-lm` from the command line with:

```
mlx_lm.generate --model mistralai/Mistral-7B-Instruct-v0.3 --prompt "hello"
```

This will download a Mistral 7B model from the Hugging Face Hub and generate
text using the given prompt.

For a full list of options run:

```
mlx_lm.generate --help
```

To quantize a model from the command line run:

```
mlx_lm.convert --hf-path mistralai/Mistral-7B-Instruct-v0.3 -q
```

For more options run:

```
mlx_lm.convert --help
```

You can upload new models to Hugging Face by specifying `--upload-repo` to
`convert`. For example, to upload a quantized Mistral-7B model to the
[MLX Hugging Face community](https://huggingface.co/mlx-community) you can do:

```
mlx_lm.convert \
    --hf-path mistralai/Mistral-7B-Instruct-v0.3 \
    -q \
    --upload-repo mlx-community/my-4bit-mistral
```

### Long Prompts and Generations 

MLX LM has some tools to scale efficiently to long prompts and generations:

- A rotating fixed-size key-value cache.
- Prompt caching

To use the rotating key-value cache pass the argument `--max-kv-size n` where
`n` can be any integer. Smaller values like `512` will use very little RAM but
result in worse quality. Larger values like `4096` or higher will use more RAM
but have better quality.

Caching prompts can substantially speedup reusing the same long context with
different queries. To cache a prompt use `mlx_lm.cache_prompt`. For example:

```bash
cat prompt.txt | mlx_lm.cache_prompt \
  --model mistralai/Mistral-7B-Instruct-v0.3 \
  --prompt - \
  --kv-cache-file mistral_prompt.safetensors
``` 

Then use the cached prompt with `mlx_lm.generate`:

```
mlx_lm.generate \
    --kv-cache-file mistral_prompt.safetensors \
    --prompt "\nSummarize the above text."
```

The cached prompt is treated as a prefix to the supplied prompt. Also notice
when using a cached prompt, the model to use is read from the cache and need
not be supplied explicitly.

### Supported Models

MLX LM supports thousands of Hugging Face format LLMs. If the model you want to
run is not supported, file an
[issue](https://github.com/ml-explore/mlx-examples/issues/new) or better yet,
submit a pull request.

Here are a few examples of Hugging Face models that work with this example:

- [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
- [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf)
- [deepseek-ai/deepseek-coder-6.7b-instruct](https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-instruct)
- [01-ai/Yi-6B-Chat](https://huggingface.co/01-ai/Yi-6B-Chat)
- [microsoft/phi-2](https://huggingface.co/microsoft/phi-2)
- [mistralai/Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1)
- [Qwen/Qwen-7B](https://huggingface.co/Qwen/Qwen-7B)
- [pfnet/plamo-13b](https://huggingface.co/pfnet/plamo-13b)
- [pfnet/plamo-13b-instruct](https://huggingface.co/pfnet/plamo-13b-instruct)
- [stabilityai/stablelm-2-zephyr-1_6b](https://huggingface.co/stabilityai/stablelm-2-zephyr-1_6b)
- [internlm/internlm2-7b](https://huggingface.co/internlm/internlm2-7b)

Most
[Mistral](https://huggingface.co/models?library=transformers,safetensors&other=mistral&sort=trending),
[Llama](https://huggingface.co/models?library=transformers,safetensors&other=llama&sort=trending),
[Phi-2](https://huggingface.co/models?library=transformers,safetensors&other=phi&sort=trending),
and
[Mixtral](https://huggingface.co/models?library=transformers,safetensors&other=mixtral&sort=trending)
style models should work out of the box.

For some models (such as `Qwen` and `plamo`) the tokenizer requires you to
enable the `trust_remote_code` option. You can do this by passing
`--trust-remote-code` in the command line. If you don't specify the flag
explicitly, you will be prompted to trust remote code in the terminal when
running the model. 

For `Qwen` models you must also specify the `eos_token`. You can do this by
passing `--eos-token "<|endoftext|>"` in the command
line. 

These options can also be set in the Python API. For example:

```python
model, tokenizer = load(
    "qwen/Qwen-7B",
    tokenizer_config={"eos_token": "<|endoftext|>", "trust_remote_code": True},
)
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/ml-explore/mlx-examples",
    "name": "mlx-lm",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": null,
    "author": "MLX Contributors",
    "author_email": "mlx@group.apple.com",
    "download_url": "https://files.pythonhosted.org/packages/20/a3/fd45a1c1a6aa267a726ca1fc32a5ba6de7bf18d46ba1edc850cea9873bab/mlx_lm-0.18.1.tar.gz",
    "platform": null,
    "description": "## Generate Text with LLMs and MLX\n\nThe easiest way to get started is to install the `mlx-lm` package:\n\n**With `pip`**:\n\n```sh\npip install mlx-lm\n```\n\n**With `conda`**:\n\n```sh\nconda install -c conda-forge mlx-lm\n```\n\nThe `mlx-lm` package also has:\n\n- [LoRA and QLoRA fine-tuning](https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/LORA.md)\n- [Merging models](https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/MERGE.md)\n- [HTTP model serving](https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/SERVER.md)\n\n### Python API\n\nYou can use `mlx-lm` as a module:\n\n```python\nfrom mlx_lm import load, generate\n\nmodel, tokenizer = load(\"mlx-community/Mistral-7B-Instruct-v0.3-4bit\")\n\nresponse = generate(model, tokenizer, prompt=\"hello\", verbose=True)\n```\n\nTo see a description of all the arguments you can do:\n\n```\n>>> help(generate)\n```\n\nCheck out the [generation\nexample](https://github.com/ml-explore/mlx-examples/tree/main/llms/mlx_lm/examples/generate_response.py)\nto see how to use the API in more detail.\n\nThe `mlx-lm` package also comes with functionality to quantize and optionally\nupload models to the Hugging Face Hub.\n\nYou can convert models in the Python API with:\n\n```python\nfrom mlx_lm import convert\n\nrepo = \"mistralai/Mistral-7B-Instruct-v0.3\"\nupload_repo = \"mlx-community/My-Mistral-7B-Instruct-v0.3-4bit\"\n\nconvert(repo, quantize=True, upload_repo=upload_repo)\n```\n\nThis will generate a 4-bit quantized Mistral 7B and upload it to the repo\n`mlx-community/My-Mistral-7B-Instruct-v0.3-4bit`. It will also save the\nconverted model in the path `mlx_model` by default.\n\nTo see a description of all the arguments you can do:\n\n```\n>>> help(convert)\n```\n\n#### Streaming\n\nFor streaming generation, use the `stream_generate` function. This returns a\ngenerator object which streams the output text. For example,\n\n```python\nfrom mlx_lm import load, stream_generate\n\nrepo = \"mlx-community/Mistral-7B-Instruct-v0.3-4bit\"\nmodel, tokenizer = load(repo)\n\nprompt = \"Write a story about Einstein\"\n\nfor t in stream_generate(model, tokenizer, prompt, max_tokens=512):\n    print(t, end=\"\", flush=True)\nprint()\n```\n\n### Command Line\n\nYou can also use `mlx-lm` from the command line with:\n\n```\nmlx_lm.generate --model mistralai/Mistral-7B-Instruct-v0.3 --prompt \"hello\"\n```\n\nThis will download a Mistral 7B model from the Hugging Face Hub and generate\ntext using the given prompt.\n\nFor a full list of options run:\n\n```\nmlx_lm.generate --help\n```\n\nTo quantize a model from the command line run:\n\n```\nmlx_lm.convert --hf-path mistralai/Mistral-7B-Instruct-v0.3 -q\n```\n\nFor more options run:\n\n```\nmlx_lm.convert --help\n```\n\nYou can upload new models to Hugging Face by specifying `--upload-repo` to\n`convert`. For example, to upload a quantized Mistral-7B model to the\n[MLX Hugging Face community](https://huggingface.co/mlx-community) you can do:\n\n```\nmlx_lm.convert \\\n    --hf-path mistralai/Mistral-7B-Instruct-v0.3 \\\n    -q \\\n    --upload-repo mlx-community/my-4bit-mistral\n```\n\n### Long Prompts and Generations \n\nMLX LM has some tools to scale efficiently to long prompts and generations:\n\n- A rotating fixed-size key-value cache.\n- Prompt caching\n\nTo use the rotating key-value cache pass the argument `--max-kv-size n` where\n`n` can be any integer. Smaller values like `512` will use very little RAM but\nresult in worse quality. Larger values like `4096` or higher will use more RAM\nbut have better quality.\n\nCaching prompts can substantially speedup reusing the same long context with\ndifferent queries. To cache a prompt use `mlx_lm.cache_prompt`. For example:\n\n```bash\ncat prompt.txt | mlx_lm.cache_prompt \\\n  --model mistralai/Mistral-7B-Instruct-v0.3 \\\n  --prompt - \\\n  --kv-cache-file mistral_prompt.safetensors\n``` \n\nThen use the cached prompt with `mlx_lm.generate`:\n\n```\nmlx_lm.generate \\\n    --kv-cache-file mistral_prompt.safetensors \\\n    --prompt \"\\nSummarize the above text.\"\n```\n\nThe cached prompt is treated as a prefix to the supplied prompt. Also notice\nwhen using a cached prompt, the model to use is read from the cache and need\nnot be supplied explicitly.\n\n### Supported Models\n\nMLX LM supports thousands of Hugging Face format LLMs. If the model you want to\nrun is not supported, file an\n[issue](https://github.com/ml-explore/mlx-examples/issues/new) or better yet,\nsubmit a pull request.\n\nHere are a few examples of Hugging Face models that work with this example:\n\n- [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)\n- [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf)\n- [deepseek-ai/deepseek-coder-6.7b-instruct](https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-instruct)\n- [01-ai/Yi-6B-Chat](https://huggingface.co/01-ai/Yi-6B-Chat)\n- [microsoft/phi-2](https://huggingface.co/microsoft/phi-2)\n- [mistralai/Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1)\n- [Qwen/Qwen-7B](https://huggingface.co/Qwen/Qwen-7B)\n- [pfnet/plamo-13b](https://huggingface.co/pfnet/plamo-13b)\n- [pfnet/plamo-13b-instruct](https://huggingface.co/pfnet/plamo-13b-instruct)\n- [stabilityai/stablelm-2-zephyr-1_6b](https://huggingface.co/stabilityai/stablelm-2-zephyr-1_6b)\n- [internlm/internlm2-7b](https://huggingface.co/internlm/internlm2-7b)\n\nMost\n[Mistral](https://huggingface.co/models?library=transformers,safetensors&other=mistral&sort=trending),\n[Llama](https://huggingface.co/models?library=transformers,safetensors&other=llama&sort=trending),\n[Phi-2](https://huggingface.co/models?library=transformers,safetensors&other=phi&sort=trending),\nand\n[Mixtral](https://huggingface.co/models?library=transformers,safetensors&other=mixtral&sort=trending)\nstyle models should work out of the box.\n\nFor some models (such as `Qwen` and `plamo`) the tokenizer requires you to\nenable the `trust_remote_code` option. You can do this by passing\n`--trust-remote-code` in the command line. If you don't specify the flag\nexplicitly, you will be prompted to trust remote code in the terminal when\nrunning the model. \n\nFor `Qwen` models you must also specify the `eos_token`. You can do this by\npassing `--eos-token \"<|endoftext|>\"` in the command\nline. \n\nThese options can also be set in the Python API. For example:\n\n```python\nmodel, tokenizer = load(\n    \"qwen/Qwen-7B\",\n    tokenizer_config={\"eos_token\": \"<|endoftext|>\", \"trust_remote_code\": True},\n)\n```\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "LLMs on Apple silicon with MLX and the Hugging Face Hub",
    "version": "0.18.1",
    "project_urls": {
        "Homepage": "https://github.com/ml-explore/mlx-examples"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "22800666e1040d07f5222e6d32dfdc3d6002b4d6f59f228b228c368fd2a3d80c",
                "md5": "35a155f79e35f0bbbb184b582cdb12d7",
                "sha256": "38610e02c71067327833c570db7414cc0e604c24c4870377c84a55d5dc690f3d"
            },
            "downloads": -1,
            "filename": "mlx_lm-0.18.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "35a155f79e35f0bbbb184b582cdb12d7",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 109941,
            "upload_time": "2024-08-30T12:57:09",
            "upload_time_iso_8601": "2024-08-30T12:57:09.811945Z",
            "url": "https://files.pythonhosted.org/packages/22/80/0666e1040d07f5222e6d32dfdc3d6002b4d6f59f228b228c368fd2a3d80c/mlx_lm-0.18.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "20a3fd45a1c1a6aa267a726ca1fc32a5ba6de7bf18d46ba1edc850cea9873bab",
                "md5": "b7fbb9651ada66200ec4d2bf9016eeea",
                "sha256": "7a3f9adbacbfa99f2bee3b3407d87ff6fad55ee7aa616bf87f918f783782abdd"
            },
            "downloads": -1,
            "filename": "mlx_lm-0.18.1.tar.gz",
            "has_sig": false,
            "md5_digest": "b7fbb9651ada66200ec4d2bf9016eeea",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 80128,
            "upload_time": "2024-08-30T12:57:12",
            "upload_time_iso_8601": "2024-08-30T12:57:12.038147Z",
            "url": "https://files.pythonhosted.org/packages/20/a3/fd45a1c1a6aa267a726ca1fc32a5ba6de7bf18d46ba1edc850cea9873bab/mlx_lm-0.18.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-08-30 12:57:12",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ml-explore",
    "github_project": "mlx-examples",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "circle": true,
    "lcname": "mlx-lm"
}
        
Elapsed time: 0.33572s