## Generate Text with LLMs and MLX
The easiest way to get started is to install the `mlx-lm` package:
**With `pip`**:
```sh
pip install mlx-lm
```
**With `conda`**:
```sh
conda install -c conda-forge mlx-lm
```
The `mlx-lm` package also has:
- [LoRA and QLoRA fine-tuning](https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/LORA.md)
- [Merging models](https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/MERGE.md)
- [HTTP model serving](https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/SERVER.md)
### Python API
You can use `mlx-lm` as a module:
```python
from mlx_lm import load, generate
model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit")
response = generate(model, tokenizer, prompt="hello", verbose=True)
```
To see a description of all the arguments you can do:
```
>>> help(generate)
```
Check out the [generation
example](https://github.com/ml-explore/mlx-examples/tree/main/llms/mlx_lm/examples/generate_response.py)
to see how to use the API in more detail.
The `mlx-lm` package also comes with functionality to quantize and optionally
upload models to the Hugging Face Hub.
You can convert models in the Python API with:
```python
from mlx_lm import convert
repo = "mistralai/Mistral-7B-Instruct-v0.3"
upload_repo = "mlx-community/My-Mistral-7B-Instruct-v0.3-4bit"
convert(repo, quantize=True, upload_repo=upload_repo)
```
This will generate a 4-bit quantized Mistral 7B and upload it to the repo
`mlx-community/My-Mistral-7B-Instruct-v0.3-4bit`. It will also save the
converted model in the path `mlx_model` by default.
To see a description of all the arguments you can do:
```
>>> help(convert)
```
#### Streaming
For streaming generation, use the `stream_generate` function. This returns a
generator object which streams the output text. For example,
```python
from mlx_lm import load, stream_generate
repo = "mlx-community/Mistral-7B-Instruct-v0.3-4bit"
model, tokenizer = load(repo)
prompt = "Write a story about Einstein"
for t in stream_generate(model, tokenizer, prompt, max_tokens=512):
print(t, end="", flush=True)
print()
```
### Command Line
You can also use `mlx-lm` from the command line with:
```
mlx_lm.generate --model mistralai/Mistral-7B-Instruct-v0.3 --prompt "hello"
```
This will download a Mistral 7B model from the Hugging Face Hub and generate
text using the given prompt.
For a full list of options run:
```
mlx_lm.generate --help
```
To quantize a model from the command line run:
```
mlx_lm.convert --hf-path mistralai/Mistral-7B-Instruct-v0.3 -q
```
For more options run:
```
mlx_lm.convert --help
```
You can upload new models to Hugging Face by specifying `--upload-repo` to
`convert`. For example, to upload a quantized Mistral-7B model to the
[MLX Hugging Face community](https://huggingface.co/mlx-community) you can do:
```
mlx_lm.convert \
--hf-path mistralai/Mistral-7B-Instruct-v0.3 \
-q \
--upload-repo mlx-community/my-4bit-mistral
```
### Long Prompts and Generations
MLX LM has some tools to scale efficiently to long prompts and generations:
- A rotating fixed-size key-value cache.
- Prompt caching
To use the rotating key-value cache pass the argument `--max-kv-size n` where
`n` can be any integer. Smaller values like `512` will use very little RAM but
result in worse quality. Larger values like `4096` or higher will use more RAM
but have better quality.
Caching prompts can substantially speedup reusing the same long context with
different queries. To cache a prompt use `mlx_lm.cache_prompt`. For example:
```bash
cat prompt.txt | mlx_lm.cache_prompt \
--model mistralai/Mistral-7B-Instruct-v0.3 \
--prompt - \
--kv-cache-file mistral_prompt.safetensors
```
Then use the cached prompt with `mlx_lm.generate`:
```
mlx_lm.generate \
--kv-cache-file mistral_prompt.safetensors \
--prompt "\nSummarize the above text."
```
The cached prompt is treated as a prefix to the supplied prompt. Also notice
when using a cached prompt, the model to use is read from the cache and need
not be supplied explicitly.
### Supported Models
MLX LM supports thousands of Hugging Face format LLMs. If the model you want to
run is not supported, file an
[issue](https://github.com/ml-explore/mlx-examples/issues/new) or better yet,
submit a pull request.
Here are a few examples of Hugging Face models that work with this example:
- [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
- [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf)
- [deepseek-ai/deepseek-coder-6.7b-instruct](https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-instruct)
- [01-ai/Yi-6B-Chat](https://huggingface.co/01-ai/Yi-6B-Chat)
- [microsoft/phi-2](https://huggingface.co/microsoft/phi-2)
- [mistralai/Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1)
- [Qwen/Qwen-7B](https://huggingface.co/Qwen/Qwen-7B)
- [pfnet/plamo-13b](https://huggingface.co/pfnet/plamo-13b)
- [pfnet/plamo-13b-instruct](https://huggingface.co/pfnet/plamo-13b-instruct)
- [stabilityai/stablelm-2-zephyr-1_6b](https://huggingface.co/stabilityai/stablelm-2-zephyr-1_6b)
- [internlm/internlm2-7b](https://huggingface.co/internlm/internlm2-7b)
Most
[Mistral](https://huggingface.co/models?library=transformers,safetensors&other=mistral&sort=trending),
[Llama](https://huggingface.co/models?library=transformers,safetensors&other=llama&sort=trending),
[Phi-2](https://huggingface.co/models?library=transformers,safetensors&other=phi&sort=trending),
and
[Mixtral](https://huggingface.co/models?library=transformers,safetensors&other=mixtral&sort=trending)
style models should work out of the box.
For some models (such as `Qwen` and `plamo`) the tokenizer requires you to
enable the `trust_remote_code` option. You can do this by passing
`--trust-remote-code` in the command line. If you don't specify the flag
explicitly, you will be prompted to trust remote code in the terminal when
running the model.
For `Qwen` models you must also specify the `eos_token`. You can do this by
passing `--eos-token "<|endoftext|>"` in the command
line.
These options can also be set in the Python API. For example:
```python
model, tokenizer = load(
"qwen/Qwen-7B",
tokenizer_config={"eos_token": "<|endoftext|>", "trust_remote_code": True},
)
```
Raw data
{
"_id": null,
"home_page": "https://github.com/ml-explore/mlx-examples",
"name": "mlx-lm",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": null,
"author": "MLX Contributors",
"author_email": "mlx@group.apple.com",
"download_url": "https://files.pythonhosted.org/packages/20/a3/fd45a1c1a6aa267a726ca1fc32a5ba6de7bf18d46ba1edc850cea9873bab/mlx_lm-0.18.1.tar.gz",
"platform": null,
"description": "## Generate Text with LLMs and MLX\n\nThe easiest way to get started is to install the `mlx-lm` package:\n\n**With `pip`**:\n\n```sh\npip install mlx-lm\n```\n\n**With `conda`**:\n\n```sh\nconda install -c conda-forge mlx-lm\n```\n\nThe `mlx-lm` package also has:\n\n- [LoRA and QLoRA fine-tuning](https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/LORA.md)\n- [Merging models](https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/MERGE.md)\n- [HTTP model serving](https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/SERVER.md)\n\n### Python API\n\nYou can use `mlx-lm` as a module:\n\n```python\nfrom mlx_lm import load, generate\n\nmodel, tokenizer = load(\"mlx-community/Mistral-7B-Instruct-v0.3-4bit\")\n\nresponse = generate(model, tokenizer, prompt=\"hello\", verbose=True)\n```\n\nTo see a description of all the arguments you can do:\n\n```\n>>> help(generate)\n```\n\nCheck out the [generation\nexample](https://github.com/ml-explore/mlx-examples/tree/main/llms/mlx_lm/examples/generate_response.py)\nto see how to use the API in more detail.\n\nThe `mlx-lm` package also comes with functionality to quantize and optionally\nupload models to the Hugging Face Hub.\n\nYou can convert models in the Python API with:\n\n```python\nfrom mlx_lm import convert\n\nrepo = \"mistralai/Mistral-7B-Instruct-v0.3\"\nupload_repo = \"mlx-community/My-Mistral-7B-Instruct-v0.3-4bit\"\n\nconvert(repo, quantize=True, upload_repo=upload_repo)\n```\n\nThis will generate a 4-bit quantized Mistral 7B and upload it to the repo\n`mlx-community/My-Mistral-7B-Instruct-v0.3-4bit`. It will also save the\nconverted model in the path `mlx_model` by default.\n\nTo see a description of all the arguments you can do:\n\n```\n>>> help(convert)\n```\n\n#### Streaming\n\nFor streaming generation, use the `stream_generate` function. This returns a\ngenerator object which streams the output text. For example,\n\n```python\nfrom mlx_lm import load, stream_generate\n\nrepo = \"mlx-community/Mistral-7B-Instruct-v0.3-4bit\"\nmodel, tokenizer = load(repo)\n\nprompt = \"Write a story about Einstein\"\n\nfor t in stream_generate(model, tokenizer, prompt, max_tokens=512):\n print(t, end=\"\", flush=True)\nprint()\n```\n\n### Command Line\n\nYou can also use `mlx-lm` from the command line with:\n\n```\nmlx_lm.generate --model mistralai/Mistral-7B-Instruct-v0.3 --prompt \"hello\"\n```\n\nThis will download a Mistral 7B model from the Hugging Face Hub and generate\ntext using the given prompt.\n\nFor a full list of options run:\n\n```\nmlx_lm.generate --help\n```\n\nTo quantize a model from the command line run:\n\n```\nmlx_lm.convert --hf-path mistralai/Mistral-7B-Instruct-v0.3 -q\n```\n\nFor more options run:\n\n```\nmlx_lm.convert --help\n```\n\nYou can upload new models to Hugging Face by specifying `--upload-repo` to\n`convert`. For example, to upload a quantized Mistral-7B model to the\n[MLX Hugging Face community](https://huggingface.co/mlx-community) you can do:\n\n```\nmlx_lm.convert \\\n --hf-path mistralai/Mistral-7B-Instruct-v0.3 \\\n -q \\\n --upload-repo mlx-community/my-4bit-mistral\n```\n\n### Long Prompts and Generations \n\nMLX LM has some tools to scale efficiently to long prompts and generations:\n\n- A rotating fixed-size key-value cache.\n- Prompt caching\n\nTo use the rotating key-value cache pass the argument `--max-kv-size n` where\n`n` can be any integer. Smaller values like `512` will use very little RAM but\nresult in worse quality. Larger values like `4096` or higher will use more RAM\nbut have better quality.\n\nCaching prompts can substantially speedup reusing the same long context with\ndifferent queries. To cache a prompt use `mlx_lm.cache_prompt`. For example:\n\n```bash\ncat prompt.txt | mlx_lm.cache_prompt \\\n --model mistralai/Mistral-7B-Instruct-v0.3 \\\n --prompt - \\\n --kv-cache-file mistral_prompt.safetensors\n``` \n\nThen use the cached prompt with `mlx_lm.generate`:\n\n```\nmlx_lm.generate \\\n --kv-cache-file mistral_prompt.safetensors \\\n --prompt \"\\nSummarize the above text.\"\n```\n\nThe cached prompt is treated as a prefix to the supplied prompt. Also notice\nwhen using a cached prompt, the model to use is read from the cache and need\nnot be supplied explicitly.\n\n### Supported Models\n\nMLX LM supports thousands of Hugging Face format LLMs. If the model you want to\nrun is not supported, file an\n[issue](https://github.com/ml-explore/mlx-examples/issues/new) or better yet,\nsubmit a pull request.\n\nHere are a few examples of Hugging Face models that work with this example:\n\n- [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)\n- [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf)\n- [deepseek-ai/deepseek-coder-6.7b-instruct](https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-instruct)\n- [01-ai/Yi-6B-Chat](https://huggingface.co/01-ai/Yi-6B-Chat)\n- [microsoft/phi-2](https://huggingface.co/microsoft/phi-2)\n- [mistralai/Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1)\n- [Qwen/Qwen-7B](https://huggingface.co/Qwen/Qwen-7B)\n- [pfnet/plamo-13b](https://huggingface.co/pfnet/plamo-13b)\n- [pfnet/plamo-13b-instruct](https://huggingface.co/pfnet/plamo-13b-instruct)\n- [stabilityai/stablelm-2-zephyr-1_6b](https://huggingface.co/stabilityai/stablelm-2-zephyr-1_6b)\n- [internlm/internlm2-7b](https://huggingface.co/internlm/internlm2-7b)\n\nMost\n[Mistral](https://huggingface.co/models?library=transformers,safetensors&other=mistral&sort=trending),\n[Llama](https://huggingface.co/models?library=transformers,safetensors&other=llama&sort=trending),\n[Phi-2](https://huggingface.co/models?library=transformers,safetensors&other=phi&sort=trending),\nand\n[Mixtral](https://huggingface.co/models?library=transformers,safetensors&other=mixtral&sort=trending)\nstyle models should work out of the box.\n\nFor some models (such as `Qwen` and `plamo`) the tokenizer requires you to\nenable the `trust_remote_code` option. You can do this by passing\n`--trust-remote-code` in the command line. If you don't specify the flag\nexplicitly, you will be prompted to trust remote code in the terminal when\nrunning the model. \n\nFor `Qwen` models you must also specify the `eos_token`. You can do this by\npassing `--eos-token \"<|endoftext|>\"` in the command\nline. \n\nThese options can also be set in the Python API. For example:\n\n```python\nmodel, tokenizer = load(\n \"qwen/Qwen-7B\",\n tokenizer_config={\"eos_token\": \"<|endoftext|>\", \"trust_remote_code\": True},\n)\n```\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "LLMs on Apple silicon with MLX and the Hugging Face Hub",
"version": "0.18.1",
"project_urls": {
"Homepage": "https://github.com/ml-explore/mlx-examples"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "22800666e1040d07f5222e6d32dfdc3d6002b4d6f59f228b228c368fd2a3d80c",
"md5": "35a155f79e35f0bbbb184b582cdb12d7",
"sha256": "38610e02c71067327833c570db7414cc0e604c24c4870377c84a55d5dc690f3d"
},
"downloads": -1,
"filename": "mlx_lm-0.18.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "35a155f79e35f0bbbb184b582cdb12d7",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 109941,
"upload_time": "2024-08-30T12:57:09",
"upload_time_iso_8601": "2024-08-30T12:57:09.811945Z",
"url": "https://files.pythonhosted.org/packages/22/80/0666e1040d07f5222e6d32dfdc3d6002b4d6f59f228b228c368fd2a3d80c/mlx_lm-0.18.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "20a3fd45a1c1a6aa267a726ca1fc32a5ba6de7bf18d46ba1edc850cea9873bab",
"md5": "b7fbb9651ada66200ec4d2bf9016eeea",
"sha256": "7a3f9adbacbfa99f2bee3b3407d87ff6fad55ee7aa616bf87f918f783782abdd"
},
"downloads": -1,
"filename": "mlx_lm-0.18.1.tar.gz",
"has_sig": false,
"md5_digest": "b7fbb9651ada66200ec4d2bf9016eeea",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 80128,
"upload_time": "2024-08-30T12:57:12",
"upload_time_iso_8601": "2024-08-30T12:57:12.038147Z",
"url": "https://files.pythonhosted.org/packages/20/a3/fd45a1c1a6aa267a726ca1fc32a5ba6de7bf18d46ba1edc850cea9873bab/mlx_lm-0.18.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-08-30 12:57:12",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "ml-explore",
"github_project": "mlx-examples",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"circle": true,
"lcname": "mlx-lm"
}