# Neural Speed
Neural Speed is an innovative library designed to support the efficient inference of large language models (LLMs) on Intel platforms through the state-of-the-art (SOTA) low-bit quantization powered by [Intel Neural Compressor](https://github.com/intel/neural-compressor). The work is inspired by [llama.cpp](https://github.com/ggerganov/llama.cpp) and further optimized for Intel platforms with our innovations in [NeurIPS' 2023](https://arxiv.org/abs/2311.00502)
## Key Features
- Highly optimized low-precision kernels on CPUs with ISAs (AMX, VNNI, AVX512F, AVX_VNNI and AVX2). See [details](neural_speed/core/README.md)
- Up to 40x performance speedup on popular LLMs compared with llama.cpp. See [details](https://medium.com/@NeuralCompressor/llm-performance-of-intel-extension-for-transformers-f7d061556176)
- Tensor parallelism across sockets/nodes on CPUs. See [details](./docs/tensor_parallelism.md)
> Neural Speed is under active development so APIs are subject to change.
## Supported Hardware
| Hardware | Supported |
|-------------|:-------------:|
|Intel Xeon Scalable Processors | ✔ |
|Intel Xeon CPU Max Series | ✔ |
|Intel Core Processors | ✔ |
## Supported Models
Support almost all the LLMs in PyTorch format from Hugging Face such as Llama2, ChatGLM2, Baichuan2, Qwen, Mistral, Whisper, etc. File an [issue](https://github.com/intel/neural-speed/issues) if your favorite LLM does not work.
Support typical LLMs in GGUF format such as Llama2, Falcon, MPT, Bloom etc. More are coming. Check out the [details](./docs/supported_models.md).
## Installation
### Install from binary
```shell
pip install -r requirements.txt
pip install neural-speed
```
### Build from Source
```shell
pip install -r requirements.txt
pip install .
```
>**Note**: GCC requires version 10+
## Quick Start (Transformer-like usage)
Install [Intel Extension for Transformers](https://github.com/intel/intel-extension-for-transformers/blob/main/docs/installation.md) to use Transformer-like APIs.
### PyTorch Model from Hugging Face
```python
from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "Intel/neural-chat-7b-v3-1" # Hugging Face model_id or local model
prompt = "Once upon a time, there existed a little girl,"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)
model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)
```
### GGUF Model from Hugging Face
```python
from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
# Specify the GGUF repo on the Hugginface
model_name = "TheBloke/Llama-2-7B-Chat-GGUF"
# Download the the specific gguf model file from the above repo
model_file = "llama-2-7b-chat.Q4_0.gguf"
# make sure you are granted to access this model on the Huggingface.
tokenizer_name = "meta-llama/Llama-2-7b-chat-hf"
prompt = "Once upon a time"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)
model = AutoModelForCausalLM.from_pretrained(model_name, model_file = model_file)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)
```
### PyTorch Model from Modelscope
```python
from transformers import TextStreamer
from modelscope import AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "qwen/Qwen-7B" # Modelscope model_id or local model
prompt = "Once upon a time, there existed a little girl,"
model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True, model_hub="modelscope")
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)
```
## Quick Start (llama.cpp-like usage)
### Single (One-click) Step
```
python scripts/run.py model-path --weight_dtype int4 -p "She opened the door and see"
```
### Multiple Steps
#### Convert and Quantize
```bash
# skip the step if GGUF model is from Hugging Face or generated by llama.cpp
python scripts/convert.py --outtype f32 --outfile ne-f32.bin EleutherAI/gpt-j-6b
```
#### Inference
```bash
# Linux and WSL
OMP_NUM_THREADS=<physic_cores> numactl -m 0 -C 0-<physic_cores-1> python scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n 256 -t <physic_cores> --color -p "She opened the door and see"
```
```bash
# Windows
python scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n 256 -t <physic_cores|P-cores> --color -p "She opened the door and see"
```
> Please refer to [Advanced Usage](./docs/advanced_usage.md) for more details.
## Advanced Topics
### New model enabling
You can consider adding your own models, please follow the document: [graph developer document](./developer_document.md).
### Performance profiling
Enable `NEURAL_SPEED_VERBOSE` environment variable for performance profiling.
Available modes:
- 0: Print full information: evaluation time and operator profiling. Need to set `NS_PROFILING` to ON and recompile.
- 1: Print evaluation time. Time taken for each evaluation.
- 2: Profile individual operator. Identify performance bottleneck within the model. Need to set `NS_PROFILING` to ON and recompile.
Raw data
{
"_id": null,
"home_page": "https://github.com/intel/neural-speed",
"name": "neural-speed",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7.0",
"maintainer_email": null,
"keywords": "Large Language Model, LLM, sub-byte",
"author": "Intel AISE/AIPC Team",
"author_email": "feng.tian@intel.com, haihao.shen@intel.com,hanwen.chang@intel.com, penghui.cheng@intel.com",
"download_url": "https://files.pythonhosted.org/packages/32/fe/9410ee58b48188c3bde1b04c9f62ad038010e04a8629b4b7977838b41a3e/neural-speed-1.0.tar.gz",
"platform": null,
"description": "# Neural Speed\n\nNeural Speed is an innovative library designed to support the efficient inference of large language models (LLMs) on Intel platforms through the state-of-the-art (SOTA) low-bit quantization powered by [Intel Neural Compressor](https://github.com/intel/neural-compressor). The work is inspired by [llama.cpp](https://github.com/ggerganov/llama.cpp) and further optimized for Intel platforms with our innovations in [NeurIPS' 2023](https://arxiv.org/abs/2311.00502)\n\n## Key Features\n- Highly optimized low-precision kernels on CPUs with ISAs (AMX, VNNI, AVX512F, AVX_VNNI and AVX2). See [details](neural_speed/core/README.md)\n- Up to 40x performance speedup on popular LLMs compared with llama.cpp. See [details](https://medium.com/@NeuralCompressor/llm-performance-of-intel-extension-for-transformers-f7d061556176) \n- Tensor parallelism across sockets/nodes on CPUs. See [details](./docs/tensor_parallelism.md)\n\n> Neural Speed is under active development so APIs are subject to change.\n\n## Supported Hardware\n| Hardware | Supported |\n|-------------|:-------------:|\n|Intel Xeon Scalable Processors | \u2714 |\n|Intel Xeon CPU Max Series | \u2714 |\n|Intel Core Processors | \u2714 |\n\n## Supported Models\nSupport almost all the LLMs in PyTorch format from Hugging Face such as Llama2, ChatGLM2, Baichuan2, Qwen, Mistral, Whisper, etc. File an [issue](https://github.com/intel/neural-speed/issues) if your favorite LLM does not work.\n\nSupport typical LLMs in GGUF format such as Llama2, Falcon, MPT, Bloom etc. More are coming. Check out the [details](./docs/supported_models.md).\n\n## Installation\n\n### Install from binary\n```shell\npip install -r requirements.txt\npip install neural-speed\n```\n\n### Build from Source\n```shell\npip install -r requirements.txt\npip install .\n```\n\n>**Note**: GCC requires version 10+\n\n\n## Quick Start (Transformer-like usage)\n\nInstall [Intel Extension for Transformers](https://github.com/intel/intel-extension-for-transformers/blob/main/docs/installation.md) to use Transformer-like APIs.\n\n\n### PyTorch Model from Hugging Face\n\n```python\nfrom transformers import AutoTokenizer, TextStreamer\nfrom intel_extension_for_transformers.transformers import AutoModelForCausalLM\nmodel_name = \"Intel/neural-chat-7b-v3-1\" # Hugging Face model_id or local model\nprompt = \"Once upon a time, there existed a little girl,\"\n\ntokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)\ninputs = tokenizer(prompt, return_tensors=\"pt\").input_ids\nstreamer = TextStreamer(tokenizer)\n\nmodel = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)\noutputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)\n```\n\n### GGUF Model from Hugging Face\n\n```python\nfrom transformers import AutoTokenizer, TextStreamer\nfrom intel_extension_for_transformers.transformers import AutoModelForCausalLM\n\n# Specify the GGUF repo on the Hugginface\nmodel_name = \"TheBloke/Llama-2-7B-Chat-GGUF\"\n# Download the the specific gguf model file from the above repo\nmodel_file = \"llama-2-7b-chat.Q4_0.gguf\"\n# make sure you are granted to access this model on the Huggingface.\ntokenizer_name = \"meta-llama/Llama-2-7b-chat-hf\"\n\nprompt = \"Once upon a time\"\ntokenizer = AutoTokenizer.from_pretrained(tokenizer_name, trust_remote_code=True)\ninputs = tokenizer(prompt, return_tensors=\"pt\").input_ids\nstreamer = TextStreamer(tokenizer)\nmodel = AutoModelForCausalLM.from_pretrained(model_name, model_file = model_file)\noutputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)\n```\n### PyTorch Model from Modelscope\n```python\nfrom transformers import TextStreamer\nfrom modelscope import AutoTokenizer\nfrom intel_extension_for_transformers.transformers import AutoModelForCausalLM\nmodel_name = \"qwen/Qwen-7B\" # Modelscope model_id or local model\nprompt = \"Once upon a time, there existed a little girl,\"\n\nmodel = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True, model_hub=\"modelscope\")\ntokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)\ninputs = tokenizer(prompt, return_tensors=\"pt\").input_ids\nstreamer = TextStreamer(tokenizer)\noutputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)\n```\n\n## Quick Start (llama.cpp-like usage)\n\n### Single (One-click) Step\n\n```\npython scripts/run.py model-path --weight_dtype int4 -p \"She opened the door and see\"\n```\n\n### Multiple Steps\n\n#### Convert and Quantize\n\n```bash\n# skip the step if GGUF model is from Hugging Face or generated by llama.cpp\npython scripts/convert.py --outtype f32 --outfile ne-f32.bin EleutherAI/gpt-j-6b\n```\n\n#### Inference\n\n```bash\n# Linux and WSL\nOMP_NUM_THREADS=<physic_cores> numactl -m 0 -C 0-<physic_cores-1> python scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n 256 -t <physic_cores> --color -p \"She opened the door and see\"\n```\n\n```bash\n# Windows\npython scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n 256 -t <physic_cores|P-cores> --color -p \"She opened the door and see\"\n```\n\n> Please refer to [Advanced Usage](./docs/advanced_usage.md) for more details.\n\n## Advanced Topics\n\n### New model enabling\nYou can consider adding your own models, please follow the document: [graph developer document](./developer_document.md).\n\n### Performance profiling\nEnable `NEURAL_SPEED_VERBOSE` environment variable for performance profiling.\n\nAvailable modes:\n- 0: Print full information: evaluation time and operator profiling. Need to set `NS_PROFILING` to ON and recompile.\n- 1: Print evaluation time. Time taken for each evaluation.\n- 2: Profile individual operator. Identify performance bottleneck within the model. Need to set `NS_PROFILING` to ON and recompile.\n",
"bugtrack_url": null,
"license": "Apache 2.0",
"summary": "Repository of Intel\u00ae Intel Extension for Transformers",
"version": "1.0",
"project_urls": {
"Homepage": "https://github.com/intel/neural-speed"
},
"split_keywords": [
"large language model",
" llm",
" sub-byte"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "71bb9e7ca61a2639ada138409df92e17cf66881e6ed985bad0cb1ecc8f5a8228",
"md5": "f8fc19e7f19e4b215628890f5cd1b8d7",
"sha256": "e86274c840d0c398162a113ec8ed808124a1006b591aa7273c3199fa5e8958a5"
},
"downloads": -1,
"filename": "neural_speed-1.0-cp310-cp310-manylinux_2_28_x86_64.whl",
"has_sig": false,
"md5_digest": "f8fc19e7f19e4b215628890f5cd1b8d7",
"packagetype": "bdist_wheel",
"python_version": "cp310",
"requires_python": ">=3.7.0",
"size": 23236526,
"upload_time": "2024-03-29T11:42:16",
"upload_time_iso_8601": "2024-03-29T11:42:16.134272Z",
"url": "https://files.pythonhosted.org/packages/71/bb/9e7ca61a2639ada138409df92e17cf66881e6ed985bad0cb1ecc8f5a8228/neural_speed-1.0-cp310-cp310-manylinux_2_28_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "ab0719e63fe447f4b6ec128573087e3ef642cc93d7639e41759a0e93aeed1eb4",
"md5": "6821c710af4cc4091ccbd9557a728647",
"sha256": "0cc96058f7bbc414658a50c0eee8fb4455fa60e99e3186249e19f64b1c6b1bb8"
},
"downloads": -1,
"filename": "neural_speed-1.0-cp310-cp310-win_amd64.whl",
"has_sig": false,
"md5_digest": "6821c710af4cc4091ccbd9557a728647",
"packagetype": "bdist_wheel",
"python_version": "cp310",
"requires_python": ">=3.7.0",
"size": 11665126,
"upload_time": "2024-03-29T11:42:20",
"upload_time_iso_8601": "2024-03-29T11:42:20.337155Z",
"url": "https://files.pythonhosted.org/packages/ab/07/19e63fe447f4b6ec128573087e3ef642cc93d7639e41759a0e93aeed1eb4/neural_speed-1.0-cp310-cp310-win_amd64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "de52ce34338c918dafeb6bd9d777ce4a0b75f814fe4eedfea4bd6b13f53efbd1",
"md5": "4ee5f6ab891a0258bcc61ccdca1be1ee",
"sha256": "33d5d4f46cabd81079e7630b7d2d085dccfe466c22b13e1f2fffe6e145253d5f"
},
"downloads": -1,
"filename": "neural_speed-1.0-cp311-cp311-manylinux_2_28_x86_64.whl",
"has_sig": false,
"md5_digest": "4ee5f6ab891a0258bcc61ccdca1be1ee",
"packagetype": "bdist_wheel",
"python_version": "cp311",
"requires_python": ">=3.7.0",
"size": 23268361,
"upload_time": "2024-03-29T11:42:23",
"upload_time_iso_8601": "2024-03-29T11:42:23.933034Z",
"url": "https://files.pythonhosted.org/packages/de/52/ce34338c918dafeb6bd9d777ce4a0b75f814fe4eedfea4bd6b13f53efbd1/neural_speed-1.0-cp311-cp311-manylinux_2_28_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "828f4f52178422f1374378fc1f744f9c6d4c7224b4f1f58e09e1950e846b9acd",
"md5": "a385a85be8fcf75e308d6275c3336a62",
"sha256": "83f83f571e9227c072d1b7ec11df383faabb2b3ded3f221b3504c3036708920e"
},
"downloads": -1,
"filename": "neural_speed-1.0-cp311-cp311-win_amd64.whl",
"has_sig": false,
"md5_digest": "a385a85be8fcf75e308d6275c3336a62",
"packagetype": "bdist_wheel",
"python_version": "cp311",
"requires_python": ">=3.7.0",
"size": 11688856,
"upload_time": "2024-03-29T11:42:27",
"upload_time_iso_8601": "2024-03-29T11:42:27.288489Z",
"url": "https://files.pythonhosted.org/packages/82/8f/4f52178422f1374378fc1f744f9c6d4c7224b4f1f58e09e1950e846b9acd/neural_speed-1.0-cp311-cp311-win_amd64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "0b8a9f304beba4925e352a739a1e72c3f9171a9432c5f054d32d3c00f256cfe7",
"md5": "fc6e41e5127568cb6ff814fc9c0f1644",
"sha256": "cc7b561d8560da13a53ee10799c490b4d8ee6593188e15a014bb2963ed080fcb"
},
"downloads": -1,
"filename": "neural_speed-1.0-cp38-cp38-manylinux_2_28_x86_64.whl",
"has_sig": false,
"md5_digest": "fc6e41e5127568cb6ff814fc9c0f1644",
"packagetype": "bdist_wheel",
"python_version": "cp38",
"requires_python": ">=3.7.0",
"size": 23237346,
"upload_time": "2024-03-29T11:42:30",
"upload_time_iso_8601": "2024-03-29T11:42:30.904049Z",
"url": "https://files.pythonhosted.org/packages/0b/8a/9f304beba4925e352a739a1e72c3f9171a9432c5f054d32d3c00f256cfe7/neural_speed-1.0-cp38-cp38-manylinux_2_28_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "b4e446e500cb76ad85807c250a09eee1461cfba7a528d1feff455c8f4df6919c",
"md5": "fa18658fd0187bc85d420bd442698457",
"sha256": "52517e22534aa64fd2d8d2af0d83a3acd3b3abfbb76a3b59b173253f2041e604"
},
"downloads": -1,
"filename": "neural_speed-1.0-cp38-cp38-win_amd64.whl",
"has_sig": false,
"md5_digest": "fa18658fd0187bc85d420bd442698457",
"packagetype": "bdist_wheel",
"python_version": "cp38",
"requires_python": ">=3.7.0",
"size": 11662277,
"upload_time": "2024-03-29T11:42:34",
"upload_time_iso_8601": "2024-03-29T11:42:34.336313Z",
"url": "https://files.pythonhosted.org/packages/b4/e4/46e500cb76ad85807c250a09eee1461cfba7a528d1feff455c8f4df6919c/neural_speed-1.0-cp38-cp38-win_amd64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "1c28feda0c5f84e9df3dcdcf71ac2ed3e3f3ea41b342c51fa27320d5a052586c",
"md5": "f2a311c260794dc6be222dd3b43cbc3a",
"sha256": "2308728a00b3875951cb92ba4798c4b8b3b1d357127256cfc2e87346493f8e16"
},
"downloads": -1,
"filename": "neural_speed-1.0-cp39-cp39-manylinux_2_28_x86_64.whl",
"has_sig": false,
"md5_digest": "f2a311c260794dc6be222dd3b43cbc3a",
"packagetype": "bdist_wheel",
"python_version": "cp39",
"requires_python": ">=3.7.0",
"size": 23238244,
"upload_time": "2024-03-29T11:42:37",
"upload_time_iso_8601": "2024-03-29T11:42:37.817128Z",
"url": "https://files.pythonhosted.org/packages/1c/28/feda0c5f84e9df3dcdcf71ac2ed3e3f3ea41b342c51fa27320d5a052586c/neural_speed-1.0-cp39-cp39-manylinux_2_28_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "f511643e36827e8accfc6df672ea9093d164c1304910a3e508b01c2b43dc9eb8",
"md5": "89e42e4b3774db83c8d07c452d8b0f69",
"sha256": "50eaa685dc298856ff23ad758cc7c3e63718b79439b1e7737418433366057d40"
},
"downloads": -1,
"filename": "neural_speed-1.0-cp39-cp39-win_amd64.whl",
"has_sig": false,
"md5_digest": "89e42e4b3774db83c8d07c452d8b0f69",
"packagetype": "bdist_wheel",
"python_version": "cp39",
"requires_python": ">=3.7.0",
"size": 11666891,
"upload_time": "2024-03-29T11:42:40",
"upload_time_iso_8601": "2024-03-29T11:42:40.396868Z",
"url": "https://files.pythonhosted.org/packages/f5/11/643e36827e8accfc6df672ea9093d164c1304910a3e508b01c2b43dc9eb8/neural_speed-1.0-cp39-cp39-win_amd64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "32fe9410ee58b48188c3bde1b04c9f62ad038010e04a8629b4b7977838b41a3e",
"md5": "d7652e0674768e747052223e10e6a8d8",
"sha256": "353d1e3c1a4b70ed4878b138d549c267c0c4da953722604011ef7e3e7bedaec4"
},
"downloads": -1,
"filename": "neural-speed-1.0.tar.gz",
"has_sig": false,
"md5_digest": "d7652e0674768e747052223e10e6a8d8",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7.0",
"size": 4371476,
"upload_time": "2024-03-29T11:42:42",
"upload_time_iso_8601": "2024-03-29T11:42:42.809603Z",
"url": "https://files.pythonhosted.org/packages/32/fe/9410ee58b48188c3bde1b04c9f62ad038010e04a8629b4b7977838b41a3e/neural-speed-1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-03-29 11:42:42",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "intel",
"github_project": "neural-speed",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "neural-speed"
}