neural-speed


Nameneural-speed JSON
Version 1.0 PyPI version JSON
download
home_pagehttps://github.com/intel/neural-speed
SummaryRepository of Intel® Intel Extension for Transformers
upload_time2024-03-29 11:42:42
maintainerNone
docs_urlNone
authorIntel AISE/AIPC Team
requires_python>=3.7.0
licenseApache 2.0
keywords large language model llm sub-byte
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Neural Speed

Neural Speed is an innovative library designed to support the efficient inference of large language models (LLMs) on Intel platforms through the state-of-the-art (SOTA) low-bit quantization powered by [Intel Neural Compressor](https://github.com/intel/neural-compressor). The work is inspired by [llama.cpp](https://github.com/ggerganov/llama.cpp) and further optimized for Intel platforms with our innovations in [NeurIPS' 2023](https://arxiv.org/abs/2311.00502)

## Key Features
- Highly optimized low-precision kernels on CPUs with ISAs (AMX, VNNI, AVX512F, AVX_VNNI and AVX2). See [details](neural_speed/core/README.md)
- Up to 40x performance speedup on popular LLMs compared with llama.cpp. See [details](https://medium.com/@NeuralCompressor/llm-performance-of-intel-extension-for-transformers-f7d061556176) 
- Tensor parallelism across sockets/nodes on CPUs. See [details](./docs/tensor_parallelism.md)

> Neural Speed is under active development so APIs are subject to change.

## Supported Hardware
| Hardware | Supported |
|-------------|:-------------:|
|Intel Xeon Scalable Processors | ✔ |
|Intel Xeon CPU Max Series | ✔ |
|Intel Core Processors | ✔ |

## Supported Models
Support almost all the LLMs in PyTorch format from Hugging Face such as Llama2, ChatGLM2, Baichuan2, Qwen, Mistral, Whisper, etc. File an [issue](https://github.com/intel/neural-speed/issues) if your favorite LLM does not work.

Support typical LLMs in GGUF format such as Llama2, Falcon, MPT, Bloom etc. More are coming. Check out the [details](./docs/supported_models.md).

## Installation

### Install from binary
```shell
pip install -r requirements.txt
pip install neural-speed
```

### Build from Source
```shell
pip install -r requirements.txt
pip install .
```

>**Note**: GCC requires version 10+


## Quick Start (Transformer-like usage)

Install [Intel Extension for Transformers](https://github.com/intel/intel-extension-for-transformers/blob/main/docs/installation.md) to use Transformer-like APIs.


### PyTorch Model from Hugging Face

```python
from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "Intel/neural-chat-7b-v3-1"     # Hugging Face model_id or local model
prompt = "Once upon a time, there existed a little girl,"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)

model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)
```

### GGUF Model from Hugging Face

```python
from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM

# Specify the GGUF repo on the Hugginface
model_name = "TheBloke/Llama-2-7B-Chat-GGUF"
# Download the the specific gguf model file from the above repo
model_file = "llama-2-7b-chat.Q4_0.gguf"
# make sure you are granted to access this model on the Huggingface.
tokenizer_name = "meta-llama/Llama-2-7b-chat-hf"

prompt = "Once upon a time"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)
model = AutoModelForCausalLM.from_pretrained(model_name, model_file = model_file)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)
```
### PyTorch Model from Modelscope
```python
from transformers import TextStreamer
from modelscope import AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "qwen/Qwen-7B"     # Modelscope model_id or local model
prompt = "Once upon a time, there existed a little girl,"

model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True, model_hub="modelscope")
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)
```

## Quick Start (llama.cpp-like usage)

### Single (One-click) Step

```
python scripts/run.py model-path --weight_dtype int4 -p "She opened the door and see"
```

### Multiple Steps

#### Convert and Quantize

```bash
# skip the step if GGUF model is from Hugging Face or generated by llama.cpp
python scripts/convert.py --outtype f32 --outfile ne-f32.bin EleutherAI/gpt-j-6b
```

#### Inference

```bash
# Linux and WSL
OMP_NUM_THREADS=<physic_cores> numactl -m 0 -C 0-<physic_cores-1> python scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n 256 -t <physic_cores> --color -p "She opened the door and see"
```

```bash
# Windows
python scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n 256 -t <physic_cores|P-cores> --color -p "She opened the door and see"
```

> Please refer to [Advanced Usage](./docs/advanced_usage.md) for more details.

## Advanced Topics

### New model enabling
You can consider adding your own models, please follow the document: [graph developer document](./developer_document.md).

### Performance profiling
Enable `NEURAL_SPEED_VERBOSE` environment variable for performance profiling.

Available modes:
- 0: Print full information: evaluation time and operator profiling. Need to set `NS_PROFILING` to ON and recompile.
- 1: Print evaluation time. Time taken for each evaluation.
- 2: Profile individual operator. Identify performance bottleneck within the model. Need to set `NS_PROFILING` to ON and recompile.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/intel/neural-speed",
    "name": "neural-speed",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7.0",
    "maintainer_email": null,
    "keywords": "Large Language Model, LLM, sub-byte",
    "author": "Intel AISE/AIPC Team",
    "author_email": "feng.tian@intel.com, haihao.shen@intel.com,hanwen.chang@intel.com, penghui.cheng@intel.com",
    "download_url": "https://files.pythonhosted.org/packages/32/fe/9410ee58b48188c3bde1b04c9f62ad038010e04a8629b4b7977838b41a3e/neural-speed-1.0.tar.gz",
    "platform": null,
    "description": "# Neural Speed\n\nNeural Speed is an innovative library designed to support the efficient inference of large language models (LLMs) on Intel platforms through the state-of-the-art (SOTA) low-bit quantization powered by [Intel Neural Compressor](https://github.com/intel/neural-compressor). The work is inspired by [llama.cpp](https://github.com/ggerganov/llama.cpp) and further optimized for Intel platforms with our innovations in [NeurIPS' 2023](https://arxiv.org/abs/2311.00502)\n\n## Key Features\n- Highly optimized low-precision kernels on CPUs with ISAs (AMX, VNNI, AVX512F, AVX_VNNI and AVX2). See [details](neural_speed/core/README.md)\n- Up to 40x performance speedup on popular LLMs compared with llama.cpp. See [details](https://medium.com/@NeuralCompressor/llm-performance-of-intel-extension-for-transformers-f7d061556176) \n- Tensor parallelism across sockets/nodes on CPUs. See [details](./docs/tensor_parallelism.md)\n\n> Neural Speed is under active development so APIs are subject to change.\n\n## Supported Hardware\n| Hardware | Supported |\n|-------------|:-------------:|\n|Intel Xeon Scalable Processors | \u2714 |\n|Intel Xeon CPU Max Series | \u2714 |\n|Intel Core Processors | \u2714 |\n\n## Supported Models\nSupport almost all the LLMs in PyTorch format from Hugging Face such as Llama2, ChatGLM2, Baichuan2, Qwen, Mistral, Whisper, etc. File an [issue](https://github.com/intel/neural-speed/issues) if your favorite LLM does not work.\n\nSupport typical LLMs in GGUF format such as Llama2, Falcon, MPT, Bloom etc. More are coming. Check out the [details](./docs/supported_models.md).\n\n## Installation\n\n### Install from binary\n```shell\npip install -r requirements.txt\npip install neural-speed\n```\n\n### Build from Source\n```shell\npip install -r requirements.txt\npip install .\n```\n\n>**Note**: GCC requires version 10+\n\n\n## Quick Start (Transformer-like usage)\n\nInstall [Intel Extension for Transformers](https://github.com/intel/intel-extension-for-transformers/blob/main/docs/installation.md) to use Transformer-like APIs.\n\n\n### PyTorch Model from Hugging Face\n\n```python\nfrom transformers import AutoTokenizer, TextStreamer\nfrom intel_extension_for_transformers.transformers import AutoModelForCausalLM\nmodel_name = \"Intel/neural-chat-7b-v3-1\"     # Hugging Face model_id or local model\nprompt = \"Once upon a time, there existed a little girl,\"\n\ntokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)\ninputs = tokenizer(prompt, return_tensors=\"pt\").input_ids\nstreamer = TextStreamer(tokenizer)\n\nmodel = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)\noutputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)\n```\n\n### GGUF Model from Hugging Face\n\n```python\nfrom transformers import AutoTokenizer, TextStreamer\nfrom intel_extension_for_transformers.transformers import AutoModelForCausalLM\n\n# Specify the GGUF repo on the Hugginface\nmodel_name = \"TheBloke/Llama-2-7B-Chat-GGUF\"\n# Download the the specific gguf model file from the above repo\nmodel_file = \"llama-2-7b-chat.Q4_0.gguf\"\n# make sure you are granted to access this model on the Huggingface.\ntokenizer_name = \"meta-llama/Llama-2-7b-chat-hf\"\n\nprompt = \"Once upon a time\"\ntokenizer = AutoTokenizer.from_pretrained(tokenizer_name, trust_remote_code=True)\ninputs = tokenizer(prompt, return_tensors=\"pt\").input_ids\nstreamer = TextStreamer(tokenizer)\nmodel = AutoModelForCausalLM.from_pretrained(model_name, model_file = model_file)\noutputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)\n```\n### PyTorch Model from Modelscope\n```python\nfrom transformers import TextStreamer\nfrom modelscope import AutoTokenizer\nfrom intel_extension_for_transformers.transformers import AutoModelForCausalLM\nmodel_name = \"qwen/Qwen-7B\"     # Modelscope model_id or local model\nprompt = \"Once upon a time, there existed a little girl,\"\n\nmodel = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True, model_hub=\"modelscope\")\ntokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)\ninputs = tokenizer(prompt, return_tensors=\"pt\").input_ids\nstreamer = TextStreamer(tokenizer)\noutputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)\n```\n\n## Quick Start (llama.cpp-like usage)\n\n### Single (One-click) Step\n\n```\npython scripts/run.py model-path --weight_dtype int4 -p \"She opened the door and see\"\n```\n\n### Multiple Steps\n\n#### Convert and Quantize\n\n```bash\n# skip the step if GGUF model is from Hugging Face or generated by llama.cpp\npython scripts/convert.py --outtype f32 --outfile ne-f32.bin EleutherAI/gpt-j-6b\n```\n\n#### Inference\n\n```bash\n# Linux and WSL\nOMP_NUM_THREADS=<physic_cores> numactl -m 0 -C 0-<physic_cores-1> python scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n 256 -t <physic_cores> --color -p \"She opened the door and see\"\n```\n\n```bash\n# Windows\npython scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n 256 -t <physic_cores|P-cores> --color -p \"She opened the door and see\"\n```\n\n> Please refer to [Advanced Usage](./docs/advanced_usage.md) for more details.\n\n## Advanced Topics\n\n### New model enabling\nYou can consider adding your own models, please follow the document: [graph developer document](./developer_document.md).\n\n### Performance profiling\nEnable `NEURAL_SPEED_VERBOSE` environment variable for performance profiling.\n\nAvailable modes:\n- 0: Print full information: evaluation time and operator profiling. Need to set `NS_PROFILING` to ON and recompile.\n- 1: Print evaluation time. Time taken for each evaluation.\n- 2: Profile individual operator. Identify performance bottleneck within the model. Need to set `NS_PROFILING` to ON and recompile.\n",
    "bugtrack_url": null,
    "license": "Apache 2.0",
    "summary": "Repository of Intel\u00ae Intel Extension for Transformers",
    "version": "1.0",
    "project_urls": {
        "Homepage": "https://github.com/intel/neural-speed"
    },
    "split_keywords": [
        "large language model",
        " llm",
        " sub-byte"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "71bb9e7ca61a2639ada138409df92e17cf66881e6ed985bad0cb1ecc8f5a8228",
                "md5": "f8fc19e7f19e4b215628890f5cd1b8d7",
                "sha256": "e86274c840d0c398162a113ec8ed808124a1006b591aa7273c3199fa5e8958a5"
            },
            "downloads": -1,
            "filename": "neural_speed-1.0-cp310-cp310-manylinux_2_28_x86_64.whl",
            "has_sig": false,
            "md5_digest": "f8fc19e7f19e4b215628890f5cd1b8d7",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.7.0",
            "size": 23236526,
            "upload_time": "2024-03-29T11:42:16",
            "upload_time_iso_8601": "2024-03-29T11:42:16.134272Z",
            "url": "https://files.pythonhosted.org/packages/71/bb/9e7ca61a2639ada138409df92e17cf66881e6ed985bad0cb1ecc8f5a8228/neural_speed-1.0-cp310-cp310-manylinux_2_28_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ab0719e63fe447f4b6ec128573087e3ef642cc93d7639e41759a0e93aeed1eb4",
                "md5": "6821c710af4cc4091ccbd9557a728647",
                "sha256": "0cc96058f7bbc414658a50c0eee8fb4455fa60e99e3186249e19f64b1c6b1bb8"
            },
            "downloads": -1,
            "filename": "neural_speed-1.0-cp310-cp310-win_amd64.whl",
            "has_sig": false,
            "md5_digest": "6821c710af4cc4091ccbd9557a728647",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.7.0",
            "size": 11665126,
            "upload_time": "2024-03-29T11:42:20",
            "upload_time_iso_8601": "2024-03-29T11:42:20.337155Z",
            "url": "https://files.pythonhosted.org/packages/ab/07/19e63fe447f4b6ec128573087e3ef642cc93d7639e41759a0e93aeed1eb4/neural_speed-1.0-cp310-cp310-win_amd64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "de52ce34338c918dafeb6bd9d777ce4a0b75f814fe4eedfea4bd6b13f53efbd1",
                "md5": "4ee5f6ab891a0258bcc61ccdca1be1ee",
                "sha256": "33d5d4f46cabd81079e7630b7d2d085dccfe466c22b13e1f2fffe6e145253d5f"
            },
            "downloads": -1,
            "filename": "neural_speed-1.0-cp311-cp311-manylinux_2_28_x86_64.whl",
            "has_sig": false,
            "md5_digest": "4ee5f6ab891a0258bcc61ccdca1be1ee",
            "packagetype": "bdist_wheel",
            "python_version": "cp311",
            "requires_python": ">=3.7.0",
            "size": 23268361,
            "upload_time": "2024-03-29T11:42:23",
            "upload_time_iso_8601": "2024-03-29T11:42:23.933034Z",
            "url": "https://files.pythonhosted.org/packages/de/52/ce34338c918dafeb6bd9d777ce4a0b75f814fe4eedfea4bd6b13f53efbd1/neural_speed-1.0-cp311-cp311-manylinux_2_28_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "828f4f52178422f1374378fc1f744f9c6d4c7224b4f1f58e09e1950e846b9acd",
                "md5": "a385a85be8fcf75e308d6275c3336a62",
                "sha256": "83f83f571e9227c072d1b7ec11df383faabb2b3ded3f221b3504c3036708920e"
            },
            "downloads": -1,
            "filename": "neural_speed-1.0-cp311-cp311-win_amd64.whl",
            "has_sig": false,
            "md5_digest": "a385a85be8fcf75e308d6275c3336a62",
            "packagetype": "bdist_wheel",
            "python_version": "cp311",
            "requires_python": ">=3.7.0",
            "size": 11688856,
            "upload_time": "2024-03-29T11:42:27",
            "upload_time_iso_8601": "2024-03-29T11:42:27.288489Z",
            "url": "https://files.pythonhosted.org/packages/82/8f/4f52178422f1374378fc1f744f9c6d4c7224b4f1f58e09e1950e846b9acd/neural_speed-1.0-cp311-cp311-win_amd64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0b8a9f304beba4925e352a739a1e72c3f9171a9432c5f054d32d3c00f256cfe7",
                "md5": "fc6e41e5127568cb6ff814fc9c0f1644",
                "sha256": "cc7b561d8560da13a53ee10799c490b4d8ee6593188e15a014bb2963ed080fcb"
            },
            "downloads": -1,
            "filename": "neural_speed-1.0-cp38-cp38-manylinux_2_28_x86_64.whl",
            "has_sig": false,
            "md5_digest": "fc6e41e5127568cb6ff814fc9c0f1644",
            "packagetype": "bdist_wheel",
            "python_version": "cp38",
            "requires_python": ">=3.7.0",
            "size": 23237346,
            "upload_time": "2024-03-29T11:42:30",
            "upload_time_iso_8601": "2024-03-29T11:42:30.904049Z",
            "url": "https://files.pythonhosted.org/packages/0b/8a/9f304beba4925e352a739a1e72c3f9171a9432c5f054d32d3c00f256cfe7/neural_speed-1.0-cp38-cp38-manylinux_2_28_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "b4e446e500cb76ad85807c250a09eee1461cfba7a528d1feff455c8f4df6919c",
                "md5": "fa18658fd0187bc85d420bd442698457",
                "sha256": "52517e22534aa64fd2d8d2af0d83a3acd3b3abfbb76a3b59b173253f2041e604"
            },
            "downloads": -1,
            "filename": "neural_speed-1.0-cp38-cp38-win_amd64.whl",
            "has_sig": false,
            "md5_digest": "fa18658fd0187bc85d420bd442698457",
            "packagetype": "bdist_wheel",
            "python_version": "cp38",
            "requires_python": ">=3.7.0",
            "size": 11662277,
            "upload_time": "2024-03-29T11:42:34",
            "upload_time_iso_8601": "2024-03-29T11:42:34.336313Z",
            "url": "https://files.pythonhosted.org/packages/b4/e4/46e500cb76ad85807c250a09eee1461cfba7a528d1feff455c8f4df6919c/neural_speed-1.0-cp38-cp38-win_amd64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1c28feda0c5f84e9df3dcdcf71ac2ed3e3f3ea41b342c51fa27320d5a052586c",
                "md5": "f2a311c260794dc6be222dd3b43cbc3a",
                "sha256": "2308728a00b3875951cb92ba4798c4b8b3b1d357127256cfc2e87346493f8e16"
            },
            "downloads": -1,
            "filename": "neural_speed-1.0-cp39-cp39-manylinux_2_28_x86_64.whl",
            "has_sig": false,
            "md5_digest": "f2a311c260794dc6be222dd3b43cbc3a",
            "packagetype": "bdist_wheel",
            "python_version": "cp39",
            "requires_python": ">=3.7.0",
            "size": 23238244,
            "upload_time": "2024-03-29T11:42:37",
            "upload_time_iso_8601": "2024-03-29T11:42:37.817128Z",
            "url": "https://files.pythonhosted.org/packages/1c/28/feda0c5f84e9df3dcdcf71ac2ed3e3f3ea41b342c51fa27320d5a052586c/neural_speed-1.0-cp39-cp39-manylinux_2_28_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f511643e36827e8accfc6df672ea9093d164c1304910a3e508b01c2b43dc9eb8",
                "md5": "89e42e4b3774db83c8d07c452d8b0f69",
                "sha256": "50eaa685dc298856ff23ad758cc7c3e63718b79439b1e7737418433366057d40"
            },
            "downloads": -1,
            "filename": "neural_speed-1.0-cp39-cp39-win_amd64.whl",
            "has_sig": false,
            "md5_digest": "89e42e4b3774db83c8d07c452d8b0f69",
            "packagetype": "bdist_wheel",
            "python_version": "cp39",
            "requires_python": ">=3.7.0",
            "size": 11666891,
            "upload_time": "2024-03-29T11:42:40",
            "upload_time_iso_8601": "2024-03-29T11:42:40.396868Z",
            "url": "https://files.pythonhosted.org/packages/f5/11/643e36827e8accfc6df672ea9093d164c1304910a3e508b01c2b43dc9eb8/neural_speed-1.0-cp39-cp39-win_amd64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "32fe9410ee58b48188c3bde1b04c9f62ad038010e04a8629b4b7977838b41a3e",
                "md5": "d7652e0674768e747052223e10e6a8d8",
                "sha256": "353d1e3c1a4b70ed4878b138d549c267c0c4da953722604011ef7e3e7bedaec4"
            },
            "downloads": -1,
            "filename": "neural-speed-1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "d7652e0674768e747052223e10e6a8d8",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7.0",
            "size": 4371476,
            "upload_time": "2024-03-29T11:42:42",
            "upload_time_iso_8601": "2024-03-29T11:42:42.809603Z",
            "url": "https://files.pythonhosted.org/packages/32/fe/9410ee58b48188c3bde1b04c9f62ad038010e04a8629b4b7977838b41a3e/neural-speed-1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-03-29 11:42:42",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "intel",
    "github_project": "neural-speed",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "neural-speed"
}
        
Elapsed time: 0.24484s