xfastertransformer

Name	xfastertransformer JSON
Version	1.8.1 JSON
	download
home_page	https://github.com/intel/xFasterTransformer
Summary	Boost large language model inference performance on CPU platform.
upload_time	2024-07-31 08:34:50
maintainer	None
docs_url	None
author	xFasterTransformer
requires_python	>=3.8
license	Apache 2.0
keywords	llm
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # xFasterTransformer

<p align="center">
  <a href="./README.md">English</a> |
  <a href="./README_CN.md">简体中文</a>
</p>

xFasterTransformer is an exceptionally optimized solution for large language models (LLM) on the X86 platform, which is similar to FasterTransformer on the GPU platform. xFasterTransformer is able to operate in distributed mode across multiple sockets and nodes to support inference on larger models. Additionally, it provides both C++ and Python APIs, spanning from high-level to low-level interfaces, making it easy to adopt and integrate.

## Table of Contents
- [xFasterTransformer](#xfastertransformer)
  - [Table of Contents](#table-of-contents)
  - [Models overview](#models-overview)
    - [Model support matrix](#model-support-matrix)
    - [DataType support list](#datatype-support-list)
  - [Documents](#documents)
  - [Installation](#installation)
    - [From PyPI](#from-pypi)
    - [Using Docker](#using-docker)
    - [Built from source](#built-from-source)
      - [Prepare Environment](#prepare-environment)
        - [Manually](#manually)
        - [Install dependent libraries](#install-dependent-libraries)
        - [How to build](#how-to-build)
  - [Models Preparation](#models-preparation)
  - [API usage](#api-usage)
    - [Python API(PyTorch)](#python-apipytorch)
    - [C++ API](#c-api)
  - [How to run](#how-to-run)
    - [Single rank](#single-rank)
    - [Multi ranks](#multi-ranks)
      - [Command line](#command-line)
      - [Code](#code)
        - [Python](#python)
        - [C++](#c)
  - [Web Demo](#web-demo)
  - [Serving](#serving)
    - [vLLM](#vllm)
      - [Install](#install)
      - [OpenAI Compatible Server](#openai-compatible-server)
    - [FastChat](#fastchat)
    - [MLServer](#mlserver)
  - [Benchmark](#benchmark)
  - [Support](#support)
  - [Accepted Papers](#accepted-papers)
  - [Q\&A](#qa)

## Models overview
Large Language Models (LLMs) develops very fast and are more widely used in many AI scenarios. xFasterTransformer is an optimized solution for LLM inference using the mainstream and popular LLM models on Xeon. xFasterTransformer fully leverages the hardware capabilities of Xeon platforms to achieve the high performance and high scalability of LLM inference both on single socket and multiple sockets/multiple nodes.

xFasterTransformer provides a series of APIs, both of C++ and Python, for end users to integrate xFasterTransformer into their own solutions or services directly. Many kinds of example codes are also provided to demonstrate the usage. Benchmark codes and scripts are provided for users to show the performance. Web demos for popular LLM models are also provided.


### Model support matrix

|       Models       | Framework |          | Distribution |
| :----------------: | :-------: | :------: | :----------: |
|                    |  PyTorch  |   C++    |              |
|      ChatGLM       | &#10004;  | &#10004; |   &#10004;   |
|      ChatGLM2      | &#10004;  | &#10004; |   &#10004;   |
|      ChatGLM3      | &#10004;  | &#10004; |   &#10004;   |
|        GLM4        | &#10004;  | &#10004; |   &#10004;   |
|       Llama        | &#10004;  | &#10004; |   &#10004;   |
|       Llama2       | &#10004;  | &#10004; |   &#10004;   |
|       Llama3       | &#10004;  | &#10004; |   &#10004;   |
|     Baichuan       | &#10004;  | &#10004; |   &#10004;   |
|     Baichuan2      | &#10004;  | &#10004; |   &#10004;   |
|        QWen        | &#10004;  | &#10004; |   &#10004;   |
|        QWen2       | &#10004;  | &#10004; |   &#10004;   |
| SecLLM(YaRN-Llama) | &#10004;  | &#10004; |   &#10004;   |
|        Opt         | &#10004;  | &#10004; |   &#10004;   |
|   Deepseek-coder   | &#10004;  | &#10004; |   &#10004;   |
|      gemma         | &#10004;  | &#10004; |   &#10004;   |
|     gemma-1.1      | &#10004;  | &#10004; |   &#10004;   |
|     codegemma      | &#10004;  | &#10004; |   &#10004;   |

### DataType support list

- FP16
- BF16
- INT8
- W8A8
- INT4
- NF4
- BF16_FP16
- BF16_INT8
- BF16_W8A8
- BF16_INT4
- BF16_NF4
- W8A8_INT8
- W8A8_int4
- W8A8_NF4

## Documents
xFasterTransformer Documents and [Wiki](https://github.com/intel/xFasterTransformer/wiki) provides the following resources:
- An introduction to xFasterTransformer.
- Comprehensive API references for both high-level and low-level interfaces in C++ and PyTorch.
- Practical API usage examples for xFasterTransformer in both C++ and PyTorch.

## Installation
### From PyPI
```bash
pip install xfastertransformer
```

### Using Docker
```bash
docker pull intel/xfastertransformer:latest
```
Run the docker with the command (Assume model files are in `/data/` directory):  
```bash
docker run -it \
    --name xfastertransformer \
    --privileged \
    --shm-size=16g \
    -v /data/:/data/ \
    -e "http_proxy=$http_proxy" \
    -e "https_proxy=$https_proxy" \
    intel/xfastertransformer:latest
```
**Notice!!!**: Please enlarge `--shm-size` if  **bus error** occurred while running in the multi-ranks mode. The default docker limits the shared memory size to 64MB and our implementation uses many shared memories to achieve a  better performance.

### Built from source
#### Prepare Environment
##### Manually
- [PyTorch](https://pytorch.org/get-started/locally/) v2.3 (When using the PyTorch API, it's required, but it's not needed when using the C++ API.)
  ```bash 
  pip install torch --index-url https://download.pytorch.org/whl/cpu
  ```

- For GPU, xFT needs ABI=1 from [torch==2.3.0+cpu.cxx11.abi](https://download.pytorch.org/whl/cpu-cxx11-abi/torch-2.3.0%2Bcpu.cxx11.abi-cp38-cp38-linux_x86_64.whl#sha256=c34512c3e07efe9b7fb5c3a918fef1a7c6eb8969c6b2eea92ee5c16a0583fe12) in [torch-whl-list](https://download.pytorch.org/whl/torch/) due to DPC++ need ABI=1.

##### Install dependent libraries

Please install libnuma package:
- CentOS: yum install libnuma-devel
- Ubuntu: apt-get install libnuma-dev

##### How to build
- Using 'CMake'
  ```bash
  # Build xFasterTransformer
  git clone https://github.com/intel/xFasterTransformer.git xFasterTransformer
  cd xFasterTransformer
  git checkout <latest-tag>
  # Please make sure torch is installed when run python example
  mkdir build && cd build
  cmake ..
  make -j
  ```
- Using `python setup.py`
  ```bash
  # Build xFasterTransformer library and C++ example.
  python setup.py build

  # Install xFasterTransformer into pip environment.
  # Notice: Run `python setup.py build` before installation!
  python setup.py install
  ```

## [Models Preparation](tools/README.md)
xFasterTransformer supports a different model format from Huggingface, but it's compatible with FasterTransformer's format.
1. Download the huggingface format model firstly.
2. After that, convert the model into xFasterTransformer format by using model convert module in xfastertransformer. If output directory is not provided, converted model will be placed into `${HF_DATASET_DIR}-xft`.
    ```
    python -c 'import xfastertransformer as xft; xft.LlamaConvert().convert("${HF_DATASET_DIR}","${OUTPUT_DIR}")'
    ```
    ***PS: Due to the potential compatibility issues between the model file and the `transformers` version, please select the appropriate `transformers` version.***
    
    Supported model convert list:
    - LlamaConvert
    - YiConvert
    - GemmaConvert
    - ChatGLMConvert
    - ChatGLM2Convert
    - ChatGLM4Convert
    - OPTConvert
    - BaichuanConvert
    - Baichuan2Convert
    - QwenConvert
    - Qwen2Convert
    - DeepseekConvert

## API usage
For more details, please see API document and [examples](examples/README.md).
### Python API(PyTorch)
Firstly, please install the dependencies.
- Python dependencies
  ```bash
  pip install -r requirements.txt
  ```
  ***PS: Due to the potential compatibility issues between the model file and the `transformers` version, please select the appropriate `transformers` version.***
- oneCCL (For multi ranks)  
  Install oneCCL and setup the environment. Please refer to [Prepare Environment](#prepare-environment).


xFasterTransformer's Python API is similar to transformers and also supports transformers's streamer to achieve the streaming output. In the example, we use transformers to encode input prompts to token ids. 
```Python
import xfastertransformer
from transformers import AutoTokenizer, TextStreamer
# Assume huggingface model dir is `/data/chatglm-6b-hf` and converted model dir is `/data/chatglm-6b-xft`.
MODEL_PATH="/data/chatglm-6b-xft"
TOKEN_PATH="/data/chatglm-6b-hf"

INPUT_PROMPT = "Once upon a time, there existed a little girl who liked to have adventures."
tokenizer = AutoTokenizer.from_pretrained(TOKEN_PATH, use_fast=False, padding_side="left", trust_remote_code=True)
streamer = TextStreamer(tokenizer, skip_special_tokens=True, skip_prompt=False)

input_ids = tokenizer(INPUT_PROMPT, return_tensors="pt", padding=False).input_ids
model = xfastertransformer.AutoModel.from_pretrained(MODEL_PATH, dtype="bf16")
generated_ids = model.generate(input_ids, max_length=200, streamer=streamer)
```

### C++ API
[SentencePiece](https://github.com/google/sentencepiece) can be used to tokenizer and detokenizer text.
```C++
#include <vector>
#include <iostream>
#include "xfastertransformer.h"
// ChatGLM token ids for prompt "Once upon a time, there existed a little girl who liked to have adventures."
std::vector<int> input(
        {3393, 955, 104, 163, 6, 173, 9166, 104, 486, 2511, 172, 7599, 103, 127, 17163, 7, 130001, 130004});

// Assume converted model dir is `/data/chatglm-6b-xft`.
xft::AutoModel model("/data/chatglm-6b-xft", xft::DataType::bf16);

model.config(/*max length*/ 100, /*num beams*/ 1);
model.input(/*input token ids*/ input, /*batch size*/ 1);

while (!model.isDone()) {
    std::vector<int> nextIds = model.generate();
}

std::vector<int> result = model.finalize();
for (auto id : result) {
    std::cout << id << " ";
}
std::cout << std::endl;
```

## How to run
Recommend preloading `libiomp5.so` to get a better performance. 
- ***[Recommended]*** Run `export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')` if xfastertransformer's python wheel package is installed.
- `libiomp5.so` file will be in `3rdparty/mkl/lib` directory after building xFasterTransformer successfully if building from source code.

### Single rank
FasterTransformer will automatically check the MPI environment, or you can use the `SINGLE_INSTANCE=1` environment variable to forcefully deactivate MPI.  

### Multi ranks
#### Command line
Use MPI to run in the multi-ranks mode, please install oneCCL firstly.
- [oneCCL Installation](https://github.com/oneapi-src/oneCCL)
  - If you have built xfastertransformer from source, oneCCL is installed in 3rdparty when compilation.
    ```
    source ./3rdparty/oneccl/build/_install/env/setvars.sh
    ```
  - ***[Recommended]*** Use provided scripts to build it from source code. 
    ```bash
    cd 3rdparty
    sh prepare_oneccl.sh
    source ./oneccl/build/_install/env/setvars.sh
    ```
  - Install oneCCL through installing [Intel® oneAPI Base Toolkit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html).***(Notice:It is recommended to use versions 2023.x and below.)*** And source the enviroment by:
    ```
    source /opt/intel/oneapi/setvars.sh
    ```

- Here is a example on local. 
  ```bash
  # or export LD_PRELOAD=libiomp5.so manually
  export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')
  OMP_NUM_THREADS=48 mpirun \
    -n 1 numactl -N 0  -m 0 ${RUN_WORKLOAD} : \
    -n 1 numactl -N 1  -m 1 ${RUN_WORKLOAD} 
  ```

#### Code
For more details, please refer to examples.
##### Python
`model.rank` can get the process's rank, `model.rank == 0` is the Master.  
For Slaves, after loading the model, the only thing needs to do is `model.generate()`. The input and generation configuration will be auto synced.
```Python
model = xfastertransformer.AutoModel.from_pretrained("/data/chatglm-6b-xft", dtype="bf16")

# Slave
while True:
    model.generate()
```
##### C++
`model.getRank()` can get the process's rank, `model.getRank() == 0` is the Master.  
For Slaves, any value can be input to `model.config()` and `model.input` since Master's value will be synced.
```C++
xft::AutoModel model("/data/chatglm-6b-xft", xft::DataType::bf16);

// Slave
while (1) {
    model.config();
    std::vector<int> input_ids;
    model.input(/*input token ids*/ input_ids, /*batch size*/ 1);

    while (!model.isDone()) {
        model.generate();
    }
}
```

## [Web Demo](examples/web_demo/README.md)
A web demo based on [Gradio](https://www.gradio.app/) is provided in repo. Now support ChatGLM, ChatGLM2 and Llama2 models.
- [Perpare the model](#prepare-model).
- Install the dependencies
  ```bash
  pip install -r examples/web_demo/requirements.txt
  ```
  ***PS: Due to the potential compatibility issues between the model file and the `transformers` version, please select the appropriate `transformers` version.***
- Run the script corresponding to the model. After the web server started, open the output URL in the browser to use the demo. Please specify the paths of model and tokenizer directory, and data type. `transformer`'s tokenizer is used to encode and decode text so `${TOKEN_PATH}` means the huggingface model directory. This demo also support multi-rank.
```bash
# Recommend preloading `libiomp5.so` to get a better performance.
# or LD_PRELOAD=libiomp5.so manually, `libiomp5.so` file will be in `3rdparty/mkl/lib` directory after build xFasterTransformer.
export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')
python examples/web_demo/ChatGLM.py \
                      --dtype=bf16 \
                      --token_path=${TOKEN_PATH} \
                      --model_path=${MODEL_PATH}
```

## Serving
### vLLM
A fork of vLLM has been created to integrate the xFasterTransformer backend, maintaining compatibility with most of the official vLLM's features. Refer [this link](serving/vllm-xft.md) for more detail.

#### Install
```bash
pip install vllm-xft
```
***Notice: Please do not install both `vllm-xft` and `vllm` simultaneously in the environment. Although the package names are different, they will actually overwrite each other.***

#### OpenAI Compatible Server
***Notice: Preload libiomp5.so is required!***
```bash
# Preload libiomp5.so by following cmd or LD_PRELOAD=libiomp5.so manually
export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')

python -m vllm.entrypoints.openai.api_server \
        --model ${MODEL_PATH} \
        --tokenizer ${TOKEN_PATH} \
        --dtype bf16 \
        --kv-cache-dtype fp16 \
        --served-model-name xft \
        --port 8000 \
        --trust-remote-code
```
For multi-rank mode, please use `python -m vllm.entrypoints.slave` as slave and keep params of slaves align with master.
```bash
# Preload libiomp5.so by following cmd or LD_PRELOAD=libiomp5.so manually
export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')

OMP_NUM_THREADS=48 mpirun \
        -n 1 numactl --all -C 0-47 -m 0 \
          python -m vllm.entrypoints.openai.api_server \
            --model ${MODEL_PATH} \
            --tokenizer ${TOKEN_PATH} \
            --dtype bf16 \
            --kv-cache-dtype fp16 \
            --served-model-name xft \
            --port 8000 \
            --trust-remote-code \
        : -n 1 numactl --all -C 48-95 -m 1 \
          python -m vllm.entrypoints.slave \
            --dtype bf16 \
            --model ${MODEL_PATH} \
            --kv-cache-dtype fp16
```

### FastChat
xFasterTransformer is an official inference backend of [FastChat](https://github.com/lm-sys/FastChat). Please refer to [xFasterTransformer in FastChat](https://github.com/lm-sys/FastChat/blob/main/docs/xFasterTransformer.md) and [FastChat's serving](https://github.com/lm-sys/FastChat/blob/main/docs/openai_api.md) for more details.

### MLServer
[A example serving of MLServer](serving/mlserver/README.md) is provided which supports REST and gRPC interface and adaptive batching feature to group inference requests together on the fly.

## [Benchmark](benchmark/README.md)

Benchmark scripts are provided to get the model inference performance quickly.
- [Prepare the model](#prepare-model).
- Install the dependencies, including oneCCL and python dependencies.
- Enter the `benchmark` folder and run `run_benchmark.sh`. Please refer to [Benchmark README](benchmark/README.md) for more information.

**Notes!!!**: The system and CPU configuration may be different. For the best performance, please try to modify OMP_NUM_THREADS, datatype and the memory nodes number (check the memory nodes using `numactl -H`) according to your test environment.

## Support

- xFasterTransformer email: xft.maintainer@intel.com
- xFasterTransformer [wechat](https://github.com/intel/xFasterTransformer/wiki)

## Accepted Papers
- ICLR'2024 on practical ML for limited/low resource settings: [Distributed Inference Performance Optimization for LLMs on CPUs](https://arxiv.org/abs/2407.00029)
- ICML'2024 on Foundation Models in the Wild: [Inference Performance Optimization for Large Language Models on CPUs](https://arxiv.org/abs/2407.07304)
- IEEE ICSESS 2024: All-in-one Approach for Large Language Models Inference

If xFT is useful for your research, please cite:
```latex
@article{he2024distributed,
  title={Distributed Inference Performance Optimization for LLMs on CPUs},
  author={He, Pujiang and Zhou, Shan and Li, Changqing and Huang, Wenhuan and Yu, Weifei and Wang, Duyi and Meng, Chen and Gui, Sheng},
  journal={arXiv preprint arXiv:2407.00029},
  year={2024}
}
```
and
```latex
@inproceedings{he2024inference,
  title={Inference Performance Optimization for Large Language Models on CPUs},
  author={He, Pujiang and Zhou, Shan and Huang, Wenhuan and Li, Changqing and Wang, Duyi and Guo, Bin and Meng, Chen and Gui, Sheng and Yu, Weifei and Xie, Yi},
  booktitle={ICML 2024 Workshop on Foundation Models in the Wild}
}
```

## Q&A

- ***Q***: Can xFasterTransformer run on a Intel® Core™ CPU?  
***A***: No. xFasterTransformer requires support for the AMX and AVX512 instruction sets, which are not available on Intel® Core™ CPUs.

- ***Q***: Can xFasterTransformer run on the Windows system?  
***A***: There is no native support for Windows, and all compatibility tests are only conducted on Linux, so Linux is recommended.

- ***Q***: Why does the program freeze or exit with errors when running in multi-rank mode after installing the latest version of oneCCL through oneAPI?  
***A***: Please try downgrading oneAPI to version 2023.x or below, or use the provided script to install oneCCL from source code.

- ***Q***: Why does running the program using two CPU sockets result in much lower performance compared to running on a single CPU socket?  
***A***: Running in this way causes the program to engage in many unnecessary cross-socket communications, significantly impacting performance. If there is a need for cross-socket deployment, consider running in a multi-rank mode with one rank on each socket.

- ***Q***:The performance is normal when running in a single rank, but why is the performance very slow and the CPU utilization very low when using MPI to run multiple ranks?   
***A***:This is because the program launched through MPI reads `OMP_NUM_THREADS=1`, which cannot correctly retrieve the appropriate value from the environment. It is necessary to manually set the value of `OMP_NUM_THREADS` based on the actual situation.

- ***Q***: Why do I still encounter errors when converting already supported models?  
***A***: Try downgrading `transformer` to an appropriate version, such as the version specified in the `requirements.txt`. This is because different versions of Transformer may change the names of certain variables.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/intel/xFasterTransformer",
    "name": "xfastertransformer",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "LLM",
    "author": "xFasterTransformer",
    "author_email": "xft.maintainer@intel.com",
    "download_url": null,
    "platform": "x86_64",
    "description": "# xFasterTransformer\n\n<p align=\"center\">\n  <a href=\"./README.md\">English</a> |\n  <a href=\"./README_CN.md\">\u7b80\u4f53\u4e2d\u6587</a>\n</p>\n\nxFasterTransformer is an exceptionally optimized solution for large language models (LLM) on the X86 platform, which is similar to FasterTransformer on the GPU platform. xFasterTransformer is able to operate in distributed mode across multiple sockets and nodes to support inference on larger models. Additionally, it provides both C++ and Python APIs, spanning from high-level to low-level interfaces, making it easy to adopt and integrate.\n\n## Table of Contents\n- [xFasterTransformer](#xfastertransformer)\n  - [Table of Contents](#table-of-contents)\n  - [Models overview](#models-overview)\n    - [Model support matrix](#model-support-matrix)\n    - [DataType support list](#datatype-support-list)\n  - [Documents](#documents)\n  - [Installation](#installation)\n    - [From PyPI](#from-pypi)\n    - [Using Docker](#using-docker)\n    - [Built from source](#built-from-source)\n      - [Prepare Environment](#prepare-environment)\n        - [Manually](#manually)\n        - [Install dependent libraries](#install-dependent-libraries)\n        - [How to build](#how-to-build)\n  - [Models Preparation](#models-preparation)\n  - [API usage](#api-usage)\n    - [Python API(PyTorch)](#python-apipytorch)\n    - [C++ API](#c-api)\n  - [How to run](#how-to-run)\n    - [Single rank](#single-rank)\n    - [Multi ranks](#multi-ranks)\n      - [Command line](#command-line)\n      - [Code](#code)\n        - [Python](#python)\n        - [C++](#c)\n  - [Web Demo](#web-demo)\n  - [Serving](#serving)\n    - [vLLM](#vllm)\n      - [Install](#install)\n      - [OpenAI Compatible Server](#openai-compatible-server)\n    - [FastChat](#fastchat)\n    - [MLServer](#mlserver)\n  - [Benchmark](#benchmark)\n  - [Support](#support)\n  - [Accepted Papers](#accepted-papers)\n  - [Q\\&A](#qa)\n\n## Models overview\nLarge Language Models (LLMs) develops very fast and are more widely used in many AI scenarios. xFasterTransformer is an optimized solution for LLM inference using the mainstream and popular LLM models on Xeon. xFasterTransformer fully leverages the hardware capabilities of Xeon platforms to achieve the high performance and high scalability of LLM inference both on single socket and multiple sockets/multiple nodes.\n\nxFasterTransformer provides a series of APIs, both of C++ and Python, for end users to integrate xFasterTransformer into their own solutions or services directly. Many kinds of example codes are also provided to demonstrate the usage. Benchmark codes and scripts are provided for users to show the performance. Web demos for popular LLM models are also provided.\n\n\n### Model support matrix\n\n|       Models       | Framework |          | Distribution |\n| :----------------: | :-------: | :------: | :----------: |\n|                    |  PyTorch  |   C++    |              |\n|      ChatGLM       | &#10004;  | &#10004; |   &#10004;   |\n|      ChatGLM2      | &#10004;  | &#10004; |   &#10004;   |\n|      ChatGLM3      | &#10004;  | &#10004; |   &#10004;   |\n|        GLM4        | &#10004;  | &#10004; |   &#10004;   |\n|       Llama        | &#10004;  | &#10004; |   &#10004;   |\n|       Llama2       | &#10004;  | &#10004; |   &#10004;   |\n|       Llama3       | &#10004;  | &#10004; |   &#10004;   |\n|     Baichuan       | &#10004;  | &#10004; |   &#10004;   |\n|     Baichuan2      | &#10004;  | &#10004; |   &#10004;   |\n|        QWen        | &#10004;  | &#10004; |   &#10004;   |\n|        QWen2       | &#10004;  | &#10004; |   &#10004;   |\n| SecLLM(YaRN-Llama) | &#10004;  | &#10004; |   &#10004;   |\n|        Opt         | &#10004;  | &#10004; |   &#10004;   |\n|   Deepseek-coder   | &#10004;  | &#10004; |   &#10004;   |\n|      gemma         | &#10004;  | &#10004; |   &#10004;   |\n|     gemma-1.1      | &#10004;  | &#10004; |   &#10004;   |\n|     codegemma      | &#10004;  | &#10004; |   &#10004;   |\n\n### DataType support list\n\n- FP16\n- BF16\n- INT8\n- W8A8\n- INT4\n- NF4\n- BF16_FP16\n- BF16_INT8\n- BF16_W8A8\n- BF16_INT4\n- BF16_NF4\n- W8A8_INT8\n- W8A8_int4\n- W8A8_NF4\n\n## Documents\nxFasterTransformer Documents and [Wiki](https://github.com/intel/xFasterTransformer/wiki) provides the following resources:\n- An introduction to xFasterTransformer.\n- Comprehensive API references for both high-level and low-level interfaces in C++ and PyTorch.\n- Practical API usage examples for xFasterTransformer in both C++ and PyTorch.\n\n## Installation\n### From PyPI\n```bash\npip install xfastertransformer\n```\n\n### Using Docker\n```bash\ndocker pull intel/xfastertransformer:latest\n```\nRun the docker with the command (Assume model files are in `/data/` directory):  \n```bash\ndocker run -it \\\n    --name xfastertransformer \\\n    --privileged \\\n    --shm-size=16g \\\n    -v /data/:/data/ \\\n    -e \"http_proxy=$http_proxy\" \\\n    -e \"https_proxy=$https_proxy\" \\\n    intel/xfastertransformer:latest\n```\n**Notice!!!**: Please enlarge `--shm-size` if  **bus error** occurred while running in the multi-ranks mode. The default docker limits the shared memory size to 64MB and our implementation uses many shared memories to achieve a  better performance.\n\n### Built from source\n#### Prepare Environment\n##### Manually\n- [PyTorch](https://pytorch.org/get-started/locally/) v2.3 (When using the PyTorch API, it's required, but it's not needed when using the C++ API.)\n  ```bash \n  pip install torch --index-url https://download.pytorch.org/whl/cpu\n  ```\n\n- For GPU, xFT needs ABI=1 from [torch==2.3.0+cpu.cxx11.abi](https://download.pytorch.org/whl/cpu-cxx11-abi/torch-2.3.0%2Bcpu.cxx11.abi-cp38-cp38-linux_x86_64.whl#sha256=c34512c3e07efe9b7fb5c3a918fef1a7c6eb8969c6b2eea92ee5c16a0583fe12) in [torch-whl-list](https://download.pytorch.org/whl/torch/) due to DPC++ need ABI=1.\n\n##### Install dependent libraries\n\nPlease install libnuma package:\n- CentOS: yum install libnuma-devel\n- Ubuntu: apt-get install libnuma-dev\n\n##### How to build\n- Using 'CMake'\n  ```bash\n  # Build xFasterTransformer\n  git clone https://github.com/intel/xFasterTransformer.git xFasterTransformer\n  cd xFasterTransformer\n  git checkout <latest-tag>\n  # Please make sure torch is installed when run python example\n  mkdir build && cd build\n  cmake ..\n  make -j\n  ```\n- Using `python setup.py`\n  ```bash\n  # Build xFasterTransformer library and C++ example.\n  python setup.py build\n\n  # Install xFasterTransformer into pip environment.\n  # Notice: Run `python setup.py build` before installation!\n  python setup.py install\n  ```\n\n## [Models Preparation](tools/README.md)\nxFasterTransformer supports a different model format from Huggingface, but it's compatible with FasterTransformer's format.\n1. Download the huggingface format model firstly.\n2. After that, convert the model into xFasterTransformer format by using model convert module in xfastertransformer. If output directory is not provided, converted model will be placed into `${HF_DATASET_DIR}-xft`.\n    ```\n    python -c 'import xfastertransformer as xft; xft.LlamaConvert().convert(\"${HF_DATASET_DIR}\",\"${OUTPUT_DIR}\")'\n    ```\n    ***PS: Due to the potential compatibility issues between the model file and the `transformers` version, please select the appropriate `transformers` version.***\n    \n    Supported model convert list:\n    - LlamaConvert\n    - YiConvert\n    - GemmaConvert\n    - ChatGLMConvert\n    - ChatGLM2Convert\n    - ChatGLM4Convert\n    - OPTConvert\n    - BaichuanConvert\n    - Baichuan2Convert\n    - QwenConvert\n    - Qwen2Convert\n    - DeepseekConvert\n\n## API usage\nFor more details, please see API document and [examples](examples/README.md).\n### Python API(PyTorch)\nFirstly, please install the dependencies.\n- Python dependencies\n  ```bash\n  pip install -r requirements.txt\n  ```\n  ***PS: Due to the potential compatibility issues between the model file and the `transformers` version, please select the appropriate `transformers` version.***\n- oneCCL (For multi ranks)  \n  Install oneCCL and setup the environment. Please refer to [Prepare Environment](#prepare-environment).\n\n\nxFasterTransformer's Python API is similar to transformers and also supports transformers's streamer to achieve the streaming output. In the example, we use transformers to encode input prompts to token ids. \n```Python\nimport xfastertransformer\nfrom transformers import AutoTokenizer, TextStreamer\n# Assume huggingface model dir is `/data/chatglm-6b-hf` and converted model dir is `/data/chatglm-6b-xft`.\nMODEL_PATH=\"/data/chatglm-6b-xft\"\nTOKEN_PATH=\"/data/chatglm-6b-hf\"\n\nINPUT_PROMPT = \"Once upon a time, there existed a little girl who liked to have adventures.\"\ntokenizer = AutoTokenizer.from_pretrained(TOKEN_PATH, use_fast=False, padding_side=\"left\", trust_remote_code=True)\nstreamer = TextStreamer(tokenizer, skip_special_tokens=True, skip_prompt=False)\n\ninput_ids = tokenizer(INPUT_PROMPT, return_tensors=\"pt\", padding=False).input_ids\nmodel = xfastertransformer.AutoModel.from_pretrained(MODEL_PATH, dtype=\"bf16\")\ngenerated_ids = model.generate(input_ids, max_length=200, streamer=streamer)\n```\n\n### C++ API\n[SentencePiece](https://github.com/google/sentencepiece) can be used to tokenizer and detokenizer text.\n```C++\n#include <vector>\n#include <iostream>\n#include \"xfastertransformer.h\"\n// ChatGLM token ids for prompt \"Once upon a time, there existed a little girl who liked to have adventures.\"\nstd::vector<int> input(\n        {3393, 955, 104, 163, 6, 173, 9166, 104, 486, 2511, 172, 7599, 103, 127, 17163, 7, 130001, 130004});\n\n// Assume converted model dir is `/data/chatglm-6b-xft`.\nxft::AutoModel model(\"/data/chatglm-6b-xft\", xft::DataType::bf16);\n\nmodel.config(/*max length*/ 100, /*num beams*/ 1);\nmodel.input(/*input token ids*/ input, /*batch size*/ 1);\n\nwhile (!model.isDone()) {\n    std::vector<int> nextIds = model.generate();\n}\n\nstd::vector<int> result = model.finalize();\nfor (auto id : result) {\n    std::cout << id << \" \";\n}\nstd::cout << std::endl;\n```\n\n## How to run\nRecommend preloading `libiomp5.so` to get a better performance. \n- ***[Recommended]*** Run `export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')` if xfastertransformer's python wheel package is installed.\n- `libiomp5.so` file will be in `3rdparty/mkl/lib` directory after building xFasterTransformer successfully if building from source code.\n\n### Single rank\nFasterTransformer will automatically check the MPI environment, or you can use the `SINGLE_INSTANCE=1` environment variable to forcefully deactivate MPI.  \n\n### Multi ranks\n#### Command line\nUse MPI to run in the multi-ranks mode, please install oneCCL firstly.\n- [oneCCL Installation](https://github.com/oneapi-src/oneCCL)\n  - If you have built xfastertransformer from source, oneCCL is installed in 3rdparty when compilation.\n    ```\n    source ./3rdparty/oneccl/build/_install/env/setvars.sh\n    ```\n  - ***[Recommended]*** Use provided scripts to build it from source code. \n    ```bash\n    cd 3rdparty\n    sh prepare_oneccl.sh\n    source ./oneccl/build/_install/env/setvars.sh\n    ```\n  - Install oneCCL through installing [Intel\u00ae oneAPI Base Toolkit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html).***(Notice:It is recommended to use versions 2023.x and below.)*** And source the enviroment by:\n    ```\n    source /opt/intel/oneapi/setvars.sh\n    ```\n\n- Here is a example on local. \n  ```bash\n  # or export LD_PRELOAD=libiomp5.so manually\n  export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')\n  OMP_NUM_THREADS=48 mpirun \\\n    -n 1 numactl -N 0  -m 0 ${RUN_WORKLOAD} : \\\n    -n 1 numactl -N 1  -m 1 ${RUN_WORKLOAD} \n  ```\n\n#### Code\nFor more details, please refer to examples.\n##### Python\n`model.rank` can get the process's rank, `model.rank == 0` is the Master.  \nFor Slaves, after loading the model, the only thing needs to do is `model.generate()`. The input and generation configuration will be auto synced.\n```Python\nmodel = xfastertransformer.AutoModel.from_pretrained(\"/data/chatglm-6b-xft\", dtype=\"bf16\")\n\n# Slave\nwhile True:\n    model.generate()\n```\n##### C++\n`model.getRank()` can get the process's rank, `model.getRank() == 0` is the Master.  \nFor Slaves, any value can be input to `model.config()` and `model.input` since Master's value will be synced.\n```C++\nxft::AutoModel model(\"/data/chatglm-6b-xft\", xft::DataType::bf16);\n\n// Slave\nwhile (1) {\n    model.config();\n    std::vector<int> input_ids;\n    model.input(/*input token ids*/ input_ids, /*batch size*/ 1);\n\n    while (!model.isDone()) {\n        model.generate();\n    }\n}\n```\n\n## [Web Demo](examples/web_demo/README.md)\nA web demo based on [Gradio](https://www.gradio.app/) is provided in repo. Now support ChatGLM, ChatGLM2 and Llama2 models.\n- [Perpare the model](#prepare-model).\n- Install the dependencies\n  ```bash\n  pip install -r examples/web_demo/requirements.txt\n  ```\n  ***PS: Due to the potential compatibility issues between the model file and the `transformers` version, please select the appropriate `transformers` version.***\n- Run the script corresponding to the model. After the web server started, open the output URL in the browser to use the demo. Please specify the paths of model and tokenizer directory, and data type. `transformer`'s tokenizer is used to encode and decode text so `${TOKEN_PATH}` means the huggingface model directory. This demo also support multi-rank.\n```bash\n# Recommend preloading `libiomp5.so` to get a better performance.\n# or LD_PRELOAD=libiomp5.so manually, `libiomp5.so` file will be in `3rdparty/mkl/lib` directory after build xFasterTransformer.\nexport $(python -c 'import xfastertransformer as xft; print(xft.get_env())')\npython examples/web_demo/ChatGLM.py \\\n                      --dtype=bf16 \\\n                      --token_path=${TOKEN_PATH} \\\n                      --model_path=${MODEL_PATH}\n```\n\n## Serving\n### vLLM\nA fork of vLLM has been created to integrate the xFasterTransformer backend, maintaining compatibility with most of the official vLLM's features. Refer [this link](serving/vllm-xft.md) for more detail.\n\n#### Install\n```bash\npip install vllm-xft\n```\n***Notice: Please do not install both `vllm-xft` and `vllm` simultaneously in the environment. Although the package names are different, they will actually overwrite each other.***\n\n#### OpenAI Compatible Server\n***Notice: Preload libiomp5.so is required!***\n```bash\n# Preload libiomp5.so by following cmd or LD_PRELOAD=libiomp5.so manually\nexport $(python -c 'import xfastertransformer as xft; print(xft.get_env())')\n\npython -m vllm.entrypoints.openai.api_server \\\n        --model ${MODEL_PATH} \\\n        --tokenizer ${TOKEN_PATH} \\\n        --dtype bf16 \\\n        --kv-cache-dtype fp16 \\\n        --served-model-name xft \\\n        --port 8000 \\\n        --trust-remote-code\n```\nFor multi-rank mode, please use `python -m vllm.entrypoints.slave` as slave and keep params of slaves align with master.\n```bash\n# Preload libiomp5.so by following cmd or LD_PRELOAD=libiomp5.so manually\nexport $(python -c 'import xfastertransformer as xft; print(xft.get_env())')\n\nOMP_NUM_THREADS=48 mpirun \\\n        -n 1 numactl --all -C 0-47 -m 0 \\\n          python -m vllm.entrypoints.openai.api_server \\\n            --model ${MODEL_PATH} \\\n            --tokenizer ${TOKEN_PATH} \\\n            --dtype bf16 \\\n            --kv-cache-dtype fp16 \\\n            --served-model-name xft \\\n            --port 8000 \\\n            --trust-remote-code \\\n        : -n 1 numactl --all -C 48-95 -m 1 \\\n          python -m vllm.entrypoints.slave \\\n            --dtype bf16 \\\n            --model ${MODEL_PATH} \\\n            --kv-cache-dtype fp16\n```\n\n### FastChat\nxFasterTransformer is an official inference backend of [FastChat](https://github.com/lm-sys/FastChat). Please refer to [xFasterTransformer in FastChat](https://github.com/lm-sys/FastChat/blob/main/docs/xFasterTransformer.md) and [FastChat's serving](https://github.com/lm-sys/FastChat/blob/main/docs/openai_api.md) for more details.\n\n### MLServer\n[A example serving of MLServer](serving/mlserver/README.md) is provided which supports REST and gRPC interface and adaptive batching feature to group inference requests together on the fly.\n\n## [Benchmark](benchmark/README.md)\n\nBenchmark scripts are provided to get the model inference performance quickly.\n- [Prepare the model](#prepare-model).\n- Install the dependencies, including oneCCL and python dependencies.\n- Enter the `benchmark` folder and run `run_benchmark.sh`. Please refer to [Benchmark README](benchmark/README.md) for more information.\n\n**Notes!!!**: The system and CPU configuration may be different. For the best performance, please try to modify OMP_NUM_THREADS, datatype and the memory nodes number (check the memory nodes using `numactl -H`) according to your test environment.\n\n## Support\n\n- xFasterTransformer email: xft.maintainer@intel.com\n- xFasterTransformer [wechat](https://github.com/intel/xFasterTransformer/wiki)\n\n## Accepted Papers\n- ICLR'2024 on practical ML for limited/low resource settings: [Distributed Inference Performance Optimization for LLMs on CPUs](https://arxiv.org/abs/2407.00029)\n- ICML'2024 on Foundation Models in the Wild: [Inference Performance Optimization for Large Language Models on CPUs](https://arxiv.org/abs/2407.07304)\n- IEEE ICSESS 2024: All-in-one Approach for Large Language Models Inference\n\nIf xFT is useful for your research, please cite:\n```latex\n@article{he2024distributed,\n  title={Distributed Inference Performance Optimization for LLMs on CPUs},\n  author={He, Pujiang and Zhou, Shan and Li, Changqing and Huang, Wenhuan and Yu, Weifei and Wang, Duyi and Meng, Chen and Gui, Sheng},\n  journal={arXiv preprint arXiv:2407.00029},\n  year={2024}\n}\n```\nand\n```latex\n@inproceedings{he2024inference,\n  title={Inference Performance Optimization for Large Language Models on CPUs},\n  author={He, Pujiang and Zhou, Shan and Huang, Wenhuan and Li, Changqing and Wang, Duyi and Guo, Bin and Meng, Chen and Gui, Sheng and Yu, Weifei and Xie, Yi},\n  booktitle={ICML 2024 Workshop on Foundation Models in the Wild}\n}\n```\n\n## Q&A\n\n- ***Q***: Can xFasterTransformer run on a Intel\u00ae Core\u2122 CPU?  \n***A***: No. xFasterTransformer requires support for the AMX and AVX512 instruction sets, which are not available on Intel\u00ae Core\u2122 CPUs.\n\n- ***Q***: Can xFasterTransformer run on the Windows system?  \n***A***: There is no native support for Windows, and all compatibility tests are only conducted on Linux, so Linux is recommended.\n\n- ***Q***: Why does the program freeze or exit with errors when running in multi-rank mode after installing the latest version of oneCCL through oneAPI?  \n***A***: Please try downgrading oneAPI to version 2023.x or below, or use the provided script to install oneCCL from source code.\n\n- ***Q***: Why does running the program using two CPU sockets result in much lower performance compared to running on a single CPU socket?  \n***A***: Running in this way causes the program to engage in many unnecessary cross-socket communications, significantly impacting performance. If there is a need for cross-socket deployment, consider running in a multi-rank mode with one rank on each socket.\n\n- ***Q***:The performance is normal when running in a single rank, but why is the performance very slow and the CPU utilization very low when using MPI to run multiple ranks?   \n***A***:This is because the program launched through MPI reads `OMP_NUM_THREADS=1`, which cannot correctly retrieve the appropriate value from the environment. It is necessary to manually set the value of `OMP_NUM_THREADS` based on the actual situation.\n\n- ***Q***: Why do I still encounter errors when converting already supported models?  \n***A***: Try downgrading `transformer` to an appropriate version, such as the version specified in the `requirements.txt`. This is because different versions of Transformer may change the names of certain variables.\n",
    "bugtrack_url": null,
    "license": "Apache 2.0",
    "summary": "Boost large language model inference performance on CPU platform.",
    "version": "1.8.1",
    "project_urls": {
        "Homepage": "https://github.com/intel/xFasterTransformer"
    },
    "split_keywords": [
        "llm"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3017925fb9063151f96093c2e9bb47b538d712e45e9640d21e11ccd673090b66",
                "md5": "a1f371bbaba0611a338d3e08c7cf88ba",
                "sha256": "8c0d9e719ef810368ec4f40c0de5855a6beeb4c0fe2d07b6b8d06b8b2bea256d"
            },
            "downloads": -1,
            "filename": "xfastertransformer-1.8.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a1f371bbaba0611a338d3e08c7cf88ba",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 39541767,
            "upload_time": "2024-07-31T08:34:50",
            "upload_time_iso_8601": "2024-07-31T08:34:50.149993Z",
            "url": "https://files.pythonhosted.org/packages/30/17/925fb9063151f96093c2e9bb47b538d712e45e9640d21e11ccd673090b66/xfastertransformer-1.8.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-07-31 08:34:50",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "intel",
    "github_project": "xFasterTransformer",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "xfastertransformer"
}

xFasterTransformer