Name | qwen-cpp JSON |
Version |
0.1.3
JSON |
| download |
home_page | |
Summary | C++ implementation of qwen & tiktoken |
upload_time | 2023-11-17 10:17:49 |
maintainer | |
docs_url | None |
author | Shijie Wang |
requires_python | >=3.7 |
license | MIT License |
keywords |
qwen
tiktoken
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# qwen.cpp
C++ implementation of [Qwen-LM](https://github.com/QwenLM/Qwen) for real-time chatting on your MacBook.
## Features
Highlights:
* [x] Pure C++ implementation based on [ggml](https://github.com/ggerganov/ggml), working in the same way as [llama.cpp](https://github.com/ggerganov/llama.cpp).
* [x] Pure C++ tiktoken implementation.
* [x] Streaming generation with typewriter effect.
* [x] Python binding.
Support Matrix:
* Hardwares: x86/arm CPU, NVIDIA GPU
* Platforms: Linux, MacOS
* Models: [Qwen-LM](https://github.com/QwenLM/Qwen)
## Getting Started
**Preparation**
Clone the qwen.cpp repository into your local machine:
```sh
git clone --recursive https://github.com/QwenLM/qwen.cpp && cd qwen.cpp
```
If you forgot the `--recursive` flag when cloning the repository, run the following command in the `qwen.cpp` folder:
```sh
git submodule update --init --recursive
```
Download the qwen.tiktoken file from [Hugging Face](https://huggingface.co/Qwen/Qwen-7B-Chat/blob/main/qwen.tiktoken) or [modelscope](https://modelscope.cn/models/qwen/Qwen-7B-Chat/files).
**Quantize Model**
Use `convert.py` to transform Qwen-LM into quantized GGML format. For example, to convert the fp16 original model to q4_0 (quantized int4) GGML model, run:
```sh
python3 qwen_cpp/convert.py -i Qwen/Qwen-7B-Chat -t q4_0 -o qwen7b-ggml.bin
```
The original model (`-i <model_name_or_path>`) can be a HuggingFace model name or a local path to your pre-downloaded model. Currently supported models are:
* Qwen-7B: `Qwen/Qwen-7B-Chat`
* Qwen-14B: `Qwen/Qwen-14B-Chat`
You are free to try any of the below quantization types by specifying `-t <type>`:
* `q4_0`: 4-bit integer quantization with fp16 scales.
* `q4_1`: 4-bit integer quantization with fp16 scales and minimum values.
* `q5_0`: 5-bit integer quantization with fp16 scales.
* `q5_1`: 5-bit integer quantization with fp16 scales and minimum values.
* `q8_0`: 8-bit integer quantization with fp16 scales.
* `f16`: half precision floating point weights without quantization.
* `f32`: single precision floating point weights without quantization.
**Build & Run**
Compile the project using CMake:
```sh
cmake -B build
cmake --build build -j --config Release
```
Now you may chat with the quantized Qwen-7B-Chat model by running:
```sh
./build/bin/main -m qwen7b-ggml.bin --tiktoken Qwen-7B-Chat/qwen.tiktoken -p 你好
# 你好!很高兴为你提供帮助。
```
To run the model in interactive mode, add the `-i` flag. For example:
```sh
./build/bin/main -m qwen7b-ggml.bin --tiktoken Qwen-7B-Chat/qwen.tiktoken -i
```
In interactive mode, your chat history will serve as the context for the next-round conversation.
Run `./build/bin/main -h` to explore more options!
## Using BLAS
**OpenBLAS**
OpenBLAS provides acceleration on CPU. Add the CMake flag `-DGGML_OPENBLAS=ON` to enable it.
```sh
cmake -B build -DGGML_OPENBLAS=ON && cmake --build build -j
```
**cuBLAS**
cuBLAS uses NVIDIA GPU to accelerate BLAS. Add the CMake flag `-DGGML_CUBLAS=ON` to enable it.
```sh
cmake -B build -DGGML_CUBLAS=ON && cmake --build build -j
```
**Metal**
MPS (Metal Performance Shaders) allows computation to run on powerful Apple Silicon GPU. Add the CMake flag `-DGGML_METAL=ON` to enable it.
```sh
cmake -B build -DGGML_METAL=ON && cmake --build build -j
```
## Python Binding
The Python binding provides high-level `chat` and `stream_chat` interface similar to the original Hugging Face Qwen-7B.
**Installation**
Install from PyPI (recommended): will trigger compilation on your platform.
```sh
pip install -U qwen-cpp
```
You may also install from source.
```sh
# install from the latest source hosted on GitHub
pip install git+https://github.com/QwenLM/qwen.cpp.git@master
# or install from your local source after git cloning the repo
pip install .
```
## tiktoken.cpp
We provide pure C++ tiktoken implementation. After installation, the usage is the same as openai tiktoken:
```python
import tiktoken_cpp as tiktoken
enc = tiktoken.get_encoding("cl100k_base")
assert enc.decode(enc.encode("hello world")) == "hello world"
```
**Benchmark**
The speed of tiktoken.cpp is on par with openai tiktoken:
```python
cd tests
RAYON_NUM_THREADS=1 python benchmark.py
```
## Development
**Unit Test**
To perform unit tests, add this CMake flag `-DQWEN_ENABLE_TESTING=ON` to enable testing. Recompile and run the unit test (including benchmark).
```sh
mkdir -p build && cd build
cmake .. -DQWEN_ENABLE_TESTING=ON && make -j
./bin/qwen_test
```
**Lint**
To format the code, run `make lint` inside the `build` folder. You should have `clang-format`, `black` and `isort` pre-installed.
## Acknowledgements
* This project is greatly inspired by [llama.cpp](https://github.com/ggerganov/llama.cpp), [chatglm.cpp](https://github.com/li-plus/chatglm.cpp), [ggml](https://github.com/ggerganov/ggml), [tiktoken](https://github.com/openai/tiktoken), [tokenizer](https://github.com/sewenew/tokenizer), [cpp-base64](https://github.com/ReneNyffenegger/cpp-base64), [re2](https://github.com/google/re2) and [unordered_dense](https://github.com/martinus/unordered_dense).
Raw data
{
"_id": null,
"home_page": "",
"name": "qwen-cpp",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": "Shijie Wang <saimeng.wsj@alibaba-inc.com>",
"keywords": "qwen,tiktoken",
"author": "Shijie Wang",
"author_email": "Shijie Wang <saimeng.wsj@alibaba-inc.com>",
"download_url": "https://files.pythonhosted.org/packages/a4/76/61e947717636072018ce25a5929af05b0e47538a564cc0c3298935b38a49/qwen-cpp-0.1.3.tar.gz",
"platform": null,
"description": "# qwen.cpp\n\nC++ implementation of [Qwen-LM](https://github.com/QwenLM/Qwen) for real-time chatting on your MacBook.\n\n## Features\n\nHighlights:\n* [x] Pure C++ implementation based on [ggml](https://github.com/ggerganov/ggml), working in the same way as [llama.cpp](https://github.com/ggerganov/llama.cpp).\n* [x] Pure C++ tiktoken implementation.\n* [x] Streaming generation with typewriter effect.\n* [x] Python binding.\n\nSupport Matrix:\n* Hardwares: x86/arm CPU, NVIDIA GPU\n* Platforms: Linux, MacOS\n* Models: [Qwen-LM](https://github.com/QwenLM/Qwen)\n\n## Getting Started\n\n**Preparation**\n\nClone the qwen.cpp repository into your local machine:\n```sh\ngit clone --recursive https://github.com/QwenLM/qwen.cpp && cd qwen.cpp\n```\n\nIf you forgot the `--recursive` flag when cloning the repository, run the following command in the `qwen.cpp` folder:\n```sh\ngit submodule update --init --recursive\n```\n\nDownload the qwen.tiktoken file from [Hugging Face](https://huggingface.co/Qwen/Qwen-7B-Chat/blob/main/qwen.tiktoken) or [modelscope](https://modelscope.cn/models/qwen/Qwen-7B-Chat/files).\n\n**Quantize Model**\n\nUse `convert.py` to transform Qwen-LM into quantized GGML format. For example, to convert the fp16 original model to q4_0 (quantized int4) GGML model, run:\n```sh\npython3 qwen_cpp/convert.py -i Qwen/Qwen-7B-Chat -t q4_0 -o qwen7b-ggml.bin\n```\n\nThe original model (`-i <model_name_or_path>`) can be a HuggingFace model name or a local path to your pre-downloaded model. Currently supported models are:\n* Qwen-7B: `Qwen/Qwen-7B-Chat`\n* Qwen-14B: `Qwen/Qwen-14B-Chat`\n\nYou are free to try any of the below quantization types by specifying `-t <type>`:\n* `q4_0`: 4-bit integer quantization with fp16 scales.\n* `q4_1`: 4-bit integer quantization with fp16 scales and minimum values.\n* `q5_0`: 5-bit integer quantization with fp16 scales.\n* `q5_1`: 5-bit integer quantization with fp16 scales and minimum values.\n* `q8_0`: 8-bit integer quantization with fp16 scales.\n* `f16`: half precision floating point weights without quantization.\n* `f32`: single precision floating point weights without quantization.\n\n**Build & Run**\n\nCompile the project using CMake:\n```sh\ncmake -B build\ncmake --build build -j --config Release\n```\n\nNow you may chat with the quantized Qwen-7B-Chat model by running:\n```sh\n./build/bin/main -m qwen7b-ggml.bin --tiktoken Qwen-7B-Chat/qwen.tiktoken -p \u4f60\u597d\n# \u4f60\u597d\uff01\u5f88\u9ad8\u5174\u4e3a\u4f60\u63d0\u4f9b\u5e2e\u52a9\u3002\n```\n\nTo run the model in interactive mode, add the `-i` flag. For example:\n```sh\n./build/bin/main -m qwen7b-ggml.bin --tiktoken Qwen-7B-Chat/qwen.tiktoken -i\n```\nIn interactive mode, your chat history will serve as the context for the next-round conversation.\n\nRun `./build/bin/main -h` to explore more options!\n\n## Using BLAS\n\n**OpenBLAS**\n\nOpenBLAS provides acceleration on CPU. Add the CMake flag `-DGGML_OPENBLAS=ON` to enable it.\n```sh\ncmake -B build -DGGML_OPENBLAS=ON && cmake --build build -j\n```\n\n**cuBLAS**\n\ncuBLAS uses NVIDIA GPU to accelerate BLAS. Add the CMake flag `-DGGML_CUBLAS=ON` to enable it.\n```sh\ncmake -B build -DGGML_CUBLAS=ON && cmake --build build -j\n```\n\n**Metal**\n\nMPS (Metal Performance Shaders) allows computation to run on powerful Apple Silicon GPU. Add the CMake flag `-DGGML_METAL=ON` to enable it.\n```sh\ncmake -B build -DGGML_METAL=ON && cmake --build build -j\n```\n\n## Python Binding\n\nThe Python binding provides high-level `chat` and `stream_chat` interface similar to the original Hugging Face Qwen-7B.\n\n**Installation**\n\nInstall from PyPI (recommended): will trigger compilation on your platform.\n```sh\npip install -U qwen-cpp\n```\n\nYou may also install from source.\n```sh\n# install from the latest source hosted on GitHub\npip install git+https://github.com/QwenLM/qwen.cpp.git@master\n# or install from your local source after git cloning the repo\npip install .\n```\n\n## tiktoken.cpp\n\nWe provide pure C++ tiktoken implementation. After installation, the usage is the same as openai tiktoken:\n```python\nimport tiktoken_cpp as tiktoken\nenc = tiktoken.get_encoding(\"cl100k_base\")\nassert enc.decode(enc.encode(\"hello world\")) == \"hello world\"\n```\n\n**Benchmark**\n\nThe speed of tiktoken.cpp is on par with openai tiktoken:\n```python\ncd tests\nRAYON_NUM_THREADS=1 python benchmark.py\n```\n\n## Development\n\n**Unit Test**\n\nTo perform unit tests, add this CMake flag `-DQWEN_ENABLE_TESTING=ON` to enable testing. Recompile and run the unit test (including benchmark).\n```sh\nmkdir -p build && cd build\ncmake .. -DQWEN_ENABLE_TESTING=ON && make -j\n./bin/qwen_test\n```\n\n**Lint**\n\nTo format the code, run `make lint` inside the `build` folder. You should have `clang-format`, `black` and `isort` pre-installed.\n\n## Acknowledgements\n\n* This project is greatly inspired by [llama.cpp](https://github.com/ggerganov/llama.cpp), [chatglm.cpp](https://github.com/li-plus/chatglm.cpp), [ggml](https://github.com/ggerganov/ggml), [tiktoken](https://github.com/openai/tiktoken), [tokenizer](https://github.com/sewenew/tokenizer), [cpp-base64](https://github.com/ReneNyffenegger/cpp-base64), [re2](https://github.com/google/re2) and [unordered_dense](https://github.com/martinus/unordered_dense).\n",
"bugtrack_url": null,
"license": "MIT License",
"summary": "C++ implementation of qwen & tiktoken",
"version": "0.1.3",
"project_urls": {
"BugTracker": "https://github.com/QwenLM/qwen.cpp/issues",
"Homepage": "https://github.com/QwenLM/qwen.cpp",
"Repository": "https://github.com/QwenLM/qwen.cpp.git"
},
"split_keywords": [
"qwen",
"tiktoken"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "a47661e947717636072018ce25a5929af05b0e47538a564cc0c3298935b38a49",
"md5": "a8a22dced8cb75837885d2bcb56bc151",
"sha256": "e4770afc32b3f5e30e973a52bc8ff1b3f0a89097efe0130cd3fb87722fff160a"
},
"downloads": -1,
"filename": "qwen-cpp-0.1.3.tar.gz",
"has_sig": false,
"md5_digest": "a8a22dced8cb75837885d2bcb56bc151",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 2933470,
"upload_time": "2023-11-17T10:17:49",
"upload_time_iso_8601": "2023-11-17T10:17:49.543678Z",
"url": "https://files.pythonhosted.org/packages/a4/76/61e947717636072018ce25a5929af05b0e47538a564cc0c3298935b38a49/qwen-cpp-0.1.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-11-17 10:17:49",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "QwenLM",
"github_project": "qwen.cpp",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "qwen-cpp"
}