intel-extension-for-transformers

Name	intel-extension-for-transformers JSON
Version	1.4.2 JSON
	download
home_page	https://github.com/intel/intel-extension-for-transformers
Summary	Repository of Intel® Intel Extension for Transformers
upload_time	2024-05-24 09:22:06
maintainer	None
docs_url	None
author	Intel AIA/AIPC Team
requires_python	>=3.7.0
license	Apache 2.0
keywords	quantization auto-tuning post-training static quantization post-training dynamic quantization quantization-aware training tuning strategy
VCS
bugtrack_url
requirements	py-cpuinfo setuptools setuptools_scm
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <div align="center">
  
Intel® Extension for Transformers
===========================
<h3>An Innovative Transformer-based Toolkit to Accelerate GenAI/LLM Everywhere</h3>

[![](https://dcbadge.vercel.app/api/server/Wxk3J3ZJkU?compact=true&style=flat-square)](https://discord.gg/Wxk3J3ZJkU)
[![Release Notes](https://img.shields.io/github/v/release/intel/intel-extension-for-transformers)](https://github.com/intel/intel-extension-for-transformers/releases)

[🏭Architecture](./docs/architecture.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[💬NeuralChat](./intel_extension_for_transformers/neural_chat)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[😃Inference on CPU](https://github.com/intel/neural-speed/tree/main)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[😃Inference  on GPU](https://github.com/intel/intel-extension-for-transformers/blob/main/docs/weightonlyquant.md#examples-for-gpu)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[💻Examples](./docs/examples.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[📖Documentations](https://intel.github.io/intel-extension-for-transformers/latest/docs/Welcome.html)
</div>

## 🚀Latest News
* [2024/04] Support the launch of **[Meta Llama 3](https://llama.meta.com/llama3/)**, the next generation of Llama models. Check out [Accelerate Meta* Llama 3 with Intel AI Solutions](https://www.intel.com/content/www/us/en/developer/articles/technical/accelerate-meta-llama3-with-intel-ai-solutions.html).
* [2024/04] Demonstrated the chatbot in 4th, 5th, and 6th Gen Xeon Scalable Processors in [**Intel Vision Pat's Keynote**](https://youtu.be/QB7FoIpx8os?t=2280).
* [2024/04] Supported **INT4 inference on Intel Meteor Lake**.
* [2024/04] Achieved a 1.8x performance improvement in GPT-J inference on the 5th Gen Xeon MLPerf v4.0 submission compared to v3.1. [News](https://www.intel.com/content/www/us/en/newsroom/news/new-gaudi-2-xeon-performance-ai-inference.html#gs.71ti1m), [Results](https://mlcommons.org/2024/03/mlperf-inference-v4/).
* [2024/01] Supported **INT4 inference on Intel GPUs** including Intel Data Center GPU Max Series (e.g., PVC) and Intel Arc A-Series (e.g., ARC). Check out the [examples](https://github.com/intel/intel-extension-for-transformers/blob/main/docs/weightonlyquant.md#examples-for-gpu) and [scripts](https://github.com/intel/intel-extension-for-transformers/blob/main/examples/huggingface/pytorch/text-generation/quantization/run_generation_gpu_woq.py).
* [2024/01] Demonstrated **Intel Hybrid Copilot** in **CES 2024 Great Minds** Session "[Bringing the Limitless Potential of AI Everywhere](https://youtu.be/70J3uO3eLZA?t=1348)".
* [2023/12] Supported **QLoRA on CPUs** to make fine-tuning on client CPU possible. Check out the [blog](https://medium.com/@NeuralCompressor/creating-your-own-llms-on-your-laptop-a08cc4f7c91b) and [readme](https://github.com/intel/intel-extension-for-transformers/blob/main/docs/qloracpu.md) for more details.
* [2023/11] Released **top-1 7B-sized LLM** [**NeuralChat-v3-1**](https://huggingface.co/Intel/neural-chat-7b-v3-1) and [DPO dataset](https://huggingface.co/datasets/Intel/orca_dpo_pairs). Check out the [nice video](https://www.youtube.com/watch?v=bWhZ1u_1rlc) published by [WorldofAI](https://www.youtube.com/@intheworldofai).
* [2023/11] Published a **4-bit chatbot demo** (based on NeuralChat) available on [Intel Hugging Face Space](https://huggingface.co/spaces/Intel/NeuralChat-ICX-INT4). Welcome to have a try! To setup the demo locally, please follow the [instructions](https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/neural_chat/docs/notebooks/setup_text_chatbot_service_on_spr.ipynb).

---
<div align="left">

## 🏃Installation
### Quick Install from Pypi
```bash
pip install intel-extension-for-transformers
```
> For system requirements and other installation tips, please refer to [Installation Guide](./docs/installation.md)

## 🌟Introduction
Intel® Extension for Transformers is an innovative toolkit designed to accelerate GenAI/LLM everywhere with the optimal performance of Transformer-based models on various Intel platforms, including Intel Gaudi2, Intel CPU, and Intel GPU. The toolkit provides the below key features and examples:

*  Seamless user experience of model compressions on Transformer-based models by extending [Hugging Face transformers](https://github.com/huggingface/transformers) APIs and leveraging [Intel® Neural Compressor](https://github.com/intel/neural-compressor)

*  Advanced software optimizations and unique compression-aware runtime (released with NeurIPS 2022's paper [Fast Distilbert on CPUs](https://arxiv.org/abs/2211.07715) and [QuaLA-MiniLM: a Quantized Length Adaptive MiniLM](https://arxiv.org/abs/2210.17114), and NeurIPS 2021's paper [Prune Once for All: Sparse Pre-Trained Language Models](https://arxiv.org/abs/2111.05754))

*  Optimized Transformer-based model packages such as [Stable Diffusion](examples/huggingface/pytorch/text-to-image/deployment/stable_diffusion), [GPT-J-6B](examples/huggingface/pytorch/text-generation/deployment), [GPT-NEOX](examples/huggingface/pytorch/language-modeling/quantization#2-validated-model-list), [BLOOM-176B](examples/huggingface/pytorch/language-modeling/inference#BLOOM-176B), [T5](examples/huggingface/pytorch/summarization/quantization#2-validated-model-list), [Flan-T5](examples/huggingface/pytorch/summarization/quantization#2-validated-model-list), and end-to-end workflows such as [SetFit-based text classification](docs/tutorials/pytorch/text-classification/SetFit_model_compression_AGNews.ipynb) and [document level sentiment analysis (DLSA)](workflows/dlsa) 

*  [NeuralChat](intel_extension_for_transformers/neural_chat), a customizable chatbot framework to create your own chatbot within minutes by leveraging a rich set of [plugins](https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/neural_chat/docs/advanced_features.md) such as [Knowledge Retrieval](./intel_extension_for_transformers/neural_chat/pipeline/plugins/retrieval/README.md), [Speech Interaction](./intel_extension_for_transformers/neural_chat/pipeline/plugins/audio/README.md), [Query Caching](./intel_extension_for_transformers/neural_chat/pipeline/plugins/caching/README.md), and [Security Guardrail](./intel_extension_for_transformers/neural_chat/pipeline/plugins/security/README.md). This framework supports Intel Gaudi2/CPU/GPU.

*  [Inference](https://github.com/intel/neural-speed/tree/main) of Large Language Model (LLM) in pure C/C++ with weight-only quantization kernels for Intel CPU and Intel GPU (TBD), supporting [GPT-NEOX](https://github.com/intel/neural-speed/tree/main/neural_speed/models/gptneox), [LLAMA](https://github.com/intel/neural-speed/tree/main/neural_speed/models/llama), [MPT](https://github.com/intel/neural-speed/tree/main/neural_speed/models/mpt), [FALCON](https://github.com/intel/neural-speed/tree/main/neural_speed/models/falcon), [BLOOM-7B](https://github.com/intel/neural-speed/tree/main/neural_speed/models/bloom), [OPT](https://github.com/intel/neural-speed/tree/main/neural_speed/models/opt), [ChatGLM2-6B](https://github.com/intel/neural-speed/tree/main/neural_speed/models/chatglm), [GPT-J-6B](https://github.com/intel/neural-speed/tree/main/neural_speed/models/gptj), and [Dolly-v2-3B](https://github.com/intel/neural-speed/tree/main/neural_speed/models/gptneox). Support AMX, VNNI, AVX512F and AVX2 instruction set. We've boosted the performance of Intel CPUs, with a particular focus on the 4th generation Intel Xeon Scalable processor, codenamed [Sapphire Rapids](https://www.intel.com/content/www/us/en/products/docs/processors/xeon-accelerated/4th-gen-xeon-scalable-processors.html).

## 🔓Validated Hardware
<table>
	<tbody>
		<tr>
			<td rowspan="2">Hardware</td>
			<td colspan="2">Fine-Tuning</td>
			<td colspan="2">Inference</td>
		</tr>
		<tr>
			<td>Full</td>
			<td>PEFT</td>
			<td>8-bit</td>
			<td>4-bit</td>
		</tr>
		<tr>
			<td>Intel Gaudi2</td>
			<td>✔</td>
			<td>✔</td>
			<td>WIP (FP8)</td>
			<td>-</td>
		</tr>
		<tr>
			<td>Intel Xeon Scalable Processors</td>
			<td>✔</td>
			<td>✔</td>
			<td>✔ (INT8, FP8)</td>
			<td>✔ (INT4, FP4, NF4)</td>
		</tr>
		<tr>
			<td>Intel Xeon CPU Max Series</td>
			<td>✔</td>
			<td>✔</td>
			<td>✔ (INT8, FP8)</td>
			<td>✔ (INT4, FP4, NF4)</td>
		</tr>
		<tr>
			<td>Intel Data Center GPU Max Series</td>
			<td>WIP </td>
			<td>WIP </td>
			<td>WIP (INT8)</td>
			<td>✔ (INT4)</td>
		</tr>
		<tr>
			<td>Intel Arc A-Series</td>
			<td>-</td>
			<td>-</td>
			<td>WIP (INT8)</td>
			<td>✔ (INT4)</td>
		</tr>
		<tr>
			<td>Intel Core Processors</td>
			<td>-</td>
			<td>✔</td>
			<td>✔ (INT8, FP8)</td>
			<td>✔ (INT4, FP4, NF4)</td>
		</tr>
	</tbody>
</table>


> In the table above, "-" means not applicable or not started yet.

## 🔓Validated Software
<table>
	<tbody>
		<tr>
			<td rowspan="2">Software</td>
			<td colspan="2">Fine-Tuning</td>
			<td colspan="2">Inference</td>
		</tr>
		<tr>
			<td>Full</td>
			<td>PEFT</td>
			<td>8-bit</td>
			<td>4-bit</td>
		</tr>
		<tr>
			<td>PyTorch</td>
			<td>2.0.1+cpu,</br> 2.0.1a0 (gpu)</td>
			<td>2.0.1+cpu,</br> 2.0.1a0 (gpu)</td>
			<td>2.1.0+cpu,</br> 2.0.1a0 (gpu)</td>
			<td>2.1.0+cpu,</br> 2.0.1a0 (gpu)</td>
		</tr>
		<tr>
			<td>Intel® Extension for PyTorch</td>
			<td>2.1.0+cpu,</br> 2.0.110+xpu</td>
			<td>2.1.0+cpu,</br> 2.0.110+xpu</td>
			<td>2.1.0+cpu,</br> 2.0.110+xpu</td>
			<td>2.1.0+cpu,</br> 2.0.110+xpu</td>
		</tr>
		<tr>
			<td>Transformers</td>
			<td>4.35.2(CPU),</br> 4.31.0 (Intel GPU)</td>
			<td>4.35.2(CPU),</br> 4.31.0 (Intel GPU)</td>
			<td>4.35.2(CPU),</br> 4.31.0 (Intel GPU)</td>
			<td>4.35.2(CPU),</br> 4.31.0 (Intel GPU)</td>
		</tr>
		<tr>
			<td>Synapse AI</td>
			<td>1.13.0</td>
			<td>1.13.0</td>
			<td>1.13.0</td>
			<td>1.13.0</td>
		</tr>
		<tr>
			<td>Gaudi2 driver</td>
			<td>1.13.0-ee32e42</td>
			<td>1.13.0-ee32e42</td>
			<td>1.13.0-ee32e42</td>
			<td>1.13.0-ee32e42</td>
		</tr>
                <tr>
                        <td>intel-level-zero-gpu</td>
                        <td>1.3.26918.50-736~22.04 </td>
                        <td>1.3.26918.50-736~22.04 </td>
                        <td>1.3.26918.50-736~22.04 </td>
                        <td>1.3.26918.50-736~22.04 </td>
                </tr>
	</tbody>
</table>

> Please refer to the detailed requirements in [CPU](intel_extension_for_transformers/neural_chat/requirements_cpu.txt), [Gaudi2](intel_extension_for_transformers/neural_chat/requirements_hpu.txt), [Intel GPU](intel_extension_for_transformers/neural_chat/requirements_xpu.txt).

## 🔓Validated OS
Ubuntu 20.04/22.04, Centos 8.

## 🌱Getting Started

### Chatbot
Below is the sample code to create your chatbot. See more [examples](intel_extension_for_transformers/neural_chat/docs/full_notebooks.md).

#### Serving (OpenAI-compatible RESTful APIs)
NeuralChat provides OpenAI-compatible RESTful APIs for chat, so you can use NeuralChat as a drop-in replacement for OpenAI APIs.
You can start NeuralChat server either using the Shell command or Python code.

```shell
# Shell Command
neuralchat_server start --config_file ./server/config/neuralchat.yaml
```

```python
# Python Code
from intel_extension_for_transformers.neural_chat import NeuralChatServerExecutor
server_executor = NeuralChatServerExecutor()
server_executor(config_file="./server/config/neuralchat.yaml", log_file="./neuralchat.log")
```

NeuralChat service can be accessible through [OpenAI client library](https://github.com/openai/openai-python), `curl` commands, and `requests` library. See more in [NeuralChat](intel_extension_for_transformers/neural_chat/README.md).

#### Offline

```python
from intel_extension_for_transformers.neural_chat import build_chatbot
chatbot = build_chatbot()
response = chatbot.predict("Tell me about Intel Xeon Scalable Processors.")
```

### Transformers-based extension APIs
Below is the sample code to use the extended Transformers APIs. See more [examples](https://github.com/intel/neural-speed/tree/main).

#### INT4 Inference (CPU)
We encourage you to install [NeuralSpeed](https://github.com/intel/neural-speed) to get the latest features (e.g., GGUF support) of LLM low-bit inference on CPUs. You may also want to use v1.3 without NeuralSpeed by following the [document](https://github.com/intel/intel-extension-for-transformers/tree/v1.3/intel_extension_for_transformers/llm/runtime/graph/README.md)

```python
from transformers import AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "Intel/neural-chat-7b-v3-1"     
prompt = "Once upon a time, there existed a little girl,"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids

model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
outputs = model.generate(inputs)
```
You can also load GGUF format model from Huggingface, we only support Q4_0 gguf format for now.
```python
from transformers import AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM

# Specify the GGUF repo on the Hugginface
model_name = "TheBloke/Llama-2-7B-Chat-GGUF"
# Download the the specific gguf model file from the above repo
gguf_file = "llama-2-7b-chat.Q4_0.gguf"
# make sure you are granted to access this model on the Huggingface.
tokenizer_name = "meta-llama/Llama-2-7b-chat-hf"
prompt = "Once upon a time, there existed a little girl,"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids

model = AutoModelForCausalLM.from_pretrained(model_name, gguf_file = gguf_file)
outputs = model.generate(inputs)
```


You can also load PyTorch Model from Modelscope
>**Note**:require modelscope
```python
from transformers import TextStreamer
from modelscope import AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "qwen/Qwen-7B"     # Modelscope model_id or local model
prompt = "Once upon a time, there existed a little girl,"

model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True, model_hub="modelscope")
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)
```

You can also load the low-bit model quantized by GPTQ/AWQ/RTN/AutoRound algorithm.
```python
from transformers import AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM, GPTQConfig

# Hugging Face GPTQ/AWQ model or use local quantize model
model_name = "MODEL_NAME_OR_PATH"
prompt = "Once upon a time, a little girl"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
outputs = model.generate(inputs)
```

#### INT4 Inference (GPU)
```python
import intel_extension_for_pytorch as ipex
from intel_extension_for_transformers.transformers.modeling import AutoModelForCausalLM
from transformers import AutoTokenizer
import torch

device_map = "xpu"
model_name ="Qwen/Qwen-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
prompt = "Once upon a time, there existed a little girl,"
inputs = tokenizer(prompt, return_tensors="pt").input_ids.to(device_map)

model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True,
                                              device_map=device_map, load_in_4bit=True)

model = ipex.optimize_transformers(model, inplace=True, dtype=torch.float16, quantization_config=True, device=device_map)

output = model.generate(inputs)
```
> Note: Please refer to the [example](https://github.com/intel/intel-extension-for-transformers/blob/main/docs/weightonlyquant.md#examples-for-gpu) and [script](https://github.com/intel/intel-extension-for-transformers/blob/main/examples/huggingface/pytorch/text-generation/quantization/run_generation_gpu_woq.py) for more details.

### Langchain-based extension APIs
Below is the sample code to use the extended Langchain APIs. See more [examples](intel_extension_for_transformers/neural_chat/pipeline/plugins/retrieval/README.md).

```python
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from langchain.chains import RetrievalQA
from langchain_core.vectorstores import VectorStoreRetriever
from intel_extension_for_transformers.langchain.vectorstores import Chroma
retriever = VectorStoreRetriever(vectorstore=Chroma(...))
retrievalQA = RetrievalQA.from_llm(llm=HuggingFacePipeline(...), retriever=retriever)
```

## 🎯Validated  Models
You can access the validated models, accuracy and performance from [Release data](./docs/release_data.md) or [Medium blog](https://medium.com/@NeuralCompressor/llm-performance-of-intel-extension-for-transformers-f7d061556176).

## 📖Documentation
<table>
<thead>
  <tr>
    <th colspan="8" align="center">OVERVIEW</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td colspan="4" align="center"><a href="intel_extension_for_transformers/neural_chat">NeuralChat</a></td>
    <td colspan="4" align="center"><a href="https://github.com/intel/neural-speed/tree/main">Neural Speed</a></td>
  </tr>
  <tr>
    <th colspan="8" align="center">NEURALCHAT</th>
  </tr>
  <tr>
    <td colspan="2" align="center"><a href="intel_extension_for_transformers/neural_chat/docs/notebooks/deploy_chatbot_on_spr.ipynb">Chatbot on Intel CPU</a></td>
    <td colspan="3" align="center"><a href="intel_extension_for_transformers/neural_chat/docs/notebooks/deploy_chatbot_on_xpu.ipynb">Chatbot on Intel GPU</a></td>
    <td colspan="3" align="center"><a href="intel_extension_for_transformers/neural_chat/docs/notebooks/deploy_chatbot_on_habana_gaudi.ipynb">Chatbot on Gaudi</a></td>
  </tr>
  <tr>
    <td colspan="4" align="center"><a href="intel_extension_for_transformers/neural_chat/examples/deployment/talkingbot/pc/build_talkingbot_on_pc.ipynb">Chatbot on Client</a></td>
    <td colspan="4" align="center"><a href="intel_extension_for_transformers/neural_chat/docs/full_notebooks.md">More Notebooks</a></td>
  </tr>
  <tr>
    <th colspan="8" align="center">NEURAL SPEED</th>
  </tr>
 <tr>
    <td colspan="2" align="center"><a href="https://github.com/intel/neural-speed/tree/main/README.md">Neural Speed</a></td>
    <td colspan="2" align="center"><a href="https://github.com/intel/neural-speed/tree/main/README.md#2-neural-speed-straight-forward">Streaming LLM</a></td>
    <td colspan="2" align="center"><a href="https://github.com/intel/neural-speed/tree/main/neural_speed/core#support-matrix">Low Precision Kernels</a></td>
    <td colspan="2" align="center"><a href="https://github.com/intel/neural-speed/tree/main/docs/tensor_parallelism.md">Tensor Parallelism</a></td>
  </tr>
  <tr>
    <th colspan="8" align="center">LLM COMPRESSION</th>
  </tr>
  <tr>
    <td colspan="2" align="center"><a href="docs/smoothquant.md">SmoothQuant (INT8)</a></td>
    <td colspan="3" align="center"><a href="docs/weightonlyquant.md">Weight-only Quantization (INT4/FP4/NF4/INT8)</a></td>
    <td colspan="3" align="center"><a href="docs/qloracpu.md">QLoRA on CPU</a></td>
  </tr>
  <tr>
    <th colspan="8" align="center">GENERAL COMPRESSION</th>
  <tr>
  <tr>
    <td colspan="2" align="center"><a href="docs/quantization.md">Quantization</a></td>
    <td colspan="2" align="center"><a href="docs/pruning.md">Pruning</a></td>
    <td colspan="2" align="center"><a href="docs/distillation.md">Distillation</a></td>
    <td align="center" colspan="2"><a href="examples/huggingface/pytorch/text-classification/orchestrate_optimizations/README.md">Orchestration</a></td>
  </tr>
  <tr>
    <td align="center" colspan="2"><a href="examples/huggingface/pytorch/language-modeling/nas/README.md">Neural Architecture Search</a></td>
    <td align="center" colspan="2"><a href="docs/export.md">Export</a></td>
    <td align="center" colspan="2"><a href="docs/metrics.md">Metrics</a></td>
    <td align="center" colspan="2"><a href="docs/objectives.md">Objectives</a></td>
  </tr>
  <tr>
    <td align="center" colspan="2"><a href="docs/pipeline.md">Pipeline</a></td>
    <td align="center" colspan="2"><a href="examples/huggingface/pytorch/question-answering/dynamic/README.md">Length Adaptive</a></td>
    <td align="center" colspan="2"><a href="docs/examples.md#early-exit">Early Exit</a></td>
    <td align="center" colspan="2"><a href="docs/data_augmentation.md">Data Augmentation</a></td>    
  </tr>
  <tr>
    <th colspan="8" align="center">TUTORIALS & RESULTS</a></th>
  </tr>
  <tr>
    <td colspan="2" align="center"><a href="docs/tutorials/README.md">Tutorials</a></td>
    <td colspan="2" align="center"><a href="https://github.com/intel/neural-speed/blob/main/docs/supported_models.md">LLM List</a></td>
    <td colspan="2" align="center"><a href="docs/examples.md">General Model List</a></td>
    <td colspan="2" align="center"><a href="intel_extension_for_transformers/transformers/runtime/docs/validated_model.md">Model Performance</a></td>
  </tr>
</tbody>
</table>

## 🙌Demo

* LLM Infinite Inference (up to 4M tokens)

https://github.com/intel/intel-extension-for-transformers/assets/109187816/1698dcda-c9ec-4f44-b159-f4e9d67ab15b

* LLM QLoRA on Client CPU

https://github.com/intel/intel-extension-for-transformers/assets/88082706/9d9bdb7e-65db-47bb-bbed-d23b151e8b31

## 📃Selected Publications/Events
* Blog published on Huggingface: [Building Cost-Efficient Enterprise RAG applications with Intel Gaudi 2 and Intel Xeon](https://huggingface.co/blog/cost-efficient-rag-applications-with-intel) (May 2024)
* Blog published on Intel Developer News: [Efficient Natural Language Embedding Models with Intel® Extension for Transformers](https://www.intel.com/content/www/us/en/developer/articles/technical/efficient-natural-language-embedding-models.html) (May 2024)
* Blog published on Techcrunch: [Intel and others commit to building open generative AI tools for the enterprise](https://techcrunch.com/2024/04/16/intel-and-others-commit-to-building-open-generative-ai-tools-for-the-enterprise) (Apr 2024)
* Video on YouTube: [Intel Vision Keynotes 2024](https://www.youtube.com/watch?v=QB7FoIpx8os&t=2280s) (Apr 2024)
* Blog published on Vectara: [Do Smaller Models Hallucinate More?](https://vectara.com/blog/do-smaller-models-hallucinate-more) (Apr 2024)
* Blog of Intel Developer News: [Use the neural-chat-7b Model for Advanced Fraud Detection: An AI-Driven Approach in Cybersecurity](https://www.intel.com/content/www/us/en/developer/articles/technical/bilics-approach-cybersecurity-using-neuralchat-7b.html) (March 2024)
* CES 2024: [CES 2024 Great Minds Keynote: Bringing the Limitless Potential of AI Everywhere: Intel Hybrid Copilot demo](https://youtu.be/70J3uO3eLZA?t=1348) (Jan 2024)
* Blog published on Medium: [Connect an AI agent with your API: Intel Neural-Chat 7b LLM can replace Open AI Function Calling](https://medium.com/11tensors/connect-an-ai-agent-with-your-api-intel-neural-chat-7b-llm-can-replace-open-ai-function-calling-242d771e7c79) (Dec 2023)
* NeurIPS'2023 on Efficient Natural Language and Speech Processing: [Efficient LLM Inference on CPUs](https://arxiv.org/abs/2311.00502) (Nov 2023)
* Blog published on Hugging Face: [Intel Neural-Chat 7b: Fine-Tuning on Gaudi2 for Top LLM Performance](https://huggingface.co/blog/Andyrasika/neural-chat-intel) (Nov 2023)
* Blog published on VMware: [AI without GPUs: A Technical Brief for VMware Private AI with Intel](https://core.vmware.com/resource/ai-without-gpus-technical-brief-vmware-private-ai-intel#section6) (Nov 2023)
  
> View [Full Publication List](./docs/publication.md)

## Additional Content

* [Release Information](./docs/release.md)
* [Contribution Guidelines](./docs/contributions.md)
* [Legal Information](./docs/legal.md)
* [Security Policy](SECURITY.md)
* [Apache License](./LICENSE)


## Acknowledgements
* Excellent open-source projects: [bitsandbytes](https://github.com/TimDettmers/bitsandbytes), [FastChat](https://github.com/lm-sys/FastChat), [fastRAG](https://github.com/IntelLabs/fastRAG), [ggml](https://github.com/ggerganov/ggml), [gptq](https://github.com/IST-DASLab/gptq), [llama.cpp](https://github.com/ggerganov/llama.cpp), [lm-evauation-harness](https://github.com/EleutherAI/lm-evaluation-harness), [peft](https://github.com/huggingface/peft), [trl](https://github.com/huggingface/trl), [streamingllm](https://github.com/mit-han-lab/streaming-llm) and many others.

* Thanks to all the [contributors](./docs/contributors.md).

## 💁Collaborations

Welcome to raise any interesting ideas on model compression techniques and LLM-based chatbot development! Feel free to reach [us](mailto:itrex.maintainers@intel.com), and we look forward to our collaborations on Intel Extension for Transformers!

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/intel/intel-extension-for-transformers",
    "name": "intel-extension-for-transformers",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7.0",
    "maintainer_email": null,
    "keywords": "quantization, auto-tuning, post-training static quantization, post-training dynamic quantization, quantization-aware training, tuning strategy",
    "author": "Intel AIA/AIPC Team",
    "author_email": "feng.tian@intel.com, haihao.shen@intel.com,hanwen.chang@intel.com, penghui.cheng@intel.com",
    "download_url": "https://files.pythonhosted.org/packages/09/1d/dd28044cc9a4fb7d152aef0bbb3d78d631504609f1bfde512557daae54ba/intel_extension_for_transformers-1.4.2.tar.gz",
    "platform": null,
    "description": "<div align=\"center\">\n  \nIntel\u00ae Extension for Transformers\n===========================\n<h3>An Innovative Transformer-based Toolkit to Accelerate GenAI/LLM Everywhere</h3>\n\n[![](https://dcbadge.vercel.app/api/server/Wxk3J3ZJkU?compact=true&style=flat-square)](https://discord.gg/Wxk3J3ZJkU)\n[![Release Notes](https://img.shields.io/github/v/release/intel/intel-extension-for-transformers)](https://github.com/intel/intel-extension-for-transformers/releases)\n\n[\ud83c\udfedArchitecture](./docs/architecture.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[\ud83d\udcacNeuralChat](./intel_extension_for_transformers/neural_chat)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[\ud83d\ude03Inference on CPU](https://github.com/intel/neural-speed/tree/main)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[\ud83d\ude03Inference  on GPU](https://github.com/intel/intel-extension-for-transformers/blob/main/docs/weightonlyquant.md#examples-for-gpu)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[\ud83d\udcbbExamples](./docs/examples.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[\ud83d\udcd6Documentations](https://intel.github.io/intel-extension-for-transformers/latest/docs/Welcome.html)\n</div>\n\n## \ud83d\ude80Latest News\n* [2024/04] Support the launch of **[Meta Llama 3](https://llama.meta.com/llama3/)**, the next generation of Llama models. Check out [Accelerate Meta* Llama 3 with Intel AI Solutions](https://www.intel.com/content/www/us/en/developer/articles/technical/accelerate-meta-llama3-with-intel-ai-solutions.html).\n* [2024/04] Demonstrated the chatbot in 4th, 5th, and 6th Gen Xeon Scalable Processors in [**Intel Vision Pat's Keynote**](https://youtu.be/QB7FoIpx8os?t=2280).\n* [2024/04] Supported **INT4 inference on Intel Meteor Lake**.\n* [2024/04] Achieved a 1.8x performance improvement in GPT-J inference on the 5th Gen Xeon MLPerf v4.0 submission compared to v3.1. [News](https://www.intel.com/content/www/us/en/newsroom/news/new-gaudi-2-xeon-performance-ai-inference.html#gs.71ti1m), [Results](https://mlcommons.org/2024/03/mlperf-inference-v4/).\n* [2024/01] Supported **INT4 inference on Intel GPUs** including Intel Data Center GPU Max Series (e.g., PVC) and Intel Arc A-Series (e.g., ARC). Check out the [examples](https://github.com/intel/intel-extension-for-transformers/blob/main/docs/weightonlyquant.md#examples-for-gpu) and [scripts](https://github.com/intel/intel-extension-for-transformers/blob/main/examples/huggingface/pytorch/text-generation/quantization/run_generation_gpu_woq.py).\n* [2024/01] Demonstrated **Intel Hybrid Copilot** in **CES 2024 Great Minds** Session \"[Bringing the Limitless Potential of AI Everywhere](https://youtu.be/70J3uO3eLZA?t=1348)\".\n* [2023/12] Supported **QLoRA on CPUs** to make fine-tuning on client CPU possible. Check out the [blog](https://medium.com/@NeuralCompressor/creating-your-own-llms-on-your-laptop-a08cc4f7c91b) and [readme](https://github.com/intel/intel-extension-for-transformers/blob/main/docs/qloracpu.md) for more details.\n* [2023/11] Released **top-1 7B-sized LLM** [**NeuralChat-v3-1**](https://huggingface.co/Intel/neural-chat-7b-v3-1) and [DPO dataset](https://huggingface.co/datasets/Intel/orca_dpo_pairs). Check out the [nice video](https://www.youtube.com/watch?v=bWhZ1u_1rlc) published by [WorldofAI](https://www.youtube.com/@intheworldofai).\n* [2023/11] Published a **4-bit chatbot demo** (based on NeuralChat) available on [Intel Hugging Face Space](https://huggingface.co/spaces/Intel/NeuralChat-ICX-INT4). Welcome to have a try! To setup the demo locally, please follow the [instructions](https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/neural_chat/docs/notebooks/setup_text_chatbot_service_on_spr.ipynb).\n\n---\n<div align=\"left\">\n\n## \ud83c\udfc3Installation\n### Quick Install from Pypi\n```bash\npip install intel-extension-for-transformers\n```\n> For system requirements and other installation tips, please refer to [Installation Guide](./docs/installation.md)\n\n## \ud83c\udf1fIntroduction\nIntel\u00ae Extension for Transformers is an innovative toolkit designed to accelerate GenAI/LLM everywhere with the optimal performance of Transformer-based models on various Intel platforms, including Intel Gaudi2, Intel CPU, and Intel GPU. The toolkit provides the below key features and examples:\n\n*  Seamless user experience of model compressions on Transformer-based models by extending [Hugging Face transformers](https://github.com/huggingface/transformers)\u00a0APIs and leveraging [Intel\u00ae Neural Compressor](https://github.com/intel/neural-compressor)\n\n*  Advanced software optimizations and unique compression-aware runtime (released with NeurIPS 2022's paper [Fast Distilbert on CPUs](https://arxiv.org/abs/2211.07715) and [QuaLA-MiniLM: a Quantized Length Adaptive MiniLM](https://arxiv.org/abs/2210.17114), and NeurIPS 2021's paper [Prune Once for All: Sparse Pre-Trained Language Models](https://arxiv.org/abs/2111.05754))\n\n*  Optimized Transformer-based model packages such as [Stable Diffusion](examples/huggingface/pytorch/text-to-image/deployment/stable_diffusion), [GPT-J-6B](examples/huggingface/pytorch/text-generation/deployment), [GPT-NEOX](examples/huggingface/pytorch/language-modeling/quantization#2-validated-model-list), [BLOOM-176B](examples/huggingface/pytorch/language-modeling/inference#BLOOM-176B), [T5](examples/huggingface/pytorch/summarization/quantization#2-validated-model-list), [Flan-T5](examples/huggingface/pytorch/summarization/quantization#2-validated-model-list), and end-to-end workflows such as [SetFit-based text classification](docs/tutorials/pytorch/text-classification/SetFit_model_compression_AGNews.ipynb) and [document level sentiment analysis (DLSA)](workflows/dlsa) \n\n*  [NeuralChat](intel_extension_for_transformers/neural_chat), a customizable chatbot framework to create your own chatbot within minutes by leveraging a rich set of [plugins](https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/neural_chat/docs/advanced_features.md) such as [Knowledge Retrieval](./intel_extension_for_transformers/neural_chat/pipeline/plugins/retrieval/README.md), [Speech Interaction](./intel_extension_for_transformers/neural_chat/pipeline/plugins/audio/README.md), [Query Caching](./intel_extension_for_transformers/neural_chat/pipeline/plugins/caching/README.md), and [Security Guardrail](./intel_extension_for_transformers/neural_chat/pipeline/plugins/security/README.md). This framework supports Intel Gaudi2/CPU/GPU.\n\n*  [Inference](https://github.com/intel/neural-speed/tree/main) of Large Language Model (LLM) in pure C/C++ with weight-only quantization kernels for Intel CPU and Intel GPU (TBD), supporting [GPT-NEOX](https://github.com/intel/neural-speed/tree/main/neural_speed/models/gptneox), [LLAMA](https://github.com/intel/neural-speed/tree/main/neural_speed/models/llama), [MPT](https://github.com/intel/neural-speed/tree/main/neural_speed/models/mpt), [FALCON](https://github.com/intel/neural-speed/tree/main/neural_speed/models/falcon), [BLOOM-7B](https://github.com/intel/neural-speed/tree/main/neural_speed/models/bloom), [OPT](https://github.com/intel/neural-speed/tree/main/neural_speed/models/opt), [ChatGLM2-6B](https://github.com/intel/neural-speed/tree/main/neural_speed/models/chatglm), [GPT-J-6B](https://github.com/intel/neural-speed/tree/main/neural_speed/models/gptj), and [Dolly-v2-3B](https://github.com/intel/neural-speed/tree/main/neural_speed/models/gptneox). Support AMX, VNNI, AVX512F and AVX2 instruction set. We've boosted the performance of Intel CPUs, with a particular focus on the 4th generation Intel Xeon Scalable processor, codenamed [Sapphire Rapids](https://www.intel.com/content/www/us/en/products/docs/processors/xeon-accelerated/4th-gen-xeon-scalable-processors.html).\n\n## \ud83d\udd13Validated Hardware\n<table>\n\t<tbody>\n\t\t<tr>\n\t\t\t<td rowspan=\"2\">Hardware</td>\n\t\t\t<td colspan=\"2\">Fine-Tuning</td>\n\t\t\t<td colspan=\"2\">Inference</td>\n\t\t</tr>\n\t\t<tr>\n\t\t\t<td>Full</td>\n\t\t\t<td>PEFT</td>\n\t\t\t<td>8-bit</td>\n\t\t\t<td>4-bit</td>\n\t\t</tr>\n\t\t<tr>\n\t\t\t<td>Intel Gaudi2</td>\n\t\t\t<td>\u2714</td>\n\t\t\t<td>\u2714</td>\n\t\t\t<td>WIP (FP8)</td>\n\t\t\t<td>-</td>\n\t\t</tr>\n\t\t<tr>\n\t\t\t<td>Intel Xeon Scalable Processors</td>\n\t\t\t<td>\u2714</td>\n\t\t\t<td>\u2714</td>\n\t\t\t<td>\u2714 (INT8, FP8)</td>\n\t\t\t<td>\u2714 (INT4, FP4, NF4)</td>\n\t\t</tr>\n\t\t<tr>\n\t\t\t<td>Intel Xeon CPU Max Series</td>\n\t\t\t<td>\u2714</td>\n\t\t\t<td>\u2714</td>\n\t\t\t<td>\u2714 (INT8, FP8)</td>\n\t\t\t<td>\u2714 (INT4, FP4, NF4)</td>\n\t\t</tr>\n\t\t<tr>\n\t\t\t<td>Intel Data Center GPU Max Series</td>\n\t\t\t<td>WIP </td>\n\t\t\t<td>WIP </td>\n\t\t\t<td>WIP (INT8)</td>\n\t\t\t<td>\u2714 (INT4)</td>\n\t\t</tr>\n\t\t<tr>\n\t\t\t<td>Intel Arc A-Series</td>\n\t\t\t<td>-</td>\n\t\t\t<td>-</td>\n\t\t\t<td>WIP (INT8)</td>\n\t\t\t<td>\u2714 (INT4)</td>\n\t\t</tr>\n\t\t<tr>\n\t\t\t<td>Intel Core Processors</td>\n\t\t\t<td>-</td>\n\t\t\t<td>\u2714</td>\n\t\t\t<td>\u2714 (INT8, FP8)</td>\n\t\t\t<td>\u2714 (INT4, FP4, NF4)</td>\n\t\t</tr>\n\t</tbody>\n</table>\n\n\n> In the table above, \"-\" means not applicable or not started yet.\n\n## \ud83d\udd13Validated Software\n<table>\n\t<tbody>\n\t\t<tr>\n\t\t\t<td rowspan=\"2\">Software</td>\n\t\t\t<td colspan=\"2\">Fine-Tuning</td>\n\t\t\t<td colspan=\"2\">Inference</td>\n\t\t</tr>\n\t\t<tr>\n\t\t\t<td>Full</td>\n\t\t\t<td>PEFT</td>\n\t\t\t<td>8-bit</td>\n\t\t\t<td>4-bit</td>\n\t\t</tr>\n\t\t<tr>\n\t\t\t<td>PyTorch</td>\n\t\t\t<td>2.0.1+cpu,</br> 2.0.1a0 (gpu)</td>\n\t\t\t<td>2.0.1+cpu,</br> 2.0.1a0 (gpu)</td>\n\t\t\t<td>2.1.0+cpu,</br> 2.0.1a0 (gpu)</td>\n\t\t\t<td>2.1.0+cpu,</br> 2.0.1a0 (gpu)</td>\n\t\t</tr>\n\t\t<tr>\n\t\t\t<td>Intel\u00ae Extension for PyTorch</td>\n\t\t\t<td>2.1.0+cpu,</br> 2.0.110+xpu</td>\n\t\t\t<td>2.1.0+cpu,</br> 2.0.110+xpu</td>\n\t\t\t<td>2.1.0+cpu,</br> 2.0.110+xpu</td>\n\t\t\t<td>2.1.0+cpu,</br> 2.0.110+xpu</td>\n\t\t</tr>\n\t\t<tr>\n\t\t\t<td>Transformers</td>\n\t\t\t<td>4.35.2(CPU),</br> 4.31.0 (Intel GPU)</td>\n\t\t\t<td>4.35.2(CPU),</br> 4.31.0 (Intel GPU)</td>\n\t\t\t<td>4.35.2(CPU),</br> 4.31.0 (Intel GPU)</td>\n\t\t\t<td>4.35.2(CPU),</br> 4.31.0 (Intel GPU)</td>\n\t\t</tr>\n\t\t<tr>\n\t\t\t<td>Synapse AI</td>\n\t\t\t<td>1.13.0</td>\n\t\t\t<td>1.13.0</td>\n\t\t\t<td>1.13.0</td>\n\t\t\t<td>1.13.0</td>\n\t\t</tr>\n\t\t<tr>\n\t\t\t<td>Gaudi2 driver</td>\n\t\t\t<td>1.13.0-ee32e42</td>\n\t\t\t<td>1.13.0-ee32e42</td>\n\t\t\t<td>1.13.0-ee32e42</td>\n\t\t\t<td>1.13.0-ee32e42</td>\n\t\t</tr>\n                <tr>\n                        <td>intel-level-zero-gpu</td>\n                        <td>1.3.26918.50-736~22.04 </td>\n                        <td>1.3.26918.50-736~22.04 </td>\n                        <td>1.3.26918.50-736~22.04 </td>\n                        <td>1.3.26918.50-736~22.04 </td>\n                </tr>\n\t</tbody>\n</table>\n\n> Please refer to the detailed requirements in [CPU](intel_extension_for_transformers/neural_chat/requirements_cpu.txt), [Gaudi2](intel_extension_for_transformers/neural_chat/requirements_hpu.txt), [Intel GPU](intel_extension_for_transformers/neural_chat/requirements_xpu.txt).\n\n## \ud83d\udd13Validated OS\nUbuntu 20.04/22.04, Centos 8.\n\n## \ud83c\udf31Getting Started\n\n### Chatbot\nBelow is the sample code to create your chatbot. See more [examples](intel_extension_for_transformers/neural_chat/docs/full_notebooks.md).\n\n#### Serving (OpenAI-compatible RESTful APIs)\nNeuralChat provides OpenAI-compatible RESTful APIs for chat, so you can use NeuralChat as a drop-in replacement for OpenAI APIs.\nYou can start NeuralChat server either using the Shell command or Python code.\n\n```shell\n# Shell Command\nneuralchat_server start --config_file ./server/config/neuralchat.yaml\n```\n\n```python\n# Python Code\nfrom intel_extension_for_transformers.neural_chat import NeuralChatServerExecutor\nserver_executor = NeuralChatServerExecutor()\nserver_executor(config_file=\"./server/config/neuralchat.yaml\", log_file=\"./neuralchat.log\")\n```\n\nNeuralChat service can be accessible through [OpenAI client library](https://github.com/openai/openai-python), `curl` commands, and `requests` library. See more in [NeuralChat](intel_extension_for_transformers/neural_chat/README.md).\n\n#### Offline\n\n```python\nfrom intel_extension_for_transformers.neural_chat import build_chatbot\nchatbot = build_chatbot()\nresponse = chatbot.predict(\"Tell me about Intel Xeon Scalable Processors.\")\n```\n\n### Transformers-based extension APIs\nBelow is the sample code to use the extended Transformers APIs. See more [examples](https://github.com/intel/neural-speed/tree/main).\n\n#### INT4 Inference (CPU)\nWe encourage you to install [NeuralSpeed](https://github.com/intel/neural-speed) to get the latest features (e.g., GGUF support) of LLM low-bit inference on CPUs. You may also want to use v1.3 without NeuralSpeed by following the [document](https://github.com/intel/intel-extension-for-transformers/tree/v1.3/intel_extension_for_transformers/llm/runtime/graph/README.md)\n\n```python\nfrom transformers import AutoTokenizer\nfrom intel_extension_for_transformers.transformers import AutoModelForCausalLM\nmodel_name = \"Intel/neural-chat-7b-v3-1\"     \nprompt = \"Once upon a time, there existed a little girl,\"\n\ntokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)\ninputs = tokenizer(prompt, return_tensors=\"pt\").input_ids\n\nmodel = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)\noutputs = model.generate(inputs)\n```\nYou can also load GGUF format model from Huggingface, we only support Q4_0 gguf format for now.\n```python\nfrom transformers import AutoTokenizer\nfrom intel_extension_for_transformers.transformers import AutoModelForCausalLM\n\n# Specify the GGUF repo on the Hugginface\nmodel_name = \"TheBloke/Llama-2-7B-Chat-GGUF\"\n# Download the the specific gguf model file from the above repo\ngguf_file = \"llama-2-7b-chat.Q4_0.gguf\"\n# make sure you are granted to access this model on the Huggingface.\ntokenizer_name = \"meta-llama/Llama-2-7b-chat-hf\"\nprompt = \"Once upon a time, there existed a little girl,\"\ntokenizer = AutoTokenizer.from_pretrained(tokenizer_name, trust_remote_code=True)\ninputs = tokenizer(prompt, return_tensors=\"pt\").input_ids\n\nmodel = AutoModelForCausalLM.from_pretrained(model_name, gguf_file = gguf_file)\noutputs = model.generate(inputs)\n```\n\n\nYou can also load PyTorch Model from Modelscope\n>**Note**:require modelscope\n```python\nfrom transformers import TextStreamer\nfrom modelscope import AutoTokenizer\nfrom intel_extension_for_transformers.transformers import AutoModelForCausalLM\nmodel_name = \"qwen/Qwen-7B\"     # Modelscope model_id or local model\nprompt = \"Once upon a time, there existed a little girl,\"\n\nmodel = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True, model_hub=\"modelscope\")\ntokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)\ninputs = tokenizer(prompt, return_tensors=\"pt\").input_ids\nstreamer = TextStreamer(tokenizer)\noutputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)\n```\n\nYou can also load the low-bit model quantized by GPTQ/AWQ/RTN/AutoRound algorithm.\n```python\nfrom transformers import AutoTokenizer\nfrom intel_extension_for_transformers.transformers import AutoModelForCausalLM, GPTQConfig\n\n# Hugging Face GPTQ/AWQ model or use local quantize model\nmodel_name = \"MODEL_NAME_OR_PATH\"\nprompt = \"Once upon a time, a little girl\"\n\ntokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)\ninputs = tokenizer(prompt, return_tensors=\"pt\").input_ids\nmodel = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)\noutputs = model.generate(inputs)\n```\n\n#### INT4 Inference (GPU)\n```python\nimport intel_extension_for_pytorch as ipex\nfrom intel_extension_for_transformers.transformers.modeling import AutoModelForCausalLM\nfrom transformers import AutoTokenizer\nimport torch\n\ndevice_map = \"xpu\"\nmodel_name =\"Qwen/Qwen-7B\"\ntokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)\nprompt = \"Once upon a time, there existed a little girl,\"\ninputs = tokenizer(prompt, return_tensors=\"pt\").input_ids.to(device_map)\n\nmodel = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True,\n                                              device_map=device_map, load_in_4bit=True)\n\nmodel = ipex.optimize_transformers(model, inplace=True, dtype=torch.float16, quantization_config=True, device=device_map)\n\noutput = model.generate(inputs)\n```\n> Note: Please refer to the [example](https://github.com/intel/intel-extension-for-transformers/blob/main/docs/weightonlyquant.md#examples-for-gpu) and [script](https://github.com/intel/intel-extension-for-transformers/blob/main/examples/huggingface/pytorch/text-generation/quantization/run_generation_gpu_woq.py) for more details.\n\n### Langchain-based extension APIs\nBelow is the sample code to use the extended Langchain APIs. See more [examples](intel_extension_for_transformers/neural_chat/pipeline/plugins/retrieval/README.md).\n\n```python\nfrom langchain_community.llms.huggingface_pipeline import HuggingFacePipeline\nfrom langchain.chains import RetrievalQA\nfrom langchain_core.vectorstores import VectorStoreRetriever\nfrom intel_extension_for_transformers.langchain.vectorstores import Chroma\nretriever = VectorStoreRetriever(vectorstore=Chroma(...))\nretrievalQA = RetrievalQA.from_llm(llm=HuggingFacePipeline(...), retriever=retriever)\n```\n\n## \ud83c\udfafValidated  Models\nYou can access the validated models, accuracy and performance from [Release data](./docs/release_data.md) or [Medium blog](https://medium.com/@NeuralCompressor/llm-performance-of-intel-extension-for-transformers-f7d061556176).\n\n## \ud83d\udcd6Documentation\n<table>\n<thead>\n  <tr>\n    <th colspan=\"8\" align=\"center\">OVERVIEW</th>\n  </tr>\n</thead>\n<tbody>\n  <tr>\n    <td colspan=\"4\" align=\"center\"><a href=\"intel_extension_for_transformers/neural_chat\">NeuralChat</a></td>\n    <td colspan=\"4\" align=\"center\"><a href=\"https://github.com/intel/neural-speed/tree/main\">Neural Speed</a></td>\n  </tr>\n  <tr>\n    <th colspan=\"8\" align=\"center\">NEURALCHAT</th>\n  </tr>\n  <tr>\n    <td colspan=\"2\" align=\"center\"><a href=\"intel_extension_for_transformers/neural_chat/docs/notebooks/deploy_chatbot_on_spr.ipynb\">Chatbot on Intel CPU</a></td>\n    <td colspan=\"3\" align=\"center\"><a href=\"intel_extension_for_transformers/neural_chat/docs/notebooks/deploy_chatbot_on_xpu.ipynb\">Chatbot on Intel GPU</a></td>\n    <td colspan=\"3\" align=\"center\"><a href=\"intel_extension_for_transformers/neural_chat/docs/notebooks/deploy_chatbot_on_habana_gaudi.ipynb\">Chatbot on Gaudi</a></td>\n  </tr>\n  <tr>\n    <td colspan=\"4\" align=\"center\"><a href=\"intel_extension_for_transformers/neural_chat/examples/deployment/talkingbot/pc/build_talkingbot_on_pc.ipynb\">Chatbot on Client</a></td>\n    <td colspan=\"4\" align=\"center\"><a href=\"intel_extension_for_transformers/neural_chat/docs/full_notebooks.md\">More Notebooks</a></td>\n  </tr>\n  <tr>\n    <th colspan=\"8\" align=\"center\">NEURAL SPEED</th>\n  </tr>\n <tr>\n    <td colspan=\"2\" align=\"center\"><a href=\"https://github.com/intel/neural-speed/tree/main/README.md\">Neural Speed</a></td>\n    <td colspan=\"2\" align=\"center\"><a href=\"https://github.com/intel/neural-speed/tree/main/README.md#2-neural-speed-straight-forward\">Streaming LLM</a></td>\n    <td colspan=\"2\" align=\"center\"><a href=\"https://github.com/intel/neural-speed/tree/main/neural_speed/core#support-matrix\">Low Precision Kernels</a></td>\n    <td colspan=\"2\" align=\"center\"><a href=\"https://github.com/intel/neural-speed/tree/main/docs/tensor_parallelism.md\">Tensor Parallelism</a></td>\n  </tr>\n  <tr>\n    <th colspan=\"8\" align=\"center\">LLM COMPRESSION</th>\n  </tr>\n  <tr>\n    <td colspan=\"2\" align=\"center\"><a href=\"docs/smoothquant.md\">SmoothQuant (INT8)</a></td>\n    <td colspan=\"3\" align=\"center\"><a href=\"docs/weightonlyquant.md\">Weight-only Quantization (INT4/FP4/NF4/INT8)</a></td>\n    <td colspan=\"3\" align=\"center\"><a href=\"docs/qloracpu.md\">QLoRA on CPU</a></td>\n  </tr>\n  <tr>\n    <th colspan=\"8\" align=\"center\">GENERAL COMPRESSION</th>\n  <tr>\n  <tr>\n    <td colspan=\"2\" align=\"center\"><a href=\"docs/quantization.md\">Quantization</a></td>\n    <td colspan=\"2\" align=\"center\"><a href=\"docs/pruning.md\">Pruning</a></td>\n    <td colspan=\"2\" align=\"center\"><a href=\"docs/distillation.md\">Distillation</a></td>\n    <td align=\"center\" colspan=\"2\"><a href=\"examples/huggingface/pytorch/text-classification/orchestrate_optimizations/README.md\">Orchestration</a></td>\n  </tr>\n  <tr>\n    <td align=\"center\" colspan=\"2\"><a href=\"examples/huggingface/pytorch/language-modeling/nas/README.md\">Neural Architecture Search</a></td>\n    <td align=\"center\" colspan=\"2\"><a href=\"docs/export.md\">Export</a></td>\n    <td align=\"center\" colspan=\"2\"><a href=\"docs/metrics.md\">Metrics</a></td>\n    <td align=\"center\" colspan=\"2\"><a href=\"docs/objectives.md\">Objectives</a></td>\n  </tr>\n  <tr>\n    <td align=\"center\" colspan=\"2\"><a href=\"docs/pipeline.md\">Pipeline</a></td>\n    <td align=\"center\" colspan=\"2\"><a href=\"examples/huggingface/pytorch/question-answering/dynamic/README.md\">Length Adaptive</a></td>\n    <td align=\"center\" colspan=\"2\"><a href=\"docs/examples.md#early-exit\">Early Exit</a></td>\n    <td align=\"center\" colspan=\"2\"><a href=\"docs/data_augmentation.md\">Data Augmentation</a></td>    \n  </tr>\n  <tr>\n    <th colspan=\"8\" align=\"center\">TUTORIALS & RESULTS</a></th>\n  </tr>\n  <tr>\n    <td colspan=\"2\" align=\"center\"><a href=\"docs/tutorials/README.md\">Tutorials</a></td>\n    <td colspan=\"2\" align=\"center\"><a href=\"https://github.com/intel/neural-speed/blob/main/docs/supported_models.md\">LLM List</a></td>\n    <td colspan=\"2\" align=\"center\"><a href=\"docs/examples.md\">General Model List</a></td>\n    <td colspan=\"2\" align=\"center\"><a href=\"intel_extension_for_transformers/transformers/runtime/docs/validated_model.md\">Model Performance</a></td>\n  </tr>\n</tbody>\n</table>\n\n## \ud83d\ude4cDemo\n\n* LLM Infinite Inference (up to 4M tokens)\n\nhttps://github.com/intel/intel-extension-for-transformers/assets/109187816/1698dcda-c9ec-4f44-b159-f4e9d67ab15b\n\n* LLM QLoRA on Client CPU\n\nhttps://github.com/intel/intel-extension-for-transformers/assets/88082706/9d9bdb7e-65db-47bb-bbed-d23b151e8b31\n\n## \ud83d\udcc3Selected Publications/Events\n* Blog published on Huggingface: [Building Cost-Efficient Enterprise RAG applications with Intel Gaudi 2 and Intel Xeon](https://huggingface.co/blog/cost-efficient-rag-applications-with-intel) (May 2024)\n* Blog published on Intel Developer News: [Efficient Natural Language Embedding Models with Intel\u00ae Extension for Transformers](https://www.intel.com/content/www/us/en/developer/articles/technical/efficient-natural-language-embedding-models.html) (May 2024)\n* Blog published on Techcrunch: [Intel and others commit to building open generative AI tools for the enterprise](https://techcrunch.com/2024/04/16/intel-and-others-commit-to-building-open-generative-ai-tools-for-the-enterprise) (Apr 2024)\n* Video on YouTube: [Intel Vision Keynotes 2024](https://www.youtube.com/watch?v=QB7FoIpx8os&t=2280s) (Apr 2024)\n* Blog published on Vectara: [Do Smaller Models Hallucinate More?](https://vectara.com/blog/do-smaller-models-hallucinate-more) (Apr 2024)\n* Blog of Intel Developer News: [Use the neural-chat-7b Model for Advanced Fraud Detection: An AI-Driven Approach in Cybersecurity](https://www.intel.com/content/www/us/en/developer/articles/technical/bilics-approach-cybersecurity-using-neuralchat-7b.html) (March 2024)\n* CES 2024: [CES 2024 Great Minds Keynote: Bringing the Limitless Potential of AI Everywhere: Intel Hybrid Copilot demo](https://youtu.be/70J3uO3eLZA?t=1348) (Jan 2024)\n* Blog published on Medium: [Connect an AI agent with your API: Intel Neural-Chat 7b LLM can replace Open AI Function Calling](https://medium.com/11tensors/connect-an-ai-agent-with-your-api-intel-neural-chat-7b-llm-can-replace-open-ai-function-calling-242d771e7c79) (Dec 2023)\n* NeurIPS'2023 on Efficient Natural Language and Speech Processing: [Efficient LLM Inference on CPUs](https://arxiv.org/abs/2311.00502) (Nov 2023)\n* Blog published on Hugging Face: [Intel Neural-Chat 7b: Fine-Tuning on Gaudi2 for Top LLM Performance](https://huggingface.co/blog/Andyrasika/neural-chat-intel) (Nov 2023)\n* Blog published on VMware: [AI without GPUs: A Technical Brief for VMware Private AI with Intel](https://core.vmware.com/resource/ai-without-gpus-technical-brief-vmware-private-ai-intel#section6) (Nov 2023)\n  \n> View [Full Publication List](./docs/publication.md)\n\n## Additional Content\n\n* [Release Information](./docs/release.md)\n* [Contribution Guidelines](./docs/contributions.md)\n* [Legal Information](./docs/legal.md)\n* [Security Policy](SECURITY.md)\n* [Apache License](./LICENSE)\n\n\n## Acknowledgements\n* Excellent open-source projects: [bitsandbytes](https://github.com/TimDettmers/bitsandbytes), [FastChat](https://github.com/lm-sys/FastChat), [fastRAG](https://github.com/IntelLabs/fastRAG), [ggml](https://github.com/ggerganov/ggml), [gptq](https://github.com/IST-DASLab/gptq), [llama.cpp](https://github.com/ggerganov/llama.cpp), [lm-evauation-harness](https://github.com/EleutherAI/lm-evaluation-harness), [peft](https://github.com/huggingface/peft), [trl](https://github.com/huggingface/trl), [streamingllm](https://github.com/mit-han-lab/streaming-llm) and many others.\n\n* Thanks to all the [contributors](./docs/contributors.md).\n\n## \ud83d\udc81Collaborations\n\nWelcome to raise any interesting ideas on model compression techniques and LLM-based chatbot development! Feel free to reach [us](mailto:itrex.maintainers@intel.com), and we look forward to our collaborations on Intel Extension for Transformers!\n",
    "bugtrack_url": null,
    "license": "Apache 2.0",
    "summary": "Repository of Intel\u00ae Intel Extension for Transformers",
    "version": "1.4.2",
    "project_urls": {
        "Homepage": "https://github.com/intel/intel-extension-for-transformers"
    },
    "split_keywords": [
        "quantization",
        " auto-tuning",
        " post-training static quantization",
        " post-training dynamic quantization",
        " quantization-aware training",
        " tuning strategy"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "78dc1b571b4cf41070708e7aa0b2c9e3054c4c3b480c2f63517a6e9fda42ee57",
                "md5": "9641f896b26ca628aa319136bc413bd3",
                "sha256": "ef87d2d47be3316aae96a479ad73b3397ae63f38225c728050ac01fab482abc2"
            },
            "downloads": -1,
            "filename": "intel_extension_for_transformers-1.4.2-cp310-cp310-manylinux_2_28_x86_64.whl",
            "has_sig": false,
            "md5_digest": "9641f896b26ca628aa319136bc413bd3",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.7.0",
            "size": 45292941,
            "upload_time": "2024-05-24T09:21:31",
            "upload_time_iso_8601": "2024-05-24T09:21:31.887717Z",
            "url": "https://files.pythonhosted.org/packages/78/dc/1b571b4cf41070708e7aa0b2c9e3054c4c3b480c2f63517a6e9fda42ee57/intel_extension_for_transformers-1.4.2-cp310-cp310-manylinux_2_28_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "61d023d785db0d59da3c676d16d70f5b97235cdc6d6caac0dbf5efd1ede5baba",
                "md5": "7691bd8131cedb3e2a8a3f71557b3406",
                "sha256": "1bf320fd1bc2c1642a19268dcbc4b2517292cfc89285403cf066dfe2a23d64d0"
            },
            "downloads": -1,
            "filename": "intel_extension_for_transformers-1.4.2-cp310-cp310-win_amd64.whl",
            "has_sig": false,
            "md5_digest": "7691bd8131cedb3e2a8a3f71557b3406",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.7.0",
            "size": 11038724,
            "upload_time": "2024-05-24T09:21:37",
            "upload_time_iso_8601": "2024-05-24T09:21:37.370393Z",
            "url": "https://files.pythonhosted.org/packages/61/d0/23d785db0d59da3c676d16d70f5b97235cdc6d6caac0dbf5efd1ede5baba/intel_extension_for_transformers-1.4.2-cp310-cp310-win_amd64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ce738ab583a1dec951684e42b71fd0058c1c9bfc7ae59c42f741d6e698bcf978",
                "md5": "80a905bf9b29b5c39b702e52dda4baba",
                "sha256": "f9f5d6f1a24a817244a2625c3486f67b181cc6279b14ab9a6a3cfe22d663bc02"
            },
            "downloads": -1,
            "filename": "intel_extension_for_transformers-1.4.2-cp311-cp311-manylinux_2_28_x86_64.whl",
            "has_sig": false,
            "md5_digest": "80a905bf9b29b5c39b702e52dda4baba",
            "packagetype": "bdist_wheel",
            "python_version": "cp311",
            "requires_python": ">=3.7.0",
            "size": 45295970,
            "upload_time": "2024-05-24T09:21:41",
            "upload_time_iso_8601": "2024-05-24T09:21:41.215295Z",
            "url": "https://files.pythonhosted.org/packages/ce/73/8ab583a1dec951684e42b71fd0058c1c9bfc7ae59c42f741d6e698bcf978/intel_extension_for_transformers-1.4.2-cp311-cp311-manylinux_2_28_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "bf245d0289f3ed91a135af1da1548ca03caadb0b58edf254edf975eb92facc83",
                "md5": "279c18c5466b7f2464b6a3e3216d9bdc",
                "sha256": "ee72ff99be4528c6e2e600c0b5c770c3986d1d07525d9f1adf25cbd2246b3acd"
            },
            "downloads": -1,
            "filename": "intel_extension_for_transformers-1.4.2-cp311-cp311-win_amd64.whl",
            "has_sig": false,
            "md5_digest": "279c18c5466b7f2464b6a3e3216d9bdc",
            "packagetype": "bdist_wheel",
            "python_version": "cp311",
            "requires_python": ">=3.7.0",
            "size": 11041115,
            "upload_time": "2024-05-24T09:21:45",
            "upload_time_iso_8601": "2024-05-24T09:21:45.075110Z",
            "url": "https://files.pythonhosted.org/packages/bf/24/5d0289f3ed91a135af1da1548ca03caadb0b58edf254edf975eb92facc83/intel_extension_for_transformers-1.4.2-cp311-cp311-win_amd64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9d811053e18663de1eca0d3e185524ef197f9a6a91aabc7e0e04383a0910694c",
                "md5": "d73076d7015edacd1fd3ea75bbf4fbf6",
                "sha256": "0b56d2a3081acfa65bfba51f65eabd3d8b0c0ba8919363c58200402211903e36"
            },
            "downloads": -1,
            "filename": "intel_extension_for_transformers-1.4.2-cp38-cp38-manylinux_2_28_x86_64.whl",
            "has_sig": false,
            "md5_digest": "d73076d7015edacd1fd3ea75bbf4fbf6",
            "packagetype": "bdist_wheel",
            "python_version": "cp38",
            "requires_python": ">=3.7.0",
            "size": 45292864,
            "upload_time": "2024-05-24T09:21:48",
            "upload_time_iso_8601": "2024-05-24T09:21:48.627468Z",
            "url": "https://files.pythonhosted.org/packages/9d/81/1053e18663de1eca0d3e185524ef197f9a6a91aabc7e0e04383a0910694c/intel_extension_for_transformers-1.4.2-cp38-cp38-manylinux_2_28_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "00f6401e202aa9c6df1bce66ef9732f397be6b14f1ff079c48cb64b99b284914",
                "md5": "5debe603790f179e1515403b5ce88f9e",
                "sha256": "165c9b4ba577ebc02d7d860f4d071c29e7f4982be53c80ff2fd5ba95ac711aaa"
            },
            "downloads": -1,
            "filename": "intel_extension_for_transformers-1.4.2-cp38-cp38-win_amd64.whl",
            "has_sig": false,
            "md5_digest": "5debe603790f179e1515403b5ce88f9e",
            "packagetype": "bdist_wheel",
            "python_version": "cp38",
            "requires_python": ">=3.7.0",
            "size": 11038703,
            "upload_time": "2024-05-24T09:21:52",
            "upload_time_iso_8601": "2024-05-24T09:21:52.680392Z",
            "url": "https://files.pythonhosted.org/packages/00/f6/401e202aa9c6df1bce66ef9732f397be6b14f1ff079c48cb64b99b284914/intel_extension_for_transformers-1.4.2-cp38-cp38-win_amd64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4c06492809535ee03c8e213c7510fefd8ce75931cd1f0a411cb318c251310d0e",
                "md5": "4988e63df0c6df212c3ab1116e511dab",
                "sha256": "f91eb6848d6fe6ba6cf0e1232c55022e3fbcc3656b42f490cbdaa6a5791ea2f9"
            },
            "downloads": -1,
            "filename": "intel_extension_for_transformers-1.4.2-cp39-cp39-manylinux_2_28_x86_64.whl",
            "has_sig": false,
            "md5_digest": "4988e63df0c6df212c3ab1116e511dab",
            "packagetype": "bdist_wheel",
            "python_version": "cp39",
            "requires_python": ">=3.7.0",
            "size": 45293100,
            "upload_time": "2024-05-24T09:21:57",
            "upload_time_iso_8601": "2024-05-24T09:21:57.224065Z",
            "url": "https://files.pythonhosted.org/packages/4c/06/492809535ee03c8e213c7510fefd8ce75931cd1f0a411cb318c251310d0e/intel_extension_for_transformers-1.4.2-cp39-cp39-manylinux_2_28_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "b937d65570275174553d0a7a238d3b8f08e9ce26272d534136c762bcc02d4270",
                "md5": "2f69656f9a157a3fb67df260d5adf074",
                "sha256": "0b2437d6d7afb5c46c587410e8dcd31391ffee56aae377c7ad8dd962d4094d3a"
            },
            "downloads": -1,
            "filename": "intel_extension_for_transformers-1.4.2-cp39-cp39-win_amd64.whl",
            "has_sig": false,
            "md5_digest": "2f69656f9a157a3fb67df260d5adf074",
            "packagetype": "bdist_wheel",
            "python_version": "cp39",
            "requires_python": ">=3.7.0",
            "size": 11038859,
            "upload_time": "2024-05-24T09:22:00",
            "upload_time_iso_8601": "2024-05-24T09:22:00.582708Z",
            "url": "https://files.pythonhosted.org/packages/b9/37/d65570275174553d0a7a238d3b8f08e9ce26272d534136c762bcc02d4270/intel_extension_for_transformers-1.4.2-cp39-cp39-win_amd64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "091ddd28044cc9a4fb7d152aef0bbb3d78d631504609f1bfde512557daae54ba",
                "md5": "20fbd4689ec2c7472697f3cf3d6fe470",
                "sha256": "946d74edec0dc55be1aa248f0f64d86aac558f782b5b33b4de47313681b48e0c"
            },
            "downloads": -1,
            "filename": "intel_extension_for_transformers-1.4.2.tar.gz",
            "has_sig": false,
            "md5_digest": "20fbd4689ec2c7472697f3cf3d6fe470",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7.0",
            "size": 106456089,
            "upload_time": "2024-05-24T09:22:06",
            "upload_time_iso_8601": "2024-05-24T09:22:06.112649Z",
            "url": "https://files.pythonhosted.org/packages/09/1d/dd28044cc9a4fb7d152aef0bbb3d78d631504609f1bfde512557daae54ba/intel_extension_for_transformers-1.4.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-05-24 09:22:06",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "intel",
    "github_project": "intel-extension-for-transformers",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "py-cpuinfo",
            "specs": []
        },
        {
            "name": "setuptools",
            "specs": [
                [
                    "==",
                    "69.5.1"
                ]
            ]
        },
        {
            "name": "setuptools_scm",
            "specs": [
                [
                    ">=",
                    "6.2"
                ]
            ]
        }
    ],
    "lcname": "intel-extension-for-transformers"
}

Intel AIA/AIPC Team