# FMS Acceleration for Fused Operations and Kernels
This library contains fused operations and custom kernels, to be expanded over time. Currently it contains the following:
1. Fused operations and kernels extracted from [unsloth](#extracted-code-from-unsloth).
- Low-Rank Adapter Fused Operations
- Fast RoPE Triton Kernels
- Fast RMS LayerNorm Triton Kernels
- Fast Cross Entropy Triton Kernels
## Plugins
Plugin | Description | Depends | Loading | Augmentation | Callbacks
--|--|--|--|--|--
[fast_quantized_peft](./src/fms_accelerate_foak/framework_plugin_fast_quantized_peft.py) | LoRA fused ops, fast cross-entropy, fast rms, fast RoPE (**Disabled**) | Contains extracted code | | ✅
[fast_kernels](./src/fms_accelerate_foak/framework_plugin_fast_kernels.py) | Enhanced version of `fast_quantized_peft`, also works for full-FT and non-quant peft | Contains extracted code | | ✅
### Supported DataType Settings
**Compatibility Matrix with Mixed Precision**
torch_dtype | Mixed Precision | Full-FT-FOAK | PEFT-FOAK | QPEFT-FOAK
-- | -- | -- | -- | --
FLOAT16 | - | **Compatible** | **Compatible** | ✗
FLOAT16 | FP16 | ValueError: <br>Attempting to <br>unscale FP16 gradients. <br>[See here](https://github.com/huggingface/peft/blob/main/docs/source/developer_guides/troubleshooting.md) | **Compatible** | **Compatible**
BFLOAT16 | - | **Compatible** | **Compatible** | ✗
BFLOAT16 | BF16 | **Compatible** | **Compatible** | [Less Performant](https://github.com/foundation-model-stack/fms-acceleration/issues/84)
NOTE: this chart is also a good reference for supported types, even for the non-FOAK case.
### Code Extracted from Unsloth
Notes on the extraction of code from [unsloth](https://github.com/unslothai/unsloth):
- While unsloth is [released under Apache 2.0](https://github.com/unslothai/unsloth/blob/main/LICENSE), there are comments indicating some exceptions strewn throughout the code base, see [an example here](https://github.com/unslothai/unsloth/blob/ec19e61c854dcf9104386fa63fc6c4f2944d4f35/unsloth/models/llama.py#L1140-L1143).
```
it would require a commercial license if used to run on more than 4 GPUs ...
```
- These exceptions appear to be located around the trainer improvements, see [another example here](https://github.com/unslothai/unsloth/blob/ec19e61c854dcf9104386fa63fc6c4f2944d4f35/unsloth/models/llama.py#L1177-L1183).
- These exceptions appear around [Feb 2024 Release](https://github.com/unslothai/unsloth/commit/3e4c5a323c16bbda2c92212b790073c4e99c2a55); any code that appears in any file where such exceptions occur **is not extracted**.
- Instead in its place, we have adopted a different approach; we adopt the approach of model patching, as opposed unsloths' approach to rewrite the model. Our approach is novel and **completely rewritten from scratch**.
- We have also enabled dropout on the lora fused operations.
- All extracted code appears before the Feb 2024 Release.
- In the table below we record what was extracted, and the exact commit from which it was taken.
Path | Description | Extracted From | Modifications | Date
--|--|--|--|--
[fused_ops/unsloth_lora](./src/fms_acceleration_foak/fused_ops/unsloth_lora) | QLoRA fast dequant, activation kernels | `unsloth/main` @ [1ecc0185](https://github.com/unslothai/unsloth/commit/1ecc0185a5759c7a0c95dfc96aceea5023cebdfc) | | 28 Jan 2024
[fused_ops/unsloth_lora/bnb](./src/fms_acceleration_foak/fused_ops/unsloth_lora/bnb) | BNB fast lora | `unsloth/main` @ [1ecc0185](https://github.com/unslothai/unsloth/commit/1ecc0185a5759c7a0c95dfc96aceea5023cebdfc) | `fast_lora.py` | 28 Jan 2024
[fused_ops/unsloth_lora/gptq](./src/fms_acceleration_foak/fused_ops/unsloth_lora/gptq) | GPTQ fast dequant (triton_v2) | `jeromeku/main` @ [2839d39](https://github.com/jeromeku/unsloth/commit/2839d390ef3bb318904289bfb9a7751a782c4e44) | `fast_lora.py`<br>`triton/layers.py` | 6 Feb 2024
[kernels/unsloth](./src/fms_acceleration_foak/kernels/unsloth) | Fast RMS, RoPE, CrossEnt kernels | `unsloth/main` @ [1ecc0185](https://github.com/unslothai/unsloth/commit/1ecc0185a5759c7a0c95dfc96aceea5023cebdfc) | `cross_entropy_loss.py`<br>`rms_layernorm.py` | 28 Jan 2024
### Supported Models
Model | norm | pos emb | cross-ent | fused_lora
--|--|--|--|--
`LlamaForCausalLM` | ✅ | ✅ | ✅ | ✅
`MistralForCausalLM` | ✅ | ✅ | ✅ | ✅
`MixtralForCausalLM` | ✅ | ✅ | ✅ | ✅
`GPTBigCodeForCausalLM` | ❌ | ❌ | ✅ | ❌
`GraniteForCausalLM` | ✅ | ✅ | ✅ | ✅
#### Adding Support For A New Model
It is realtively easy by following an existing template, in what follows we use [GraniteForCausalLM](./src/fms_acceleration_foak/models/granite.py) as an example.
- implement a `get_mp_rules` for the new model, which returns a list of `ModelPatcherRule`.
- logic that needs to be changed is the various classes that the rules are triggered on. Import the various module classes likes so:
```python
from transformers.models.granite.modeling_granite import (
GraniteAttention,
GraniteMLP,
GraniteRMSNorm,
)
```
- replace the classes appropriately in various locations in `ModelPatcherRule`. In particular the `ModelPatcherTrigger` portions of it. Name `rule_id` appropriately.
```python
ModelPatcherRule(
rule_id="granite-rms",
trigger=ModelPatcherTrigger(check=GraniteRMSNorm),
forward=fast_rms_layernorm,
)
```
## Known Issues
- MixedPrecision `--fp16` or `--bf16` should be used with `fast_lora`.
- `fast_lora` has issues with FSDP V1 with the `peft` style of FSDP wrapping.
* This is because the adapter's forward functions are bypassed in the fused ops.
* For AutoGPTQ/QLoRA this is addressed by distributing the adapters using DDP so they will be unsharded in time for the fused ops.
- `fast_rope_embeddings` does not work with position_ids. Currently `position_ids` are ignored and could give wrong results.
Raw data
{
"_id": null,
"home_page": null,
"name": "fms-acceleration-foak",
"maintainer": null,
"docs_url": null,
"requires_python": "~=3.9",
"maintainer_email": null,
"keywords": "acceleration, fms-hf-tuning, fused-ops, triton",
"author": null,
"author_email": "Fabian Lim <flim@sg.ibm.com>, Aaron Chew <aaron.chew1@ibm.com>",
"download_url": null,
"platform": null,
"description": "# FMS Acceleration for Fused Operations and Kernels\n\nThis library contains fused operations and custom kernels, to be expanded over time. Currently it contains the following:\n\n\n1. Fused operations and kernels extracted from [unsloth](#extracted-code-from-unsloth). \n - Low-Rank Adapter Fused Operations\n - Fast RoPE Triton Kernels\n - Fast RMS LayerNorm Triton Kernels\n - Fast Cross Entropy Triton Kernels\n\n## Plugins\n\nPlugin | Description | Depends | Loading | Augmentation | Callbacks\n--|--|--|--|--|--\n[fast_quantized_peft](./src/fms_accelerate_foak/framework_plugin_fast_quantized_peft.py) | LoRA fused ops, fast cross-entropy, fast rms, fast RoPE (**Disabled**) | Contains extracted code | | \u2705\n[fast_kernels](./src/fms_accelerate_foak/framework_plugin_fast_kernels.py) | Enhanced version of `fast_quantized_peft`, also works for full-FT and non-quant peft | Contains extracted code | | \u2705\n\n### Supported DataType Settings\n**Compatibility Matrix with Mixed Precision**\ntorch_dtype | Mixed Precision | Full-FT-FOAK | PEFT-FOAK | QPEFT-FOAK\n-- | -- | -- | -- | --\nFLOAT16 | - | **Compatible** | **Compatible** | \u2717\nFLOAT16 | FP16 | ValueError: <br>Attempting to <br>unscale FP16 gradients. <br>[See here](https://github.com/huggingface/peft/blob/main/docs/source/developer_guides/troubleshooting.md) | **Compatible** | **Compatible**\nBFLOAT16 | - | **Compatible** | **Compatible** | \u2717\nBFLOAT16 | BF16 | **Compatible** | **Compatible** | [Less Performant](https://github.com/foundation-model-stack/fms-acceleration/issues/84)\n\nNOTE: this chart is also a good reference for supported types, even for the non-FOAK case.\n\n### Code Extracted from Unsloth\n\n\nNotes on the extraction of code from [unsloth](https://github.com/unslothai/unsloth):\n- While unsloth is [released under Apache 2.0](https://github.com/unslothai/unsloth/blob/main/LICENSE), there are comments indicating some exceptions strewn throughout the code base, see [an example here](https://github.com/unslothai/unsloth/blob/ec19e61c854dcf9104386fa63fc6c4f2944d4f35/unsloth/models/llama.py#L1140-L1143).\n ```\n it would require a commercial license if used to run on more than 4 GPUs ...\n ```\n- These exceptions appear to be located around the trainer improvements, see [another example here](https://github.com/unslothai/unsloth/blob/ec19e61c854dcf9104386fa63fc6c4f2944d4f35/unsloth/models/llama.py#L1177-L1183).\n- These exceptions appear around [Feb 2024 Release](https://github.com/unslothai/unsloth/commit/3e4c5a323c16bbda2c92212b790073c4e99c2a55); any code that appears in any file where such exceptions occur **is not extracted**.\n- Instead in its place, we have adopted a different approach; we adopt the approach of model patching, as opposed unsloths' approach to rewrite the model. Our approach is novel and **completely rewritten from scratch**. \n- We have also enabled dropout on the lora fused operations.\n- All extracted code appears before the Feb 2024 Release. \n- In the table below we record what was extracted, and the exact commit from which it was taken.\n\nPath | Description | Extracted From | Modifications | Date\n--|--|--|--|--\n[fused_ops/unsloth_lora](./src/fms_acceleration_foak/fused_ops/unsloth_lora) | QLoRA fast dequant, activation kernels | `unsloth/main` @ [1ecc0185](https://github.com/unslothai/unsloth/commit/1ecc0185a5759c7a0c95dfc96aceea5023cebdfc) | | 28 Jan 2024\n[fused_ops/unsloth_lora/bnb](./src/fms_acceleration_foak/fused_ops/unsloth_lora/bnb) | BNB fast lora | `unsloth/main` @ [1ecc0185](https://github.com/unslothai/unsloth/commit/1ecc0185a5759c7a0c95dfc96aceea5023cebdfc) | `fast_lora.py` | 28 Jan 2024\n[fused_ops/unsloth_lora/gptq](./src/fms_acceleration_foak/fused_ops/unsloth_lora/gptq) | GPTQ fast dequant (triton_v2) | `jeromeku/main` @ [2839d39](https://github.com/jeromeku/unsloth/commit/2839d390ef3bb318904289bfb9a7751a782c4e44) | `fast_lora.py`<br>`triton/layers.py` | 6 Feb 2024\n[kernels/unsloth](./src/fms_acceleration_foak/kernels/unsloth) | Fast RMS, RoPE, CrossEnt kernels | `unsloth/main` @ [1ecc0185](https://github.com/unslothai/unsloth/commit/1ecc0185a5759c7a0c95dfc96aceea5023cebdfc) | `cross_entropy_loss.py`<br>`rms_layernorm.py` | 28 Jan 2024\n\n### Supported Models\n\nModel | norm | pos emb | cross-ent | fused_lora \n--|--|--|--|--\n`LlamaForCausalLM` | \u2705 | \u2705 | \u2705 | \u2705 \n`MistralForCausalLM` | \u2705 | \u2705 | \u2705 | \u2705 \n`MixtralForCausalLM` | \u2705 | \u2705 | \u2705 | \u2705 \n`GPTBigCodeForCausalLM` | \u274c | \u274c | \u2705 | \u274c \n`GraniteForCausalLM` | \u2705 | \u2705 | \u2705 | \u2705 \n\n#### Adding Support For A New Model\n\nIt is realtively easy by following an existing template, in what follows we use [GraniteForCausalLM](./src/fms_acceleration_foak/models/granite.py) as an example.\n- implement a `get_mp_rules` for the new model, which returns a list of `ModelPatcherRule`.\n- logic that needs to be changed is the various classes that the rules are triggered on. Import the various module classes likes so:\n ```python\n from transformers.models.granite.modeling_granite import ( \n GraniteAttention,\n GraniteMLP,\n GraniteRMSNorm,\n )\n ```\n- replace the classes appropriately in various locations in `ModelPatcherRule`. In particular the `ModelPatcherTrigger` portions of it. Name `rule_id` appropriately.\n ```python\n ModelPatcherRule(\n rule_id=\"granite-rms\",\n trigger=ModelPatcherTrigger(check=GraniteRMSNorm),\n forward=fast_rms_layernorm,\n )\n ```\n\n## Known Issues\n\n- MixedPrecision `--fp16` or `--bf16` should be used with `fast_lora`.\n- `fast_lora` has issues with FSDP V1 with the `peft` style of FSDP wrapping. \n * This is because the adapter's forward functions are bypassed in the fused ops.\n * For AutoGPTQ/QLoRA this is addressed by distributing the adapters using DDP so they will be unsharded in time for the fused ops.\n- `fast_rope_embeddings` does not work with position_ids. Currently `position_ids` are ignored and could give wrong results.",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "FMS Acceleration using Fused Operations and Kernels",
"version": "0.3.2",
"project_urls": null,
"split_keywords": [
"acceleration",
" fms-hf-tuning",
" fused-ops",
" triton"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "3ae36c3dceda75ead755a6dc3a6d1fdbbf37a04f10672cef34df9013207c4684",
"md5": "de1447db74bc3380b5f3a2f85c0609ad",
"sha256": "b0d02efe2d95a8ca4f079fb1d5b2808def63907871a53b28f850b14b9260d0bc"
},
"downloads": -1,
"filename": "fms_acceleration_foak-0.3.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "de1447db74bc3380b5f3a2f85c0609ad",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "~=3.9",
"size": 51853,
"upload_time": "2024-10-23T09:09:38",
"upload_time_iso_8601": "2024-10-23T09:09:38.325336Z",
"url": "https://files.pythonhosted.org/packages/3a/e3/6c3dceda75ead755a6dc3a6d1fdbbf37a04f10672cef34df9013207c4684/fms_acceleration_foak-0.3.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-23 09:09:38",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "fms-acceleration-foak"
}