# FMS Acceleration for Fused Operations and Kernels
This library contains fused operations and custom kernels, to be expanded over time. Currently it contains the following:
1. Fused operations and kernels extracted from [unsloth](#extracted-code-from-unsloth).
- Low-Rank Adapter Fused Operations
- Fast RoPE Triton Kernels
- Fast RMS LayerNorm Triton Kernels
- Fast Cross Entropy Triton Kernels
## Plugins
Plugin | Description | Depends | Loading | Augmentation | Callbacks
--|--|--|--|--|--
[fast_quantized_peft](./src/fms_accelerate_foak/framework_plugin_fast_quantized_peft.py) | LoRA fused ops, fast cross-entropy, fast rms, fast RoPE | Contains extracted code | | ✅
[fast_kernels](./src/fms_accelerate_foak/framework_plugin_fast_kernels.py) | Enhanced version of quantized_peft, that also works for full-FT and non-quant peft | Contains extracted code | | ✅
### Supported DataType Settings
**Compatibility Matrix with Mixed Precision**
torch_dtype | Mixed Precision | Full-FT-FOAK | PEFT-FOAK | QPEFT-FOAK
-- | -- | -- | -- | --
FLOAT16 | - | ✗ Not Allowed | ✗| ✗
FLOAT16 | FP16 | ValueError: <br>Attempting to <br>unscale FP16 gradients. <br>[See here](https://github.com/huggingface/peft/blob/main/docs/source/developer_guides/troubleshooting.md) | **Compatible** | **Compatible**
BFLOAT16 | - | ✗ | ✗ | ✗
BFLOAT16 | BF16 | **Compatible** | **Compatible** | [Less Performant](https://github.com/foundation-model-stack/fms-acceleration/issues/84)
### Code Extracted from Unsloth
Notes on the extraction of code from [unsloth](https://github.com/unslothai/unsloth):
- While unsloth is [released under Apache 2.0](https://github.com/unslothai/unsloth/blob/main/LICENSE), there are comments indicating some exceptions strewn throughout the code base, see [an example here](https://github.com/unslothai/unsloth/blob/ec19e61c854dcf9104386fa63fc6c4f2944d4f35/unsloth/models/llama.py#L1140-L1143).
```
it would require a commercial license if used to run on more than 4 GPUs ...
```
- These exceptions appear to be located around the trainer improvements, see [another example here](https://github.com/unslothai/unsloth/blob/ec19e61c854dcf9104386fa63fc6c4f2944d4f35/unsloth/models/llama.py#L1177-L1183).
- These exceptions appear around [Feb 2024 Release](https://github.com/unslothai/unsloth/commit/3e4c5a323c16bbda2c92212b790073c4e99c2a55); any code that appears in any file where such exceptions occur **is not extracted**.
- Instead in its place, we have adopted a different approach; we adopt the approach of model patching, as opposed unsloths' approach to rewrite the model. Our approach is novel and **completely rewritten from scratch**.
- We have also enabled dropout on the lora fused operations.
- All extracted code appears before the Feb 2024 Release.
- In the table below we record what was extracted, and the exact commit from which it was taken.
Path | Description | Extracted From | Modifications | Date
--|--|--|--|--
[fused_ops/unsloth_lora](./src/fms_acceleration_foak/fused_ops/unsloth_lora) | QLoRA fast dequant, activation kernels | `unsloth/main` @ [1ecc0185](https://github.com/unslothai/unsloth/commit/1ecc0185a5759c7a0c95dfc96aceea5023cebdfc) | | 28 Jan 2024
[fused_ops/unsloth_lora/bnb](./src/fms_acceleration_foak/fused_ops/unsloth_lora/bnb) | BNB fast lora | `unsloth/main` @ [1ecc0185](https://github.com/unslothai/unsloth/commit/1ecc0185a5759c7a0c95dfc96aceea5023cebdfc) | `fast_lora.py` | 28 Jan 2024
[fused_ops/unsloth_lora/gptq](./src/fms_acceleration_foak/fused_ops/unsloth_lora/gptq) | GPTQ fast dequant (triton_v2) | `jeromeku/main` @ [2839d39](https://github.com/jeromeku/unsloth/commit/2839d390ef3bb318904289bfb9a7751a782c4e44) | `fast_lora.py`<br>`triton/layers.py` | 6 Feb 2024
[kernels/unsloth](./src/fms_acceleration_foak/kernels/unsloth) | Fast RMS, RoPE, CrossEnt kernels | `unsloth/main` @ [1ecc0185](https://github.com/unslothai/unsloth/commit/1ecc0185a5759c7a0c95dfc96aceea5023cebdfc) | `cross_entropy_loss.py`<br>`rms_layernorm.py` | 28 Jan 2024
### Supported Models
Model | norm | pos emb | cross-ent | fused_lora
--|--|--|--|--
`LlamaForCausalLM` | ✅ | ✅ | ✅ | ✅
`MistralForCausalLM` | ✅ | ✅ | ✅ | ✅
`MixtralForCausalLM` | ✅ | ✅ | ✅ | ✅
`GPTBigCodeForCausalLM` | ❌ | ❌ | ✅ | ❌
<!-- `GraniteForCausalLM` | ✅ | ✅ | ✅ | ✅ -->
## Known Issues
- MixedPrecision `--fp16` or `--bf16` should be used with `fast_lora`.
- `fast_lora` has issues with FSDP V1 with the `peft` style of FSDP wrapping.
* This is because the adapter's forward functions are bypassed in the fused ops.
* For AutoGPTQ/QLoRA this is addressed by distributing the adapters using DDP so they will be unsharded in time for the fused ops.
- `fast_rope_embeddings` does not work with position_ids. Currently `position_ids` are ignored and could give wrong results.
Raw data
{
"_id": null,
"home_page": null,
"name": "fms-acceleration-foak",
"maintainer": null,
"docs_url": null,
"requires_python": "~=3.9",
"maintainer_email": null,
"keywords": "acceleration, fms-hf-tuning, fused-ops, triton",
"author": null,
"author_email": "Fabian Lim <flim@sg.ibm.com>, Aaron Chew <aaron.chew1@ibm.com>",
"download_url": null,
"platform": null,
"description": "# FMS Acceleration for Fused Operations and Kernels\n\nThis library contains fused operations and custom kernels, to be expanded over time. Currently it contains the following:\n\n\n1. Fused operations and kernels extracted from [unsloth](#extracted-code-from-unsloth). \n - Low-Rank Adapter Fused Operations\n - Fast RoPE Triton Kernels\n - Fast RMS LayerNorm Triton Kernels\n - Fast Cross Entropy Triton Kernels\n\n## Plugins\n\nPlugin | Description | Depends | Loading | Augmentation | Callbacks\n--|--|--|--|--|--\n[fast_quantized_peft](./src/fms_accelerate_foak/framework_plugin_fast_quantized_peft.py) | LoRA fused ops, fast cross-entropy, fast rms, fast RoPE | Contains extracted code | | \u2705\n[fast_kernels](./src/fms_accelerate_foak/framework_plugin_fast_kernels.py) | Enhanced version of quantized_peft, that also works for full-FT and non-quant peft | Contains extracted code | | \u2705\n\n### Supported DataType Settings\n**Compatibility Matrix with Mixed Precision**\ntorch_dtype | Mixed Precision | Full-FT-FOAK | PEFT-FOAK | QPEFT-FOAK\n-- | -- | -- | -- | --\nFLOAT16 | - | \u2717 Not Allowed | \u2717| \u2717\nFLOAT16 | FP16 | ValueError: <br>Attempting to <br>unscale FP16 gradients. <br>[See here](https://github.com/huggingface/peft/blob/main/docs/source/developer_guides/troubleshooting.md) | **Compatible** | **Compatible**\nBFLOAT16 | - | \u2717 | \u2717 | \u2717\nBFLOAT16 | BF16 | **Compatible** | **Compatible** | [Less Performant](https://github.com/foundation-model-stack/fms-acceleration/issues/84)\n\n### Code Extracted from Unsloth\n\n\nNotes on the extraction of code from [unsloth](https://github.com/unslothai/unsloth):\n- While unsloth is [released under Apache 2.0](https://github.com/unslothai/unsloth/blob/main/LICENSE), there are comments indicating some exceptions strewn throughout the code base, see [an example here](https://github.com/unslothai/unsloth/blob/ec19e61c854dcf9104386fa63fc6c4f2944d4f35/unsloth/models/llama.py#L1140-L1143).\n ```\n it would require a commercial license if used to run on more than 4 GPUs ...\n ```\n- These exceptions appear to be located around the trainer improvements, see [another example here](https://github.com/unslothai/unsloth/blob/ec19e61c854dcf9104386fa63fc6c4f2944d4f35/unsloth/models/llama.py#L1177-L1183).\n- These exceptions appear around [Feb 2024 Release](https://github.com/unslothai/unsloth/commit/3e4c5a323c16bbda2c92212b790073c4e99c2a55); any code that appears in any file where such exceptions occur **is not extracted**.\n- Instead in its place, we have adopted a different approach; we adopt the approach of model patching, as opposed unsloths' approach to rewrite the model. Our approach is novel and **completely rewritten from scratch**. \n- We have also enabled dropout on the lora fused operations.\n- All extracted code appears before the Feb 2024 Release. \n- In the table below we record what was extracted, and the exact commit from which it was taken.\n\nPath | Description | Extracted From | Modifications | Date\n--|--|--|--|--\n[fused_ops/unsloth_lora](./src/fms_acceleration_foak/fused_ops/unsloth_lora) | QLoRA fast dequant, activation kernels | `unsloth/main` @ [1ecc0185](https://github.com/unslothai/unsloth/commit/1ecc0185a5759c7a0c95dfc96aceea5023cebdfc) | | 28 Jan 2024\n[fused_ops/unsloth_lora/bnb](./src/fms_acceleration_foak/fused_ops/unsloth_lora/bnb) | BNB fast lora | `unsloth/main` @ [1ecc0185](https://github.com/unslothai/unsloth/commit/1ecc0185a5759c7a0c95dfc96aceea5023cebdfc) | `fast_lora.py` | 28 Jan 2024\n[fused_ops/unsloth_lora/gptq](./src/fms_acceleration_foak/fused_ops/unsloth_lora/gptq) | GPTQ fast dequant (triton_v2) | `jeromeku/main` @ [2839d39](https://github.com/jeromeku/unsloth/commit/2839d390ef3bb318904289bfb9a7751a782c4e44) | `fast_lora.py`<br>`triton/layers.py` | 6 Feb 2024\n[kernels/unsloth](./src/fms_acceleration_foak/kernels/unsloth) | Fast RMS, RoPE, CrossEnt kernels | `unsloth/main` @ [1ecc0185](https://github.com/unslothai/unsloth/commit/1ecc0185a5759c7a0c95dfc96aceea5023cebdfc) | `cross_entropy_loss.py`<br>`rms_layernorm.py` | 28 Jan 2024\n\n### Supported Models\n\nModel | norm | pos emb | cross-ent | fused_lora \n--|--|--|--|--\n`LlamaForCausalLM` | \u2705 | \u2705 | \u2705 | \u2705 \n`MistralForCausalLM` | \u2705 | \u2705 | \u2705 | \u2705 \n`MixtralForCausalLM` | \u2705 | \u2705 | \u2705 | \u2705 \n`GPTBigCodeForCausalLM` | \u274c | \u274c | \u2705 | \u274c \n<!-- `GraniteForCausalLM` | \u2705 | \u2705 | \u2705 | \u2705 -->\n\n## Known Issues\n\n- MixedPrecision `--fp16` or `--bf16` should be used with `fast_lora`.\n- `fast_lora` has issues with FSDP V1 with the `peft` style of FSDP wrapping. \n * This is because the adapter's forward functions are bypassed in the fused ops.\n * For AutoGPTQ/QLoRA this is addressed by distributing the adapters using DDP so they will be unsharded in time for the fused ops.\n- `fast_rope_embeddings` does not work with position_ids. Currently `position_ids` are ignored and could give wrong results.",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "FMS Acceleration using Fused Operations and Kernels",
"version": "0.3.0",
"project_urls": null,
"split_keywords": [
"acceleration",
" fms-hf-tuning",
" fused-ops",
" triton"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "f1665a00540be7cfcee9c29990a64d5352aca421d86cdb380ff7a70a6604a838",
"md5": "f106129e3a21c71734309442b245f40b",
"sha256": "674d51967c06867bcbefec41b42accb97e5942a4088e0768b22c9adad36734eb"
},
"downloads": -1,
"filename": "fms_acceleration_foak-0.3.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "f106129e3a21c71734309442b245f40b",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "~=3.9",
"size": 50870,
"upload_time": "2024-09-16T06:41:09",
"upload_time_iso_8601": "2024-09-16T06:41:09.211327Z",
"url": "https://files.pythonhosted.org/packages/f1/66/5a00540be7cfcee9c29990a64d5352aca421d86cdb380ff7a70a6604a838/fms_acceleration_foak-0.3.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-09-16 06:41:09",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "fms-acceleration-foak"
}