fms-acceleration-foak

Name	fms-acceleration-foak JSON
Version	0.3.2 JSON
	download
home_page	None
Summary	FMS Acceleration using Fused Operations and Kernels
upload_time	2024-10-23 09:09:38
maintainer	None
docs_url	None
author	None
requires_python	~=3.9
license	Apache-2.0
keywords	acceleration fms-hf-tuning fused-ops triton
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # FMS Acceleration for Fused Operations and Kernels

This library contains fused operations and custom kernels, to be expanded over time. Currently it contains the following:


1. Fused operations and kernels extracted from [unsloth](#extracted-code-from-unsloth). 
    - Low-Rank Adapter Fused Operations
    - Fast RoPE Triton Kernels
    - Fast RMS LayerNorm Triton Kernels
    - Fast Cross Entropy Triton Kernels

## Plugins

Plugin | Description | Depends | Loading | Augmentation | Callbacks
--|--|--|--|--|--
[fast_quantized_peft](./src/fms_accelerate_foak/framework_plugin_fast_quantized_peft.py) | LoRA fused ops, fast cross-entropy, fast rms, fast RoPE (**Disabled**) | Contains extracted code |  | ✅
[fast_kernels](./src/fms_accelerate_foak/framework_plugin_fast_kernels.py) | Enhanced version of `fast_quantized_peft`, also works for full-FT and non-quant peft | Contains extracted code |  | ✅

### Supported DataType Settings
**Compatibility Matrix with Mixed Precision**
torch_dtype | Mixed Precision | Full-FT-FOAK | PEFT-FOAK | QPEFT-FOAK
-- | -- | -- | -- | --
FLOAT16 | - | **Compatible**  | **Compatible** | ✗
FLOAT16 | FP16 | ValueError: <br>Attempting to <br>unscale FP16 gradients. <br>[See here](https://github.com/huggingface/peft/blob/main/docs/source/developer_guides/troubleshooting.md) | **Compatible** | **Compatible**
BFLOAT16 | - | **Compatible**  | **Compatible**  | ✗
BFLOAT16 | BF16 | **Compatible** | **Compatible** | [Less Performant](https://github.com/foundation-model-stack/fms-acceleration/issues/84)

NOTE: this chart is also a good reference for supported types, even for the non-FOAK case.

### Code Extracted from Unsloth


Notes on the extraction of code from [unsloth](https://github.com/unslothai/unsloth):
- While unsloth is [released under Apache 2.0](https://github.com/unslothai/unsloth/blob/main/LICENSE), there are comments indicating some exceptions strewn throughout the code base, see [an example here](https://github.com/unslothai/unsloth/blob/ec19e61c854dcf9104386fa63fc6c4f2944d4f35/unsloth/models/llama.py#L1140-L1143).
    ```
    it would require a commercial license if used to run on more than 4 GPUs ...
    ```
- These exceptions appear to be located around the trainer improvements, see [another example here](https://github.com/unslothai/unsloth/blob/ec19e61c854dcf9104386fa63fc6c4f2944d4f35/unsloth/models/llama.py#L1177-L1183).
- These exceptions appear around [Feb 2024 Release](https://github.com/unslothai/unsloth/commit/3e4c5a323c16bbda2c92212b790073c4e99c2a55); any code that appears in any file where such exceptions occur **is not extracted**.
- Instead in its place, we have adopted a different approach; we adopt the approach of model patching, as opposed unsloths' approach to rewrite the model. Our approach is novel and **completely rewritten from scratch**. 
- We have also enabled dropout on the lora fused operations.
- All extracted code appears before the Feb 2024 Release. 
- In the table below we record what was extracted, and the exact commit from which it was taken.

Path | Description | Extracted From  | Modifications | Date
--|--|--|--|--
[fused_ops/unsloth_lora](./src/fms_acceleration_foak/fused_ops/unsloth_lora) | QLoRA fast dequant, activation kernels | `unsloth/main` @ [1ecc0185](https://github.com/unslothai/unsloth/commit/1ecc0185a5759c7a0c95dfc96aceea5023cebdfc) |  | 28 Jan 2024
[fused_ops/unsloth_lora/bnb](./src/fms_acceleration_foak/fused_ops/unsloth_lora/bnb) | BNB fast lora | `unsloth/main` @ [1ecc0185](https://github.com/unslothai/unsloth/commit/1ecc0185a5759c7a0c95dfc96aceea5023cebdfc) | `fast_lora.py` | 28 Jan 2024
[fused_ops/unsloth_lora/gptq](./src/fms_acceleration_foak/fused_ops/unsloth_lora/gptq) | GPTQ fast dequant (triton_v2) | `jeromeku/main` @ [2839d39](https://github.com/jeromeku/unsloth/commit/2839d390ef3bb318904289bfb9a7751a782c4e44) | `fast_lora.py`<br>`triton/layers.py` | 6 Feb 2024
[kernels/unsloth](./src/fms_acceleration_foak/kernels/unsloth) | Fast RMS, RoPE, CrossEnt kernels | `unsloth/main` @ [1ecc0185](https://github.com/unslothai/unsloth/commit/1ecc0185a5759c7a0c95dfc96aceea5023cebdfc) | `cross_entropy_loss.py`<br>`rms_layernorm.py` | 28 Jan 2024

### Supported Models

Model | norm | pos emb | cross-ent | fused_lora 
--|--|--|--|--
`LlamaForCausalLM` | ✅  | ✅ | ✅  | ✅ 
`MistralForCausalLM` | ✅  | ✅ | ✅  | ✅ 
`MixtralForCausalLM` | ✅  | ✅ | ✅  | ✅ 
`GPTBigCodeForCausalLM` | ❌  | ❌ | ✅  | ❌ 
`GraniteForCausalLM` | ✅  | ✅ | ✅  | ✅ 

#### Adding Support For A New Model

It is realtively easy by following an existing template, in what follows we use [GraniteForCausalLM](./src/fms_acceleration_foak/models/granite.py) as an example.
- implement a `get_mp_rules` for the new model, which returns a list of `ModelPatcherRule`.
- logic that needs to be changed is the various classes that the rules are triggered on. Import the various module classes likes so:
    ```python
    from transformers.models.granite.modeling_granite import ( 
        GraniteAttention,
        GraniteMLP,
        GraniteRMSNorm,
    )
    ```
- replace the classes appropriately in various locations in `ModelPatcherRule`. In particular the `ModelPatcherTrigger` portions of it. Name `rule_id` appropriately.
    ```python
    ModelPatcherRule(
        rule_id="granite-rms",
        trigger=ModelPatcherTrigger(check=GraniteRMSNorm),
        forward=fast_rms_layernorm,
    )
    ```

## Known Issues

- MixedPrecision `--fp16` or `--bf16` should be used with `fast_lora`.
- `fast_lora` has issues with FSDP V1 with the `peft` style of FSDP wrapping. 
    * This is because the adapter's forward functions are bypassed in the fused ops.
    * For AutoGPTQ/QLoRA this is addressed by distributing the adapters using DDP so they will be unsharded in time for the fused ops.
- `fast_rope_embeddings` does not work with position_ids. Currently `position_ids` are ignored and could give wrong results.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "fms-acceleration-foak",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "~=3.9",
    "maintainer_email": null,
    "keywords": "acceleration, fms-hf-tuning, fused-ops, triton",
    "author": null,
    "author_email": "Fabian Lim <flim@sg.ibm.com>, Aaron Chew <aaron.chew1@ibm.com>",
    "download_url": null,
    "platform": null,
    "description": "# FMS Acceleration for Fused Operations and Kernels\n\nThis library contains fused operations and custom kernels, to be expanded over time. Currently it contains the following:\n\n\n1. Fused operations and kernels extracted from [unsloth](#extracted-code-from-unsloth). \n    - Low-Rank Adapter Fused Operations\n    - Fast RoPE Triton Kernels\n    - Fast RMS LayerNorm Triton Kernels\n    - Fast Cross Entropy Triton Kernels\n\n## Plugins\n\nPlugin | Description | Depends | Loading | Augmentation | Callbacks\n--|--|--|--|--|--\n[fast_quantized_peft](./src/fms_accelerate_foak/framework_plugin_fast_quantized_peft.py) | LoRA fused ops, fast cross-entropy, fast rms, fast RoPE (**Disabled**) | Contains extracted code |  | \u2705\n[fast_kernels](./src/fms_accelerate_foak/framework_plugin_fast_kernels.py) | Enhanced version of `fast_quantized_peft`, also works for full-FT and non-quant peft | Contains extracted code |  | \u2705\n\n### Supported DataType Settings\n**Compatibility Matrix with Mixed Precision**\ntorch_dtype | Mixed Precision | Full-FT-FOAK | PEFT-FOAK | QPEFT-FOAK\n-- | -- | -- | -- | --\nFLOAT16 | - | **Compatible**  | **Compatible** | \u2717\nFLOAT16 | FP16 | ValueError: <br>Attempting to <br>unscale FP16 gradients. <br>[See here](https://github.com/huggingface/peft/blob/main/docs/source/developer_guides/troubleshooting.md) | **Compatible** | **Compatible**\nBFLOAT16 | - | **Compatible**  | **Compatible**  | \u2717\nBFLOAT16 | BF16 | **Compatible** | **Compatible** | [Less Performant](https://github.com/foundation-model-stack/fms-acceleration/issues/84)\n\nNOTE: this chart is also a good reference for supported types, even for the non-FOAK case.\n\n### Code Extracted from Unsloth\n\n\nNotes on the extraction of code from [unsloth](https://github.com/unslothai/unsloth):\n- While unsloth is [released under Apache 2.0](https://github.com/unslothai/unsloth/blob/main/LICENSE), there are comments indicating some exceptions strewn throughout the code base, see [an example here](https://github.com/unslothai/unsloth/blob/ec19e61c854dcf9104386fa63fc6c4f2944d4f35/unsloth/models/llama.py#L1140-L1143).\n    ```\n    it would require a commercial license if used to run on more than 4 GPUs ...\n    ```\n- These exceptions appear to be located around the trainer improvements, see [another example here](https://github.com/unslothai/unsloth/blob/ec19e61c854dcf9104386fa63fc6c4f2944d4f35/unsloth/models/llama.py#L1177-L1183).\n- These exceptions appear around [Feb 2024 Release](https://github.com/unslothai/unsloth/commit/3e4c5a323c16bbda2c92212b790073c4e99c2a55); any code that appears in any file where such exceptions occur **is not extracted**.\n- Instead in its place, we have adopted a different approach; we adopt the approach of model patching, as opposed unsloths' approach to rewrite the model. Our approach is novel and **completely rewritten from scratch**. \n- We have also enabled dropout on the lora fused operations.\n- All extracted code appears before the Feb 2024 Release. \n- In the table below we record what was extracted, and the exact commit from which it was taken.\n\nPath | Description | Extracted From  | Modifications | Date\n--|--|--|--|--\n[fused_ops/unsloth_lora](./src/fms_acceleration_foak/fused_ops/unsloth_lora) | QLoRA fast dequant, activation kernels | `unsloth/main` @ [1ecc0185](https://github.com/unslothai/unsloth/commit/1ecc0185a5759c7a0c95dfc96aceea5023cebdfc) |  | 28 Jan 2024\n[fused_ops/unsloth_lora/bnb](./src/fms_acceleration_foak/fused_ops/unsloth_lora/bnb) | BNB fast lora | `unsloth/main` @ [1ecc0185](https://github.com/unslothai/unsloth/commit/1ecc0185a5759c7a0c95dfc96aceea5023cebdfc) | `fast_lora.py` | 28 Jan 2024\n[fused_ops/unsloth_lora/gptq](./src/fms_acceleration_foak/fused_ops/unsloth_lora/gptq) | GPTQ fast dequant (triton_v2) | `jeromeku/main` @ [2839d39](https://github.com/jeromeku/unsloth/commit/2839d390ef3bb318904289bfb9a7751a782c4e44) | `fast_lora.py`<br>`triton/layers.py` | 6 Feb 2024\n[kernels/unsloth](./src/fms_acceleration_foak/kernels/unsloth) | Fast RMS, RoPE, CrossEnt kernels | `unsloth/main` @ [1ecc0185](https://github.com/unslothai/unsloth/commit/1ecc0185a5759c7a0c95dfc96aceea5023cebdfc) | `cross_entropy_loss.py`<br>`rms_layernorm.py` | 28 Jan 2024\n\n### Supported Models\n\nModel | norm | pos emb | cross-ent | fused_lora \n--|--|--|--|--\n`LlamaForCausalLM` | \u2705  | \u2705 | \u2705  | \u2705 \n`MistralForCausalLM` | \u2705  | \u2705 | \u2705  | \u2705 \n`MixtralForCausalLM` | \u2705  | \u2705 | \u2705  | \u2705 \n`GPTBigCodeForCausalLM` | \u274c  | \u274c | \u2705  | \u274c \n`GraniteForCausalLM` | \u2705  | \u2705 | \u2705  | \u2705 \n\n#### Adding Support For A New Model\n\nIt is realtively easy by following an existing template, in what follows we use [GraniteForCausalLM](./src/fms_acceleration_foak/models/granite.py) as an example.\n- implement a `get_mp_rules` for the new model, which returns a list of `ModelPatcherRule`.\n- logic that needs to be changed is the various classes that the rules are triggered on. Import the various module classes likes so:\n    ```python\n    from transformers.models.granite.modeling_granite import ( \n        GraniteAttention,\n        GraniteMLP,\n        GraniteRMSNorm,\n    )\n    ```\n- replace the classes appropriately in various locations in `ModelPatcherRule`. In particular the `ModelPatcherTrigger` portions of it. Name `rule_id` appropriately.\n    ```python\n    ModelPatcherRule(\n        rule_id=\"granite-rms\",\n        trigger=ModelPatcherTrigger(check=GraniteRMSNorm),\n        forward=fast_rms_layernorm,\n    )\n    ```\n\n## Known Issues\n\n- MixedPrecision `--fp16` or `--bf16` should be used with `fast_lora`.\n- `fast_lora` has issues with FSDP V1 with the `peft` style of FSDP wrapping. \n    * This is because the adapter's forward functions are bypassed in the fused ops.\n    * For AutoGPTQ/QLoRA this is addressed by distributing the adapters using DDP so they will be unsharded in time for the fused ops.\n- `fast_rope_embeddings` does not work with position_ids. Currently `position_ids` are ignored and could give wrong results.",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "FMS Acceleration using Fused Operations and Kernels",
    "version": "0.3.2",
    "project_urls": null,
    "split_keywords": [
        "acceleration",
        " fms-hf-tuning",
        " fused-ops",
        " triton"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3ae36c3dceda75ead755a6dc3a6d1fdbbf37a04f10672cef34df9013207c4684",
                "md5": "de1447db74bc3380b5f3a2f85c0609ad",
                "sha256": "b0d02efe2d95a8ca4f079fb1d5b2808def63907871a53b28f850b14b9260d0bc"
            },
            "downloads": -1,
            "filename": "fms_acceleration_foak-0.3.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "de1447db74bc3380b5f3a2f85c0609ad",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "~=3.9",
            "size": 51853,
            "upload_time": "2024-10-23T09:09:38",
            "upload_time_iso_8601": "2024-10-23T09:09:38.325336Z",
            "url": "https://files.pythonhosted.org/packages/3a/e3/6c3dceda75ead755a6dc3a6d1fdbbf37a04f10672cef34df9013207c4684/fms_acceleration_foak-0.3.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-23 09:09:38",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "fms-acceleration-foak"
}

None