pipegoose


Namepipegoose JSON
Version 0.2.0 PyPI version JSON
download
home_page
Summary
upload_time2023-10-24 04:35:35
maintainer
docs_urlNone
authorxrsrke
requires_python>=3.9,<4.0
licenseMIT
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # 🚧 PipeGoose: Training any 🤗 `transformers` in Megatron-LM 3D parallelism and ZeRO-1 out of the box

[<img src="https://img.shields.io/badge/license-MIT-blue">](https://github.com/xrsrke/pipegoose) [![tests](https://github.com/xrsrke/pipegoose/actions/workflows/tests.yaml/badge.svg)](https://github.com/xrsrke/pipegoose/actions/workflows/tests.yaml) [<img src="https://img.shields.io/discord/767863440248143916?label=discord">](https://discord.gg/s9ZS9VXZ3p) [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) [<img alt="Codecov" src="https://img.shields.io/codecov/c/github/xrsrke/pipegoose">](https://app.codecov.io/gh/xrsrke/pipegoose) [![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/) [![Twitter](https://img.shields.io/twitter/url/https/twitter.com/cloudposse.svg?style=social&label=Follow%20%40xariusrke)](https://twitter.com/xariusrke)

![pipeline](3d-parallelism.png)

<!-- [![docs](https://img.shields.io/github/deployments/Production?label=docs&logo=vercel)](https://docs.dev/) -->


Honk honk honk! This project is actively under development. Check out my learning progress [here](https://twitter.com/xariusrke/status/1667999818554413057).

⚠️ **The project is actively under development and not ready for use.**

⚠️ **The APIs is still a work in progress and could change at any time. None of the public APIs are set in stone until we hit version 0.6.9.**

⚠️ **Support for hybrid 3D parallelism and distributed optimizer for 🤗 `transformers` will be available in the upcoming weeks (it's basically done, but it doesn't support 🤗 `transformers` yet)**

⚠️ **This library is underperforming compared to Megatron-LM and DeepSpeed (not even achieving reasonable performance). We're actively pushing it to reach 180 TFLOPs and go beyond Megatron-LM **


```diff
from torch.utils.data import DataLoader
+ from torch.utils.data.distributed import DistributedSampler
from torch.optim import SGD
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset

+ from pipegoose.distributed import ParallelContext, ParallelMode
+ from pipegoose.nn import DataParallel, TensorParallel

model = AutoModelForCausalLM.from_pretrained("bigscience/bloom-560m")
tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-560m")
tokenizer.pad_token = tokenizer.eos_token

BATCH_SIZE = 4
+ DATA_PARALLEL_SIZE = 2
+ parallel_context = ParallelContext.from_torch(
+    tensor_parallel_size=2,
+    data_parallel_size=2,
+    pipeline_parallel_size=1
+ )
+ model = TensorParallel(model, parallel_context).parallelize()
+ model = DataParallel(model, parallel_context).parallelize()
model.to("cuda")
+ device = next(model.parameters()).device

optim = SGD(model.parameters(), lr=1e-3)

dataset = load_dataset("imdb", split="train")
+ dp_rank = parallel_context.get_local_rank(ParallelMode.DATA)
+ sampler = DistributedSampler(dataset, num_replicas=DATA_PARALLEL_SIZE, rank=dp_rank, seed=42)
+ dataloader = DataLoader(dataset, batch_size=BATCH_SIZE // DATA_PARALLEL_SIZE, shuffle=False, sampler=sampler)

for epoch in range(100):
+    sampler.set_epoch(epoch)

    for batch in dataloader:
        inputs = tokenizer(batch["text"], padding=True, truncation=True, max_length=1024, return_tensors="pt")
        inputs = {name: tensor.to(device) for name, tensor in inputs.items()}
        labels = inputs["input_ids"]

        outputs = model(**inputs, labels=labels)

        optim.zero_grad()
        outputs.loss.backward()
        optim.step()
```

**Installation and try it out**

You can install the package through the following command:

```bash
pip install pipegoose
```

And try out a hybrid tensor and data parallelism training script.

```bash
cd pipegoose/examples
torchrun --standalone --nnodes=1 --nproc-per-node=4 hybrid_parallelism.py
```

We did a small scale for correctness testing by run comparing the training losses between a paralleized transformers and one kept by default, start at identical checkpoint and training data. We will conduct rigorous large scale convergence and weak scaling  law benchmarks against megatron and deepspeed in the near future
- Data Parallelism [link]
- Tensor Parallelism
- Hybrid 2D Parallelism

**Features**
- Megatron-style 3D parallelism
- ZeRO-1: Distributed BF16 Optimizer
- Highly optimized CUDA kernels port from Megatron-LM, DeepSpeed
- ...

**Implementation Details**

- Supports training `transformers` model in Megatron 3D parallelism and ZeRO-1 (write from scratch).
- Implements parallel compute and data transfer using separate CUDA streams.
- Gradient checkpointing will be implemented by enforcing virtual dependency in the backpropagation graph, ensuring that the activation for gradient checkpoint will be recomputed just in time for each (micro-batch, partition).
- Custom algorithms for model partitioning with two default partitioning models based on elapsed time and GPU memory consumption per layer.
- Potential support includes:
    - Callbacks within the pipeline: `Callback(function, microbatch_idx, partition_idx)` for before and after the forward, backward, and recompute steps (for gradient checkpointing).
    - Mixed precision training.

**Appreciation**

- Big thanks to 🤗 [Hugging Face](https://huggingface.co/) for sponsoring this project with 8x A100 GPUs for testing! And [Zach Schrier](https://twitter.com/zach_schrier) for monthly twitch donations

- The library's APIs are inspired by [OSLO](https://github.com/EleutherAI/oslo)'s and [ColossalAI](https://github.com/hpcaitech/ColossalAI)'s APIs.

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "pipegoose",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.9,<4.0",
    "maintainer_email": "",
    "keywords": "",
    "author": "xrsrke",
    "author_email": "hello@xrs.wtf",
    "download_url": "https://files.pythonhosted.org/packages/77/e6/0d50659350606c26241466d3ad7ba01e4c1e620dfd2afc6f9b2ade3ef9e0/pipegoose-0.2.0.tar.gz",
    "platform": null,
    "description": "# \ud83d\udea7 PipeGoose: Training any \ud83e\udd17 `transformers` in Megatron-LM 3D parallelism and ZeRO-1 out of the box\n\n[<img src=\"https://img.shields.io/badge/license-MIT-blue\">](https://github.com/xrsrke/pipegoose) [![tests](https://github.com/xrsrke/pipegoose/actions/workflows/tests.yaml/badge.svg)](https://github.com/xrsrke/pipegoose/actions/workflows/tests.yaml) [<img src=\"https://img.shields.io/discord/767863440248143916?label=discord\">](https://discord.gg/s9ZS9VXZ3p) [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) [<img alt=\"Codecov\" src=\"https://img.shields.io/codecov/c/github/xrsrke/pipegoose\">](https://app.codecov.io/gh/xrsrke/pipegoose) [![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/) [![Twitter](https://img.shields.io/twitter/url/https/twitter.com/cloudposse.svg?style=social&label=Follow%20%40xariusrke)](https://twitter.com/xariusrke)\n\n![pipeline](3d-parallelism.png)\n\n<!-- [![docs](https://img.shields.io/github/deployments/Production?label=docs&logo=vercel)](https://docs.dev/) -->\n\n\nHonk honk honk! This project is actively under development. Check out my learning progress [here](https://twitter.com/xariusrke/status/1667999818554413057).\n\n\u26a0\ufe0f **The project is actively under development and not ready for use.**\n\n\u26a0\ufe0f **The APIs is still a work in progress and could change at any time. None of the public APIs are set in stone until we hit version 0.6.9.**\n\n\u26a0\ufe0f **Support for hybrid 3D parallelism and distributed optimizer for \ud83e\udd17 `transformers` will be available in the upcoming weeks (it's basically done, but it doesn't support \ud83e\udd17 `transformers` yet)**\n\n\u26a0\ufe0f **This library is underperforming compared to Megatron-LM and DeepSpeed (not even achieving reasonable performance). We're actively pushing it to reach 180 TFLOPs and go beyond Megatron-LM **\n\n\n```diff\nfrom torch.utils.data import DataLoader\n+ from torch.utils.data.distributed import DistributedSampler\nfrom torch.optim import SGD\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\nfrom datasets import load_dataset\n\n+ from pipegoose.distributed import ParallelContext, ParallelMode\n+ from pipegoose.nn import DataParallel, TensorParallel\n\nmodel = AutoModelForCausalLM.from_pretrained(\"bigscience/bloom-560m\")\ntokenizer = AutoTokenizer.from_pretrained(\"bigscience/bloom-560m\")\ntokenizer.pad_token = tokenizer.eos_token\n\nBATCH_SIZE = 4\n+ DATA_PARALLEL_SIZE = 2\n+ parallel_context = ParallelContext.from_torch(\n+    tensor_parallel_size=2,\n+    data_parallel_size=2,\n+    pipeline_parallel_size=1\n+ )\n+ model = TensorParallel(model, parallel_context).parallelize()\n+ model = DataParallel(model, parallel_context).parallelize()\nmodel.to(\"cuda\")\n+ device = next(model.parameters()).device\n\noptim = SGD(model.parameters(), lr=1e-3)\n\ndataset = load_dataset(\"imdb\", split=\"train\")\n+ dp_rank = parallel_context.get_local_rank(ParallelMode.DATA)\n+ sampler = DistributedSampler(dataset, num_replicas=DATA_PARALLEL_SIZE, rank=dp_rank, seed=42)\n+ dataloader = DataLoader(dataset, batch_size=BATCH_SIZE // DATA_PARALLEL_SIZE, shuffle=False, sampler=sampler)\n\nfor epoch in range(100):\n+    sampler.set_epoch(epoch)\n\n    for batch in dataloader:\n        inputs = tokenizer(batch[\"text\"], padding=True, truncation=True, max_length=1024, return_tensors=\"pt\")\n        inputs = {name: tensor.to(device) for name, tensor in inputs.items()}\n        labels = inputs[\"input_ids\"]\n\n        outputs = model(**inputs, labels=labels)\n\n        optim.zero_grad()\n        outputs.loss.backward()\n        optim.step()\n```\n\n**Installation and try it out**\n\nYou can install the package through the following command:\n\n```bash\npip install pipegoose\n```\n\nAnd try out a hybrid tensor and data parallelism training script.\n\n```bash\ncd pipegoose/examples\ntorchrun --standalone --nnodes=1 --nproc-per-node=4 hybrid_parallelism.py\n```\n\nWe did a small scale for correctness testing by run comparing the training losses between a paralleized transformers and one kept by default, start at identical checkpoint and training data. We will conduct rigorous large scale convergence and weak scaling  law benchmarks against megatron and deepspeed in the near future\n- Data Parallelism [link]\n- Tensor Parallelism\n- Hybrid 2D Parallelism\n\n**Features**\n- Megatron-style 3D parallelism\n- ZeRO-1: Distributed BF16 Optimizer\n- Highly optimized CUDA kernels port from Megatron-LM, DeepSpeed\n- ...\n\n**Implementation Details**\n\n- Supports training `transformers` model in Megatron 3D parallelism and ZeRO-1 (write from scratch).\n- Implements parallel compute and data transfer using separate CUDA streams.\n- Gradient checkpointing will be implemented by enforcing virtual dependency in the backpropagation graph, ensuring that the activation for gradient checkpoint will be recomputed just in time for each (micro-batch, partition).\n- Custom algorithms for model partitioning with two default partitioning models based on elapsed time and GPU memory consumption per layer.\n- Potential support includes:\n    - Callbacks within the pipeline: `Callback(function, microbatch_idx, partition_idx)` for before and after the forward, backward, and recompute steps (for gradient checkpointing).\n    - Mixed precision training.\n\n**Appreciation**\n\n- Big thanks to \ud83e\udd17 [Hugging Face](https://huggingface.co/) for sponsoring this project with 8x A100 GPUs for testing! And [Zach Schrier](https://twitter.com/zach_schrier) for monthly twitch donations\n\n- The library's APIs are inspired by [OSLO](https://github.com/EleutherAI/oslo)'s and [ColossalAI](https://github.com/hpcaitech/ColossalAI)'s APIs.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "",
    "version": "0.2.0",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "31f8db4ffb0d55cbf8a2e6e983aefcf9b4c05b7729579afd993f3a8f3b2cf228",
                "md5": "a95ba5864896e14ee69ff462a96bd4a9",
                "sha256": "f60414446cbbb78ece443e777e7e8382266377b72171bb96b50cbccde5cfba13"
            },
            "downloads": -1,
            "filename": "pipegoose-0.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a95ba5864896e14ee69ff462a96bd4a9",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9,<4.0",
            "size": 67029,
            "upload_time": "2023-10-24T04:35:33",
            "upload_time_iso_8601": "2023-10-24T04:35:33.757157Z",
            "url": "https://files.pythonhosted.org/packages/31/f8/db4ffb0d55cbf8a2e6e983aefcf9b4c05b7729579afd993f3a8f3b2cf228/pipegoose-0.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "77e60d50659350606c26241466d3ad7ba01e4c1e620dfd2afc6f9b2ade3ef9e0",
                "md5": "04a091d5372da4a8a0de948355f78a9f",
                "sha256": "c8d61d77bfe66ddd66573c69733b9617ebfce4d1d356d9e309b502f156764af3"
            },
            "downloads": -1,
            "filename": "pipegoose-0.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "04a091d5372da4a8a0de948355f78a9f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9,<4.0",
            "size": 44188,
            "upload_time": "2023-10-24T04:35:35",
            "upload_time_iso_8601": "2023-10-24T04:35:35.949108Z",
            "url": "https://files.pythonhosted.org/packages/77/e6/0d50659350606c26241466d3ad7ba01e4c1e620dfd2afc6f9b2ade3ef9e0/pipegoose-0.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-10-24 04:35:35",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "pipegoose"
}
        
Elapsed time: 3.13319s