ditty


Nameditty JSON
Version 0.5.0 PyPI version JSON
download
home_pagehttps://github.com/iantbutler01/ditty
SummaryNone
upload_time2024-05-02 15:42:47
maintainerNone
docs_urlNone
authorIan T Butler (KinglyCrow)
requires_pythonNone
licenseApache V2
keywords finetuning llm nlp machine learning
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Ditty

A simple fine-tune.

## What
A very simple library for finetuning Huggingface Pretrained AutoModelForCausalLM such as GPTNeoX, leveraging Huggingface Accelerate, Transformers, Datasets and Peft

Ditty has support for LORA, 8bit, and fp32 cpu offloading right out of the box and assumes you are running with a single GPU or distributed over multiple GPUs by default.

Checkpointing supported, currently a bug with pushing to HF model hub though so checkpoints are all local.

FP16, BFLOAT16 now supported.

QLORA 4bit supported under experimental, and requires installing development branches of accelerate, peft, transformers and the latest bitsandbytes.


## What Not
- Ditty does not support ASICs like TPU or Trainium.
- Ditty does not handle Sagemaker
- Ditty does not by default run with the CPU
- Ditty does not handle evaluation sets or benchmarking, this may or may not change.

## Soon
- Ditty may handle distributed cluster finetuning
- Ditty will support DeepSpeed

## Classes

### Pipeline

Pipeline is responsible for running the entire show. Simply subclass Pipeline and implement the `dataset` method for your custom data, this must return a `torch.utils.data.DataLoader`

Instantiate with your chosen config and then simply call `run`.

### Trainer

Trainer does what it's name implies, which is to train the model. You may never need to touch this if you're not interested in customizing the training loop.

### Data

Data wraps an HF Dataset and can configure length grouped sampling and random sampling, as well as handling collation, batching, seeds, removing unused columns and a few other things.

The primary way of using this class is through the `prepare` method which takes a list of operations to perform against the dataset. These are normal operations like `map` and `filter`.

Example:
```python
data = Data(
    load_kwargs={"path": self.dataset_name, "name":
                 self.dataset_language},
    tokenizer=self.tokenizer,
    seed=self.seed,
    batch_size=self.batch_size,
    grad_accum=self.grad_accum,
)

....sic

dataloader = data.prepare(
        [
            ("filter", filter_longer, {}),
            ("map", do_something, dict(batched=True, remove_columns=columns)), 
            ("map", truncate, {}),
        ]
    )
```

This can be used to great effect when overriding the `dataset` method in a subclass of `Pipeline`.

## Setup

```
pip install ditty
```


## Tips

https://github.com/google/python-fire is a tool for autogenerating CLIs from Python functions, dicts and objects.

It can be combined with Pipeline to make a very quick cli for launching your process.

## Attribution / Statement of Changes

Portions of this library look to Huggingface's transformers Trainer class as a reference and in some cases re-implements functions from Trainer, simplified to only account for GPU based work and overall narrower supported scope.

This statement is both to fulfill the obligations of the ApacheV2 licencse, but also because those folks do super cool work and I appreciate all they've done for the community and its just right to call this out.

## License

Apache V2 see the LICENSE file for full text.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/iantbutler01/ditty",
    "name": "ditty",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "finetuning, llm, nlp, machine learning",
    "author": "Ian T Butler (KinglyCrow)",
    "author_email": "iantbutler01@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/da/60/60f8d713984af59aafe3013e554969c66a123c128aff581996b2c18f9265/ditty-0.5.0.tar.gz",
    "platform": null,
    "description": "# Ditty\n\nA simple fine-tune.\n\n## What\nA very simple library for finetuning Huggingface Pretrained AutoModelForCausalLM such as GPTNeoX, leveraging Huggingface Accelerate, Transformers, Datasets and Peft\n\nDitty has support for LORA, 8bit, and fp32 cpu offloading right out of the box and assumes you are running with a single GPU or distributed over multiple GPUs by default.\n\nCheckpointing supported, currently a bug with pushing to HF model hub though so checkpoints are all local.\n\nFP16, BFLOAT16 now supported.\n\nQLORA 4bit supported under experimental, and requires installing development branches of accelerate, peft, transformers and the latest bitsandbytes.\n\n\n## What Not\n- Ditty does not support ASICs like TPU or Trainium.\n- Ditty does not handle Sagemaker\n- Ditty does not by default run with the CPU\n- Ditty does not handle evaluation sets or benchmarking, this may or may not change.\n\n## Soon\n- Ditty may handle distributed cluster finetuning\n- Ditty will support DeepSpeed\n\n## Classes\n\n### Pipeline\n\nPipeline is responsible for running the entire show. Simply subclass Pipeline and implement the `dataset` method for your custom data, this must return a `torch.utils.data.DataLoader`\n\nInstantiate with your chosen config and then simply call `run`.\n\n### Trainer\n\nTrainer does what it's name implies, which is to train the model. You may never need to touch this if you're not interested in customizing the training loop.\n\n### Data\n\nData wraps an HF Dataset and can configure length grouped sampling and random sampling, as well as handling collation, batching, seeds, removing unused columns and a few other things.\n\nThe primary way of using this class is through the `prepare` method which takes a list of operations to perform against the dataset. These are normal operations like `map` and `filter`.\n\nExample:\n```python\ndata = Data(\n    load_kwargs={\"path\": self.dataset_name, \"name\":\n                 self.dataset_language},\n    tokenizer=self.tokenizer,\n    seed=self.seed,\n    batch_size=self.batch_size,\n    grad_accum=self.grad_accum,\n)\n\n....sic\n\ndataloader = data.prepare(\n        [\n            (\"filter\", filter_longer, {}),\n            (\"map\", do_something, dict(batched=True, remove_columns=columns)), \n            (\"map\", truncate, {}),\n        ]\n    )\n```\n\nThis can be used to great effect when overriding the `dataset` method in a subclass of `Pipeline`.\n\n## Setup\n\n```\npip install ditty\n```\n\n\n## Tips\n\nhttps://github.com/google/python-fire is a tool for autogenerating CLIs from Python functions, dicts and objects.\n\nIt can be combined with Pipeline to make a very quick cli for launching your process.\n\n## Attribution / Statement of Changes\n\nPortions of this library look to Huggingface's transformers Trainer class as a reference and in some cases re-implements functions from Trainer, simplified to only account for GPU based work and overall narrower supported scope.\n\nThis statement is both to fulfill the obligations of the ApacheV2 licencse, but also because those folks do super cool work and I appreciate all they've done for the community and its just right to call this out.\n\n## License\n\nApache V2 see the LICENSE file for full text.\n",
    "bugtrack_url": null,
    "license": "Apache V2",
    "summary": null,
    "version": "0.5.0",
    "project_urls": {
        "Homepage": "https://github.com/iantbutler01/ditty"
    },
    "split_keywords": [
        "finetuning",
        " llm",
        " nlp",
        " machine learning"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "99aafe084269cd8ffe1c306a4f018905bf755079ff9d76a13339e5219553b83b",
                "md5": "48f4c48409d5c73d592af6a2fccc6f26",
                "sha256": "e5750ae60ca17d36097f64732382f6e4dd52596345957bd2ae55c993ccf7754b"
            },
            "downloads": -1,
            "filename": "ditty-0.5.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "48f4c48409d5c73d592af6a2fccc6f26",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 14030,
            "upload_time": "2024-05-02T15:42:46",
            "upload_time_iso_8601": "2024-05-02T15:42:46.202460Z",
            "url": "https://files.pythonhosted.org/packages/99/aa/fe084269cd8ffe1c306a4f018905bf755079ff9d76a13339e5219553b83b/ditty-0.5.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "da6060f8d713984af59aafe3013e554969c66a123c128aff581996b2c18f9265",
                "md5": "51ccdf095d23d51b70b603d137ea87f1",
                "sha256": "afc739c14fa2b7123ad54e6367e5eee83b0aa09f7d9dbaf24423aeadebfbe4db"
            },
            "downloads": -1,
            "filename": "ditty-0.5.0.tar.gz",
            "has_sig": false,
            "md5_digest": "51ccdf095d23d51b70b603d137ea87f1",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 14308,
            "upload_time": "2024-05-02T15:42:47",
            "upload_time_iso_8601": "2024-05-02T15:42:47.845294Z",
            "url": "https://files.pythonhosted.org/packages/da/60/60f8d713984af59aafe3013e554969c66a123c128aff581996b2c18f9265/ditty-0.5.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-05-02 15:42:47",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "iantbutler01",
    "github_project": "ditty",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "ditty"
}
        
Elapsed time: 2.05786s