paella

Name	paella JSON
Version	0.0.1.dev1 JSON
	download
home_page
Summary	paella - Fast Text-Conditional Discrete Denoising on Vector-Quantized Latent Spaces
upload_time	2023-04-16 16:31:48
maintainer
docs_url	None
author	Juncong Moo
requires_python
license
keywords	paella
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1geY_Z8m8dyjrky6uwiMepwySTWkVYl1j?usp=sharing)
![LAION Blog Post](https://user-images.githubusercontent.com/61938694/232235929-94dacf4a-b3f6-4359-901b-500781f55c12.png)

# Paella
Conditional text-to-image generation has seen countless recent improvements in terms of quality, diversity and fidelity. Nevertheless, most state-of-the-art models require numerous inference steps to produce faithful generations, resulting in performance bottlenecks for end-user applications. In this paper we introduce Paella, a novel text-to-image model requiring less than 10 steps to sample high-fidelity images, using a speed-optimized architecture allowing to sample a single image in less than 500 ms, while having 573M parameters. The model operates on a compressed & quantized latent space, it is conditioned on CLIP embeddings and uses an improved sampling function over previous works. Aside from text-conditional image generation, our model is able to do latent space interpolation and image manipulations such as inpainting, outpainting, and structural editing.
<br>
<br>
![collage](https://user-images.githubusercontent.com/61938694/231021615-38df0a0a-d97e-4f7a-99d9-99952357b4b1.png)

## Update 12.04
Since the paper-release we worked intensively to bring Paella to a similar level as other 
state-of-the-art models. With this release we are coming a step closer to that goal. However, our main intention is not
to make the greatest text-to-image model out there (at least for now), it is to bring text-to-image models closer
to people outside the field on a technical level. For example, a lot of models have codebases with many thousand lines 
of code, that make it very hard for people to dive into the code and easily understand it. And that is the contribution
we are the proudest of with Paella. The training and sampling code for Paella is minimalistic and can be understood in 
a few minutes, making further extensions, quick tests, idea testing etc. extremely fast. For instance, the entire
sampling code can be written in just **12 lines** of code.


Please find all details about the model and how it was trained in our [preprint paper on arxiv](https://arxiv.org/pdf/2211.07292.pdf).
<hr>

## Code
We especially want to highlight the minimalistic amount of code that is necessary to run & train Paella. 
The training & sampling code can fit in under 140 lines of code. We hope to the field of generative AI and especially 
text-to-image more accessible to more people this way. In order to just understand the basic logic you can take a look 
at the [main folder](https://github.com/dome272/Paella/tree/main/src). For a more advanced training script, 
including mixed precision, distributed training, better logging and all conditioning models you can take a look at the 
[distributed folder](https://github.com/dome272/Paella/tree/main/src_distributed).

## Models
| Model           | Download                                             | Parameters      | Conditioning                       |
|-----------------|------------------------------------------------------|-----------------|------------------------------------|
| Paella v3       | [Huggingface](https://huggingface.co/dome272/Paella) | 1B (+1B prior)  | ByT5-XL, CLIP-H-Text, CLIP-H-Image |

## Sampling
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1geY_Z8m8dyjrky6uwiMepwySTWkVYl1j?usp=sharing)

For sampling, you can just take a look at the [sampling.ipynb](https://github.com/dome272/Paella/blob/main/paella_inference.ipynb) notebook. :sunglasses: <br>
**Note**: Since we condition on ByT5-XL, CLIP-H-Text, CLIP-H-Image, sampling with the model takes at least 30GB of RAM,
unfortunately. We are hoping to use smaller conditioning models in the future.

## Train your own Paella
Depending on how you want to train Paella, we provided code for running it on a 
[single-GPU](https://github.com/dome272/Paella/tree/main/src) or for 
[multiple-GPU / multi-node training](https://github.com/dome272/Paella/tree/main/src_distributed).
The main file for training is [train.py](https://github.com/dome272/Paella/blob/main/src/train.py). You can adjust all 
[hyperparameters](https://github.com/dome272/Paella/blob/main/src/train.py#L10) to your own needs. 
In the distributed training code we provided a [webdataset](https://github.com/webdataset/webdataset/) dataloader,
whereas in the single-GPU code you have to set your [own dataloader](https://github.com/dome272/Paella/blob/main/src/utils.py#L19).
Make sure it returns a tuple of ```(images, captions)``` where ```images``` is a ```torch.Tensor``` of shape 
```batch_size x channels x height x width``` and captions is a ```List``` of length ```batch_size```. To start the
training you can just run ```python3 paella.py``` for the single-GPU case and for the multi-GPU case we provided a
[slurm](https://slurm.schedmd.com/documentation.html) script for launching the training you can find 
[here](https://github.com/dome272/Paella/blob/main/src_distributed/run/run.sh).


### License
The model code and weights are released under the [MIT license](https://github.com/dome272/Paella/blob/main/LICENSE).

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "paella",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "paella",
    "author": "Juncong Moo",
    "author_email": "",
    "download_url": "",
    "platform": null,
    "description": "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1geY_Z8m8dyjrky6uwiMepwySTWkVYl1j?usp=sharing)\n![LAION Blog Post](https://user-images.githubusercontent.com/61938694/232235929-94dacf4a-b3f6-4359-901b-500781f55c12.png)\n\n# Paella\nConditional text-to-image generation has seen countless recent improvements in terms of quality, diversity and fidelity. Nevertheless, most state-of-the-art models require numerous inference steps to produce faithful generations, resulting in performance bottlenecks for end-user applications. In this paper we introduce Paella, a novel text-to-image model requiring less than 10 steps to sample high-fidelity images, using a speed-optimized architecture allowing to sample a single image in less than 500 ms, while having 573M parameters. The model operates on a compressed & quantized latent space, it is conditioned on CLIP embeddings and uses an improved sampling function over previous works. Aside from text-conditional image generation, our model is able to do latent space interpolation and image manipulations such as inpainting, outpainting, and structural editing.\n<br>\n<br>\n![collage](https://user-images.githubusercontent.com/61938694/231021615-38df0a0a-d97e-4f7a-99d9-99952357b4b1.png)\n\n## Update 12.04\nSince the paper-release we worked intensively to bring Paella to a similar level as other \nstate-of-the-art models. With this release we are coming a step closer to that goal. However, our main intention is not\nto make the greatest text-to-image model out there (at least for now), it is to bring text-to-image models closer\nto people outside the field on a technical level. For example, a lot of models have codebases with many thousand lines \nof code, that make it very hard for people to dive into the code and easily understand it. And that is the contribution\nwe are the proudest of with Paella. The training and sampling code for Paella is minimalistic and can be understood in \na few minutes, making further extensions, quick tests, idea testing etc. extremely fast. For instance, the entire\nsampling code can be written in just **12 lines** of code.\n\n\nPlease find all details about the model and how it was trained in our [preprint paper on arxiv](https://arxiv.org/pdf/2211.07292.pdf).\n<hr>\n\n## Code\nWe especially want to highlight the minimalistic amount of code that is necessary to run & train Paella. \nThe training & sampling code can fit in under 140 lines of code. We hope to the field of generative AI and especially \ntext-to-image more accessible to more people this way. In order to just understand the basic logic you can take a look \nat the [main folder](https://github.com/dome272/Paella/tree/main/src). For a more advanced training script, \nincluding mixed precision, distributed training, better logging and all conditioning models you can take a look at the \n[distributed folder](https://github.com/dome272/Paella/tree/main/src_distributed).\n\n## Models\n| Model           | Download                                             | Parameters      | Conditioning                       |\n|-----------------|------------------------------------------------------|-----------------|------------------------------------|\n| Paella v3       | [Huggingface](https://huggingface.co/dome272/Paella) | 1B (+1B prior)  | ByT5-XL, CLIP-H-Text, CLIP-H-Image |\n\n## Sampling\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1geY_Z8m8dyjrky6uwiMepwySTWkVYl1j?usp=sharing)\n\nFor sampling, you can just take a look at the [sampling.ipynb](https://github.com/dome272/Paella/blob/main/paella_inference.ipynb) notebook. :sunglasses: <br>\n**Note**: Since we condition on ByT5-XL, CLIP-H-Text, CLIP-H-Image, sampling with the model takes at least 30GB of RAM,\nunfortunately. We are hoping to use smaller conditioning models in the future.\n\n## Train your own Paella\nDepending on how you want to train Paella, we provided code for running it on a \n[single-GPU](https://github.com/dome272/Paella/tree/main/src) or for \n[multiple-GPU / multi-node training](https://github.com/dome272/Paella/tree/main/src_distributed).\nThe main file for training is [train.py](https://github.com/dome272/Paella/blob/main/src/train.py). You can adjust all \n[hyperparameters](https://github.com/dome272/Paella/blob/main/src/train.py#L10) to your own needs. \nIn the distributed training code we provided a [webdataset](https://github.com/webdataset/webdataset/) dataloader,\nwhereas in the single-GPU code you have to set your [own dataloader](https://github.com/dome272/Paella/blob/main/src/utils.py#L19).\nMake sure it returns a tuple of ```(images, captions)``` where ```images``` is a ```torch.Tensor``` of shape \n```batch_size x channels x height x width``` and captions is a ```List``` of length ```batch_size```. To start the\ntraining you can just run ```python3 paella.py``` for the single-GPU case and for the multi-GPU case we provided a\n[slurm](https://slurm.schedmd.com/documentation.html) script for launching the training you can find \n[here](https://github.com/dome272/Paella/blob/main/src_distributed/run/run.sh).\n\n\n### License\nThe model code and weights are released under the [MIT license](https://github.com/dome272/Paella/blob/main/LICENSE).\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "paella - Fast Text-Conditional Discrete Denoising on Vector-Quantized Latent Spaces",
    "version": "0.0.1.dev1",
    "split_keywords": [
        "paella"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a81985e827438e53bebdd71fec6465ebff335ccf1c9cc06c2c2e9a20de3dc46d",
                "md5": "e09b21abebce5e4ade3840e23f8c72e4",
                "sha256": "fb3a61199ddf8527b320ce17a0db2ce5d1c3ce88434533460c070a1327c600ff"
            },
            "downloads": -1,
            "filename": "paella-0.0.1.dev1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e09b21abebce5e4ade3840e23f8c72e4",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 26128,
            "upload_time": "2023-04-16T16:31:48",
            "upload_time_iso_8601": "2023-04-16T16:31:48.473184Z",
            "url": "https://files.pythonhosted.org/packages/a8/19/85e827438e53bebdd71fec6465ebff335ccf1c9cc06c2c2e9a20de3dc46d/paella-0.0.1.dev1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-04-16 16:31:48",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "lcname": "paella"
}

Juncong Moo