pali3


Namepali3 JSON
Version 0.0.7 PyPI version JSON
download
home_pagehttps://github.com/kyegomez/pali3
Summarypali3 - Pytorch
upload_time2023-10-25 16:56:36
maintainer
docs_urlNone
authorKye Gomez
requires_python>=3.6,<4.0
licenseMIT
keywords artificial intelligence deep learning optimizers prompt engineering
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [![Multi-Modality](agorabanner.png)](https://discord.gg/qUtxnK2NMf)

# Pali3
![pali](pali.png)

"Figure 1: Overview of the PaLI-3 (5B) model: images are encoded into visual tokens individually
by the contrastively pretrained 2B SigLIP vision model. Along with a query, these visual tokens
are passed to an 3B encoder-decoder UL2 Transformer which produces the desired answer."


Vit trained with siglip loss -> embeddings -> ul2 -> text tokens

text -> tokenizer -> embeddings -> ul2 -> text tokens

[ARXVIV PAPER LINK](https://arxiv.org/pdf/2310.09199v1.pdf)

--------

## Installation

`pip install pali3`

-------

## Usage:

```python
import torch
from pali3.main import Pali3

model = Pali3()

img = torch.randn(1, 3, 256, 256)
prompt = torch.randint(0, 256, (1, 1024))
mask = torch.ones(1, 1024).bool()
output_text = torch.randint(0, 256, (1, 1024))

result = model.process(img, prompt, output_text, mask)
print(result)


```


-------

## Architecture

Here is the ASCII representation of the model architecture and the stages of training:

```
Model Architecture:

Image Input
    |
    V
Contrastive Vision Encoder (ViT-G/14)
    |
    V
Transformer Encoder
    |
    V
Transformer Decoder
    |
    V
Text Output

Stages of Training:

Stage 0: Unimodal pretraining
    |
    V
Stage 1: Multimodal training
    |
    V
Stage 2: Resolution increase
    |
    V
Task specialization (transfer)

```


# Model Training Phases
The model architecture consists of a contrastive vision encoder (ViT-G/14) that encodes the image into tokens. These tokens are passed to a transformer encoder and then to a transformer decoder that generates a text output.

The training procedure consists of multiple stages:

-   Stage 0: Unimodal pretraining. The image encoder is pretrained contrastively on image-text pairs from the web, following the SigLIP training protocol. The text encoder-decoder is a 3B UL2 model trained following the mixture of denoisers procedure.

-   Stage 1: Multimodal training. The image encoder is combined with the text encoder-decoder and trained on a multimodal task and data mixture, keeping the image encoder frozen and using its native resolution.

-   Stage 2: Resolution increase. The resolution of the model is increased by fine-tuning the whole model with a short curriculum of increasing resolutions.

-   Task specialization (transfer). Finally, for each individual task, the model is fine-tuned with frozen ViT image encoder on the task's training data.

Please note that this is a high-level representation and the actual implementation might involve more details and complexities.



------

# Vit Architecture
Here are the ASCII diagrams for the ViT (Vision Transformer)

```
ViT (Vision Transformer):

Image Input
    |
    V
Patch Extraction
    |
    V
Linear Embedding
    |
    V
Positional Encoding
    |
    V
Transformer Encoder Blocks (Multiple Layers)
    |
    V
Classification Head (Optional)
    |
    V
Output (Image Embeddings)

```

The ViT starts with patch extraction from the input image. These patches are then linearly embedded and positional encodings are added. The resulting sequence of patch embeddings is passed through multiple layers of transformer encoders. Optionally, a classification head can be added at the end to get class probabilities for image classification tasks. The output of the ViT is the image embeddings.

-------

# UL2 Encoder/Decoder Transformer
```
Encoder-Decoder Architecture:

Input (Image + Text Tokens)
    |
    V
Transformer Encoder
    |
    V
Encoder Output (Context for Decoder)
    |
    V
Transformer Decoder
    |
    V
Output (Generated Text)

```

The encoder-decoder architecture starts with the input, which is a combination of image and text tokens in this case. The input is passed through a transformer encoder, which generates a context for the decoder. The transformer decoder then uses this context to generate the output text.


# Dataset Strategy
Here is a table summarizing the key datasets mentioned in the paper along with their metadata and source links:

- Made with claude so links could be fake

| Dataset | Type | Size | Tasks | Source |
|-|-|-|-|-|
| ImageNet-22k | Image Classification | 14M images, 21,841 classes | Pretraining | https://github.com/google-research-datasets/ImageNet-21k-P |
| MS COCO | Image Captioning, VQA | 330K images, 80 object categories | Evaluation | https://cocodataset.org | 
| Flickr30k | Image Captioning | 31K images | Evaluation | https://www.kaggle.com/dataset/flickr30k |
| VQAv2 | Visual QA | 204K images, 1.1M questions | Evaluation | https://visualqa.org/download.html |  
| GQA | Visual QA | 22M graph-based questions | Evaluation | https://cs.stanford.edu/people/dorarad/gqa/download.html |
| RefCOCO/RefCOCO+ | Referring Expression | 19,994/19,992 images | Evaluation | https://github.com/lichengunc/refer |
| TextCaps | Image Captioning | 31,014 images | Evaluation | https://textvqa.org/textcaps |
| TextVQA | Visual QA | 28,408 images | Evaluation | https://textvqa.org/index.html |
| STVQA | Visual QA | 249,991 QA pairs | Evaluation | https://tvqa.cs.unc.edu/ |
| OCR-VQA | Visual QA | 45,336 images | Evaluation | https://ocrvqa.cloudcv.org/ |
| DocVQA | Visual QA | 5,000 document images | Evaluation | https://github.com/doc-vqa/docvqa |
| InfographiVQA | Visual QA | 10,047 infographic images | Evaluation | https://github.com/doc-vqa/InfoVQA |
| WebLI | Image-Text Pairs | 72M image-text pairs in 100+ languages | Pretraining | https://laion.ai/blogs/webli/ |
| JFT-300M | Image Classification | 303M images, 18,291 classes | Pretraining | https://github.com/google-research-datasets/jft300m |
| CrossModal-3600 | Image-Text Retrieval | 31K images, 3600 lang-image pairs | Evaluation | https://laion.ai/crossmodal-3600/ |

-----

# License
MIT

# Todo

- [x] Implement sig_lip vit model with training recipe
- [x] Implement the text tokenizer, maybe use token monster 
- [x] Implement the UL2 Transformer Encoder and Decoder
- [ ] Implement the pooling layer after vit then linear
- [ ] Implement the prepending the visual token embeddings to the text embeddings
- [ ] Implement training scripts for the full pali3 model


# Citation

```bibtex
@misc{2310.09199,
Author = {Xi Chen and Xiao Wang and Lucas Beyer and Alexander Kolesnikov and Jialin Wu and Paul Voigtlaender and Basil Mustafa and Sebastian Goodman and Ibrahim Alabdulmohsin and Piotr Padlewski and Daniel Salz and Xi Xiong and Daniel Vlasic and Filip Pavetic and Keran Rong and Tianli Yu and Daniel Keysers and Xiaohua Zhai and Radu Soricut},
Title = {PaLI-3 Vision Language Models: Smaller, Faster, Stronger},
Year = {2023},
Eprint = {arXiv:2310.09199},
}
```




            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/kyegomez/pali3",
    "name": "pali3",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6,<4.0",
    "maintainer_email": "",
    "keywords": "artificial intelligence,deep learning,optimizers,Prompt Engineering",
    "author": "Kye Gomez",
    "author_email": "kye@apac.ai",
    "download_url": "https://files.pythonhosted.org/packages/6d/a9/53972abb58852184c1c029120412ccbab03a3fa5e2e88ad355ea67a9c867/pali3-0.0.7.tar.gz",
    "platform": null,
    "description": "[![Multi-Modality](agorabanner.png)](https://discord.gg/qUtxnK2NMf)\n\n# Pali3\n![pali](pali.png)\n\n\"Figure 1: Overview of the PaLI-3 (5B) model: images are encoded into visual tokens individually\nby the contrastively pretrained 2B SigLIP vision model. Along with a query, these visual tokens\nare passed to an 3B encoder-decoder UL2 Transformer which produces the desired answer.\"\n\n\nVit trained with siglip loss -> embeddings -> ul2 -> text tokens\n\ntext -> tokenizer -> embeddings -> ul2 -> text tokens\n\n[ARXVIV PAPER LINK](https://arxiv.org/pdf/2310.09199v1.pdf)\n\n--------\n\n## Installation\n\n`pip install pali3`\n\n-------\n\n## Usage:\n\n```python\nimport torch\nfrom pali3.main import Pali3\n\nmodel = Pali3()\n\nimg = torch.randn(1, 3, 256, 256)\nprompt = torch.randint(0, 256, (1, 1024))\nmask = torch.ones(1, 1024).bool()\noutput_text = torch.randint(0, 256, (1, 1024))\n\nresult = model.process(img, prompt, output_text, mask)\nprint(result)\n\n\n```\n\n\n-------\n\n## Architecture\n\nHere is the ASCII representation of the model architecture and the stages of training:\n\n```\nModel Architecture:\n\nImage Input\n    |\n    V\nContrastive Vision Encoder (ViT-G/14)\n    |\n    V\nTransformer Encoder\n    |\n    V\nTransformer Decoder\n    |\n    V\nText Output\n\nStages of Training:\n\nStage 0: Unimodal pretraining\n    |\n    V\nStage 1: Multimodal training\n    |\n    V\nStage 2: Resolution increase\n    |\n    V\nTask specialization (transfer)\n\n```\n\n\n# Model Training Phases\nThe model architecture consists of a contrastive vision encoder (ViT-G/14) that encodes the image into tokens. These tokens are passed to a transformer encoder and then to a transformer decoder that generates a text output.\n\nThe training procedure consists of multiple stages:\n\n-   Stage 0: Unimodal pretraining. The image encoder is pretrained contrastively on image-text pairs from the web, following the SigLIP training protocol. The text encoder-decoder is a 3B UL2 model trained following the mixture of denoisers procedure.\n\n-   Stage 1: Multimodal training. The image encoder is combined with the text encoder-decoder and trained on a multimodal task and data mixture, keeping the image encoder frozen and using its native resolution.\n\n-   Stage 2: Resolution increase. The resolution of the model is increased by fine-tuning the whole model with a short curriculum of increasing resolutions.\n\n-   Task specialization (transfer). Finally, for each individual task, the model is fine-tuned with frozen ViT image encoder on the task's training data.\n\nPlease note that this is a high-level representation and the actual implementation might involve more details and complexities.\n\n\n\n------\n\n# Vit Architecture\nHere are the ASCII diagrams for the ViT (Vision Transformer)\n\n```\nViT (Vision Transformer):\n\nImage Input\n    |\n    V\nPatch Extraction\n    |\n    V\nLinear Embedding\n    |\n    V\nPositional Encoding\n    |\n    V\nTransformer Encoder Blocks (Multiple Layers)\n    |\n    V\nClassification Head (Optional)\n    |\n    V\nOutput (Image Embeddings)\n\n```\n\nThe ViT starts with patch extraction from the input image. These patches are then linearly embedded and positional encodings are added. The resulting sequence of patch embeddings is passed through multiple layers of transformer encoders. Optionally, a classification head can be added at the end to get class probabilities for image classification tasks. The output of the ViT is the image embeddings.\n\n-------\n\n# UL2 Encoder/Decoder Transformer\n```\nEncoder-Decoder Architecture:\n\nInput (Image + Text Tokens)\n    |\n    V\nTransformer Encoder\n    |\n    V\nEncoder Output (Context for Decoder)\n    |\n    V\nTransformer Decoder\n    |\n    V\nOutput (Generated Text)\n\n```\n\nThe encoder-decoder architecture starts with the input, which is a combination of image and text tokens in this case. The input is passed through a transformer encoder, which generates a context for the decoder. The transformer decoder then uses this context to generate the output text.\n\n\n# Dataset Strategy\nHere is a table summarizing the key datasets mentioned in the paper along with their metadata and source links:\n\n- Made with claude so links could be fake\n\n| Dataset | Type | Size | Tasks | Source |\n|-|-|-|-|-|\n| ImageNet-22k | Image Classification | 14M images, 21,841 classes | Pretraining | https://github.com/google-research-datasets/ImageNet-21k-P |\n| MS COCO | Image Captioning, VQA | 330K images, 80 object categories | Evaluation | https://cocodataset.org | \n| Flickr30k | Image Captioning | 31K images | Evaluation | https://www.kaggle.com/dataset/flickr30k |\n| VQAv2 | Visual QA | 204K images, 1.1M questions | Evaluation | https://visualqa.org/download.html |  \n| GQA | Visual QA | 22M graph-based questions | Evaluation | https://cs.stanford.edu/people/dorarad/gqa/download.html |\n| RefCOCO/RefCOCO+ | Referring Expression | 19,994/19,992 images | Evaluation | https://github.com/lichengunc/refer |\n| TextCaps | Image Captioning | 31,014 images | Evaluation | https://textvqa.org/textcaps |\n| TextVQA | Visual QA | 28,408 images | Evaluation | https://textvqa.org/index.html |\n| STVQA | Visual QA | 249,991 QA pairs | Evaluation | https://tvqa.cs.unc.edu/ |\n| OCR-VQA | Visual QA | 45,336 images | Evaluation | https://ocrvqa.cloudcv.org/ |\n| DocVQA | Visual QA | 5,000 document images | Evaluation | https://github.com/doc-vqa/docvqa |\n| InfographiVQA | Visual QA | 10,047 infographic images | Evaluation | https://github.com/doc-vqa/InfoVQA |\n| WebLI | Image-Text Pairs | 72M image-text pairs in 100+ languages | Pretraining | https://laion.ai/blogs/webli/ |\n| JFT-300M | Image Classification | 303M images, 18,291 classes | Pretraining | https://github.com/google-research-datasets/jft300m |\n| CrossModal-3600 | Image-Text Retrieval | 31K images, 3600 lang-image pairs | Evaluation | https://laion.ai/crossmodal-3600/ |\n\n-----\n\n# License\nMIT\n\n# Todo\n\n- [x] Implement sig_lip vit model with training recipe\n- [x] Implement the text tokenizer, maybe use token monster \n- [x] Implement the UL2 Transformer Encoder and Decoder\n- [ ] Implement the pooling layer after vit then linear\n- [ ] Implement the prepending the visual token embeddings to the text embeddings\n- [ ] Implement training scripts for the full pali3 model\n\n\n# Citation\n\n```bibtex\n@misc{2310.09199,\nAuthor = {Xi Chen and Xiao Wang and Lucas Beyer and Alexander Kolesnikov and Jialin Wu and Paul Voigtlaender and Basil Mustafa and Sebastian Goodman and Ibrahim Alabdulmohsin and Piotr Padlewski and Daniel Salz and Xi Xiong and Daniel Vlasic and Filip Pavetic and Keran Rong and Tianli Yu and Daniel Keysers and Xiaohua Zhai and Radu Soricut},\nTitle = {PaLI-3 Vision Language Models: Smaller, Faster, Stronger},\nYear = {2023},\nEprint = {arXiv:2310.09199},\n}\n```\n\n\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "pali3 - Pytorch",
    "version": "0.0.7",
    "project_urls": {
        "Homepage": "https://github.com/kyegomez/pali3",
        "Repository": "https://github.com/kyegomez/pali3"
    },
    "split_keywords": [
        "artificial intelligence",
        "deep learning",
        "optimizers",
        "prompt engineering"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "bdb5ec213e0e24745c4d204255c00652667f00b444c825f8bd5af8d803e06e72",
                "md5": "9e713e2e52a2d9aecc67b1825ce2be08",
                "sha256": "6c0aa1b36c75a48bc3f35d9581b8c6b01979c2c65c26846b2afec79e8650dbe7"
            },
            "downloads": -1,
            "filename": "pali3-0.0.7-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "9e713e2e52a2d9aecc67b1825ce2be08",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6,<4.0",
            "size": 22453,
            "upload_time": "2023-10-25T16:56:34",
            "upload_time_iso_8601": "2023-10-25T16:56:34.878949Z",
            "url": "https://files.pythonhosted.org/packages/bd/b5/ec213e0e24745c4d204255c00652667f00b444c825f8bd5af8d803e06e72/pali3-0.0.7-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6da953972abb58852184c1c029120412ccbab03a3fa5e2e88ad355ea67a9c867",
                "md5": "3662174c9c59cc5b19d290b70bacf730",
                "sha256": "61b390067123117f85e43948ba60dd11022e6968e169dd5a12099692604f4a9b"
            },
            "downloads": -1,
            "filename": "pali3-0.0.7.tar.gz",
            "has_sig": false,
            "md5_digest": "3662174c9c59cc5b19d290b70bacf730",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6,<4.0",
            "size": 24834,
            "upload_time": "2023-10-25T16:56:36",
            "upload_time_iso_8601": "2023-10-25T16:56:36.638993Z",
            "url": "https://files.pythonhosted.org/packages/6d/a9/53972abb58852184c1c029120412ccbab03a3fa5e2e88ad355ea67a9c867/pali3-0.0.7.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-10-25 16:56:36",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "kyegomez",
    "github_project": "pali3",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "pali3"
}
        
Elapsed time: 0.18052s