simple-clip

Name	simple-clip JSON
Version	0.2.0 JSON
	download
home_page	https://github.com/filipbasara0/simple-clip
Summary	A minimal, but effective implementation of CLIP (Contrastive Language-Image Pretraining) in PyTorch
upload_time	2024-01-24 19:40:32
maintainer
docs_url	None
author	Filip Basara
requires_python
license	MIT
keywords	machine learning pytorch self-supervised learning representation learning contrastive learning
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # simple-clip
Simple implementation of [CLIP](https://arxiv.org/abs/2103.00020) (Contrastive Language-Image Pretraining) in PyTorch.

<img width="1099" alt="3d5d1009-6e3d-4570-8fd9-ee8f588003e7" src="https://github.com/filipbasara0/simple-clip/assets/29043871/27e708ac-0ced-4382-bcc4-e0db5fc2d115">

# CLIP
[CLIP](https://arxiv.org/abs/2103.00020) (Contrastive Language-Image Pretraining) by OpenAI is a model that unifies text and image understanding through a contrastive learning approach. It employs two neural networks, one for image processing and another for text processing, which are jointly trained on a large dataset of images and their corresponding textual descriptions. This training enables the model to understand and link visual content with natural language. CLIP's distinctive feature is its zero-shot learning capability, allowing it to generalize across various visual tasks without task-specific training, solely based on textual prompts. This makes it highly adaptable for diverse applications in AI, from image classification to complex visual reasoning tasks.

Also has a support for the sigmoid pairwise loss, from the [SigLIP](https://arxiv.org/abs/2303.15343) paper. Using this loss, the model seems to converge slower, but eventually reaches similar results as the contrastive loss. To use the SigLIP loss, specify `-- use_siglip` when running the `train_clip` command.

# Results

All experiments used ResNet50 and Distill BERT as respectively image and text encoders. Models were first trained on smaller datasets, such as COCO to validate the approach. Later on, they were trained on combined COCO and sbucaptions data and a yfcc7m subset.

Models were evaluated in zero-shot fashion, where text queries were constructed as "a photo of {label_name}". For ImageNet, we used the 50k validation dataset.

ImageNet results surpassed the [zero-shot scaling trend](https://github.com/mlfoundations/open_clip/blob/main/docs/LOW_ACC.md), by a few points, signalling a potential for smaller but more diverse and information dense datasets. This is in line with https://arxiv.org/abs/2205.01397, where authors determined that the main contributing factor in model quality and robustness for the CLIP objective are more diverse training distribution. In other words, data quality and diversity >> data quantity.

| Training Datasets           | Training steps  | Text Encoder            | Image Encoder | Eval dataset | Top1 % | Top5 % | Top10 % |
|-----------------------------|-----------------|-------------------------|---------------|--------------|--------|--------|---------|
| yfcc7m + coco + sbucaptions | 57,800          | distilbert-base-uncased | ResNet-50     | STL-10       | 93.75  | -      | -       |
| yfcc7m + coco + sbucaptions | 57,800          | distilbert-base-uncased | ResNet-50     | ImageNet     | 37.10  | 63.04  | 71.70   |

Trained CLIP model can be found [here](https://drive.google.com/file/d/1UnakTzwVYE0x2A6rPNaK2OhypVBOM1zI/view?usp=sharing).

The `yfcc7m + coco + sbucaptions` dataset has around 8M samples in total, where 7M comes from `yfcc7m`, 810k from `sbucaptions` and 110k from `coco`.

Links to notebooks with [ImageNet](https://github.com/filipbasara0/simple-clip/blob/main/notebooks/zero-shot-imagenet1k.ipynb) and [STL](https://github.com/filipbasara0/simple-clip/blob/main/notebooks/zero-shot-stl.ipynb) results.

# Usage

### Instalation
```bash
$ pip install simple-clip
```

Code currently supports ResNet18, ResNet50 and an experimental version of the EfficientNet model as image encoders. Resnet50 was used in all experiments as the image encoder.
Distill BERT (`distilbert-base-uncased`) was used as the text encoder in all experiments.

Supported datasets are textcap, coco, sbucaptions and yfcc7m.

### Examples
`yfcc7m` CLIP was trained with this command (around 7M samples):

`train_clip --dataset_name yfcc7m --fp16_precision --batch_size 256  --log_every_n_steps 50 --image_size 224 --learning_rate 1e-4 --imagenet_eval`

Combined `coco + textcaptions + sbucaptions` CLIP was trained using (around 1M samples):

`train_clip --dataset_name combined --fp16_precision --batch_size 256  --log_every_n_steps 50 --image_size 224 --learning_rate 1e-4 --imagenet_eval`


### Detailed options
Once the code is setup, run the following command with optinos listed below:
`train_clip [args...]⬇️`

```
options:
  -h, --help            show this help message and exit
  --dataset_path DATASET_PATH
                        Path where datasets will be saved
  --dataset_name {textcap,coco,sbucaptions,combined,yfcc7m}
                        Dataset name
  --image_encoder_name {resnet18,resnet50,efficientnet}
                        image model architecture: resnet18, resnet50 or efficientnet (default: resnet50)
  --text_encoder_name {distilbert-base-uncased}
                        text model architecture: distilbert-base-uncased (default: distilbert-base-uncased)
  -save_model_dir SAVE_MODEL_DIR
                        Path where models
  --num_epochs NUM_EPOCHS
                        Number of epochs for training
  --image_size IMAGE_SIZE
                        Image size
  -b BATCH_SIZE, --batch_size BATCH_SIZE
                        Batch size
  -lr LEARNING_RATE, --learning_rate LEARNING_RATE
  -wd WEIGHT_DECAY, --weight_decay WEIGHT_DECAY
  --fp16_precision      Whether to use 16-bit precision for GPU training
  --imagenet_eval       Whether to evaluate on imagenet validation dataset. Required huggingface imagenet-1k dataset.
  --imagenet_eval_steps IMAGENET_EVAL_STEPS
                        Evaluate on imagenet every N steps
  --log_every_n_steps LOG_EVERY_N_STEPS
                        Log every n steps
  --ckpt_path CKPT_PATH
                        Specify path to relic_model.pth to resume training
  --use_siglip          Whether to use the SigLIP loss
```

# Citation
```
@misc{radford2021learning,
      title={Learning Transferable Visual Models From Natural Language Supervision}, 
      author={Alec Radford and Jong Wook Kim and Chris Hallacy and Aditya Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},
      year={2021},
      eprint={2103.00020},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@misc{zhai2023sigmoid,
      title={Sigmoid Loss for Language Image Pre-Training}, 
      author={Xiaohua Zhai and Basil Mustafa and Alexander Kolesnikov and Lucas Beyer},
      year={2023},
      eprint={2303.15343},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
```

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/filipbasara0/simple-clip",
    "name": "simple-clip",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "machine learning,pytorch,self-supervised learning,representation learning,contrastive learning",
    "author": "Filip Basara",
    "author_email": "basarafilip@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/62/a2/933060c2ca17637bb9f8115865bbde40660366ec48216924119ee9ba5311/simple-clip-0.2.0.tar.gz",
    "platform": null,
    "description": "# simple-clip\nSimple implementation of [CLIP](https://arxiv.org/abs/2103.00020) (Contrastive Language-Image Pretraining) in PyTorch.\n\n<img width=\"1099\" alt=\"3d5d1009-6e3d-4570-8fd9-ee8f588003e7\" src=\"https://github.com/filipbasara0/simple-clip/assets/29043871/27e708ac-0ced-4382-bcc4-e0db5fc2d115\">\n\n# CLIP\n[CLIP](https://arxiv.org/abs/2103.00020) (Contrastive Language-Image Pretraining) by OpenAI is a model that unifies text and image understanding through a contrastive learning approach. It employs two neural networks, one for image processing and another for text processing, which are jointly trained on a large dataset of images and their corresponding textual descriptions. This training enables the model to understand and link visual content with natural language. CLIP's distinctive feature is its zero-shot learning capability, allowing it to generalize across various visual tasks without task-specific training, solely based on textual prompts. This makes it highly adaptable for diverse applications in AI, from image classification to complex visual reasoning tasks.\n\nAlso has a support for the sigmoid pairwise loss, from the [SigLIP](https://arxiv.org/abs/2303.15343) paper. Using this loss, the model seems to converge slower, but eventually reaches similar results as the contrastive loss. To use the SigLIP loss, specify `-- use_siglip` when running the `train_clip` command.\n\n# Results\n\nAll experiments used ResNet50 and Distill BERT as respectively image and text encoders. Models were first trained on smaller datasets, such as COCO to validate the approach. Later on, they were trained on combined COCO and sbucaptions data and a yfcc7m subset.\n\nModels were evaluated in zero-shot fashion, where text queries were constructed as \"a photo of {label_name}\". For ImageNet, we used the 50k validation dataset.\n\nImageNet results surpassed the [zero-shot scaling trend](https://github.com/mlfoundations/open_clip/blob/main/docs/LOW_ACC.md), by a few points, signalling a potential for smaller but more diverse and information dense datasets. This is in line with https://arxiv.org/abs/2205.01397, where authors determined that the main contributing factor in model quality and robustness for the CLIP objective are more diverse training distribution. In other words, data quality and diversity >> data quantity.\n\n| Training Datasets           | Training steps  | Text Encoder            | Image Encoder | Eval dataset | Top1 % | Top5 % | Top10 % |\n|-----------------------------|-----------------|-------------------------|---------------|--------------|--------|--------|---------|\n| yfcc7m + coco + sbucaptions | 57,800          | distilbert-base-uncased | ResNet-50     | STL-10       | 93.75  | -      | -       |\n| yfcc7m + coco + sbucaptions | 57,800          | distilbert-base-uncased | ResNet-50     | ImageNet     | 37.10  | 63.04  | 71.70   |\n\nTrained CLIP model can be found [here](https://drive.google.com/file/d/1UnakTzwVYE0x2A6rPNaK2OhypVBOM1zI/view?usp=sharing).\n\nThe `yfcc7m + coco + sbucaptions` dataset has around 8M samples in total, where 7M comes from `yfcc7m`, 810k from `sbucaptions` and 110k from `coco`.\n\nLinks to notebooks with [ImageNet](https://github.com/filipbasara0/simple-clip/blob/main/notebooks/zero-shot-imagenet1k.ipynb) and [STL](https://github.com/filipbasara0/simple-clip/blob/main/notebooks/zero-shot-stl.ipynb) results.\n\n# Usage\n\n### Instalation\n```bash\n$ pip install simple-clip\n```\n\nCode currently supports ResNet18, ResNet50 and an experimental version of the EfficientNet model as image encoders. Resnet50 was used in all experiments as the image encoder.\nDistill BERT (`distilbert-base-uncased`) was used as the text encoder in all experiments.\n\nSupported datasets are textcap, coco, sbucaptions and yfcc7m.\n\n### Examples\n`yfcc7m` CLIP was trained with this command (around 7M samples):\n\n`train_clip --dataset_name yfcc7m --fp16_precision --batch_size 256  --log_every_n_steps 50 --image_size 224 --learning_rate 1e-4 --imagenet_eval`\n\nCombined `coco + textcaptions + sbucaptions` CLIP was trained using (around 1M samples):\n\n`train_clip --dataset_name combined --fp16_precision --batch_size 256  --log_every_n_steps 50 --image_size 224 --learning_rate 1e-4 --imagenet_eval`\n\n\n### Detailed options\nOnce the code is setup, run the following command with optinos listed below:\n`train_clip [args...]\u2b07\ufe0f`\n\n```\noptions:\n  -h, --help            show this help message and exit\n  --dataset_path DATASET_PATH\n                        Path where datasets will be saved\n  --dataset_name {textcap,coco,sbucaptions,combined,yfcc7m}\n                        Dataset name\n  --image_encoder_name {resnet18,resnet50,efficientnet}\n                        image model architecture: resnet18, resnet50 or efficientnet (default: resnet50)\n  --text_encoder_name {distilbert-base-uncased}\n                        text model architecture: distilbert-base-uncased (default: distilbert-base-uncased)\n  -save_model_dir SAVE_MODEL_DIR\n                        Path where models\n  --num_epochs NUM_EPOCHS\n                        Number of epochs for training\n  --image_size IMAGE_SIZE\n                        Image size\n  -b BATCH_SIZE, --batch_size BATCH_SIZE\n                        Batch size\n  -lr LEARNING_RATE, --learning_rate LEARNING_RATE\n  -wd WEIGHT_DECAY, --weight_decay WEIGHT_DECAY\n  --fp16_precision      Whether to use 16-bit precision for GPU training\n  --imagenet_eval       Whether to evaluate on imagenet validation dataset. Required huggingface imagenet-1k dataset.\n  --imagenet_eval_steps IMAGENET_EVAL_STEPS\n                        Evaluate on imagenet every N steps\n  --log_every_n_steps LOG_EVERY_N_STEPS\n                        Log every n steps\n  --ckpt_path CKPT_PATH\n                        Specify path to relic_model.pth to resume training\n  --use_siglip          Whether to use the SigLIP loss\n```\n\n# Citation\n```\n@misc{radford2021learning,\n      title={Learning Transferable Visual Models From Natural Language Supervision}, \n      author={Alec Radford and Jong Wook Kim and Chris Hallacy and Aditya Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},\n      year={2021},\n      eprint={2103.00020},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV}\n}\n\n@misc{zhai2023sigmoid,\n      title={Sigmoid Loss for Language Image Pre-Training}, \n      author={Xiaohua Zhai and Basil Mustafa and Alexander Kolesnikov and Lucas Beyer},\n      year={2023},\n      eprint={2303.15343},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV}\n}\n```\n\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A minimal, but effective implementation of CLIP (Contrastive Language-Image Pretraining) in PyTorch",
    "version": "0.2.0",
    "project_urls": {
        "Homepage": "https://github.com/filipbasara0/simple-clip"
    },
    "split_keywords": [
        "machine learning",
        "pytorch",
        "self-supervised learning",
        "representation learning",
        "contrastive learning"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5105aa77dfa34cb0c3269547057e6c256b90781070a8a0e5696539b5798352a3",
                "md5": "2b9b704d7c15a70d10a6da82298fcefd",
                "sha256": "7d28fd79f248fa469a84a9aae9aa371d4afc0c161abab662ebcc7cefd02dd68d"
            },
            "downloads": -1,
            "filename": "simple_clip-0.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "2b9b704d7c15a70d10a6da82298fcefd",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 14625,
            "upload_time": "2024-01-24T19:40:30",
            "upload_time_iso_8601": "2024-01-24T19:40:30.700585Z",
            "url": "https://files.pythonhosted.org/packages/51/05/aa77dfa34cb0c3269547057e6c256b90781070a8a0e5696539b5798352a3/simple_clip-0.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "62a2933060c2ca17637bb9f8115865bbde40660366ec48216924119ee9ba5311",
                "md5": "816101117b7a9de690fe0e199777b17e",
                "sha256": "55070f3e4b2f211195e1198e82d732b135a5c5087cbb787747f84ec9aaa714ce"
            },
            "downloads": -1,
            "filename": "simple-clip-0.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "816101117b7a9de690fe0e199777b17e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 14333,
            "upload_time": "2024-01-24T19:40:32",
            "upload_time_iso_8601": "2024-01-24T19:40:32.600997Z",
            "url": "https://files.pythonhosted.org/packages/62/a2/933060c2ca17637bb9f8115865bbde40660366ec48216924119ee9ba5311/simple-clip-0.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-01-24 19:40:32",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "filipbasara0",
    "github_project": "simple-clip",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "simple-clip"
}

Filip Basara