dreamsim

Name	dreamsim JSON
Version	0.2.1 JSON
	download
home_page	https://github.com/ssundaram21/dreamsim
Summary	DreamSim similarity metric
upload_time	2024-10-15 00:38:23
maintainer	None
docs_url	None
author	None
requires_python	None
license	None
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <!-- # ![icon](images/figs/icon.png)  DreamSim Perceptual Metric -->
<!-- # DreamSim Perceptual Metric <img src="images/figs/icon.png" align="left" width="50px"/>  -->
# DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data
### [Project Page](https://dreamsim-nights.github.io/) | [Paper](https://arxiv.org/abs/2306.09344) | [Bibtex](#bibtex)
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1taEOMzFE9g81D9AwH27Uhy2U82tQGAVI?usp=sharing)

[Stephanie Fu](https://stephanie-fu.github.io)\* $^{1}$, [Netanel Tamir](https://netanel-tamir.github.io)\* $^{2}$, [Shobhita Sundaram](https://ssundaram21.github.io)\* $^{1}$, [Lucy Chai](https://people.csail.mit.edu/lrchai/) $^1$, [Richard Zhang](http://richzhang.github.io) $^3$, [Tali Dekel](https://www.weizmann.ac.il/math/dekel/) $^2$, [Phillip Isola](https://web.mit.edu/phillipi/) $^1$.<br>
(*equal contribution, order decided by random seed)<br>
$^1$ MIT, $^2$ Weizmann Institute of Science, $^3$ Adobe Research.

![teaser](images/figs/teaser.jpg)

**Summary**

Current metrics for perceptual image similarity operate at the level of pixels and patches. These metrics compare images in terms of their low-level colors and textures, but fail to capture mid-level differences in layout, pose, semantic content, etc. Models that use image-level embeddings such as DINO and CLIP capture high-level and semantic judgements, but may not be aligned with human perception of more finegrained attributes.

DreamSim is a new metric for perceptual image similarity that bridges the gap between "low-level" metrics (e.g. LPIPS, PSNR, SSIM) and "high-level" measures (e.g. CLIP). Our model was trained by concatenating CLIP, OpenCLIP, and DINO embeddings, and then finetuning on human perceptual judgements. We gathered these judgements on a dataset of ~20k image triplets, generated by diffusion models. Our model achieves better alignment with human similarity judgements than existing metrics, and can be used for downstream applications such as image retrieval.

## 🚀 Newest Updates
**10/14/24:** We released 4 new variants of DreamSim! These new checkpoints are:
- DINOv2 B/14 and SynCLR B/16 as backbones
- Models trained with the original contrastive loss on both CLS and dense features. 

These models (and the originals) are further evaluated in **our new NeurIPS 2024 paper, When Does Perceptual Alignment Benefit Vision Representations?**

We find that our perceptually-aligned models outperform the baseline models on a variety of standard computer vision tasks, including 
semantic segmentation, depth estimation, object counting, instance retrieval, and retrieval-augmented generation. These results 
point towards perceptual alignment being a useful task for learning general-purpose vision representations. See the paper and our [blog post](https://percep-align.github.io) 
for more details.

Here's how they perform on NIGHTS:
|                   | NIGHTS - Val | NIGHTS - Test |
|-------------------|--------------|---------------|
| `ensemble`        |       96.9%       |     96.2%     |
| `dino_vitb16`     |        95.6%      |       94.8%      |
| `open_clip_vitb32`     |     95.6%         |     95.3%    |
| `clip_vitb32`|      94.9%        |     93.6%       |
| `dinov2_vitb14`     |      94.9%        |    95.0%     |
| `synclr_vitb16`|       96.0%       |      95.9%      |
| `dino_vitb16 (patch)`     |      94.9%        |   94.8%     |
| `dinov2_vitb14 (patch)`|     95.5%         |      95.1%      |

**9/14/24:** We released new versions of the ensemble and single-branch DreamSim models compatible with `peft>=0.2.0`.

We also released the entire [100k (unfiltered) NIGHTS dataset](#new-download-the-entire-100k-pre-filtered-nights-dataset) and the [JND (Just-Noticeable Difference) votes](#new-download-the-jnd-data).

## Table of Contents
* [Requirements](#requirements)
* [Setup](#setup)
* [Usage](#usage)
  * [Quickstart](#quickstart-perceptual-similarity-metric)
  * [Single-branch models](#single-branch-models)
  * [Feature extraction](#feature-extraction)
  * [Image retrieval](#image-retrieval)
  * [Perceptual loss function](#perceptual-loss-function)
* [NIGHTS Dataset](#nights-novel-image-generations-with-human-tested-similarities-dataset)
* [Experiments](#experiments)
* [Citation](#citation)

## Requirements
- Linux
- Python 3

## Setup

**Option 1:** Install using pip: 

```pip install dreamsim```

The package is used for importing and using the DreamSim model.

**Option 2:** Clone our repo and install dependencies.
This is necessary for running our training/evaluation scripts.

```
python3 -m venv ds
source ds/bin/activate
pip install -r requirements.txt
export PYTHONPATH="$PYTHONPATH:$(realpath ./dreamsim)"
```
To install with conda:
```
conda create -n ds
conda activate ds
conda install pip # verify with the `which pip` command
pip install -r requirements.txt
export PYTHONPATH="$PYTHONPATH:$(realpath ./dreamsim)"
```

## Usage
**For walk-through examples of the below use-cases, check out our [Colab demo](https://colab.research.google.com/drive/1taEOMzFE9g81D9AwH27Uhy2U82tQGAVI?usp=sharing).**

### Quickstart: Perceptual similarity metric
The basic use case is to measure the perceptual distance between two images. **A higher score means more different, lower means more similar**. 

The following code snippet is all you need. The first time that you run `dreamsim` it will automatically download the model weights. The default model settings are specified in `./dreamsim/config.py`.
```
from dreamsim import dreamsim
from PIL import Image

device = "cuda"
model, preprocess = dreamsim(pretrained=True, device=device)

img1 = preprocess(Image.open("img1_path")).to(device)
img2 = preprocess(Image.open("img2_path")).to(device)
distance = model(img1, img2) # The model takes an RGB image from [0, 1], size batch_sizex3x224x224
```

To run on example images, run `demo.py`. The script should produce distances (0.4453, 0.2756). 

### Single-branch models
By default, DreamSim uses an ensemble of CLIP, DINO, and OpenCLIP (all ViT-B/16). If you need a lighter-weight model you can use *single-branch* versions of DreamSim where only a single backbone is finetuned. **The single-branch models provide a ~3x speedup over the ensemble.**

The available options are OpenCLIP-ViTB/32, DINO-ViTB/16, CLIP-ViTB/32, in order of performance. To load a single-branch model, use the `dreamsim_type` argument. For example:
```
dreamsim_dino_model, preprocess = dreamsim(pretrained=True, dreamsim_type="dino_vitb16")
```

### Feature extraction
To extract a *single image embedding* using dreamsim, use the `embed` method as shown in the following snippet:
```
img1 = preprocess(Image.open("img1_path")).to("cuda")
embedding = model.embed(img1)
```
The perceptual distance between two images is the cosine distance between their embeddings. If the embeddings are normalized (true by default) L2 distance can also be used.


### Image retrieval
Our model can be used for image retrieval, and plugged into existing such pipelines. The code below ranks a dataset of images based on their similarity to a given query image. 

To speed things up, instead of directly calling `model(query, image)` for each pair, we use the `model.embed(image)` method to pre-compute single-image embeddings, and then take the cosine distance between embedding pairs.
```
import pandas as pd
from tqdm import tqdm
import torch.nn.functional as F

# let query be a sample image.
# let images be a list of images we are searching.

# Compute the query image embedding
query_embed = model.embed(preprocess(query).to("cuda"))
dists = {}

# Compute the (cosine) distance between the query and each search image
for i, im in tqdm(enumerate(images), total=len(images)):
   img_embed = model.embed(preprocess(im).to("cuda"))
   dists[i] = (1 - F.cosine_similarity(query_embed, img_embed, dim=-1)).item()

# Return results sorted by distance
df = pd.DataFrame({"ids": list(dists.keys()), "dists": list(dists.values())})
return df.sort_values(by="dists")
```

### Perceptual loss function
Our model can be used as a loss function for iterative optimization (similarly to the LPIPS metric). These are the key lines; for the full example, refer to the [Colab](https://colab.research.google.com/drive/1taEOMzFE9g81D9AwH27Uhy2U82tQGAVI?usp=sharing).
```
for i in range(n_iters):
    dist = model(predicted_image, reference_image)
    dist.backward()
    optimizer.step()
```


<a name="bibtex"></a>
## Citation

If you find our work or any of our materials useful, please cite our paper:
```
@misc{fu2023dreamsim,
      title={DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data}, 
      author={Stephanie Fu and Netanel Tamir and Shobhita Sundaram and Lucy Chai and Richard Zhang and Tali Dekel and Phillip Isola},
      year={2023},
      eprint={2306.09344},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
```

## Acknowledgements
Our code borrows from the ["Deep ViT Features as Dense Visual Descriptors"](https://dino-vit-features.github.io/) repository for ViT feature extraction, and takes inspiration from the [UniverSeg](https://github.com/JJGO/UniverSeg) respository for code structure.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/ssundaram21/dreamsim",
    "name": "dreamsim",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": null,
    "author": null,
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/24/a8/808a6ed5435fe42db80f095f24ff025094239beeb2037a8fb6d8d8e828a8/dreamsim-0.2.1.tar.gz",
    "platform": null,
    "description": "<!-- # ![icon](images/figs/icon.png)  DreamSim Perceptual Metric -->\n<!-- # DreamSim Perceptual Metric <img src=\"images/figs/icon.png\" align=\"left\" width=\"50px\"/>  -->\n# DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data\n### [Project Page](https://dreamsim-nights.github.io/) | [Paper](https://arxiv.org/abs/2306.09344) | [Bibtex](#bibtex)\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1taEOMzFE9g81D9AwH27Uhy2U82tQGAVI?usp=sharing)\n\n[Stephanie Fu](https://stephanie-fu.github.io)\\* $^{1}$, [Netanel Tamir](https://netanel-tamir.github.io)\\* $^{2}$, [Shobhita Sundaram](https://ssundaram21.github.io)\\* $^{1}$, [Lucy Chai](https://people.csail.mit.edu/lrchai/) $^1$, [Richard Zhang](http://richzhang.github.io) $^3$, [Tali Dekel](https://www.weizmann.ac.il/math/dekel/) $^2$, [Phillip Isola](https://web.mit.edu/phillipi/) $^1$.<br>\n(*equal contribution, order decided by random seed)<br>\n$^1$ MIT, $^2$ Weizmann Institute of Science, $^3$ Adobe Research.\n\n![teaser](images/figs/teaser.jpg)\n\n**Summary**\n\nCurrent metrics for perceptual image similarity operate at the level of pixels and patches. These metrics compare images in terms of their low-level colors and textures, but fail to capture mid-level differences in layout, pose, semantic content, etc. Models that use image-level embeddings such as DINO and CLIP capture high-level and semantic judgements, but may not be aligned with human perception of more finegrained attributes.\n\nDreamSim is a new metric for perceptual image similarity that bridges the gap between \"low-level\" metrics (e.g. LPIPS, PSNR, SSIM) and \"high-level\" measures (e.g. CLIP). Our model was trained by concatenating CLIP, OpenCLIP, and DINO embeddings, and then finetuning on human perceptual judgements. We gathered these judgements on a dataset of ~20k image triplets, generated by diffusion models. Our model achieves better alignment with human similarity judgements than existing metrics, and can be used for downstream applications such as image retrieval.\n\n## \ud83d\ude80 Newest Updates\n**10/14/24:** We released 4 new variants of DreamSim! These new checkpoints are:\n- DINOv2 B/14 and SynCLR B/16 as backbones\n- Models trained with the original contrastive loss on both CLS and dense features. \n\nThese models (and the originals) are further evaluated in **our new NeurIPS 2024 paper, When Does Perceptual Alignment Benefit Vision Representations?**\n\nWe find that our perceptually-aligned models outperform the baseline models on a variety of standard computer vision tasks, including \nsemantic segmentation, depth estimation, object counting, instance retrieval, and retrieval-augmented generation. These results \npoint towards perceptual alignment being a useful task for learning general-purpose vision representations. See the paper and our [blog post](https://percep-align.github.io) \nfor more details.\n\nHere's how they perform on NIGHTS:\n|                   | NIGHTS - Val | NIGHTS - Test |\n|-------------------|--------------|---------------|\n| `ensemble`        |       96.9%       |     96.2%     |\n| `dino_vitb16`     |        95.6%      |       94.8%      |\n| `open_clip_vitb32`     |     95.6%         |     95.3%    |\n| `clip_vitb32`|      94.9%        |     93.6%       |\n| `dinov2_vitb14`     |      94.9%        |    95.0%     |\n| `synclr_vitb16`|       96.0%       |      95.9%      |\n| `dino_vitb16 (patch)`     |      94.9%        |   94.8%     |\n| `dinov2_vitb14 (patch)`|     95.5%         |      95.1%      |\n\n**9/14/24:** We released new versions of the ensemble and single-branch DreamSim models compatible with `peft>=0.2.0`.\n\nWe also released the entire [100k (unfiltered) NIGHTS dataset](#new-download-the-entire-100k-pre-filtered-nights-dataset) and the [JND (Just-Noticeable Difference) votes](#new-download-the-jnd-data).\n\n## Table of Contents\n* [Requirements](#requirements)\n* [Setup](#setup)\n* [Usage](#usage)\n  * [Quickstart](#quickstart-perceptual-similarity-metric)\n  * [Single-branch models](#single-branch-models)\n  * [Feature extraction](#feature-extraction)\n  * [Image retrieval](#image-retrieval)\n  * [Perceptual loss function](#perceptual-loss-function)\n* [NIGHTS Dataset](#nights-novel-image-generations-with-human-tested-similarities-dataset)\n* [Experiments](#experiments)\n* [Citation](#citation)\n\n## Requirements\n- Linux\n- Python 3\n\n## Setup\n\n**Option 1:** Install using pip: \n\n```pip install dreamsim```\n\nThe package is used for importing and using the DreamSim model.\n\n**Option 2:** Clone our repo and install dependencies.\nThis is necessary for running our training/evaluation scripts.\n\n```\npython3 -m venv ds\nsource ds/bin/activate\npip install -r requirements.txt\nexport PYTHONPATH=\"$PYTHONPATH:$(realpath ./dreamsim)\"\n```\nTo install with conda:\n```\nconda create -n ds\nconda activate ds\nconda install pip # verify with the `which pip` command\npip install -r requirements.txt\nexport PYTHONPATH=\"$PYTHONPATH:$(realpath ./dreamsim)\"\n```\n\n## Usage\n**For walk-through examples of the below use-cases, check out our [Colab demo](https://colab.research.google.com/drive/1taEOMzFE9g81D9AwH27Uhy2U82tQGAVI?usp=sharing).**\n\n### Quickstart: Perceptual similarity metric\nThe basic use case is to measure the perceptual distance between two images. **A higher score means more different, lower means more similar**. \n\nThe following code snippet is all you need. The first time that you run `dreamsim` it will automatically download the model weights. The default model settings are specified in `./dreamsim/config.py`.\n```\nfrom dreamsim import dreamsim\nfrom PIL import Image\n\ndevice = \"cuda\"\nmodel, preprocess = dreamsim(pretrained=True, device=device)\n\nimg1 = preprocess(Image.open(\"img1_path\")).to(device)\nimg2 = preprocess(Image.open(\"img2_path\")).to(device)\ndistance = model(img1, img2) # The model takes an RGB image from [0, 1], size batch_sizex3x224x224\n```\n\nTo run on example images, run `demo.py`. The script should produce distances (0.4453, 0.2756). \n\n### Single-branch models\nBy default, DreamSim uses an ensemble of CLIP, DINO, and OpenCLIP (all ViT-B/16). If you need a lighter-weight model you can use *single-branch* versions of DreamSim where only a single backbone is finetuned. **The single-branch models provide a ~3x speedup over the ensemble.**\n\nThe available options are OpenCLIP-ViTB/32, DINO-ViTB/16, CLIP-ViTB/32, in order of performance. To load a single-branch model, use the `dreamsim_type` argument. For example:\n```\ndreamsim_dino_model, preprocess = dreamsim(pretrained=True, dreamsim_type=\"dino_vitb16\")\n```\n\n### Feature extraction\nTo extract a *single image embedding* using dreamsim, use the `embed` method as shown in the following snippet:\n```\nimg1 = preprocess(Image.open(\"img1_path\")).to(\"cuda\")\nembedding = model.embed(img1)\n```\nThe perceptual distance between two images is the cosine distance between their embeddings. If the embeddings are normalized (true by default) L2 distance can also be used.\n\n\n### Image retrieval\nOur model can be used for image retrieval, and plugged into existing such pipelines. The code below ranks a dataset of images based on their similarity to a given query image. \n\nTo speed things up, instead of directly calling `model(query, image)` for each pair, we use the `model.embed(image)` method to pre-compute single-image embeddings, and then take the cosine distance between embedding pairs.\n```\nimport pandas as pd\nfrom tqdm import tqdm\nimport torch.nn.functional as F\n\n# let query be a sample image.\n# let images be a list of images we are searching.\n\n# Compute the query image embedding\nquery_embed = model.embed(preprocess(query).to(\"cuda\"))\ndists = {}\n\n# Compute the (cosine) distance between the query and each search image\nfor i, im in tqdm(enumerate(images), total=len(images)):\n   img_embed = model.embed(preprocess(im).to(\"cuda\"))\n   dists[i] = (1 - F.cosine_similarity(query_embed, img_embed, dim=-1)).item()\n\n# Return results sorted by distance\ndf = pd.DataFrame({\"ids\": list(dists.keys()), \"dists\": list(dists.values())})\nreturn df.sort_values(by=\"dists\")\n```\n\n### Perceptual loss function\nOur model can be used as a loss function for iterative optimization (similarly to the LPIPS metric). These are the key lines; for the full example, refer to the [Colab](https://colab.research.google.com/drive/1taEOMzFE9g81D9AwH27Uhy2U82tQGAVI?usp=sharing).\n```\nfor i in range(n_iters):\n    dist = model(predicted_image, reference_image)\n    dist.backward()\n    optimizer.step()\n```\n\n\n<a name=\"bibtex\"></a>\n## Citation\n\nIf you find our work or any of our materials useful, please cite our paper:\n```\n@misc{fu2023dreamsim,\n      title={DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data}, \n      author={Stephanie Fu and Netanel Tamir and Shobhita Sundaram and Lucy Chai and Richard Zhang and Tali Dekel and Phillip Isola},\n      year={2023},\n      eprint={2306.09344},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV}\n}\n```\n\n## Acknowledgements\nOur code borrows from the [\"Deep ViT Features as Dense Visual Descriptors\"](https://dino-vit-features.github.io/) repository for ViT feature extraction, and takes inspiration from the [UniverSeg](https://github.com/JJGO/UniverSeg) respository for code structure.\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "DreamSim similarity metric",
    "version": "0.2.1",
    "project_urls": {
        "Homepage": "https://github.com/ssundaram21/dreamsim"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "24a8808a6ed5435fe42db80f095f24ff025094239beeb2037a8fb6d8d8e828a8",
                "md5": "bfecf5f60a4902831ee2894644f3ec34",
                "sha256": "36c655ee1bb5dbbf1730f03a59ac0d0180922f2151dfbdfb4f058b1756903d14"
            },
            "downloads": -1,
            "filename": "dreamsim-0.2.1.tar.gz",
            "has_sig": false,
            "md5_digest": "bfecf5f60a4902831ee2894644f3ec34",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 24954,
            "upload_time": "2024-10-15T00:38:23",
            "upload_time_iso_8601": "2024-10-15T00:38:23.631232Z",
            "url": "https://files.pythonhosted.org/packages/24/a8/808a6ed5435fe42db80f095f24ff025094239beeb2037a8fb6d8d8e828a8/dreamsim-0.2.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-15 00:38:23",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ssundaram21",
    "github_project": "dreamsim",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "dreamsim"
}

None