<!-- #  DreamSim Perceptual Metric -->
<!-- # DreamSim Perceptual Metric <img src="images/figs/icon.png" align="left" width="50px"/> -->
# DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data
### [Project Page](https://dreamsim-nights.github.io/) | [Paper](https://arxiv.org/abs/2306.09344) | [Bibtex](#bibtex)
[](https://colab.research.google.com/drive/1taEOMzFE9g81D9AwH27Uhy2U82tQGAVI?usp=sharing)
[Stephanie Fu](https://stephanie-fu.github.io)\* $^{1}$, [Netanel Tamir](https://netanel-tamir.github.io)\* $^{2}$, [Shobhita Sundaram](https://ssundaram21.github.io)\* $^{1}$, [Lucy Chai](https://people.csail.mit.edu/lrchai/) $^1$, [Richard Zhang](http://richzhang.github.io) $^3$, [Tali Dekel](https://www.weizmann.ac.il/math/dekel/) $^2$, [Phillip Isola](https://web.mit.edu/phillipi/) $^1$.<br>
(*equal contribution, order decided by random seed)<br>
$^1$ MIT, $^2$ Weizmann Institute of Science, $^3$ Adobe Research.

**Summary**
Current metrics for perceptual image similarity operate at the level of pixels and patches. These metrics compare images in terms of their low-level colors and textures, but fail to capture mid-level differences in layout, pose, semantic content, etc. Models that use image-level embeddings such as DINO and CLIP capture high-level and semantic judgements, but may not be aligned with human perception of more finegrained attributes.
DreamSim is a new metric for perceptual image similarity that bridges the gap between "low-level" metrics (e.g. LPIPS, PSNR, SSIM) and "high-level" measures (e.g. CLIP). Our model was trained by concatenating CLIP, OpenCLIP, and DINO embeddings, and then finetuning on human perceptual judgements. We gathered these judgements on a dataset of ~20k image triplets, generated by diffusion models. Our model achieves better alignment with human similarity judgements than existing metrics, and can be used for downstream applications such as image retrieval.
## 🚀 Newest Updates
**10/14/24:** We released 4 new variants of DreamSim! These new checkpoints are:
- DINOv2 B/14 and SynCLR B/16 as backbones
- Models trained with the original contrastive loss on both CLS and dense features.
These models (and the originals) are further evaluated in **our new NeurIPS 2024 paper, When Does Perceptual Alignment Benefit Vision Representations?**
We find that our perceptually-aligned models outperform the baseline models on a variety of standard computer vision tasks, including
semantic segmentation, depth estimation, object counting, instance retrieval, and retrieval-augmented generation. These results
point towards perceptual alignment being a useful task for learning general-purpose vision representations. See the paper and our [blog post](https://percep-align.github.io)
for more details.
Here's how they perform on NIGHTS:
| | NIGHTS - Val | NIGHTS - Test |
|-------------------|--------------|---------------|
| `ensemble` | 96.9% | 96.2% |
| `dino_vitb16` | 95.6% | 94.8% |
| `open_clip_vitb32` | 95.6% | 95.3% |
| `clip_vitb32`| 94.9% | 93.6% |
| `dinov2_vitb14` | 94.9% | 95.0% |
| `synclr_vitb16`| 96.0% | 95.9% |
| `dino_vitb16 (patch)` | 94.9% | 94.8% |
| `dinov2_vitb14 (patch)`| 95.5% | 95.1% |
**9/14/24:** We released new versions of the ensemble and single-branch DreamSim models compatible with `peft>=0.2.0`.
We also released the entire [100k (unfiltered) NIGHTS dataset](#new-download-the-entire-100k-pre-filtered-nights-dataset) and the [JND (Just-Noticeable Difference) votes](#new-download-the-jnd-data).
## Table of Contents
* [Requirements](#requirements)
* [Setup](#setup)
* [Usage](#usage)
* [Quickstart](#quickstart-perceptual-similarity-metric)
* [Single-branch models](#single-branch-models)
* [Feature extraction](#feature-extraction)
* [Image retrieval](#image-retrieval)
* [Perceptual loss function](#perceptual-loss-function)
* [NIGHTS Dataset](#nights-novel-image-generations-with-human-tested-similarities-dataset)
* [Experiments](#experiments)
* [Citation](#citation)
## Requirements
- Linux
- Python 3
## Setup
**Option 1:** Install using pip:
```pip install dreamsim```
The package is used for importing and using the DreamSim model.
**Option 2:** Clone our repo and install dependencies.
This is necessary for running our training/evaluation scripts.
```
python3 -m venv ds
source ds/bin/activate
pip install -r requirements.txt
export PYTHONPATH="$PYTHONPATH:$(realpath ./dreamsim)"
```
To install with conda:
```
conda create -n ds
conda activate ds
conda install pip # verify with the `which pip` command
pip install -r requirements.txt
export PYTHONPATH="$PYTHONPATH:$(realpath ./dreamsim)"
```
## Usage
**For walk-through examples of the below use-cases, check out our [Colab demo](https://colab.research.google.com/drive/1taEOMzFE9g81D9AwH27Uhy2U82tQGAVI?usp=sharing).**
### Quickstart: Perceptual similarity metric
The basic use case is to measure the perceptual distance between two images. **A higher score means more different, lower means more similar**.
The following code snippet is all you need. The first time that you run `dreamsim` it will automatically download the model weights. The default model settings are specified in `./dreamsim/config.py`.
```
from dreamsim import dreamsim
from PIL import Image
device = "cuda"
model, preprocess = dreamsim(pretrained=True, device=device)
img1 = preprocess(Image.open("img1_path")).to(device)
img2 = preprocess(Image.open("img2_path")).to(device)
distance = model(img1, img2) # The model takes an RGB image from [0, 1], size batch_sizex3x224x224
```
To run on example images, run `demo.py`. The script should produce distances (0.4453, 0.2756).
### Single-branch models
By default, DreamSim uses an ensemble of CLIP, DINO, and OpenCLIP (all ViT-B/16). If you need a lighter-weight model you can use *single-branch* versions of DreamSim where only a single backbone is finetuned. **The single-branch models provide a ~3x speedup over the ensemble.**
The available options are OpenCLIP-ViTB/32, DINO-ViTB/16, CLIP-ViTB/32, in order of performance. To load a single-branch model, use the `dreamsim_type` argument. For example:
```
dreamsim_dino_model, preprocess = dreamsim(pretrained=True, dreamsim_type="dino_vitb16")
```
### Feature extraction
To extract a *single image embedding* using dreamsim, use the `embed` method as shown in the following snippet:
```
img1 = preprocess(Image.open("img1_path")).to("cuda")
embedding = model.embed(img1)
```
The perceptual distance between two images is the cosine distance between their embeddings. If the embeddings are normalized (true by default) L2 distance can also be used.
### Image retrieval
Our model can be used for image retrieval, and plugged into existing such pipelines. The code below ranks a dataset of images based on their similarity to a given query image.
To speed things up, instead of directly calling `model(query, image)` for each pair, we use the `model.embed(image)` method to pre-compute single-image embeddings, and then take the cosine distance between embedding pairs.
```
import pandas as pd
from tqdm import tqdm
import torch.nn.functional as F
# let query be a sample image.
# let images be a list of images we are searching.
# Compute the query image embedding
query_embed = model.embed(preprocess(query).to("cuda"))
dists = {}
# Compute the (cosine) distance between the query and each search image
for i, im in tqdm(enumerate(images), total=len(images)):
img_embed = model.embed(preprocess(im).to("cuda"))
dists[i] = (1 - F.cosine_similarity(query_embed, img_embed, dim=-1)).item()
# Return results sorted by distance
df = pd.DataFrame({"ids": list(dists.keys()), "dists": list(dists.values())})
return df.sort_values(by="dists")
```
### Perceptual loss function
Our model can be used as a loss function for iterative optimization (similarly to the LPIPS metric). These are the key lines; for the full example, refer to the [Colab](https://colab.research.google.com/drive/1taEOMzFE9g81D9AwH27Uhy2U82tQGAVI?usp=sharing).
```
for i in range(n_iters):
dist = model(predicted_image, reference_image)
dist.backward()
optimizer.step()
```
<a name="bibtex"></a>
## Citation
If you find our work or any of our materials useful, please cite our paper:
```
@misc{fu2023dreamsim,
title={DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data},
author={Stephanie Fu and Netanel Tamir and Shobhita Sundaram and Lucy Chai and Richard Zhang and Tali Dekel and Phillip Isola},
year={2023},
eprint={2306.09344},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
```
## Acknowledgements
Our code borrows from the ["Deep ViT Features as Dense Visual Descriptors"](https://dino-vit-features.github.io/) repository for ViT feature extraction, and takes inspiration from the [UniverSeg](https://github.com/JJGO/UniverSeg) respository for code structure.
Raw data
{
"_id": null,
"home_page": "https://github.com/ssundaram21/dreamsim",
"name": "dreamsim",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": null,
"author": null,
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/24/a8/808a6ed5435fe42db80f095f24ff025094239beeb2037a8fb6d8d8e828a8/dreamsim-0.2.1.tar.gz",
"platform": null,
"description": "<!-- #  DreamSim Perceptual Metric -->\n<!-- # DreamSim Perceptual Metric <img src=\"images/figs/icon.png\" align=\"left\" width=\"50px\"/> -->\n# DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data\n### [Project Page](https://dreamsim-nights.github.io/) | [Paper](https://arxiv.org/abs/2306.09344) | [Bibtex](#bibtex)\n[](https://colab.research.google.com/drive/1taEOMzFE9g81D9AwH27Uhy2U82tQGAVI?usp=sharing)\n\n[Stephanie Fu](https://stephanie-fu.github.io)\\* $^{1}$, [Netanel Tamir](https://netanel-tamir.github.io)\\* $^{2}$, [Shobhita Sundaram](https://ssundaram21.github.io)\\* $^{1}$, [Lucy Chai](https://people.csail.mit.edu/lrchai/) $^1$, [Richard Zhang](http://richzhang.github.io) $^3$, [Tali Dekel](https://www.weizmann.ac.il/math/dekel/) $^2$, [Phillip Isola](https://web.mit.edu/phillipi/) $^1$.<br>\n(*equal contribution, order decided by random seed)<br>\n$^1$ MIT, $^2$ Weizmann Institute of Science, $^3$ Adobe Research.\n\n\n\n**Summary**\n\nCurrent metrics for perceptual image similarity operate at the level of pixels and patches. These metrics compare images in terms of their low-level colors and textures, but fail to capture mid-level differences in layout, pose, semantic content, etc. Models that use image-level embeddings such as DINO and CLIP capture high-level and semantic judgements, but may not be aligned with human perception of more finegrained attributes.\n\nDreamSim is a new metric for perceptual image similarity that bridges the gap between \"low-level\" metrics (e.g. LPIPS, PSNR, SSIM) and \"high-level\" measures (e.g. CLIP). Our model was trained by concatenating CLIP, OpenCLIP, and DINO embeddings, and then finetuning on human perceptual judgements. We gathered these judgements on a dataset of ~20k image triplets, generated by diffusion models. Our model achieves better alignment with human similarity judgements than existing metrics, and can be used for downstream applications such as image retrieval.\n\n## \ud83d\ude80 Newest Updates\n**10/14/24:** We released 4 new variants of DreamSim! These new checkpoints are:\n- DINOv2 B/14 and SynCLR B/16 as backbones\n- Models trained with the original contrastive loss on both CLS and dense features. \n\nThese models (and the originals) are further evaluated in **our new NeurIPS 2024 paper, When Does Perceptual Alignment Benefit Vision Representations?**\n\nWe find that our perceptually-aligned models outperform the baseline models on a variety of standard computer vision tasks, including \nsemantic segmentation, depth estimation, object counting, instance retrieval, and retrieval-augmented generation. These results \npoint towards perceptual alignment being a useful task for learning general-purpose vision representations. See the paper and our [blog post](https://percep-align.github.io) \nfor more details.\n\nHere's how they perform on NIGHTS:\n| | NIGHTS - Val | NIGHTS - Test |\n|-------------------|--------------|---------------|\n| `ensemble` | 96.9% | 96.2% |\n| `dino_vitb16` | 95.6% | 94.8% |\n| `open_clip_vitb32` | 95.6% | 95.3% |\n| `clip_vitb32`| 94.9% | 93.6% |\n| `dinov2_vitb14` | 94.9% | 95.0% |\n| `synclr_vitb16`| 96.0% | 95.9% |\n| `dino_vitb16 (patch)` | 94.9% | 94.8% |\n| `dinov2_vitb14 (patch)`| 95.5% | 95.1% |\n\n**9/14/24:** We released new versions of the ensemble and single-branch DreamSim models compatible with `peft>=0.2.0`.\n\nWe also released the entire [100k (unfiltered) NIGHTS dataset](#new-download-the-entire-100k-pre-filtered-nights-dataset) and the [JND (Just-Noticeable Difference) votes](#new-download-the-jnd-data).\n\n## Table of Contents\n* [Requirements](#requirements)\n* [Setup](#setup)\n* [Usage](#usage)\n * [Quickstart](#quickstart-perceptual-similarity-metric)\n * [Single-branch models](#single-branch-models)\n * [Feature extraction](#feature-extraction)\n * [Image retrieval](#image-retrieval)\n * [Perceptual loss function](#perceptual-loss-function)\n* [NIGHTS Dataset](#nights-novel-image-generations-with-human-tested-similarities-dataset)\n* [Experiments](#experiments)\n* [Citation](#citation)\n\n## Requirements\n- Linux\n- Python 3\n\n## Setup\n\n**Option 1:** Install using pip: \n\n```pip install dreamsim```\n\nThe package is used for importing and using the DreamSim model.\n\n**Option 2:** Clone our repo and install dependencies.\nThis is necessary for running our training/evaluation scripts.\n\n```\npython3 -m venv ds\nsource ds/bin/activate\npip install -r requirements.txt\nexport PYTHONPATH=\"$PYTHONPATH:$(realpath ./dreamsim)\"\n```\nTo install with conda:\n```\nconda create -n ds\nconda activate ds\nconda install pip # verify with the `which pip` command\npip install -r requirements.txt\nexport PYTHONPATH=\"$PYTHONPATH:$(realpath ./dreamsim)\"\n```\n\n## Usage\n**For walk-through examples of the below use-cases, check out our [Colab demo](https://colab.research.google.com/drive/1taEOMzFE9g81D9AwH27Uhy2U82tQGAVI?usp=sharing).**\n\n### Quickstart: Perceptual similarity metric\nThe basic use case is to measure the perceptual distance between two images. **A higher score means more different, lower means more similar**. \n\nThe following code snippet is all you need. The first time that you run `dreamsim` it will automatically download the model weights. The default model settings are specified in `./dreamsim/config.py`.\n```\nfrom dreamsim import dreamsim\nfrom PIL import Image\n\ndevice = \"cuda\"\nmodel, preprocess = dreamsim(pretrained=True, device=device)\n\nimg1 = preprocess(Image.open(\"img1_path\")).to(device)\nimg2 = preprocess(Image.open(\"img2_path\")).to(device)\ndistance = model(img1, img2) # The model takes an RGB image from [0, 1], size batch_sizex3x224x224\n```\n\nTo run on example images, run `demo.py`. The script should produce distances (0.4453, 0.2756). \n\n### Single-branch models\nBy default, DreamSim uses an ensemble of CLIP, DINO, and OpenCLIP (all ViT-B/16). If you need a lighter-weight model you can use *single-branch* versions of DreamSim where only a single backbone is finetuned. **The single-branch models provide a ~3x speedup over the ensemble.**\n\nThe available options are OpenCLIP-ViTB/32, DINO-ViTB/16, CLIP-ViTB/32, in order of performance. To load a single-branch model, use the `dreamsim_type` argument. For example:\n```\ndreamsim_dino_model, preprocess = dreamsim(pretrained=True, dreamsim_type=\"dino_vitb16\")\n```\n\n### Feature extraction\nTo extract a *single image embedding* using dreamsim, use the `embed` method as shown in the following snippet:\n```\nimg1 = preprocess(Image.open(\"img1_path\")).to(\"cuda\")\nembedding = model.embed(img1)\n```\nThe perceptual distance between two images is the cosine distance between their embeddings. If the embeddings are normalized (true by default) L2 distance can also be used.\n\n\n### Image retrieval\nOur model can be used for image retrieval, and plugged into existing such pipelines. The code below ranks a dataset of images based on their similarity to a given query image. \n\nTo speed things up, instead of directly calling `model(query, image)` for each pair, we use the `model.embed(image)` method to pre-compute single-image embeddings, and then take the cosine distance between embedding pairs.\n```\nimport pandas as pd\nfrom tqdm import tqdm\nimport torch.nn.functional as F\n\n# let query be a sample image.\n# let images be a list of images we are searching.\n\n# Compute the query image embedding\nquery_embed = model.embed(preprocess(query).to(\"cuda\"))\ndists = {}\n\n# Compute the (cosine) distance between the query and each search image\nfor i, im in tqdm(enumerate(images), total=len(images)):\n img_embed = model.embed(preprocess(im).to(\"cuda\"))\n dists[i] = (1 - F.cosine_similarity(query_embed, img_embed, dim=-1)).item()\n\n# Return results sorted by distance\ndf = pd.DataFrame({\"ids\": list(dists.keys()), \"dists\": list(dists.values())})\nreturn df.sort_values(by=\"dists\")\n```\n\n### Perceptual loss function\nOur model can be used as a loss function for iterative optimization (similarly to the LPIPS metric). These are the key lines; for the full example, refer to the [Colab](https://colab.research.google.com/drive/1taEOMzFE9g81D9AwH27Uhy2U82tQGAVI?usp=sharing).\n```\nfor i in range(n_iters):\n dist = model(predicted_image, reference_image)\n dist.backward()\n optimizer.step()\n```\n\n\n<a name=\"bibtex\"></a>\n## Citation\n\nIf you find our work or any of our materials useful, please cite our paper:\n```\n@misc{fu2023dreamsim,\n title={DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data}, \n author={Stephanie Fu and Netanel Tamir and Shobhita Sundaram and Lucy Chai and Richard Zhang and Tali Dekel and Phillip Isola},\n year={2023},\n eprint={2306.09344},\n archivePrefix={arXiv},\n primaryClass={cs.CV}\n}\n```\n\n## Acknowledgements\nOur code borrows from the [\"Deep ViT Features as Dense Visual Descriptors\"](https://dino-vit-features.github.io/) repository for ViT feature extraction, and takes inspiration from the [UniverSeg](https://github.com/JJGO/UniverSeg) respository for code structure.\n\n",
"bugtrack_url": null,
"license": null,
"summary": "DreamSim similarity metric",
"version": "0.2.1",
"project_urls": {
"Homepage": "https://github.com/ssundaram21/dreamsim"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "24a8808a6ed5435fe42db80f095f24ff025094239beeb2037a8fb6d8d8e828a8",
"md5": "bfecf5f60a4902831ee2894644f3ec34",
"sha256": "36c655ee1bb5dbbf1730f03a59ac0d0180922f2151dfbdfb4f058b1756903d14"
},
"downloads": -1,
"filename": "dreamsim-0.2.1.tar.gz",
"has_sig": false,
"md5_digest": "bfecf5f60a4902831ee2894644f3ec34",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 24954,
"upload_time": "2024-10-15T00:38:23",
"upload_time_iso_8601": "2024-10-15T00:38:23.631232Z",
"url": "https://files.pythonhosted.org/packages/24/a8/808a6ed5435fe42db80f095f24ff025094239beeb2037a8fb6d8d8e828a8/dreamsim-0.2.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-15 00:38:23",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "ssundaram21",
"github_project": "dreamsim",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "dreamsim"
}