# :monkey: simages:monkey:
[![PyPI version](https://badge.fury.io/py/simages.svg)](https://badge.fury.io/py/simages) [![Build Status](https://travis-ci.com/justinshenk/simages.svg?branch=master)](https://travis-ci.com/justinshenk/simages) [![Documentation Status](https://readthedocs.org/projects/simages/badge/?version=latest)](https://simages.readthedocs.io/en/latest/?badge=latest) [![DOI](https://zenodo.org/badge/188052094.svg)](https://zenodo.org/badge/latestdoi/188052094) [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/justinshenk/simages/master?filepath=demo.ipynb)
Find similar images within a dataset.
Useful for removing duplicate images from a dataset after scraping images with [google-images-download](https://github.com/hardikvasa/google-images-download).
The Python API returns `pairs, duplicates`, where pairs are the (ordered) closest pairs and distances is the
corresponding embedding distance.
### Install
See the [installation docs](https://simages.readthedocs.io/en/latest/install.html) for all details.
```bash
pip install simages
```
or install from source:
```bash
git clone https://github.com/justinshenk/simages
cd simages
pip install .
```
To install the interactive interface, [install mongodb](https://docs.mongodb.com/manual/installation/) and use rather `pip install "simages[all]"`.
### Demo
1. Minimal command-line interface with ```simages-show```:
![simages_demo](images/simages_demo.gif)
2. Interactive image deletion with ```simages add/find```:
![simages_web_demo](images/screenshot_server.png)
### Usage
Two interfaces exist:
1. minimal interface which plots the duplicates for visual inspection
2. mongodb + flask interface which allows interactive deletion [optional]
#### Minimal Interface
In your console, enter the directory with images and use `simages-show`:
```bash
$ simages-show --data-dir .
```
```
usage: simages-show [-h] [--data-dir DATA_DIR] [--show-train]
[--epochs EPOCHS] [--num-channels NUM_CHANNELS]
[--pairs PAIRS] [--zdim ZDIM] [-s]
-h, --help show this help message and exit
--data-dir DATA_DIR, -d DATA_DIR
Folder containing image data
--show-train, -t Show training of embedding extractor every epoch
--epochs EPOCHS, -e EPOCHS
Number of passes of dataset through model for
training. More is better but takes more time.
--num-channels NUM_CHANNELS, -c NUM_CHANNELS
Number of channels for data (1 for grayscale, 3 for
color)
--pairs PAIRS, -p PAIRS
Number of pairs of images to show
--zdim ZDIM, -z ZDIM Compression bits (bigger generally performs better but
takes more time)
-s, --show Show closest pairs
```
#### Web Interface [Optional]
Note: To install the web interface API, [install and run mongodb](https://docs.mongodb.com/manual/installation/) and use `pip install "simages[all]"` to install optional dependencies.
Add your pictures to the database (this will take some time depending on the number of pictures)
```
simages add <images_folder_path>
```
A webpage will come up with all of the similar or duplicate pictures:
```
simages find <images_folder_path>
```
```
Usage:
simages add <path> ... [--db=<db_path>] [--parallel=<num_processes>]
simages remove <path> ... [--db=<db_path>]
simages clear [--db=<db_path>]
simages show [--db=<db_path>]
simages find <path> [--print] [--delete] [--match-time] [--trash=<trash_path>] [--db=<db_path>] [--epochs=<epochs>]
simages -h | --help
Options:
-h, --help Show this screen
--db=<db_path> The location of the database or a MongoDB URI. (default: ./db)
--parallel=<num_processes> The number of parallel processes to run to hash the image
files (default: number of CPUs).
find:
--print Only print duplicate files rather than displaying HTML file
--delete Move all found duplicate pictures to the trash. This option takes priority over --print.
--match-time Adds the extra constraint that duplicate images must have the
same capture times in order to be considered.
--trash=<trash_path> Where files will be put when they are deleted (default: ./Trash)
--epochs=<epochs> Epochs for training [default: 2]
```
### Python APIs
#### Numpy array
```python
from simages import find_duplicates
import numpy as np
array_data = np.random.random(100, 3, 48, 48)# N x C x H x W
pairs, distances = find_duplicates(array_data)
```
#### Folder
```python
from simages import find_duplicates
data_dir = "my_images_folder"
pairs, distances = find_duplicates(data_dir)
```
Default options for `find_duplicates` are:
```python
def find_duplicates(
input: Union[str or np.ndarray],
n: int = 5,
num_epochs: int = 2,
num_channels: int = 3,
show: bool = False,
show_train: bool = False,
**kwargs
):
"""Find duplicates in dataset. Either `array` or `data_dir` must be specified.
Args:
input (str or np.ndarray): folder directory or N x C x H x W array
n (int): number of closest pairs to identify
num_epochs (int): how long to train the autoencoder (more is generally better)
show (bool): display the closest pairs
show_train (bool): show output every
z_dim (int): size of compression (more is generally better, but slower)
kwargs (dict): etc, passed to `EmbeddingExtractor`
Returns:
pairs (np.ndarray): indices for closest pairs of images, n x 2 array
distances (np.ndarray): distances of each pair to each other
```
#### `Embeddings` API
```python
from simages import Embeddings
import numpy as np
N = 1000
data = np.random.random((N, 28, 28))
embeddings = Embeddings(data)
# Access the array
array = embeddings.array # N x z (compression size)
# Get 10 closest pairs of images
pairs, distances = embeddings.duplicates(n=5)
```
```python
In [0]: pairs
Out[0]: array([[912, 990], [716, 790], [907, 943], [483, 492], [806, 883]])
In [1]: distances
Out[1]: array([0.00148035, 0.00150703, 0.00158789, 0.00168699, 0.00168721])
```
#### `EmbeddingExtractor` API
```python
from simages import EmbeddingExtractor
import numpy as np
N = 1000
data = np.random.random((N, 28, 28))
extractor = EmbeddingExtractor(data, num_channels=1) # grayscale
# Show 10 closest pairs of images
pairs, distances = extractor.show_duplicates(n=10)
```
Class attributes and parameters:
```python
class EmbeddingExtractor:
"""Extract embeddings from data with models and allow visualization.
Attributes:
trainloader (torch loader)
evalloader (torch loader)
model (torch.nn.Module)
embeddings (np.ndarray)
"""
def __init__(
self,
input:Union[str, np.ndarray],
num_channels=None,
num_epochs=2,
batch_size=32,
show_train=True,
show=False,
z_dim=8,
**kwargs,
):
"""Inits EmbeddingExtractor with input, either `str` or `np.nd.array`, performs training and validation.
Args:
input (np.ndarray or str): data
num_channels (int): grayscale = 1, color = 3
num_epochs (int): more is better (generally)
batch_size (int): number of images per batch
show_train (bool): show intermediate training results
show (bool): show closest pairs
z_dim (int): compression size
kwargs (dict)
"""
```
Specify tne number of pairs to identify with the parameter `n`.
### How it works
*simages* uses a convolutional autoencoder with PyTorch and compares the latent representations with [closely](https://github.com/justinshenk/closely) :triangular_ruler:.
#### Dependencies
*simages* depends on
the following packages:
- [closely](https://github.com/justinshenk/closely)
- [torch](https://pytorch.org)
- [torchvision](https://pytorch.org)
- scikit-learn
- matplotlib
The following dependencies are required for the interactive deleting interface:
- pymongodb
- fastcluster
- flask
- jinja2
- dnspython
- python-magic
- termcolor
### Cite
If you use simages, please cite it:
```
@misc{justin_shenk_2019_3237830,
author = {Justin Shenk},
title = {justinshenk/simages: v19.0.1},
month = jun,
year = 2019,
doi = {10.5281/zenodo.3237830},
url = {https://doi.org/10.5281/zenodo.3237830}
}
```
Raw data
{
"_id": null,
"home_page": "https://github.com/justinshenk/simages",
"name": "simages",
"maintainer": "Justin Shenk",
"docs_url": null,
"requires_python": ">= 3.6",
"maintainer_email": "shenkjustin@gmail.com",
"keywords": "images,photos,duplicates,preprocessing,similar data",
"author": "Justin Shenk",
"author_email": "shenkjustin@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/9e/49/c578fc57305cca6339175f9f3e641800f587d3a6e614dbcea5abb0615856/simages-23.0.7.tar.gz",
"platform": null,
"description": "# :monkey: simages:monkey:\n[![PyPI version](https://badge.fury.io/py/simages.svg)](https://badge.fury.io/py/simages) [![Build Status](https://travis-ci.com/justinshenk/simages.svg?branch=master)](https://travis-ci.com/justinshenk/simages) [![Documentation Status](https://readthedocs.org/projects/simages/badge/?version=latest)](https://simages.readthedocs.io/en/latest/?badge=latest) [![DOI](https://zenodo.org/badge/188052094.svg)](https://zenodo.org/badge/latestdoi/188052094) [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/justinshenk/simages/master?filepath=demo.ipynb)\n\n\nFind similar images within a dataset. \n\nUseful for removing duplicate images from a dataset after scraping images with [google-images-download](https://github.com/hardikvasa/google-images-download).\n\nThe Python API returns `pairs, duplicates`, where pairs are the (ordered) closest pairs and distances is the \ncorresponding embedding distance.\n\n### Install\n\nSee the [installation docs](https://simages.readthedocs.io/en/latest/install.html) for all details. \n\n```bash\npip install simages\n```\n\nor install from source:\n\n```bash\ngit clone https://github.com/justinshenk/simages\ncd simages\npip install .\n```\n\nTo install the interactive interface, [install mongodb](https://docs.mongodb.com/manual/installation/) and use rather `pip install \"simages[all]\"`.\n\n### Demo\n\n1. Minimal command-line interface with ```simages-show```:\n\n![simages_demo](images/simages_demo.gif)\n\n2. Interactive image deletion with ```simages add/find```:\n![simages_web_demo](images/screenshot_server.png)\n\n### Usage\n\nTwo interfaces exist:\n\n1. minimal interface which plots the duplicates for visual inspection\n2. mongodb + flask interface which allows interactive deletion [optional]\n \n#### Minimal Interface\n\nIn your console, enter the directory with images and use `simages-show`:\n\n```bash\n$ simages-show --data-dir .\n```\n\n```\nusage: simages-show [-h] [--data-dir DATA_DIR] [--show-train]\n [--epochs EPOCHS] [--num-channels NUM_CHANNELS]\n [--pairs PAIRS] [--zdim ZDIM] [-s]\n\n -h, --help show this help message and exit\n --data-dir DATA_DIR, -d DATA_DIR\n Folder containing image data\n --show-train, -t Show training of embedding extractor every epoch\n --epochs EPOCHS, -e EPOCHS\n Number of passes of dataset through model for\n training. More is better but takes more time.\n --num-channels NUM_CHANNELS, -c NUM_CHANNELS\n Number of channels for data (1 for grayscale, 3 for\n color)\n --pairs PAIRS, -p PAIRS\n Number of pairs of images to show\n --zdim ZDIM, -z ZDIM Compression bits (bigger generally performs better but\n takes more time)\n -s, --show Show closest pairs\n\n```\n\n#### Web Interface [Optional]\n\nNote: To install the web interface API, [install and run mongodb](https://docs.mongodb.com/manual/installation/) and use `pip install \"simages[all]\"` to install optional dependencies.\n\nAdd your pictures to the database (this will take some time depending on the number of pictures)\n\n```\nsimages add <images_folder_path>\n```\n\nA webpage will come up with all of the similar or duplicate pictures:\n```\nsimages find <images_folder_path>\n```\n\n```\nUsage:\n simages add <path> ... [--db=<db_path>] [--parallel=<num_processes>]\n simages remove <path> ... [--db=<db_path>]\n simages clear [--db=<db_path>]\n simages show [--db=<db_path>]\n simages find <path> [--print] [--delete] [--match-time] [--trash=<trash_path>] [--db=<db_path>] [--epochs=<epochs>]\n simages -h | --help\nOptions:\n -h, --help Show this screen\n --db=<db_path> The location of the database or a MongoDB URI. (default: ./db)\n --parallel=<num_processes> The number of parallel processes to run to hash the image\n files (default: number of CPUs).\n find:\n --print Only print duplicate files rather than displaying HTML file\n --delete Move all found duplicate pictures to the trash. This option takes priority over --print.\n --match-time Adds the extra constraint that duplicate images must have the\n same capture times in order to be considered.\n --trash=<trash_path> Where files will be put when they are deleted (default: ./Trash)\n --epochs=<epochs> Epochs for training [default: 2]\n```\n\n\n### Python APIs\n\n#### Numpy array\n\n```python\nfrom simages import find_duplicates\nimport numpy as np\n\narray_data = np.random.random(100, 3, 48, 48)# N x C x H x W\npairs, distances = find_duplicates(array_data)\n \n```\n\n#### Folder\n\n```python\nfrom simages import find_duplicates\n\ndata_dir = \"my_images_folder\"\npairs, distances = find_duplicates(data_dir)\n \n```\n\nDefault options for `find_duplicates` are:\n\n```python\ndef find_duplicates(\n input: Union[str or np.ndarray],\n n: int = 5,\n num_epochs: int = 2,\n num_channels: int = 3,\n show: bool = False,\n show_train: bool = False,\n **kwargs\n):\n \"\"\"Find duplicates in dataset. Either `array` or `data_dir` must be specified.\n\n Args:\n input (str or np.ndarray): folder directory or N x C x H x W array\n n (int): number of closest pairs to identify\n num_epochs (int): how long to train the autoencoder (more is generally better)\n show (bool): display the closest pairs\n show_train (bool): show output every\n z_dim (int): size of compression (more is generally better, but slower)\n kwargs (dict): etc, passed to `EmbeddingExtractor`\n\n Returns:\n pairs (np.ndarray): indices for closest pairs of images, n x 2 array\n distances (np.ndarray): distances of each pair to each other\n```\n\n#### `Embeddings` API\n\n```python\nfrom simages import Embeddings\nimport numpy as np\n\nN = 1000\ndata = np.random.random((N, 28, 28))\nembeddings = Embeddings(data)\n\n# Access the array\narray = embeddings.array # N x z (compression size)\n\n# Get 10 closest pairs of images\npairs, distances = embeddings.duplicates(n=5)\n\n```\n\n```python\nIn [0]: pairs\nOut[0]: array([[912, 990], [716, 790], [907, 943], [483, 492], [806, 883]])\n\nIn [1]: distances\nOut[1]: array([0.00148035, 0.00150703, 0.00158789, 0.00168699, 0.00168721])\n```\n\n#### `EmbeddingExtractor` API\n\n```python\nfrom simages import EmbeddingExtractor\nimport numpy as np\n\nN = 1000\ndata = np.random.random((N, 28, 28))\nextractor = EmbeddingExtractor(data, num_channels=1) # grayscale\n\n# Show 10 closest pairs of images\npairs, distances = extractor.show_duplicates(n=10)\n\n```\n\nClass attributes and parameters:\n\n```python\nclass EmbeddingExtractor:\n \"\"\"Extract embeddings from data with models and allow visualization.\n\n Attributes:\n trainloader (torch loader)\n evalloader (torch loader)\n model (torch.nn.Module)\n embeddings (np.ndarray)\n\n \"\"\"\n def __init__(\n self,\n input:Union[str, np.ndarray],\n num_channels=None,\n num_epochs=2,\n batch_size=32,\n show_train=True,\n show=False,\n z_dim=8,\n **kwargs,\n ):\n \"\"\"Inits EmbeddingExtractor with input, either `str` or `np.nd.array`, performs training and validation.\n \n Args:\n input (np.ndarray or str): data\n num_channels (int): grayscale = 1, color = 3\n num_epochs (int): more is better (generally)\n batch_size (int): number of images per batch\n show_train (bool): show intermediate training results\n show (bool): show closest pairs\n z_dim (int): compression size\n kwargs (dict)\n \n \"\"\"\n\n```\n\nSpecify tne number of pairs to identify with the parameter `n`.\n \n### How it works\n\n*simages* uses a convolutional autoencoder with PyTorch and compares the latent representations with [closely](https://github.com/justinshenk/closely) :triangular_ruler:.\n\n#### Dependencies\n\n*simages* depends on\nthe following packages:\n\n- [closely](https://github.com/justinshenk/closely)\n- [torch](https://pytorch.org)\n- [torchvision](https://pytorch.org)\n- scikit-learn\n- matplotlib\n\nThe following dependencies are required for the interactive deleting interface:\n \n- pymongodb\n- fastcluster\n- flask\n- jinja2\n- dnspython\n- python-magic\n- termcolor\n\n### Cite\n\nIf you use simages, please cite it:\n```\n @misc{justin_shenk_2019_3237830,\n author = {Justin Shenk},\n title = {justinshenk/simages: v19.0.1},\n month = jun,\n year = 2019,\n doi = {10.5281/zenodo.3237830},\n url = {https://doi.org/10.5281/zenodo.3237830}\n }\n```\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Find similar images in a dataset",
"version": "23.0.7",
"project_urls": {
"Homepage": "https://github.com/justinshenk/simages"
},
"split_keywords": [
"images",
"photos",
"duplicates",
"preprocessing",
"similar data"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "7c288367a1b6cfa525e228911e0e925e010f8f10f350d5984318e8209aef83f3",
"md5": "06fec5c1cf15411e654f1883c9717bbc",
"sha256": "0ca2de57ae0438d92a057bdcbd7301b2de8ff865e96206bf1c3dadca23901001"
},
"downloads": -1,
"filename": "simages-23.0.7-py3-none-any.whl",
"has_sig": false,
"md5_digest": "06fec5c1cf15411e654f1883c9717bbc",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">= 3.6",
"size": 14723528,
"upload_time": "2023-06-22T16:04:57",
"upload_time_iso_8601": "2023-06-22T16:04:57.907415Z",
"url": "https://files.pythonhosted.org/packages/7c/28/8367a1b6cfa525e228911e0e925e010f8f10f350d5984318e8209aef83f3/simages-23.0.7-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "9e49c578fc57305cca6339175f9f3e641800f587d3a6e614dbcea5abb0615856",
"md5": "cbf416cd8808bdf4637cfb52b3773bf0",
"sha256": "ad14051ffdd7a4a2f9950465ffea530deb4e9817e3b9b5d8848d471a92654292"
},
"downloads": -1,
"filename": "simages-23.0.7.tar.gz",
"has_sig": false,
"md5_digest": "cbf416cd8808bdf4637cfb52b3773bf0",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">= 3.6",
"size": 27267470,
"upload_time": "2023-06-22T16:05:56",
"upload_time_iso_8601": "2023-06-22T16:05:56.189861Z",
"url": "https://files.pythonhosted.org/packages/9e/49/c578fc57305cca6339175f9f3e641800f587d3a6e614dbcea5abb0615856/simages-23.0.7.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-06-22 16:05:56",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "justinshenk",
"github_project": "simages",
"travis_ci": true,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "simages"
}