duplicate-images


Nameduplicate-images JSON
Version 0.8.1 PyPI version JSON
download
home_pagehttps://github.com/lene/DuplicateImages
SummaryFinds equal or similar images in a directory containing (many) image files
upload_time2023-08-15 09:47:22
maintainer
docs_urlNone
authorLene Preuss
requires_python>=3.8.5,<4.0.0
license
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Finding Duplicate Images

Finds equal or similar images in a directory containing (many) image files.

Official home page: https://github.com/lene/DuplicateImages

Development page: https://gitlab.com/lilacashes/DuplicateImages

PyPI page: https://pypi.org/project/duplicate-images

## Usage

Installing:
```shell
$ pip install duplicate_images
```

Printing the help screen:
```shell
$ find-dups -h
```

Quick test run:
```shell
$ find-dups $IMAGE_ROOT 
```

Typical usage:
```shell
$ find-dups $IMAGE_ROOT --parallel --progress --hash-db hashes.pickle
```

### Supported image formats

* JPEG and PNG (tested quite thoroughly)
* HEIC (experimental support, tested cursorily only)

### Image comparison algorithms

Use the `--algorithm` option to select how equal images are found. The default algorithm is `phash`.

`ahash`, `colorhash`, `dhash`, `dhash_vertical`, `phash`, `phash_simple`, `whash`: seven different 
image hashing algorithms. See https://pypi.org/project/ImageHash for an introduction on image 
hashing and https://tech.okcupid.com/evaluating-perceptual-image-hashes-okcupid for some gory 
details which image hashing algorithm performs best in which situation. For a start I recommend 
using `phash`, and only evaluating the other algorithms if `phash` does not perform satisfactorily 
in your use case.

### Image similarity threshold configuration

Use the `--max-distance` parameter to tune how close images should be to be considered duplicates.
The argument is a positive integer. Its value is highly dependent on the algorithm used and the 
nature of the images compared, so the best value for your use case can oly be found through 
experimentation.

Use the `--hash-size` parameter to tune the precision of the hashing algorithms. For the `colorhash`
algorithm the hash size is interpreted as the number of bin bits and defaults to 3. For all other
algorithms the hash size defaults to 8. For `whash` it must be a power of 2.

### Actions for matching image pairs

Use the `--on-equal` option to select what to do to pairs of equal images. The default action is 
`print`.
- `delete-first` or `d1`: deletes the first of the two files
- `delete-second` or `d2`: deletes the second of the two files
- `delete-bigger` or `d>`: deletes the file with the bigger size
- `delete-smaller` or `d<`: deletes the file with the smaller size
- `eog`: launches the `eog` image viewer to compare the two files (*deprecated* by `exec`)
- `xv`: launches the `xv` image viewer to compare the two files (*deprecated* by `exec`)
- `print`: prints the two files
- `print_inline`: like `print` but without newline
- `quote`: prints the two files quoted for POSIX shells
- `quote_inline`: like `quote` but without newline
- `exec`: executes a command (see `--exec` argument)
- `none`: does nothing.

The `--exec` argument allows calling another program when the `--on-equal exec` option is given.\
You can pass a command line string like `--exec "program {1} {2}"` where `{1}` and `{2}` are replaced by the matching pair files.

**Examples**\
`--exec "open -a Preview -W {1} {2}"`: Opens the files in MacOS Preview app and waits for it.

### Parallel execution

Use the `--parallel` option to utilize all free cores on your system. 

### Serial execution

`find-dups` can also use an alternative algorithm which is O(N<sup>2</sup>) in the number of images.
Use the `--serial` option to use this alternative algorithm. 

### Progress and verbosity control

- `--progress` prints a progress bar each for the process of reading the images, and the process of 
  finding duplicates among the scanned image
- `--debug` prints debugging output
- `--quiet` decreases the log level by 1 for each time it is called; `--debug` and `--quiet` cancel
  each other out

### Pre-storing and using image hashes to speed up computation

Use the `--hash-db $PICKLE_FILE` option to store image hashes in the file `$PICKLE_FILE` and read
image hashes from that file if they are already present there. This avoids having to compute the 
image hashes anew at every run and can significantly speed up run times.

## Development notes

Needs Python3, Pillow imaging library and `pillow-heif` HEIF plugin to run, additionally Wand for 
the test suite.

Uses Poetry for dependency management.

### Installation

From source:
```shell
$ git clone https://gitlab.com/lilacashes/DuplicateImages.git
$ cd DuplicateImages
$ pip3 install poetry
$ poetry install
```

### Running

```shell
$ poetry run find-dups $PICTURE_DIR
```
or
```shell
$ poetry run find-dups -h
```
for a list of all possible options.

### Test suite

Running it all:
```shell
$ poetry run pytest
$ poetry run mypy duplicate_images tests
$ poetry run flake8
$ poetry run pylint duplicate_images tests
```
or simply 
```shell
$ .git_hooks/pre-push
```
Setting the test suite to be run before every push:
```shell
$ cd .git/hooks
$ ln -s ../../.git_hooks/pre-push .
```

### Publishing

There is a job in GitLab CI for publishing to `pypi.org` that runs as soon as a new tag is added, 
which happens automatically whenever a MR is merged. The tag is the same as the `version` in the 
`pyproject.toml` file. For every MR it needs to be ensured that the `version` is not the same as an 
already existing tag.

To publish the package on PyPI manually:
```shell
$ poetry config repositories.testpypi https://test.pypi.org/legacy/
$ poetry build
$ poetry publish --username $PYPI_USER --password $PYPI_PASSWORD --repository testpypi && \
  poetry publish --username $PYPI_USER --password $PYPI_PASSWORD
```
(obviously assuming here that username and password are the same on PyPI and TestPyPI)

#### Updating GitHub mirror

GitHub is set up as a push mirror in GitLab CI, but mirroring is flaky at the time and may not
succeed. 

To push to the GitHub repository manually (assuming the GitHub repository is set up as remote 
`github`):
```shell
$ git checkout master
$ git fetch
$ git pull --rebase
$ git tag  # to check that the latest tag is present
$ git push --tags github master 
```

### Profiling

#### CPU time
To show the top functions by time spent, including called functions:
```shell
$ poetry run python -m cProfile -s tottime ./duplicate_images/duplicate.py \ 
    --algorithm $ALGORITHM --action-equal none $IMAGE_DIR 2>&1 | head -n 15
```
or, to show the top functions by time spent in the function alone:
```shell
$ poetry run python -m cProfile -s cumtime ./duplicate_images/duplicate.py \ 
    --algorithm $ALGORITHM --action-equal none $IMAGE_DIR 2>&1 | head -n 15
```

#### Memory usage
```shell
$ poetry run fil-profile run ./duplicate_images/duplicate.py \
    --algorithm $ALGORITHM --action-equal none $IMAGE_DIR 2>&1
```
This will open a browser window showing the functions using the most memory (see 
https://pypi.org/project/filprofiler for more details).

## Contributors

- Lene Preuss (https://github.com/lene): primary developer
- Mike Reiche (https://github.com/mreiche): support for arbitrary actions, speedups
- https://github.com/beijingjazzpanda: bug fix

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/lene/DuplicateImages",
    "name": "duplicate-images",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8.5,<4.0.0",
    "maintainer_email": "",
    "keywords": "",
    "author": "Lene Preuss",
    "author_email": "lene.preuss@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/3c/af/121173c77b648fbca3b4523418e3700929198266fe31fb23f6e382d71c39/duplicate_images-0.8.1.tar.gz",
    "platform": null,
    "description": "# Finding Duplicate Images\n\nFinds equal or similar images in a directory containing (many) image files.\n\nOfficial home page: https://github.com/lene/DuplicateImages\n\nDevelopment page: https://gitlab.com/lilacashes/DuplicateImages\n\nPyPI page: https://pypi.org/project/duplicate-images\n\n## Usage\n\nInstalling:\n```shell\n$ pip install duplicate_images\n```\n\nPrinting the help screen:\n```shell\n$ find-dups -h\n```\n\nQuick test run:\n```shell\n$ find-dups $IMAGE_ROOT \n```\n\nTypical usage:\n```shell\n$ find-dups $IMAGE_ROOT --parallel --progress --hash-db hashes.pickle\n```\n\n### Supported image formats\n\n* JPEG and PNG (tested quite thoroughly)\n* HEIC (experimental support, tested cursorily only)\n\n### Image comparison algorithms\n\nUse the `--algorithm` option to select how equal images are found. The default algorithm is `phash`.\n\n`ahash`, `colorhash`, `dhash`, `dhash_vertical`, `phash`, `phash_simple`, `whash`: seven different \nimage hashing algorithms. See https://pypi.org/project/ImageHash for an introduction on image \nhashing and https://tech.okcupid.com/evaluating-perceptual-image-hashes-okcupid for some gory \ndetails which image hashing algorithm performs best in which situation. For a start I recommend \nusing `phash`, and only evaluating the other algorithms if `phash` does not perform satisfactorily \nin your use case.\n\n### Image similarity threshold configuration\n\nUse the `--max-distance` parameter to tune how close images should be to be considered duplicates.\nThe argument is a positive integer. Its value is highly dependent on the algorithm used and the \nnature of the images compared, so the best value for your use case can oly be found through \nexperimentation.\n\nUse the `--hash-size` parameter to tune the precision of the hashing algorithms. For the `colorhash`\nalgorithm the hash size is interpreted as the number of bin bits and defaults to 3. For all other\nalgorithms the hash size defaults to 8. For `whash` it must be a power of 2.\n\n### Actions for matching image pairs\n\nUse the `--on-equal` option to select what to do to pairs of equal images. The default action is \n`print`.\n- `delete-first` or `d1`: deletes the first of the two files\n- `delete-second` or `d2`: deletes the second of the two files\n- `delete-bigger` or `d>`: deletes the file with the bigger size\n- `delete-smaller` or `d<`: deletes the file with the smaller size\n- `eog`: launches the `eog` image viewer to compare the two files (*deprecated* by `exec`)\n- `xv`: launches the `xv` image viewer to compare the two files (*deprecated* by `exec`)\n- `print`: prints the two files\n- `print_inline`: like `print` but without newline\n- `quote`: prints the two files quoted for POSIX shells\n- `quote_inline`: like `quote` but without newline\n- `exec`: executes a command (see `--exec` argument)\n- `none`: does nothing.\n\nThe `--exec` argument allows calling another program when the `--on-equal exec` option is given.\\\nYou can pass a command line string like `--exec \"program {1} {2}\"` where `{1}` and `{2}` are replaced by the matching pair files.\n\n**Examples**\\\n`--exec \"open -a Preview -W {1} {2}\"`: Opens the files in MacOS Preview app and waits for it.\n\n### Parallel execution\n\nUse the `--parallel` option to utilize all free cores on your system. \n\n### Serial execution\n\n`find-dups` can also use an alternative algorithm which is O(N<sup>2</sup>) in the number of images.\nUse the `--serial` option to use this alternative algorithm. \n\n### Progress and verbosity control\n\n- `--progress` prints a progress bar each for the process of reading the images, and the process of \n  finding duplicates among the scanned image\n- `--debug` prints debugging output\n- `--quiet` decreases the log level by 1 for each time it is called; `--debug` and `--quiet` cancel\n  each other out\n\n### Pre-storing and using image hashes to speed up computation\n\nUse the `--hash-db $PICKLE_FILE` option to store image hashes in the file `$PICKLE_FILE` and read\nimage hashes from that file if they are already present there. This avoids having to compute the \nimage hashes anew at every run and can significantly speed up run times.\n\n## Development notes\n\nNeeds Python3, Pillow imaging library and `pillow-heif` HEIF plugin to run, additionally Wand for \nthe test suite.\n\nUses Poetry for dependency management.\n\n### Installation\n\nFrom source:\n```shell\n$ git clone https://gitlab.com/lilacashes/DuplicateImages.git\n$ cd DuplicateImages\n$ pip3 install poetry\n$ poetry install\n```\n\n### Running\n\n```shell\n$ poetry run find-dups $PICTURE_DIR\n```\nor\n```shell\n$ poetry run find-dups -h\n```\nfor a list of all possible options.\n\n### Test suite\n\nRunning it all:\n```shell\n$ poetry run pytest\n$ poetry run mypy duplicate_images tests\n$ poetry run flake8\n$ poetry run pylint duplicate_images tests\n```\nor simply \n```shell\n$ .git_hooks/pre-push\n```\nSetting the test suite to be run before every push:\n```shell\n$ cd .git/hooks\n$ ln -s ../../.git_hooks/pre-push .\n```\n\n### Publishing\n\nThere is a job in GitLab CI for publishing to `pypi.org` that runs as soon as a new tag is added, \nwhich happens automatically whenever a MR is merged. The tag is the same as the `version` in the \n`pyproject.toml` file. For every MR it needs to be ensured that the `version` is not the same as an \nalready existing tag.\n\nTo publish the package on PyPI manually:\n```shell\n$ poetry config repositories.testpypi https://test.pypi.org/legacy/\n$ poetry build\n$ poetry publish --username $PYPI_USER --password $PYPI_PASSWORD --repository testpypi && \\\n  poetry publish --username $PYPI_USER --password $PYPI_PASSWORD\n```\n(obviously assuming here that username and password are the same on PyPI and TestPyPI)\n\n#### Updating GitHub mirror\n\nGitHub is set up as a push mirror in GitLab CI, but mirroring is flaky at the time and may not\nsucceed. \n\nTo push to the GitHub repository manually (assuming the GitHub repository is set up as remote \n`github`):\n```shell\n$ git checkout master\n$ git fetch\n$ git pull --rebase\n$ git tag  # to check that the latest tag is present\n$ git push --tags github master \n```\n\n### Profiling\n\n#### CPU time\nTo show the top functions by time spent, including called functions:\n```shell\n$ poetry run python -m cProfile -s tottime ./duplicate_images/duplicate.py \\ \n    --algorithm $ALGORITHM --action-equal none $IMAGE_DIR 2>&1 | head -n 15\n```\nor, to show the top functions by time spent in the function alone:\n```shell\n$ poetry run python -m cProfile -s cumtime ./duplicate_images/duplicate.py \\ \n    --algorithm $ALGORITHM --action-equal none $IMAGE_DIR 2>&1 | head -n 15\n```\n\n#### Memory usage\n```shell\n$ poetry run fil-profile run ./duplicate_images/duplicate.py \\\n    --algorithm $ALGORITHM --action-equal none $IMAGE_DIR 2>&1\n```\nThis will open a browser window showing the functions using the most memory (see \nhttps://pypi.org/project/filprofiler for more details).\n\n## Contributors\n\n- Lene Preuss (https://github.com/lene): primary developer\n- Mike Reiche (https://github.com/mreiche): support for arbitrary actions, speedups\n- https://github.com/beijingjazzpanda: bug fix\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Finds equal or similar images in a directory containing (many) image files",
    "version": "0.8.1",
    "project_urls": {
        "Homepage": "https://github.com/lene/DuplicateImages",
        "Repository": "https://github.com/lene/DuplicateImages.git"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1cf8f1b29e70ae8c4a049e7b498f886fb68283a3cb12251c40e4e7014c82d33c",
                "md5": "7b6e54f1bb0e171f8a33d943e66436d0",
                "sha256": "3e42eb30132ee07abce1234b5af34bddb4546b71916f6dcca845b93bfecef069"
            },
            "downloads": -1,
            "filename": "duplicate_images-0.8.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "7b6e54f1bb0e171f8a33d943e66436d0",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8.5,<4.0.0",
            "size": 14239,
            "upload_time": "2023-08-15T09:47:21",
            "upload_time_iso_8601": "2023-08-15T09:47:21.202896Z",
            "url": "https://files.pythonhosted.org/packages/1c/f8/f1b29e70ae8c4a049e7b498f886fb68283a3cb12251c40e4e7014c82d33c/duplicate_images-0.8.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3caf121173c77b648fbca3b4523418e3700929198266fe31fb23f6e382d71c39",
                "md5": "4059e9bb7301b955b122b79e7b162a0b",
                "sha256": "377d1c555fb85577943a2afc6f81b2bceaa65d2aaf18bd7a3f4ee1b91f794570"
            },
            "downloads": -1,
            "filename": "duplicate_images-0.8.1.tar.gz",
            "has_sig": false,
            "md5_digest": "4059e9bb7301b955b122b79e7b162a0b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8.5,<4.0.0",
            "size": 13294,
            "upload_time": "2023-08-15T09:47:22",
            "upload_time_iso_8601": "2023-08-15T09:47:22.836268Z",
            "url": "https://files.pythonhosted.org/packages/3c/af/121173c77b648fbca3b4523418e3700929198266fe31fb23f6e382d71c39/duplicate_images-0.8.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-08-15 09:47:22",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "lene",
    "github_project": "DuplicateImages",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "duplicate-images"
}
        
Elapsed time: 0.10208s