snip-dedup


Namesnip-dedup JSON
Version 0.0.4 PyPI version JSON
download
home_pageNone
SummarySNIP: compact index for large dataset
upload_time2023-03-13 11:50:59
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseNone
keywords computer vision dataset deduplicate index laion machine learning snip
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # snip-dedup

[![PyPI - Version](https://img.shields.io/pypi/v/hatch.svg?logo=pypi&label=PyPI&logoColor=gold)](https://pypi.org/project/snip-dedup/)
[![linting - Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/charliermarsh/ruff/main/assets/badge/v0.json)](https://github.com/charliermarsh/ruff)
[![format - Black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![license - MIT](https://img.shields.io/badge/license-MIT-9400d3.svg)](https://spdx.org/licenses/)

SNIP is a very compact index (25GB) that has found roughly half a billion duplicates on the LAION-2B-en dataset. You may download the de-duplicated dataset below.

SNIP de-duplicated L2B on a standard home computer, taking just several days. We believe the community will benefit from such a dataset, in light of recent research showing the copyright and privacy risks associated with training generative models on highly duplicated datasets, as well as SNIP for a de-duplication, compression and retrieval tool.

## Install

```sh
pip install --upgrade snip-dedup
```

## Usage

```sh
# List available commands
snip --help
snip download --help

# Download and deduplicate the 10 first shards of the dataset
snip download --start 0 --end 10
```

Then, you may download (deduplicated) laion2b images with the awesome [img2dataset](https://github.com/rom1504/img2dataset).

You may check the fidelity of the duplicates by randomly sampling labeled duplicates, and using SNIP to detect its dup. You may do that with retrieve_dup_urls_demo.py (note you will need the original metadata files for this)

## Roadmap

You can also do with SNIP (coming soon...)
- [ ] Train SNIP Indices on your features
- [ ] Download full or sharded SNIP indices for various CLIP networks
- [ ] Do semantic search with extremely compact indices (25 GB or less) on billions of images
- [ ] Compress your features with SNIP descriptors
- [ ] Read our research paper

## About

** DISCLAIMER ** 
Use at your own risk. Help for better de-duiplication (higher acc, higher recall) is very much appreciated. Taking raw CLIP features as the ground truth for exact duplicates, we get nearly 81% precision (and likely much higher for near duplicates, see below).

We release this index for public use and exploration of the LAION-2B-en dataset (more indices coming soon). Soon we will release tools to train your own SNIP indices as well as our scientific paper discussing the method in more detail.

You may find the following necessary files here:

[Binary array of De-duplicated Images](https://drive.google.com/file/d/1RYDylZKaPyaVs5YNwIrGqHU2BewdFwxY/view?usp=sharing)

[SNIP index](https://drive.google.com/file/d/1RYDylZKaPyaVs5YNwIrGqHU2BewdFwxY/view?usp=sharing)

[SNIP descriptor](https://drive.google.com/file/d/1QTA9yWevwPMhvMW8P5mAIBDy42xUpr-m/view?usp=share_link)

Other:

[cumulative sizes of features (for indexing sharded files)](https://drive.google.com/file/d/1OdVt5rjYw55XfMhsQSdqcVOP7lG2qj4W/view?usp=sharing)

## Finding images overfit by Stable Diffusion

By analyzing the most duplicated images, we have found several more images verbatim copied by Stable Diffusion, posing a copyright problem:

![sylvester stallone](https://github.com/ryanwebster90/snip-dedup/blob/main/sylvester_overfit.jpeg)
![hopped up logo](https://github.com/ryanwebster90/snip-dedup/blob/main/overfit_2.jpeg)


## Note on False positives
We noticed many images labled as dup by SNIP but not by raw feats are in fact newar duplicates, for example:

![Chess1](https://en.chessok.net/uploads/posts/2017-09/1506718434_knight-on-the-left-1.nc3.jpg)
![Chess2](https://m.media-amazon.com/images/I/51jNRpWUCjL.jpg)

you may check a list of (randomly sampled) detected duplicate pairs [here](https://docs.google.com/spreadsheets/d/1Eq46U3MbTXzNoLCvnHLcw64X3bWE3ZE8zMJVQU9_gCg/edit?usp=sharing)


## Semantic Search

SNIP can also be used for semantic search. At just 25GB, it still can return the same k-NN's compared to exhaustive search roughly a third of the time, over 2.15B database vectors. 

## Contribute

This python project uses the [`hatch`][hatch] project manager.
Dependencies are specified inside the `pyproject.toml` file, and build configs inside the `hatch.toml` file.
As such you can enter the isolated development environment with `hatch shell` from inside the repository.

To avoid silly mistakes, the code is checked with [pyright][pyright].
To ensure a consistent styling, all python code is formatted with [black][black] and we use the [ruff][ruff] linter.
Once you have installed them, you can check that the code is consistent with:

```sh
hatch run check  # check for mistakes via static analysis
hatch run format # check formatting of all python files
hatch run lint   # check linting rules
```

TODO: check pyright, formatting and linter in CI

[ ] CI
[ ] check max file size on CI to prevent pushing data
[ ] add docs. numpy docstring standard https://numpydoc.readthedocs.io/en/latest/format.html
[ ] auto publish github action. example at https://github.com/ofek/hatch-showcase/blob/master/.github/workflows/build.yml
[ ] add tests?

[hatch]: https://github.com/pypa/hatch
[pyright]: https://github.com/microsoft/pyright
[black]: https://github.com/psf/black
[ruff]: https://github.com/charliermarsh/ruff

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "snip-dedup",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "computer vision,dataset,deduplicate,index,laion,machine learning,snip",
    "author": null,
    "author_email": "Ryan Webster <rwebstr@gmail.com>, Matthieu Pizenberg <matthieu.pizenberg@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/6b/05/a8b0d15cbc60b11b9e4ea7733d770120cd2b384765e9251776bc10f42f05/snip_dedup-0.0.4.tar.gz",
    "platform": null,
    "description": "# snip-dedup\n\n[![PyPI - Version](https://img.shields.io/pypi/v/hatch.svg?logo=pypi&label=PyPI&logoColor=gold)](https://pypi.org/project/snip-dedup/)\n[![linting - Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/charliermarsh/ruff/main/assets/badge/v0.json)](https://github.com/charliermarsh/ruff)\n[![format - Black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n[![license - MIT](https://img.shields.io/badge/license-MIT-9400d3.svg)](https://spdx.org/licenses/)\n\nSNIP is a very compact index (25GB) that has found roughly half a billion duplicates on the LAION-2B-en dataset. You may download the de-duplicated dataset below.\n\nSNIP de-duplicated L2B on a standard home computer, taking just several days. We believe the community will benefit from such a dataset, in light of recent research showing the copyright and privacy risks associated with training generative models on highly duplicated datasets, as well as SNIP for a de-duplication, compression and retrieval tool.\n\n## Install\n\n```sh\npip install --upgrade snip-dedup\n```\n\n## Usage\n\n```sh\n# List available commands\nsnip --help\nsnip download --help\n\n# Download and deduplicate the 10 first shards of the dataset\nsnip download --start 0 --end 10\n```\n\nThen, you may download (deduplicated) laion2b images with the awesome [img2dataset](https://github.com/rom1504/img2dataset).\n\nYou may check the fidelity of the duplicates by randomly sampling labeled duplicates, and using SNIP to detect its dup. You may do that with retrieve_dup_urls_demo.py (note you will need the original metadata files for this)\n\n## Roadmap\n\nYou can also do with SNIP (coming soon...)\n- [ ] Train SNIP Indices on your features\n- [ ] Download full or sharded SNIP indices for various CLIP networks\n- [ ] Do semantic search with extremely compact indices (25 GB or less) on billions of images\n- [ ] Compress your features with SNIP descriptors\n- [ ] Read our research paper\n\n## About\n\n** DISCLAIMER ** \nUse at your own risk. Help for better de-duiplication (higher acc, higher recall) is very much appreciated. Taking raw CLIP features as the ground truth for exact duplicates, we get nearly 81% precision (and likely much higher for near duplicates, see below).\n\nWe release this index for public use and exploration of the LAION-2B-en dataset (more indices coming soon). Soon we will release tools to train your own SNIP indices as well as our scientific paper discussing the method in more detail.\n\nYou may find the following necessary files here:\n\n[Binary array of De-duplicated Images](https://drive.google.com/file/d/1RYDylZKaPyaVs5YNwIrGqHU2BewdFwxY/view?usp=sharing)\n\n[SNIP index](https://drive.google.com/file/d/1RYDylZKaPyaVs5YNwIrGqHU2BewdFwxY/view?usp=sharing)\n\n[SNIP descriptor](https://drive.google.com/file/d/1QTA9yWevwPMhvMW8P5mAIBDy42xUpr-m/view?usp=share_link)\n\nOther:\n\n[cumulative sizes of features (for indexing sharded files)](https://drive.google.com/file/d/1OdVt5rjYw55XfMhsQSdqcVOP7lG2qj4W/view?usp=sharing)\n\n## Finding images overfit by Stable Diffusion\n\nBy analyzing the most duplicated images, we have found several more images verbatim copied by Stable Diffusion, posing a copyright problem:\n\n![sylvester stallone](https://github.com/ryanwebster90/snip-dedup/blob/main/sylvester_overfit.jpeg)\n![hopped up logo](https://github.com/ryanwebster90/snip-dedup/blob/main/overfit_2.jpeg)\n\n\n## Note on False positives\nWe noticed many images labled as dup by SNIP but not by raw feats are in fact newar duplicates, for example:\n\n![Chess1](https://en.chessok.net/uploads/posts/2017-09/1506718434_knight-on-the-left-1.nc3.jpg)\n![Chess2](https://m.media-amazon.com/images/I/51jNRpWUCjL.jpg)\n\nyou may check a list of (randomly sampled) detected duplicate pairs [here](https://docs.google.com/spreadsheets/d/1Eq46U3MbTXzNoLCvnHLcw64X3bWE3ZE8zMJVQU9_gCg/edit?usp=sharing)\n\n\n## Semantic Search\n\nSNIP can also be used for semantic search. At just 25GB, it still can return the same k-NN's compared to exhaustive search roughly a third of the time, over 2.15B database vectors. \n\n## Contribute\n\nThis python project uses the [`hatch`][hatch] project manager.\nDependencies are specified inside the `pyproject.toml` file, and build configs inside the `hatch.toml` file.\nAs such you can enter the isolated development environment with `hatch shell` from inside the repository.\n\nTo avoid silly mistakes, the code is checked with [pyright][pyright].\nTo ensure a consistent styling, all python code is formatted with [black][black] and we use the [ruff][ruff] linter.\nOnce you have installed them, you can check that the code is consistent with:\n\n```sh\nhatch run check  # check for mistakes via static analysis\nhatch run format # check formatting of all python files\nhatch run lint   # check linting rules\n```\n\nTODO: check pyright, formatting and linter in CI\n\n[ ] CI\n[ ] check max file size on CI to prevent pushing data\n[ ] add docs. numpy docstring standard https://numpydoc.readthedocs.io/en/latest/format.html\n[ ] auto publish github action. example at https://github.com/ofek/hatch-showcase/blob/master/.github/workflows/build.yml\n[ ] add tests?\n\n[hatch]: https://github.com/pypa/hatch\n[pyright]: https://github.com/microsoft/pyright\n[black]: https://github.com/psf/black\n[ruff]: https://github.com/charliermarsh/ruff\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "SNIP: compact index for large dataset",
    "version": "0.0.4",
    "split_keywords": [
        "computer vision",
        "dataset",
        "deduplicate",
        "index",
        "laion",
        "machine learning",
        "snip"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "963f573fb197b28f67d7ecdc3908768c98f4a37c23a06e5329d71f749a6eb7d7",
                "md5": "e7228d0c1dfd230a484bfec8c2b95ee0",
                "sha256": "3ccaafe9b1baacd0f4ef06b9766f1274211636593010ad869de94ea1990ede47"
            },
            "downloads": -1,
            "filename": "snip_dedup-0.0.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e7228d0c1dfd230a484bfec8c2b95ee0",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 8676,
            "upload_time": "2023-03-13T11:51:01",
            "upload_time_iso_8601": "2023-03-13T11:51:01.599470Z",
            "url": "https://files.pythonhosted.org/packages/96/3f/573fb197b28f67d7ecdc3908768c98f4a37c23a06e5329d71f749a6eb7d7/snip_dedup-0.0.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "6b05a8b0d15cbc60b11b9e4ea7733d770120cd2b384765e9251776bc10f42f05",
                "md5": "2fa5858b799288cf09d071ad79bd2a62",
                "sha256": "c45f7c4474c1ce75e61908f186ccd5a3b9a3b1f62a17a8cdf7b57db8066c5a85"
            },
            "downloads": -1,
            "filename": "snip_dedup-0.0.4.tar.gz",
            "has_sig": false,
            "md5_digest": "2fa5858b799288cf09d071ad79bd2a62",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 60356,
            "upload_time": "2023-03-13T11:50:59",
            "upload_time_iso_8601": "2023-03-13T11:50:59.025043Z",
            "url": "https://files.pythonhosted.org/packages/6b/05/a8b0d15cbc60b11b9e4ea7733d770120cd2b384765e9251776bc10f42f05/snip_dedup-0.0.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-03-13 11:50:59",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "lcname": "snip-dedup"
}
        
Elapsed time: 0.15555s