selfclean


Nameselfclean JSON
Version 0.0.35 PyPI version JSON
download
home_pageNone
SummaryA holistic self-supervised data cleaning strategy to detect off-topic samples, near duplicates and label errors.
upload_time2025-01-07 09:11:55
maintainerNone
docs_urlNone
authorNone
requires_python>=3.6
licenseAttribution-NonCommercial 4.0 International
keywords machine_learning data_cleaning datacentric_ai datacentric self-supervised learning
VCS
bugtrack_url
requirements tqdm numpy matplotlib pandas scikit_learn pytest torchvision torchinfo torchmetrics einops black coverage darglint isort pre-commit pytest-cov transformers seaborn SciencePlots scikit-image codecov jupyter loguru memory-profiler
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # 🧼🔎 SelfClean

[![Test and Coverage](https://github.com/Digital-Dermatology/SelfClean/actions/workflows/pytest-coverage.yml/badge.svg)](https://github.com/Digital-Dermatology/SelfClean/actions/workflows/pytest-coverage.yml)

![SelfClean Teaser](https://github.com/Digital-Dermatology/SelfClean/raw/main/assets/SelfClean_Teaser.png)

A holistic self-supervised data cleaning strategy to detect off-topic samples, near duplicates, and label errors.

**Publications:** [SelfClean Paper (NeurIPS24)](https://arxiv.org/abs/2305.17048) | [Data Cleaning Protocol Paper (ML4H23@NeurIPS)](https://arxiv.org/abs/2309.06961)

**NOTE:** Make sure to have `git-lfs` installed before pulling the repository to ensure the pre-trained models are pulled correctly ([git-lfs install instructions](https://docs.github.com/en/repositories/working-with-files/managing-large-files/installing-git-large-file-storage)).

This project is licensed under the terms of the [Creative Commons Attribution-NonCommercial 4.0 International license](https://creativecommons.org/licenses/by-nc/4.0/).

<img src="https://mirrors.creativecommons.org/presskit/icons/cc.svg" alt="cc" width="20"/> <img src="https://mirrors.creativecommons.org/presskit/icons/by.svg" alt="by" width="20"/> <img src="https://mirrors.creativecommons.org/presskit/icons/nc.svg" alt="nc" width="20"/>

## Installation

> Install SelfClean via [PyPI](https://pypi.org/project/selfclean/):

```python
# upgrade pip to its latest version
pip install -U pip

# install selfclean
pip install selfclean

# Alternatively, use explicit python version (XX)
python3.XX -m pip install selfclean
```

## Getting Started

You can run SelfClean in a few lines of code:

```python
from selfclean import SelfClean

selfclean = SelfClean(
    # displays the top-7 images from each error type
    # per default this option is disabled
    plot_top_N=7,
)

# run on pytorch dataset
issues = selfclean.run_on_dataset(
    dataset=copy.copy(dataset),
)
# run on image folder
issues = selfclean.run_on_image_folder(
    input_path="path/to/images",
)

# get the data quality issue rankings
df_near_duplicates = issues.get_issues("near_duplicates", return_as_df=True)
df_off_topic_samples = issues.get_issues("off_topic_samples", return_as_df=True)
df_label_errors = issues.get_issues("label_errors", return_as_df=True)
```

**Examples:**
In `examples/`, we've provided some example notebooks in which you will learn how to analyze and clean datasets using SelfClean.
These examples analyze different benchmark datasets such as:

- <a href="https://github.com/fastai/imagenette">Imagenette</a> 🖼️ (Open in <a href="https://nbviewer.org/github/Digital-Dermatology/SelfClean/blob/main/examples/Investigate_Imagenette.ipynb">NBViewer</a> | <a href="https://github.com/Digital-Dermatology/SelfClean/blob/main/examples/Investigate_Imagenette.ipynb">GitHub</a> | <a href="https://colab.research.google.com/github/Digital-Dermatology/SelfClean/blob/main/examples/Investigate_Imagenette.ipynb">Colab</a>)
- <a href="https://www.robots.ox.ac.uk/~vgg/data/pets/">Oxford-IIIT Pet</a> 🐶 (Open in <a href="https://nbviewer.org/github/Digital-Dermatology/SelfClean/blob/main/examples/Investigate_OxfordIIITPet.ipynb">NBViewer</a> | <a href="https://github.com/Digital-Dermatology/SelfClean/blob/main/examples/Investigate_OxfordIIITPet.ipynb">GitHub</a> | <a href="https://colab.research.google.com/github/Digital-Dermatology/SelfClean/blob/main/examples/Investigate_OxfordIIITPet.ipynb">Colab</a>)

Also, check out our <a href="https://www.kaggle.com/code/fabiangrger/removing-the-psychic-from-the-dataset">Kaggle notebook</a> to see an illustration of how to get a gold medal for cleaning a competition dataset.

## Development Environment
Run `make` for a list of possible targets.

Run these commands to install the requirements for the development environment:
```bash
make init
make install
```

To run linters on all files:
```bash
pre-commit run --all-files
```

We use the following packages for code and test conventions:
- `black` for code style
- `isort` for import sorting
- `pytest` for running tests

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "selfclean",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": null,
    "keywords": "machine_learning, data_cleaning, datacentric_ai, datacentric, self-supervised learning",
    "author": null,
    "author_email": "Fabian Gr\u00f6ger <fabian.groeger@unibas.ch>",
    "download_url": "https://files.pythonhosted.org/packages/00/97/fab90f7b7b029407c66c330990600f83fcd81f88d98120ba0623b2f68299/selfclean-0.0.35.tar.gz",
    "platform": null,
    "description": "# \ud83e\uddfc\ud83d\udd0e SelfClean\n\n[![Test and Coverage](https://github.com/Digital-Dermatology/SelfClean/actions/workflows/pytest-coverage.yml/badge.svg)](https://github.com/Digital-Dermatology/SelfClean/actions/workflows/pytest-coverage.yml)\n\n![SelfClean Teaser](https://github.com/Digital-Dermatology/SelfClean/raw/main/assets/SelfClean_Teaser.png)\n\nA holistic self-supervised data cleaning strategy to detect off-topic samples, near duplicates, and label errors.\n\n**Publications:** [SelfClean Paper (NeurIPS24)](https://arxiv.org/abs/2305.17048) | [Data Cleaning Protocol Paper (ML4H23@NeurIPS)](https://arxiv.org/abs/2309.06961)\n\n**NOTE:** Make sure to have `git-lfs` installed before pulling the repository to ensure the pre-trained models are pulled correctly ([git-lfs install instructions](https://docs.github.com/en/repositories/working-with-files/managing-large-files/installing-git-large-file-storage)).\n\nThis project is licensed under the terms of the [Creative Commons Attribution-NonCommercial 4.0 International license](https://creativecommons.org/licenses/by-nc/4.0/).\n\n<img src=\"https://mirrors.creativecommons.org/presskit/icons/cc.svg\" alt=\"cc\" width=\"20\"/> <img src=\"https://mirrors.creativecommons.org/presskit/icons/by.svg\" alt=\"by\" width=\"20\"/> <img src=\"https://mirrors.creativecommons.org/presskit/icons/nc.svg\" alt=\"nc\" width=\"20\"/>\n\n## Installation\n\n> Install SelfClean via [PyPI](https://pypi.org/project/selfclean/):\n\n```python\n# upgrade pip to its latest version\npip install -U pip\n\n# install selfclean\npip install selfclean\n\n# Alternatively, use explicit python version (XX)\npython3.XX -m pip install selfclean\n```\n\n## Getting Started\n\nYou can run SelfClean in a few lines of code:\n\n```python\nfrom selfclean import SelfClean\n\nselfclean = SelfClean(\n    # displays the top-7 images from each error type\n    # per default this option is disabled\n    plot_top_N=7,\n)\n\n# run on pytorch dataset\nissues = selfclean.run_on_dataset(\n    dataset=copy.copy(dataset),\n)\n# run on image folder\nissues = selfclean.run_on_image_folder(\n    input_path=\"path/to/images\",\n)\n\n# get the data quality issue rankings\ndf_near_duplicates = issues.get_issues(\"near_duplicates\", return_as_df=True)\ndf_off_topic_samples = issues.get_issues(\"off_topic_samples\", return_as_df=True)\ndf_label_errors = issues.get_issues(\"label_errors\", return_as_df=True)\n```\n\n**Examples:**\nIn `examples/`, we've provided some example notebooks in which you will learn how to analyze and clean datasets using SelfClean.\nThese examples analyze different benchmark datasets such as:\n\n- <a href=\"https://github.com/fastai/imagenette\">Imagenette</a> \ud83d\uddbc\ufe0f (Open in <a href=\"https://nbviewer.org/github/Digital-Dermatology/SelfClean/blob/main/examples/Investigate_Imagenette.ipynb\">NBViewer</a> | <a href=\"https://github.com/Digital-Dermatology/SelfClean/blob/main/examples/Investigate_Imagenette.ipynb\">GitHub</a> | <a href=\"https://colab.research.google.com/github/Digital-Dermatology/SelfClean/blob/main/examples/Investigate_Imagenette.ipynb\">Colab</a>)\n- <a href=\"https://www.robots.ox.ac.uk/~vgg/data/pets/\">Oxford-IIIT Pet</a> \ud83d\udc36 (Open in <a href=\"https://nbviewer.org/github/Digital-Dermatology/SelfClean/blob/main/examples/Investigate_OxfordIIITPet.ipynb\">NBViewer</a> | <a href=\"https://github.com/Digital-Dermatology/SelfClean/blob/main/examples/Investigate_OxfordIIITPet.ipynb\">GitHub</a> | <a href=\"https://colab.research.google.com/github/Digital-Dermatology/SelfClean/blob/main/examples/Investigate_OxfordIIITPet.ipynb\">Colab</a>)\n\nAlso, check out our <a href=\"https://www.kaggle.com/code/fabiangrger/removing-the-psychic-from-the-dataset\">Kaggle notebook</a> to see an illustration of how to get a gold medal for cleaning a competition dataset.\n\n## Development Environment\nRun `make` for a list of possible targets.\n\nRun these commands to install the requirements for the development environment:\n```bash\nmake init\nmake install\n```\n\nTo run linters on all files:\n```bash\npre-commit run --all-files\n```\n\nWe use the following packages for code and test conventions:\n- `black` for code style\n- `isort` for import sorting\n- `pytest` for running tests\n",
    "bugtrack_url": null,
    "license": "Attribution-NonCommercial 4.0 International",
    "summary": "A holistic self-supervised data cleaning strategy to detect off-topic samples, near duplicates and label errors.",
    "version": "0.0.35",
    "project_urls": {
        "Homepage": "https://selfclean.github.io/",
        "Source Code": "https://github.com/Digital-Dermatology/SelfClean"
    },
    "split_keywords": [
        "machine_learning",
        " data_cleaning",
        " datacentric_ai",
        " datacentric",
        " self-supervised learning"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "449fe2cdeefb6694bb8000e1a77735588e46ee3ca11342644479da9a03b0322d",
                "md5": "f27d80299923a4e242e69618c7622d90",
                "sha256": "7457c0a6fbc74fe3f45e206e16d4b87f4300aa6a28ef6b089e4741cecb03a612"
            },
            "downloads": -1,
            "filename": "selfclean-0.0.35-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "f27d80299923a4e242e69618c7622d90",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 181780,
            "upload_time": "2025-01-07T09:11:52",
            "upload_time_iso_8601": "2025-01-07T09:11:52.725371Z",
            "url": "https://files.pythonhosted.org/packages/44/9f/e2cdeefb6694bb8000e1a77735588e46ee3ca11342644479da9a03b0322d/selfclean-0.0.35-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0097fab90f7b7b029407c66c330990600f83fcd81f88d98120ba0623b2f68299",
                "md5": "ea86b97144ab264910d8127a68a27590",
                "sha256": "56d78a42b40e50c9f063df924f18b77896efc905f4fb3362834317c04bea76b2"
            },
            "downloads": -1,
            "filename": "selfclean-0.0.35.tar.gz",
            "has_sig": false,
            "md5_digest": "ea86b97144ab264910d8127a68a27590",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 121483,
            "upload_time": "2025-01-07T09:11:55",
            "upload_time_iso_8601": "2025-01-07T09:11:55.775544Z",
            "url": "https://files.pythonhosted.org/packages/00/97/fab90f7b7b029407c66c330990600f83fcd81f88d98120ba0623b2f68299/selfclean-0.0.35.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-07 09:11:55",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Digital-Dermatology",
    "github_project": "SelfClean",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "tqdm",
            "specs": []
        },
        {
            "name": "numpy",
            "specs": []
        },
        {
            "name": "matplotlib",
            "specs": []
        },
        {
            "name": "pandas",
            "specs": []
        },
        {
            "name": "scikit_learn",
            "specs": []
        },
        {
            "name": "pytest",
            "specs": []
        },
        {
            "name": "torchvision",
            "specs": []
        },
        {
            "name": "torchinfo",
            "specs": []
        },
        {
            "name": "torchmetrics",
            "specs": []
        },
        {
            "name": "einops",
            "specs": []
        },
        {
            "name": "black",
            "specs": [
                [
                    ">=",
                    "22.6"
                ]
            ]
        },
        {
            "name": "coverage",
            "specs": [
                [
                    ">=",
                    "6"
                ]
            ]
        },
        {
            "name": "darglint",
            "specs": [
                [
                    ">=",
                    "1.8"
                ]
            ]
        },
        {
            "name": "isort",
            "specs": [
                [
                    ">=",
                    "5.10"
                ]
            ]
        },
        {
            "name": "pre-commit",
            "specs": [
                [
                    ">=",
                    "2.20"
                ]
            ]
        },
        {
            "name": "pytest-cov",
            "specs": [
                [
                    ">=",
                    "3"
                ]
            ]
        },
        {
            "name": "transformers",
            "specs": [
                [
                    "==",
                    "4.27.4"
                ]
            ]
        },
        {
            "name": "seaborn",
            "specs": []
        },
        {
            "name": "SciencePlots",
            "specs": []
        },
        {
            "name": "scikit-image",
            "specs": []
        },
        {
            "name": "codecov",
            "specs": []
        },
        {
            "name": "jupyter",
            "specs": []
        },
        {
            "name": "loguru",
            "specs": []
        },
        {
            "name": "memory-profiler",
            "specs": []
        }
    ],
    "lcname": "selfclean"
}
        
Elapsed time: 0.41712s