# 🧼🔎 SelfClean
[![Test and Coverage](https://github.com/Digital-Dermatology/SelfClean/actions/workflows/pytest-coverage.yml/badge.svg)](https://github.com/Digital-Dermatology/SelfClean/actions/workflows/pytest-coverage.yml)
![SelfClean Teaser](https://github.com/Digital-Dermatology/SelfClean/raw/main/assets/SelfClean_Teaser.png)
A holistic self-supervised data cleaning strategy to detect off-topic samples, near duplicates, and label errors.
**Publications:** [SelfClean Paper (NeurIPS24)](https://arxiv.org/abs/2305.17048) | [Data Cleaning Protocol Paper (ML4H23@NeurIPS)](https://arxiv.org/abs/2309.06961)
**NOTE:** Make sure to have `git-lfs` installed before pulling the repository to ensure the pre-trained models are pulled correctly ([git-lfs install instructions](https://docs.github.com/en/repositories/working-with-files/managing-large-files/installing-git-large-file-storage)).
This project is licensed under the terms of the [Creative Commons Attribution-NonCommercial 4.0 International license](https://creativecommons.org/licenses/by-nc/4.0/).
<img src="https://mirrors.creativecommons.org/presskit/icons/cc.svg" alt="cc" width="20"/> <img src="https://mirrors.creativecommons.org/presskit/icons/by.svg" alt="by" width="20"/> <img src="https://mirrors.creativecommons.org/presskit/icons/nc.svg" alt="nc" width="20"/>
## Installation
> Install SelfClean via [PyPI](https://pypi.org/project/selfclean/):
```python
# upgrade pip to its latest version
pip install -U pip
# install selfclean
pip install selfclean
# Alternatively, use explicit python version (XX)
python3.XX -m pip install selfclean
```
## Getting Started
You can run SelfClean in a few lines of code:
```python
from selfclean import SelfClean
selfclean = SelfClean(
# displays the top-7 images from each error type
# per default this option is disabled
plot_top_N=7,
)
# run on pytorch dataset
issues = selfclean.run_on_dataset(
dataset=copy.copy(dataset),
)
# run on image folder
issues = selfclean.run_on_image_folder(
input_path="path/to/images",
)
# get the data quality issue rankings
df_near_duplicates = issues.get_issues("near_duplicates", return_as_df=True)
df_off_topic_samples = issues.get_issues("off_topic_samples", return_as_df=True)
df_label_errors = issues.get_issues("label_errors", return_as_df=True)
```
**Examples:**
In `examples/`, we've provided some example notebooks in which you will learn how to analyze and clean datasets using SelfClean.
These examples analyze different benchmark datasets such as:
- <a href="https://github.com/fastai/imagenette">Imagenette</a> 🖼️ (Open in <a href="https://nbviewer.org/github/Digital-Dermatology/SelfClean/blob/main/examples/Investigate_Imagenette.ipynb">NBViewer</a> | <a href="https://github.com/Digital-Dermatology/SelfClean/blob/main/examples/Investigate_Imagenette.ipynb">GitHub</a> | <a href="https://colab.research.google.com/github/Digital-Dermatology/SelfClean/blob/main/examples/Investigate_Imagenette.ipynb">Colab</a>)
- <a href="https://www.robots.ox.ac.uk/~vgg/data/pets/">Oxford-IIIT Pet</a> 🐶 (Open in <a href="https://nbviewer.org/github/Digital-Dermatology/SelfClean/blob/main/examples/Investigate_OxfordIIITPet.ipynb">NBViewer</a> | <a href="https://github.com/Digital-Dermatology/SelfClean/blob/main/examples/Investigate_OxfordIIITPet.ipynb">GitHub</a> | <a href="https://colab.research.google.com/github/Digital-Dermatology/SelfClean/blob/main/examples/Investigate_OxfordIIITPet.ipynb">Colab</a>)
Also, check out our <a href="https://www.kaggle.com/code/fabiangrger/removing-the-psychic-from-the-dataset">Kaggle notebook</a> to see an illustration of how to get a gold medal for cleaning a competition dataset.
## Development Environment
Run `make` for a list of possible targets.
Run these commands to install the requirements for the development environment:
```bash
make init
make install
```
To run linters on all files:
```bash
pre-commit run --all-files
```
We use the following packages for code and test conventions:
- `black` for code style
- `isort` for import sorting
- `pytest` for running tests
Raw data
{
"_id": null,
"home_page": null,
"name": "selfclean",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": null,
"keywords": "machine_learning, data_cleaning, datacentric_ai, datacentric, self-supervised learning",
"author": null,
"author_email": "Fabian Gr\u00f6ger <fabian.groeger@unibas.ch>",
"download_url": "https://files.pythonhosted.org/packages/00/97/fab90f7b7b029407c66c330990600f83fcd81f88d98120ba0623b2f68299/selfclean-0.0.35.tar.gz",
"platform": null,
"description": "# \ud83e\uddfc\ud83d\udd0e SelfClean\n\n[![Test and Coverage](https://github.com/Digital-Dermatology/SelfClean/actions/workflows/pytest-coverage.yml/badge.svg)](https://github.com/Digital-Dermatology/SelfClean/actions/workflows/pytest-coverage.yml)\n\n![SelfClean Teaser](https://github.com/Digital-Dermatology/SelfClean/raw/main/assets/SelfClean_Teaser.png)\n\nA holistic self-supervised data cleaning strategy to detect off-topic samples, near duplicates, and label errors.\n\n**Publications:** [SelfClean Paper (NeurIPS24)](https://arxiv.org/abs/2305.17048) | [Data Cleaning Protocol Paper (ML4H23@NeurIPS)](https://arxiv.org/abs/2309.06961)\n\n**NOTE:** Make sure to have `git-lfs` installed before pulling the repository to ensure the pre-trained models are pulled correctly ([git-lfs install instructions](https://docs.github.com/en/repositories/working-with-files/managing-large-files/installing-git-large-file-storage)).\n\nThis project is licensed under the terms of the [Creative Commons Attribution-NonCommercial 4.0 International license](https://creativecommons.org/licenses/by-nc/4.0/).\n\n<img src=\"https://mirrors.creativecommons.org/presskit/icons/cc.svg\" alt=\"cc\" width=\"20\"/> <img src=\"https://mirrors.creativecommons.org/presskit/icons/by.svg\" alt=\"by\" width=\"20\"/> <img src=\"https://mirrors.creativecommons.org/presskit/icons/nc.svg\" alt=\"nc\" width=\"20\"/>\n\n## Installation\n\n> Install SelfClean via [PyPI](https://pypi.org/project/selfclean/):\n\n```python\n# upgrade pip to its latest version\npip install -U pip\n\n# install selfclean\npip install selfclean\n\n# Alternatively, use explicit python version (XX)\npython3.XX -m pip install selfclean\n```\n\n## Getting Started\n\nYou can run SelfClean in a few lines of code:\n\n```python\nfrom selfclean import SelfClean\n\nselfclean = SelfClean(\n # displays the top-7 images from each error type\n # per default this option is disabled\n plot_top_N=7,\n)\n\n# run on pytorch dataset\nissues = selfclean.run_on_dataset(\n dataset=copy.copy(dataset),\n)\n# run on image folder\nissues = selfclean.run_on_image_folder(\n input_path=\"path/to/images\",\n)\n\n# get the data quality issue rankings\ndf_near_duplicates = issues.get_issues(\"near_duplicates\", return_as_df=True)\ndf_off_topic_samples = issues.get_issues(\"off_topic_samples\", return_as_df=True)\ndf_label_errors = issues.get_issues(\"label_errors\", return_as_df=True)\n```\n\n**Examples:**\nIn `examples/`, we've provided some example notebooks in which you will learn how to analyze and clean datasets using SelfClean.\nThese examples analyze different benchmark datasets such as:\n\n- <a href=\"https://github.com/fastai/imagenette\">Imagenette</a> \ud83d\uddbc\ufe0f (Open in <a href=\"https://nbviewer.org/github/Digital-Dermatology/SelfClean/blob/main/examples/Investigate_Imagenette.ipynb\">NBViewer</a> | <a href=\"https://github.com/Digital-Dermatology/SelfClean/blob/main/examples/Investigate_Imagenette.ipynb\">GitHub</a> | <a href=\"https://colab.research.google.com/github/Digital-Dermatology/SelfClean/blob/main/examples/Investigate_Imagenette.ipynb\">Colab</a>)\n- <a href=\"https://www.robots.ox.ac.uk/~vgg/data/pets/\">Oxford-IIIT Pet</a> \ud83d\udc36 (Open in <a href=\"https://nbviewer.org/github/Digital-Dermatology/SelfClean/blob/main/examples/Investigate_OxfordIIITPet.ipynb\">NBViewer</a> | <a href=\"https://github.com/Digital-Dermatology/SelfClean/blob/main/examples/Investigate_OxfordIIITPet.ipynb\">GitHub</a> | <a href=\"https://colab.research.google.com/github/Digital-Dermatology/SelfClean/blob/main/examples/Investigate_OxfordIIITPet.ipynb\">Colab</a>)\n\nAlso, check out our <a href=\"https://www.kaggle.com/code/fabiangrger/removing-the-psychic-from-the-dataset\">Kaggle notebook</a> to see an illustration of how to get a gold medal for cleaning a competition dataset.\n\n## Development Environment\nRun `make` for a list of possible targets.\n\nRun these commands to install the requirements for the development environment:\n```bash\nmake init\nmake install\n```\n\nTo run linters on all files:\n```bash\npre-commit run --all-files\n```\n\nWe use the following packages for code and test conventions:\n- `black` for code style\n- `isort` for import sorting\n- `pytest` for running tests\n",
"bugtrack_url": null,
"license": "Attribution-NonCommercial 4.0 International",
"summary": "A holistic self-supervised data cleaning strategy to detect off-topic samples, near duplicates and label errors.",
"version": "0.0.35",
"project_urls": {
"Homepage": "https://selfclean.github.io/",
"Source Code": "https://github.com/Digital-Dermatology/SelfClean"
},
"split_keywords": [
"machine_learning",
" data_cleaning",
" datacentric_ai",
" datacentric",
" self-supervised learning"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "449fe2cdeefb6694bb8000e1a77735588e46ee3ca11342644479da9a03b0322d",
"md5": "f27d80299923a4e242e69618c7622d90",
"sha256": "7457c0a6fbc74fe3f45e206e16d4b87f4300aa6a28ef6b089e4741cecb03a612"
},
"downloads": -1,
"filename": "selfclean-0.0.35-py3-none-any.whl",
"has_sig": false,
"md5_digest": "f27d80299923a4e242e69618c7622d90",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6",
"size": 181780,
"upload_time": "2025-01-07T09:11:52",
"upload_time_iso_8601": "2025-01-07T09:11:52.725371Z",
"url": "https://files.pythonhosted.org/packages/44/9f/e2cdeefb6694bb8000e1a77735588e46ee3ca11342644479da9a03b0322d/selfclean-0.0.35-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "0097fab90f7b7b029407c66c330990600f83fcd81f88d98120ba0623b2f68299",
"md5": "ea86b97144ab264910d8127a68a27590",
"sha256": "56d78a42b40e50c9f063df924f18b77896efc905f4fb3362834317c04bea76b2"
},
"downloads": -1,
"filename": "selfclean-0.0.35.tar.gz",
"has_sig": false,
"md5_digest": "ea86b97144ab264910d8127a68a27590",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 121483,
"upload_time": "2025-01-07T09:11:55",
"upload_time_iso_8601": "2025-01-07T09:11:55.775544Z",
"url": "https://files.pythonhosted.org/packages/00/97/fab90f7b7b029407c66c330990600f83fcd81f88d98120ba0623b2f68299/selfclean-0.0.35.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-01-07 09:11:55",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Digital-Dermatology",
"github_project": "SelfClean",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "tqdm",
"specs": []
},
{
"name": "numpy",
"specs": []
},
{
"name": "matplotlib",
"specs": []
},
{
"name": "pandas",
"specs": []
},
{
"name": "scikit_learn",
"specs": []
},
{
"name": "pytest",
"specs": []
},
{
"name": "torchvision",
"specs": []
},
{
"name": "torchinfo",
"specs": []
},
{
"name": "torchmetrics",
"specs": []
},
{
"name": "einops",
"specs": []
},
{
"name": "black",
"specs": [
[
">=",
"22.6"
]
]
},
{
"name": "coverage",
"specs": [
[
">=",
"6"
]
]
},
{
"name": "darglint",
"specs": [
[
">=",
"1.8"
]
]
},
{
"name": "isort",
"specs": [
[
">=",
"5.10"
]
]
},
{
"name": "pre-commit",
"specs": [
[
">=",
"2.20"
]
]
},
{
"name": "pytest-cov",
"specs": [
[
">=",
"3"
]
]
},
{
"name": "transformers",
"specs": [
[
"==",
"4.27.4"
]
]
},
{
"name": "seaborn",
"specs": []
},
{
"name": "SciencePlots",
"specs": []
},
{
"name": "scikit-image",
"specs": []
},
{
"name": "codecov",
"specs": []
},
{
"name": "jupyter",
"specs": []
},
{
"name": "loguru",
"specs": []
},
{
"name": "memory-profiler",
"specs": []
}
],
"lcname": "selfclean"
}