# Data-SUITE: Data-centric identification of in-distribution incongruous examples
[](https://arxiv.org/abs/2202.08836)
[](https://github.com/vanderschaarlab/Data-SUITE/blob/main/LICENSE)

This repository contains the implementation of Data-SUITE, a "Data-Centric AI" framework to identify in-distribution incongruous data examples.
DATA-SUITE leverages copula modeling, representation learning, and conformal prediction to build feature-wise confidence interval estimators based on a set of training instances. The copula modeling is optional, but allows a nice property of not needing access to real training data after the initial stages or to augment smaller datasets when needed.
These estimators can be used to evaluate the congruence of test instances with respect to the training set, to answer two practically useful questions:
(1) which test instances will be reliably predicted by a model trained with the training instances?
(2) can we identify incongruous regions of the feature space so that data owners understand the data's limitations or guide future data collection?
For more details, please read our [ICML 2022 paper](https://arxiv.org/abs/2202.08836): 'Data-SUITE: Data-centric identification of in-distribution incongruous examples'.
## Installation
1. Clone the repository
2. Create a new virtual environment with Python 3.7, 3.8 or 3.9. e.g:
```shell
virtualenv ds_env
```
3. Run the following command from the repository directory:
```shell
pip install -r requirements.txt
```
4. Two libraries for benchmarks (alibi-detect and aix360) have conflicting requirements for tensorflow. This can be circumvented by running the below. If this doesn't resolve the issue manually install the two packages in requirements-no-deps.txt using pip.
```shell
pip install --no-deps -r requirements-no-deps.txt
```
**NOTE:** It is now also possible to also install this repo from source or pypi. This is done in the following ways.
1. From inside the repo you can run:
```shell
pip install .
```
or from anywhere run
```shell
pip install data_suite
```
This installs the minimum number of packages to run `data_suite`.
2. If you wish to run benchmarks, please install with the benchmarks extra with:
```shell
pip install data_suite[benchmarks]
```
3. If you wish to run contribute and adhere to the coding style, please install with the contribute extra with:
```shell
pip install data_suite[contribute]
```
## Getting started

We provide two tutorial notebooks to illustrate the usage of Data-SUITE, with an example on synthetic data.
These notebooks can be found in the ``/tutorial`` folder.
1. ``tutorial_simple.ipynb``
- Provides a simple object-oriented (OO) interface to use Data-SUITE with simple fit & predict options.
2. ``tutorial_detailed.ipynb``
- Provides a more detailed look at the inner workings of Data-SUITE.
Both tutorials achieve the same objective, to get started with Data-SUITE.
## Data-SUITE with synthetic & real-data
We also provide code to run Data-SUITE on public datasets. This includes the synthetic data experiments, as well as, the publicly available real-world datasets.
A variety of jupyter notebooks are provided for this purpose. They are contained in the notebooks folder of the repo.
For ease of usage we have provided bash scripts to execute the notebooks via Papermill. The results for all the different experiments/analysis for the datasets are then stored in their specific ``/results`` folder. These include dataframes of metrics, figures, etc.
These bash scripts are contained in the ``/scripts`` folder.
1. Synthetic data:
The synthetic data experiment uses [Weights and Biases - wandb](https://wandb.ai) to log results over the various runs.
Your specific wandb credentials should be added to: ``notebooks/synthetic_pipeline.ipynb``
Thereafter one can run from the main dir:
```shell
bash scripts/synthetic_pipeline.sh
```
Once the experiment has completed all results are logged to wandb. Note this run might take quite some time. One can then download the .csv of logged results from wandb and place it in the ``/artifacts`` folder as ``synthetic_artifacts.csv``.
Since the experiment can take quite long, we have provided an artifact ``synthetic_artifacts.csv`` obtained from wandb.
``synthetic_artifacts.csv`` can then be processed processed to obtain the desired metrics & plots:
For this run from the main dir:
```shell
bash scripts/process_synthetic.sh
```
All results will then be written to ``/results/synthetic``
2. Real data:
To run the public real-world datasets one simply needs to run, for example, from the main dir:
```shell
bash scripts/run_adult.sh
```
OR
```shell
bash scripts/run_electric.sh
```
All results from the different main paper & appendix experiments will be written to the ``/results`` folder. These include dataframes for tables of metrics, figures, etc.
The real world dataset notebooks can also serve as inspiration for usage on one's own data.
## Citing
If you use this code, please cite the associated paper:
```
@inproceedings
{seedat2022data,
title={Data-SUITE: Data-centric identification of in-distribution incongruous examples},
author={Seedat, Nabeel and Crabbe, Jonathan and van der Schaar, Mihaela},
journal={arXiv preprint arXiv:2202.08836},
year={2022}
}
```
Raw data
{
"_id": null,
"home_page": "https://arxiv.org/abs/2202.08836",
"name": "data-suite",
"maintainer": null,
"docs_url": null,
"requires_python": "<3.10",
"maintainer_email": null,
"keywords": "data-centric AI, data quality, uncertainty",
"author": "Nabeel Seedat",
"author_email": "seedatnabeel@gmail.com",
"download_url": null,
"platform": "any",
"description": "# Data-SUITE: Data-centric identification of in-distribution incongruous examples\n\n[](https://arxiv.org/abs/2202.08836)\n[](https://github.com/vanderschaarlab/Data-SUITE/blob/main/LICENSE)\n\n\n\n\nThis repository contains the implementation of Data-SUITE, a \"Data-Centric AI\" framework to identify in-distribution incongruous data examples.\n\nDATA-SUITE leverages copula modeling, representation learning, and conformal prediction to build feature-wise confidence interval estimators based on a set of training instances. The copula modeling is optional, but allows a nice property of not needing access to real training data after the initial stages or to augment smaller datasets when needed.\n\nThese estimators can be used to evaluate the congruence of test instances with respect to the training set, to answer two practically useful questions:\n\n(1) which test instances will be reliably predicted by a model trained with the training instances?\n\n(2) can we identify incongruous regions of the feature space so that data owners understand the data's limitations or guide future data collection?\n\nFor more details, please read our [ICML 2022 paper](https://arxiv.org/abs/2202.08836): 'Data-SUITE: Data-centric identification of in-distribution incongruous examples'.\n\n## Installation\n1. Clone the repository\n2. Create a new virtual environment with Python 3.7, 3.8 or 3.9. e.g:\n```shell\n virtualenv ds_env\n```\n3. Run the following command from the repository directory:\n ```shell\npip install -r requirements.txt\n ```\n\n4. Two libraries for benchmarks (alibi-detect and aix360) have conflicting requirements for tensorflow. This can be circumvented by running the below. If this doesn't resolve the issue manually install the two packages in requirements-no-deps.txt using pip.\n\n ```shell\n pip install --no-deps -r requirements-no-deps.txt\n ```\n\n**NOTE:** It is now also possible to also install this repo from source or pypi. This is done in the following ways.\n\n1. From inside the repo you can run:\n ```shell\npip install .\n ```\nor from anywhere run\n\n ```shell\npip install data_suite\n ```\n\nThis installs the minimum number of packages to run `data_suite`.\n\n2. If you wish to run benchmarks, please install with the benchmarks extra with:\n ```shell\npip install data_suite[benchmarks]\n ```\n\n 3. If you wish to run contribute and adhere to the coding style, please install with the contribute extra with:\n ```shell\npip install data_suite[contribute]\n ```\n\n\n## Getting started\n\n\n\nWe provide two tutorial notebooks to illustrate the usage of Data-SUITE, with an example on synthetic data.\n\nThese notebooks can be found in the ``/tutorial`` folder.\n\n1. ``tutorial_simple.ipynb``\n\n - Provides a simple object-oriented (OO) interface to use Data-SUITE with simple fit & predict options.\n\n2. ``tutorial_detailed.ipynb``\n\n- Provides a more detailed look at the inner workings of Data-SUITE.\n\nBoth tutorials achieve the same objective, to get started with Data-SUITE.\n\n## Data-SUITE with synthetic & real-data\n\nWe also provide code to run Data-SUITE on public datasets. This includes the synthetic data experiments, as well as, the publicly available real-world datasets.\n\nA variety of jupyter notebooks are provided for this purpose. They are contained in the notebooks folder of the repo.\n\nFor ease of usage we have provided bash scripts to execute the notebooks via Papermill. The results for all the different experiments/analysis for the datasets are then stored in their specific ``/results`` folder. These include dataframes of metrics, figures, etc.\n\nThese bash scripts are contained in the ``/scripts`` folder.\n\n1. Synthetic data:\n\nThe synthetic data experiment uses [Weights and Biases - wandb](https://wandb.ai) to log results over the various runs.\n\nYour specific wandb credentials should be added to: ``notebooks/synthetic_pipeline.ipynb``\n\nThereafter one can run from the main dir:\n```shell\nbash scripts/synthetic_pipeline.sh\n```\n\nOnce the experiment has completed all results are logged to wandb. Note this run might take quite some time. One can then download the .csv of logged results from wandb and place it in the ``/artifacts`` folder as ``synthetic_artifacts.csv``.\n\nSince the experiment can take quite long, we have provided an artifact ``synthetic_artifacts.csv`` obtained from wandb.\n\n``synthetic_artifacts.csv`` can then be processed processed to obtain the desired metrics & plots:\n\nFor this run from the main dir:\n\n```shell\nbash scripts/process_synthetic.sh\n```\n\nAll results will then be written to ``/results/synthetic``\n\n\n2. Real data:\n\nTo run the public real-world datasets one simply needs to run, for example, from the main dir:\n\n```shell\nbash scripts/run_adult.sh\n```\n\nOR\n\n```shell\nbash scripts/run_electric.sh\n```\n\nAll results from the different main paper & appendix experiments will be written to the ``/results`` folder. These include dataframes for tables of metrics, figures, etc.\n\nThe real world dataset notebooks can also serve as inspiration for usage on one's own data.\n\n\n## Citing\n\nIf you use this code, please cite the associated paper:\n\n```\n@inproceedings\n{seedat2022data,\ntitle={Data-SUITE: Data-centric identification of in-distribution incongruous examples},\nauthor={Seedat, Nabeel and Crabbe, Jonathan and van der Schaar, Mihaela},\njournal={arXiv preprint arXiv:2202.08836},\nyear={2022}\n}\n```\n\n\n",
"bugtrack_url": null,
"license": "Apache 2.0",
"summary": "Data-centric identification of in-distribution incongruous examples",
"version": "0.2.1",
"project_urls": {
"Homepage": "https://arxiv.org/abs/2202.08836"
},
"split_keywords": [
"data-centric ai",
" data quality",
" uncertainty"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "8bd4540cc219b4b2709b354bbd2f3cbd179e12f0ee195e333b628b90557c8d3e",
"md5": "7e6e570b646e99d29933aa129bfa392f",
"sha256": "bc92048270960bee7f724e1216f81a37e15f5403b20f605780630a7f086c7cbd"
},
"downloads": -1,
"filename": "data_suite-0.2.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "7e6e570b646e99d29933aa129bfa392f",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<3.10",
"size": 33520,
"upload_time": "2025-02-03T10:14:03",
"upload_time_iso_8601": "2025-02-03T10:14:03.998984Z",
"url": "https://files.pythonhosted.org/packages/8b/d4/540cc219b4b2709b354bbd2f3cbd179e12f0ee195e333b628b90557c8d3e/data_suite-0.2.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "27c05245d2c78b7c818ec90ff6e642a902ba5357e278e880a0f6b36bb35f6412",
"md5": "2cd21e8e217045fb59c01dd354cdd51f",
"sha256": "6302d7cab399c9fabc7737f0be264d6b285220e2c3797c1ab6a2ca29096af54c"
},
"downloads": -1,
"filename": "data_suite-0.2.1-py3-none-macosx_10_14_x86_64.whl",
"has_sig": false,
"md5_digest": "2cd21e8e217045fb59c01dd354cdd51f",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<3.10",
"size": 33531,
"upload_time": "2025-02-03T10:13:33",
"upload_time_iso_8601": "2025-02-03T10:13:33.491320Z",
"url": "https://files.pythonhosted.org/packages/27/c0/5245d2c78b7c818ec90ff6e642a902ba5357e278e880a0f6b36bb35f6412/data_suite-0.2.1-py3-none-macosx_10_14_x86_64.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-02-03 10:14:03",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "data-suite"
}