mdca


Namemdca JSON
Version 0.1.16 PyPI version JSON
download
home_pageNone
SummaryMDCA: Multi-dimensional Data Combination Analysis. It's used to analysis data table through multi-dimensional data combinations. Multi-dimensional distribution, fairness, and model error analysis are supported.
upload_time2025-03-15 04:31:51
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseNone
keywords multi-dimensional multidimensional distribution fairness model fairness error analysis model error model error analysis
VCS
bugtrack_url
requirements bitarray numpy pandas scipy
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # MDCA: Multi-dimensional Data Combination Analysis.

## Languages 多语言:
#### [English Version](README.md)  ####
#### [简体中文版本](README_zh.md)  ####

## What's MDCA?

MDCA analyzes multi-dimensional data combinations in data table.
Multi-dimensional distribution, fairness, and model error analysis are supported.

### Multi-dimensional Distribution Analysis

The distribution deviation of data may cause the prediction model to be biased towards majority classes and overfit minority classes, which affects the accuracy of the model.
Even if the data distribution of different values for each column is uniform, combinations of values in multiple columns tend to be non-uniform.  
**Multi-dimensional distribution analysis can quickly find the value combinations with deviated-from-baseline distributions.**

### Multi-dimensional Fairness Analysis

Data can be inherently biased. For example, gender, race, and nationality values may cause the model to make biased predictions,
and it is not always feasible to simply remove columns that may be biased.
Even if every column is fair, combination of multiple columns can be biased.  
**Multi-dimensional fairness analysis can quickly find the value combinations with deviated-from-baseline positive rates as well as higher amounts.**

Fairness detection in raw data sets is now supported, but Model fairness (eg. Equal Odds, Demographic Parity, etc.) is under development.

### Multi-dimensional Model Error Analysis

Model has different prediction accuracy for different value combinations.
Finding the value combinations with higher prediction error rate is helpful to understand the error of model, so as to improve the data quality and improve model prediction accuracy.  
**Multi-dimensional model error analysis can quickly find the value combinations with deviated-from-baseline prediction error rates as well as higher amounts in prediction error.**

## Installing

```bash
pip install mdca
```

## Typical usages

### Distribution Analysis

```bash
# recommended
mdca --data='path/to/data.csv' --mode=distribution --min-coverage=0.05 --target-column=<name of label column> --target-value=<value of positive label>  

# for data tables doesn't have a label column
mdca --data='path/to/data.csv' --mode=distribution --min-coverage=0.05  
```

### Fairness Analysis

```bash
mdca --data='path/to/data.csv' --mode=fairness --target-column=<name of label column> --target-value=<value of positive label> --min-coverage=0.05  
```

### Model Error Analysis

```bash
mdca --data='path/to/data.csv' --mode=error --target-column=<name of label column> --prediction-column=<name of predicted label column> --min-error-coverage=0.05  
```

## Concepts

For a data table, there are multiple columns to describe multiple characteristics of objects.  
If in some cases, the data is used to train classification models, there is also an _actual label_ column.  
As well, for model prediction, there is also a _predicted label_ to store the prediction results of a model.

| columnA | columnB | ... | columnX | actual label<br/>(optional) | predicted label<br/>(optional) |
| ------- | ------- | --- | ------- | --------------------------- | ------------------------------ |
| valueA1 | valueB1 | ... | valueX1 | 1                           | 1                              |
| valueA2 | valueB2 | ... | valueX2 | 0                           | 1                              |
| valueA3 | valueB3 | ... | valueX3 | 0                           | 0                              |
| valueA4 | valueB4 | ... | valueX4 | 1                           | 1                              |
| ...     | ...     | ... | ...     | ...                         | ...                            |

With this kind of data table, MDCA uses the following concepts:

**Target column** (-tc or --target-column): The name of the actual label column. It's optional in **_distribution_** mode,
but mandatory in **_fairness_** and **_error_** mode.

**Target value** (-tv or --target-value): The label value of positive sample in the target column.
For example, _"1", "true"_ is often used for binary-classification, and for multi-classification,
you can specify it as a target category you want to analysis, like "sport" for a news classification,
or "rain" for a weather prediction.

**Prediction column** (-pc or --prediction-column): The name of predicted label column. It's only available in **_error_** mode now.

**Min coverage** (-mc or --min-coverage): Minimum proportion of rows of analyzed value combinations in the total data.
Data combinations lower than this threshold will be ignored. Default value can be viewed using _mdca --help_

**Min target coverage** (-mtc or --min-target-coverage): Minimum proportion of rows of analyzed value combinations in the target data (value in target-column == target-value).
Data combinations lower than this threshold will be ignored. Default value can be viewed using _mdca --help_

**Min error coverage** (-mec or --min-error-coverage): Minimum proportion of rows of analyzed value combinations in the error data (value in prediction-column != value in target-column). Data combinations lower than this threshold will be ignored. Default value can be viewed using _mdca --help_

## Getting Started

### Performing Distribution Analysis

To perform _Distribution Analysis_, you need to specify a data table path (CSV is supported so far) and an analysis mode as "distribution".
Meanwhile, **_Target column_** and **_Target value_** are recommended to specify if your data table has a target column.
In this way, analyzer can give target related indicators with each distribution.  
The simplest command is:
```bash
# recommended
mdca --data='path/to/data.csv' --mode=distribution --target-column=<name of label column> --target-value=<value of positive label>

# for data tables doesn't have a label column
mdca --data='path/to/data.csv' --mode=distribution
```

**_Min coverage_** is mandatory, but without specifying a value, it will use a default value described in --help.
You can still manually specify arguments like min coverage, min target coverage:
```bash
# manually specify min coverage
mdca --data='path/to/data.csv' --mode=distribution --min-coverage=0.05  
mdca --data='path/to/data.csv' --mode=distribution --min-target-coverage=0.05  
```

You can also specify columns you want to analysis:
```bash
# if you want to ensure column1, column2, column3 to be uniform distributed
mdca --data='path/to/data.csv' --mode=distribution --column='column1, column2, column3'  
```

After execution finished, you will get results like this:

========== Results of Coverage Increase ============

| Coverage (Baseline, +N%, *X)     | Target Rate(Overall +%N) | Result                                                                                          |
| -------------------------------- | ------------------------ | ----------------------------------------------------------------------------------------------- |
| 54.52% ( 8.33%, +46.19%, *6.54 ) | 25.95% ( -5.72%)         | [nationality=Dutch, ind-debateclub=False, ind-entrepeneur_exp=False]                            |
| 62.00% (16.67%, +45.33%, *3.72 ) | 29.35% ( -2.32%)         | [nationality=Dutch, ind-international_exp=False]                                                |
| 41.33% (11.11%, +30.21%, *3.72 ) | 35.63% ( +3.96%)         | [gender=male, nationality=Dutch]                                                                |
| 39.40% (11.11%, +28.29%, *3.55 ) | 20.69% (-10.99%)         | [nationality=Dutch, ind-degree=bachelor]                                                        |
| 30.33% ( 4.17%, +26.16%, *7.28 ) | 26.30% ( -5.38%)         | [ind-debateclub=False, ind-international_exp=False, ind-entrepeneur_exp=False, ind-languages=1] |
| ...                              | ...                      | ...                                                                                             |

In this result, there are three columns: **Coverage (Baseline, +N%, *X)**, **Target Rate(Overall +N%)**, and **Result**.  
**Coverage** means the actual proportion of rows of the current result in the total data.  
**Baseline** means the expected coverage of the current result. __(+N%, *X)__ means the actual coverage is how much and how many times higher than the baseline coverage.  

**Baseline** coverage is calculated by the following formula:

$$
\vec{C} = (column1, column2, ..., columnN) ∈ Columns(Data Table)
$$

$$
Baseline Coverage(\vec{C}) = \frac{1}{Unique Value Combinations(\vec{C})}
$$

For example, there are two values of gender: *male*, *female*, and two values of nationality: *China*, *America*.
The value combinations of $ \vec{C}=(gender, nationality) $ are: {*(male, China), (male, America), (female, China), (female, America)*}.
So the $ Unique Value Combinations(\vec{C}) = 4 $, and $ Baseline Coverage(\vec{C}) = \frac{1}{4} = 0.25 $.
This algorithm indicates that the Baseline Coverage is the proportion of rows of a value combination in case of all the data are ideally uniform distributed.

**Target Rate** means the rate of positive samples in the given value combination. **Result** is the given value combination.

### Performing Fairness Analysis

To perform _Fairness Analysis_, you need to specify a data table path (CSV is supported so far) and an analysis mode as "fairness".
Meanwhile, **_Target column_** and **_Target value_** are mandatory, so that MDCA can analysis fairness of target rate to each value combination.  
The simplest command is:
```bash
mdca --data='path/to/data.csv' --mode=fairness --target-column=<name of label column> --target-value=<value of positive label>
```
**_Min coverage_** is mandatory, but without specifying a value, it will use a default value described in --help.
You can still manually specify arguments like min coverage, min target coverage:
```bash
mdca --data='path/to/data.csv' --mode=fairness  --target-column=<name of label column> --target-value=<value of positive label> --min-coverage=0.05  
mdca --data='path/to/data.csv' --mode=fairness  --target-column=<name of label column> --target-value=<value of positive label> --min-target-coverage=0.05  
```

You can also specify columns you want to analysis:
```bash
# if you want to ensure positive sample rate of combinations of column1, column2, column3 to be fair
mdca --data='path/to/data.csv' --mode=fairness --column='column1, column2, column3' --target-column=<name of label column> --target-value=<value of positive label>  
```

After execution finished, you will get results like this:

========== Results of Target Rate Increase ============

| Coverage(Count), | Target Rate(Overall+N%), | Result                           |
|------------------|--------------------------|----------------------------------|
| 13.18% (   527), | 41.75% (+10.07%),        | [gender=male, sport=Rugby]       |
| 5.33% (   213),  | 44.13% (+12.46%),        | [gender=male, age=29]            |
| 7.22% (   289),  | 40.14% ( +8.46%),        | [age=30]                         |
| 41.33% (  1653), | 35.63% ( +3.96%),        | [gender=male, nationality=Dutch] |
| 15.72% (   629), | 36.09% ( +4.41%),        | [gender=male, sport=Football]    |
| 5.92% (   237),  | 37.55% ( +5.88%),        | [gender=male, age=24]            |
| ...              | ...                      | ...                              |

In this result, there are three columns: **Coverage (Count)**, **Target Rate(Overall +N%)**, and **Result**.   
**Coverage** means the actual proportion of rows of the current result in the total data.   
**Count** means the actual count of rows.  
**Target Rate** means the rate of positive samples in the data of the given value combination. 
**(Overall +N%)** means how much higher the target rate is than the overall target rate in the total data table.  
**Result** is the given value combination.  


### Performing Model Error Analysis

To perform _Model Error Analysis_, you need to specify a data table path (CSV is supported so far) and an analysis mode as "error".
Meanwhile, **_Target column_** and **_Prediction column_** are mandatory, so that MDCA can analysis error rate of each value combination.  
The simplest command is:
```bash
mdca --data='path/to/data.csv' --mode=error --target-column=<name of label column> --prediction-column=<name of predicted label column> 
```
**_Min error coverage_** is mandatory, but without specifying a value, it will use a default value described in --help.
You can still manually specify arguments like min coverage, min error coverage:
```bash
mdca --data='path/to/data.csv' --mode=error  --target-column=<name of label column> --prediction-column=<name of predicted label column>  --min-coverage=0.05  
mdca --data='path/to/data.csv' --mode=error  --target-column=<name of label column> --prediction-column=<name of predicted label column>  --min-error-coverage=0.05  
```

You can also specify columns you want to analysis:
```bash
# if you want to ensure positive sample rate of combinations of column1, column2, column3 to be fair
mdca --data='path/to/data.csv' --mode=error --column='column1, column2, column3' --target-column=<name of label column> --prediction-column=<name of predicted label column>
```

After execution finished, you will get results like this:

========== Results of Error Rate Increase ============

| Error Coverage(Count) | Error Rate(Overall+N%) | Result                                           |
|-----------------------|------------------------|--------------------------------------------------|
| 51.69% ( 20713)       | 35.97% (+12.92%)       | [subGrade_trans=[14, 30)]                        |
| 11.46% (  4591)       | 40.35% (+17.31%)       | [term=5, verificationStatus=2]                   |
| 12.22% (  4897)       | 36.36% (+13.32%)       | [term=5, verificationStatus=1]                   |
| 21.04% (  8430)       | 32.77% ( +9.73%)       | [verificationStatus=2, ficoRangeHigh=[664, 687)] |
| 5.90% (  2364)        | 37.13% (+14.08%)       | [term=5, n14=3]                                  |
| 53.32% ( 21365)       | 28.40% ( +5.36%)       | [ficoRangeHigh=[664, 687)]                       |
| ...                   | ...                    | ...                                              |

In this result, there are three columns: **Error Coverage (Count)**, **Error Rate(Overall +N%)**, and **Result**.   
**Error Coverage** means the actual proportion of rows of the current result in the prediction error data.   
**Count** means the actual count of rows.  
**Error Rate** means the rate of prediction errors in the data of the given value combination. 
**(Overall +N%)** means how much higher the error rate is than the overall error rate in the total data table.  
**Result** is the given value combination.  

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "mdca",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "jingjiajie <932166095@qq.com>",
    "keywords": "multi-dimensional, multidimensional, distribution, fairness, model fairness, error analysis, model error, model error analysis",
    "author": null,
    "author_email": "jingjiajie <932166095@qq.com>",
    "download_url": "https://files.pythonhosted.org/packages/70/cf/c600520a3a088e97e6d136f38f3b7d567d4036c864c6f179ec753da0a9ad/mdca-0.1.16.tar.gz",
    "platform": null,
    "description": "# MDCA: Multi-dimensional Data Combination Analysis.\r\n\r\n## Languages \u591a\u8bed\u8a00:\r\n#### [English Version](README.md)  ####\r\n#### [\u7b80\u4f53\u4e2d\u6587\u7248\u672c](README_zh.md)  ####\r\n\r\n## What's MDCA?\r\n\r\nMDCA analyzes multi-dimensional data combinations in data table.\r\nMulti-dimensional distribution, fairness, and model error analysis are supported.\r\n\r\n### Multi-dimensional Distribution Analysis\r\n\r\nThe distribution deviation of data may cause the prediction model to be biased towards majority classes and overfit minority classes, which affects the accuracy of the model.\r\nEven if the data distribution of different values for each column is uniform, combinations of values in multiple columns tend to be non-uniform.  \r\n**Multi-dimensional distribution analysis can quickly find the value combinations with deviated-from-baseline distributions.**\r\n\r\n### Multi-dimensional Fairness Analysis\r\n\r\nData can be inherently biased. For example, gender, race, and nationality values may cause the model to make biased predictions,\r\nand it is not always feasible to simply remove columns that may be biased.\r\nEven if every column is fair, combination of multiple columns can be biased.  \r\n**Multi-dimensional fairness analysis can quickly find the value combinations with deviated-from-baseline positive rates as well as higher amounts.**\r\n\r\nFairness detection in raw data sets is now supported, but Model fairness (eg. Equal Odds, Demographic Parity, etc.) is under development.\r\n\r\n### Multi-dimensional Model Error Analysis\r\n\r\nModel has different prediction accuracy for different value combinations.\r\nFinding the value combinations with higher prediction error rate is helpful to understand the error of model, so as to improve the data quality and improve model prediction accuracy.  \r\n**Multi-dimensional model error analysis can quickly find the value combinations with deviated-from-baseline prediction error rates as well as higher amounts in prediction error.**\r\n\r\n## Installing\r\n\r\n```bash\r\npip install mdca\r\n```\r\n\r\n## Typical usages\r\n\r\n### Distribution Analysis\r\n\r\n```bash\r\n# recommended\r\nmdca --data='path/to/data.csv' --mode=distribution --min-coverage=0.05 --target-column=<name of label column> --target-value=<value of positive label>  \r\n\r\n# for data tables doesn't have a label column\r\nmdca --data='path/to/data.csv' --mode=distribution --min-coverage=0.05  \r\n```\r\n\r\n### Fairness Analysis\r\n\r\n```bash\r\nmdca --data='path/to/data.csv' --mode=fairness --target-column=<name of label column> --target-value=<value of positive label> --min-coverage=0.05  \r\n```\r\n\r\n### Model Error Analysis\r\n\r\n```bash\r\nmdca --data='path/to/data.csv' --mode=error --target-column=<name of label column> --prediction-column=<name of predicted label column> --min-error-coverage=0.05  \r\n```\r\n\r\n## Concepts\r\n\r\nFor a data table, there are multiple columns to describe multiple characteristics of objects.  \r\nIf in some cases, the data is used to train classification models, there is also an _actual label_ column.  \r\nAs well, for model prediction, there is also a _predicted label_ to store the prediction results of a model.\r\n\r\n| columnA | columnB | ... | columnX | actual label<br/>(optional) | predicted label<br/>(optional) |\r\n| ------- | ------- | --- | ------- | --------------------------- | ------------------------------ |\r\n| valueA1 | valueB1 | ... | valueX1 | 1                           | 1                              |\r\n| valueA2 | valueB2 | ... | valueX2 | 0                           | 1                              |\r\n| valueA3 | valueB3 | ... | valueX3 | 0                           | 0                              |\r\n| valueA4 | valueB4 | ... | valueX4 | 1                           | 1                              |\r\n| ...     | ...     | ... | ...     | ...                         | ...                            |\r\n\r\nWith this kind of data table, MDCA uses the following concepts:\r\n\r\n**Target column** (-tc or --target-column): The name of the actual label column. It's optional in **_distribution_** mode,\r\nbut mandatory in **_fairness_** and **_error_** mode.\r\n\r\n**Target value** (-tv or --target-value): The label value of positive sample in the target column.\r\nFor example, _\"1\", \"true\"_ is often used for binary-classification, and for multi-classification,\r\nyou can specify it as a target category you want to analysis, like \"sport\" for a news classification,\r\nor \"rain\" for a weather prediction.\r\n\r\n**Prediction column** (-pc or --prediction-column): The name of predicted label column. It's only available in **_error_** mode now.\r\n\r\n**Min coverage** (-mc or --min-coverage): Minimum proportion of rows of analyzed value combinations in the total data.\r\nData combinations lower than this threshold will be ignored. Default value can be viewed using _mdca --help_\r\n\r\n**Min target coverage** (-mtc or --min-target-coverage): Minimum proportion of rows of analyzed value combinations in the target data (value in target-column == target-value).\r\nData combinations lower than this threshold will be ignored. Default value can be viewed using _mdca --help_\r\n\r\n**Min error coverage** (-mec or --min-error-coverage): Minimum proportion of rows of analyzed value combinations in the error data (value in prediction-column != value in target-column). Data combinations lower than this threshold will be ignored. Default value can be viewed using _mdca --help_\r\n\r\n## Getting Started\r\n\r\n### Performing Distribution Analysis\r\n\r\nTo perform _Distribution Analysis_, you need to specify a data table path (CSV is supported so far) and an analysis mode as \"distribution\".\r\nMeanwhile, **_Target column_** and **_Target value_** are recommended to specify if your data table has a target column.\r\nIn this way, analyzer can give target related indicators with each distribution.  \r\nThe simplest command is:\r\n```bash\r\n# recommended\r\nmdca --data='path/to/data.csv' --mode=distribution --target-column=<name of label column> --target-value=<value of positive label>\r\n\r\n# for data tables doesn't have a label column\r\nmdca --data='path/to/data.csv' --mode=distribution\r\n```\r\n\r\n**_Min coverage_** is mandatory, but without specifying a value, it will use a default value described in --help.\r\nYou can still manually specify arguments like min coverage, min target coverage:\r\n```bash\r\n# manually specify min coverage\r\nmdca --data='path/to/data.csv' --mode=distribution --min-coverage=0.05  \r\nmdca --data='path/to/data.csv' --mode=distribution --min-target-coverage=0.05  \r\n```\r\n\r\nYou can also specify columns you want to analysis:\r\n```bash\r\n# if you want to ensure column1, column2, column3 to be uniform distributed\r\nmdca --data='path/to/data.csv' --mode=distribution --column='column1, column2, column3'  \r\n```\r\n\r\nAfter execution finished, you will get results like this:\r\n\r\n========== Results of Coverage Increase ============\r\n\r\n| Coverage (Baseline, +N%, *X)     | Target Rate(Overall +%N) | Result                                                                                          |\r\n| -------------------------------- | ------------------------ | ----------------------------------------------------------------------------------------------- |\r\n| 54.52% ( 8.33%, +46.19%, *6.54 ) | 25.95% ( -5.72%)         | [nationality=Dutch, ind-debateclub=False, ind-entrepeneur_exp=False]                            |\r\n| 62.00% (16.67%, +45.33%, *3.72 ) | 29.35% ( -2.32%)         | [nationality=Dutch, ind-international_exp=False]                                                |\r\n| 41.33% (11.11%, +30.21%, *3.72 ) | 35.63% ( +3.96%)         | [gender=male, nationality=Dutch]                                                                |\r\n| 39.40% (11.11%, +28.29%, *3.55 ) | 20.69% (-10.99%)         | [nationality=Dutch, ind-degree=bachelor]                                                        |\r\n| 30.33% ( 4.17%, +26.16%, *7.28 ) | 26.30% ( -5.38%)         | [ind-debateclub=False, ind-international_exp=False, ind-entrepeneur_exp=False, ind-languages=1] |\r\n| ...                              | ...                      | ...                                                                                             |\r\n\r\nIn this result, there are three columns: **Coverage (Baseline, +N%, *X)**, **Target Rate(Overall +N%)**, and **Result**.  \r\n**Coverage** means the actual proportion of rows of the current result in the total data.  \r\n**Baseline** means the expected coverage of the current result. __(+N%, *X)__ means the actual coverage is how much and how many times higher than the baseline coverage.  \r\n\r\n**Baseline** coverage is calculated by the following formula:\r\n\r\n$$\r\n\\vec{C} = (column1, column2, ..., columnN) \u2208 Columns(Data Table)\r\n$$\r\n\r\n$$\r\nBaseline Coverage(\\vec{C}) = \\frac{1}{Unique Value Combinations(\\vec{C})}\r\n$$\r\n\r\nFor example, there are two values of gender: *male*, *female*, and two values of nationality: *China*, *America*.\r\nThe value combinations of $ \\vec{C}=(gender, nationality) $ are: {*(male, China), (male, America), (female, China), (female, America)*}.\r\nSo the $ Unique Value Combinations(\\vec{C}) = 4 $, and $ Baseline Coverage(\\vec{C}) = \\frac{1}{4} = 0.25 $.\r\nThis algorithm indicates that the Baseline Coverage is the proportion of rows of a value combination in case of all the data are ideally uniform distributed.\r\n\r\n**Target Rate** means the rate of positive samples in the given value combination. **Result** is the given value combination.\r\n\r\n### Performing Fairness Analysis\r\n\r\nTo perform _Fairness Analysis_, you need to specify a data table path (CSV is supported so far) and an analysis mode as \"fairness\".\r\nMeanwhile, **_Target column_** and **_Target value_** are mandatory, so that MDCA can analysis fairness of target rate to each value combination.  \r\nThe simplest command is:\r\n```bash\r\nmdca --data='path/to/data.csv' --mode=fairness --target-column=<name of label column> --target-value=<value of positive label>\r\n```\r\n**_Min coverage_** is mandatory, but without specifying a value, it will use a default value described in --help.\r\nYou can still manually specify arguments like min coverage, min target coverage:\r\n```bash\r\nmdca --data='path/to/data.csv' --mode=fairness  --target-column=<name of label column> --target-value=<value of positive label> --min-coverage=0.05  \r\nmdca --data='path/to/data.csv' --mode=fairness  --target-column=<name of label column> --target-value=<value of positive label> --min-target-coverage=0.05  \r\n```\r\n\r\nYou can also specify columns you want to analysis:\r\n```bash\r\n# if you want to ensure positive sample rate of combinations of column1, column2, column3 to be fair\r\nmdca --data='path/to/data.csv' --mode=fairness --column='column1, column2, column3' --target-column=<name of label column> --target-value=<value of positive label>  \r\n```\r\n\r\nAfter execution finished, you will get results like this:\r\n\r\n========== Results of Target Rate Increase ============\r\n\r\n| Coverage(Count), | Target Rate(Overall+N%), | Result                           |\r\n|------------------|--------------------------|----------------------------------|\r\n| 13.18% (   527), | 41.75% (+10.07%),        | [gender=male, sport=Rugby]       |\r\n| 5.33% (   213),  | 44.13% (+12.46%),        | [gender=male, age=29]            |\r\n| 7.22% (   289),  | 40.14% ( +8.46%),        | [age=30]                         |\r\n| 41.33% (  1653), | 35.63% ( +3.96%),        | [gender=male, nationality=Dutch] |\r\n| 15.72% (   629), | 36.09% ( +4.41%),        | [gender=male, sport=Football]    |\r\n| 5.92% (   237),  | 37.55% ( +5.88%),        | [gender=male, age=24]            |\r\n| ...              | ...                      | ...                              |\r\n\r\nIn this result, there are three columns: **Coverage (Count)**, **Target Rate(Overall +N%)**, and **Result**.   \r\n**Coverage** means the actual proportion of rows of the current result in the total data.   \r\n**Count** means the actual count of rows.  \r\n**Target Rate** means the rate of positive samples in the data of the given value combination. \r\n**(Overall +N%)** means how much higher the target rate is than the overall target rate in the total data table.  \r\n**Result** is the given value combination.  \r\n\r\n\r\n### Performing Model Error Analysis\r\n\r\nTo perform _Model Error Analysis_, you need to specify a data table path (CSV is supported so far) and an analysis mode as \"error\".\r\nMeanwhile, **_Target column_** and **_Prediction column_** are mandatory, so that MDCA can analysis error rate of each value combination.  \r\nThe simplest command is:\r\n```bash\r\nmdca --data='path/to/data.csv' --mode=error --target-column=<name of label column> --prediction-column=<name of predicted label column> \r\n```\r\n**_Min error coverage_** is mandatory, but without specifying a value, it will use a default value described in --help.\r\nYou can still manually specify arguments like min coverage, min error coverage:\r\n```bash\r\nmdca --data='path/to/data.csv' --mode=error  --target-column=<name of label column> --prediction-column=<name of predicted label column>  --min-coverage=0.05  \r\nmdca --data='path/to/data.csv' --mode=error  --target-column=<name of label column> --prediction-column=<name of predicted label column>  --min-error-coverage=0.05  \r\n```\r\n\r\nYou can also specify columns you want to analysis:\r\n```bash\r\n# if you want to ensure positive sample rate of combinations of column1, column2, column3 to be fair\r\nmdca --data='path/to/data.csv' --mode=error --column='column1, column2, column3' --target-column=<name of label column> --prediction-column=<name of predicted label column>\r\n```\r\n\r\nAfter execution finished, you will get results like this:\r\n\r\n========== Results of Error Rate Increase ============\r\n\r\n| Error Coverage(Count) | Error Rate(Overall+N%) | Result                                           |\r\n|-----------------------|------------------------|--------------------------------------------------|\r\n| 51.69% ( 20713)       | 35.97% (+12.92%)       | [subGrade_trans=[14, 30)]                        |\r\n| 11.46% (  4591)       | 40.35% (+17.31%)       | [term=5, verificationStatus=2]                   |\r\n| 12.22% (  4897)       | 36.36% (+13.32%)       | [term=5, verificationStatus=1]                   |\r\n| 21.04% (  8430)       | 32.77% ( +9.73%)       | [verificationStatus=2, ficoRangeHigh=[664, 687)] |\r\n| 5.90% (  2364)        | 37.13% (+14.08%)       | [term=5, n14=3]                                  |\r\n| 53.32% ( 21365)       | 28.40% ( +5.36%)       | [ficoRangeHigh=[664, 687)]                       |\r\n| ...                   | ...                    | ...                                              |\r\n\r\nIn this result, there are three columns: **Error Coverage (Count)**, **Error Rate(Overall +N%)**, and **Result**.   \r\n**Error Coverage** means the actual proportion of rows of the current result in the prediction error data.   \r\n**Count** means the actual count of rows.  \r\n**Error Rate** means the rate of prediction errors in the data of the given value combination. \r\n**(Overall +N%)** means how much higher the error rate is than the overall error rate in the total data table.  \r\n**Result** is the given value combination.  \r\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "MDCA: Multi-dimensional Data Combination Analysis. It's used to analysis data table through multi-dimensional data combinations. Multi-dimensional distribution, fairness, and model error analysis are supported.",
    "version": "0.1.16",
    "project_urls": {
        "Bug Tracker": "https://github.com/jingjiajie/mdca/issues",
        "Homepage": "https://github.com/jingjiajie/mdca",
        "Repository": "https://github.com/jingjiajie/mdca.git"
    },
    "split_keywords": [
        "multi-dimensional",
        " multidimensional",
        " distribution",
        " fairness",
        " model fairness",
        " error analysis",
        " model error",
        " model error analysis"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "529d721d6d7d047e3724206d26b4301be663eb7e1e0b3287fcde54f340bbfdc1",
                "md5": "4b96a0d3f23b42175eae4575ab04912f",
                "sha256": "124c965756b597bf2c9018229d1cdaeedc9773949f8172b6cb3c624baab62f26"
            },
            "downloads": -1,
            "filename": "mdca-0.1.16-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "4b96a0d3f23b42175eae4575ab04912f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 26828,
            "upload_time": "2025-03-15T04:31:50",
            "upload_time_iso_8601": "2025-03-15T04:31:50.619654Z",
            "url": "https://files.pythonhosted.org/packages/52/9d/721d6d7d047e3724206d26b4301be663eb7e1e0b3287fcde54f340bbfdc1/mdca-0.1.16-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "70cfc600520a3a088e97e6d136f38f3b7d567d4036c864c6f179ec753da0a9ad",
                "md5": "8800597d9f92487fc5a6485bed16aa8a",
                "sha256": "d692d727f781c1d1e52f5a9f8a77f0793c59d16eee008d866b32bf9c1477c829"
            },
            "downloads": -1,
            "filename": "mdca-0.1.16.tar.gz",
            "has_sig": false,
            "md5_digest": "8800597d9f92487fc5a6485bed16aa8a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 25367,
            "upload_time": "2025-03-15T04:31:51",
            "upload_time_iso_8601": "2025-03-15T04:31:51.729071Z",
            "url": "https://files.pythonhosted.org/packages/70/cf/c600520a3a088e97e6d136f38f3b7d567d4036c864c6f179ec753da0a9ad/mdca-0.1.16.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-03-15 04:31:51",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "jingjiajie",
    "github_project": "mdca",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "bitarray",
            "specs": [
                [
                    "==",
                    "3.1.0"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    "==",
                    "2.2.3"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    "==",
                    "2.2.3"
                ]
            ]
        },
        {
            "name": "scipy",
            "specs": [
                [
                    "==",
                    "1.15.2"
                ]
            ]
        }
    ],
    "lcname": "mdca"
}
        
Elapsed time: 0.85930s