shapley-decomposition


Nameshapley-decomposition JSON
Version 0.0.2 PyPI version JSON
download
home_pagehttps://github.com/canitez01/shapley_decomposition
SummaryDecomposition using shapley values
upload_time2024-11-07 22:12:50
maintainerNone
docs_urlNone
authorCan Itez
requires_pythonNone
licenseMIT
keywords python data analysis descriptive analysis shapley values owen values decomposition
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Shapley Decomposition

This package consists of two applications of shapley values in descriptive analysis: 1) a generalized module for decomposing change over instance, using shapley values[^1] (initially influenced by the World Bank's Job Structure tool[^2]) and 2) shapley and owen values based decomposition of R^2 (contribution of independent variables to a goodness of fit metric -R^2 in this case-) for linear regression models[^3].

## Notes

Aim of this package is to help decompose the effect of changing unknowns/variables on two instances of an equation. It should be noted that being able to decompose the contribution of variables doesn't mean that the results are always clearly interpretable. Many features of variables like; scale, relation mode, change dynamics (slow paced/fast paced, instant/lagged), etc. deserves attention when interpreting their individual contribution to the change or result.   

Both for the first and second application, the computation time increases exponentially as the number of variables increase. This is the result of powersets and so 2^n calculations.

Shapley value:

$v(i) = \sum \limits _{S \subseteq M \setminus i} \phi(s) \cdot [V(S \cup \{i\})-V(S)]$

$\phi(s) = (m-1-s)! \cdot s!/m!$

where $i \in M$ and M is the main set of variables and $m=|M|, s=|S|$. For shapley change decomposition, $[V(S \cup \{i_{t_1} \})-V(S\cup \{i_{t_0} \})]$ and s is the number of variables with $t_1$ instance.  

Owen value:

$o(i) = \sum \limits _{R \subseteq N \setminus k} \sum \limits _{T \subseteq B_k \setminus i} \phi(r) \cdot \omega(t) \cdot [V(Q \cup T \cup \{i\})-V(Q \cup T)]$

$\phi(r) = (n-1-r)! \cdot r!/n!$

$\phi(t) = (b_k-1-t)! \cdot t!/b_k!$

where $i \in M$ and M is the main set of variables. N is the powerset of coalition/group set composed of i individuals.  $Q = \bigcup_{r \in R}B_r$ and $n=|N|, r=|R|, b_k=|B_k|, t=|T|$.

## Installation

Run the following to install

```python
pip install shapley_decomposition
```

## Workings

`shapley_decomposition.shapley_change` module consists of three functions: `samples()`, `shapley_values()` and `decomposition()`. `shapley_change.samples(dataframe)` returns cartesian products of variable-instance couples. `shapley_change.shapley_values(dataframe, "your function")` returns weighted differences for each variable, sum of which gives the shapley value. `shapley_change.decomposition(dataframe, "your function")` returns decomposed total change by variable contributions. These functions of shapley_change module accepts either or both of the **data** and **function** inputs:

1. The structure of input data is **important**. Module accepts pandas dataframes or 2d arrays:
  * If pandas dataframe is used as input, both the dependent variable and the independent variables should be presented in the given format (variable names as index and years as columns):

    |  | year1 | year2 |
    | --- | ----------- | ----|
    | **y** | y_value | y_value |
    | **x1** | x1_value | x1_value |
    | **x2** | x2_value | x2_value |
    | **...** | ... | ... |
    | **xn** | xn_value | xn_value |

  * If an array is preferred, note that module will convert it to a pandas dataframe format and expects y and xs in the following order:
    ```
    [[y_value,y_value],
      [x1_value,x1_value],
      [x2_value,x2_value]]
      ...
    ```
2. Function defines the relation between xs and y. Due to the characteristic of shapley decomposition the sum of xs' contributions must be equal to y (with plus minus 0.0001 freedom in this module due to the residue of arithmetic operations), therefore no place for residuals. An input relation that fails to create the given y will shoot a specific error. Function input is expected in text format. It is evaluated by a custom parser (eval() function is avoided due to security risks). Expected format for the function input is the right hand side of the equation:

    * `"x1+x2*(x3/x4)**x5"`
    * `"(x1+x2)*x3+x4"`
    * `"x1*x2**2"`

    All arithmetic operators and paranthesis operations are usable:
    * `"+" , "-" , "*" , "/" or "÷", "**" or "^"`

3. If `shapley_change.decomposition(df,"your function", cagr=True)` is called, a yearly_growth (using compound annual growth rate - cagr) column will be added, which will index the decomposition to cagr of the y. Default is `cagr=False`.   

The `shapley_decomposition.shapley_r2` module consists of three functions as well: `samples()`, `shapley_decomposition()` and `owen_decomposition`. `shapley_r2.samples(dataframe)` returns powerset variable pairs that model uses. `shapley_r2.shapley_decomposition(dataframe)` returns the decomposition of model r^2 to the contributions of variables. `shapley_r2.owen_decomposition(dataframe, [["x1","x2"],[..]])` returns the owen decomposition of model r^2 to the contributions of variables and groups/coalitions. Input features expected by shapley_r2 functions are as:

  1. The expected format for the input dataframe or array is:

  |  | x1 | x2 | ... | xn | y |  
  | --- | --- | --- | --- | --- | --- |
  | **0** | x1_value | x2_value | ... | xn_value | y_value |
  | **1** | x1_value | x2_value | ... | xn_value | y_value |
  | **2** | x1_value | x2_value | ... | xn_value | y_value |
  | **...** | ... | ... | ... | ... | ... |
  | **n** | x1_value | x2_value | ... | xn_value | y_value |


  2. `shapley_r2.owen_decomposition` expects the group/coalition structure as the second input. This input should be a list of list showing the variables grouped within coalition/group lists. For example a model of 8 variables, x1,x2,...,x8 has three groups/coalitions which are formed as group1:(x1,x2,x3), group2:(x4) and group3:(x5,x6,x7,x8). Then the second input of owen_decomposition should be `[["x1","x2","x3"],["x4"],["x5","x6","x7","x8"]]`. Even if it is a singleton like group2 which has only x4, variable name should be in a list. If every group is a singleton, then the owen values will be equal to shapley values.

  3. As the computation time increases exponentially with the number of variables. For the shapley_decomposition function a default upper variable limit of 10 variables has been set. Same limit applies for owen_decomposition but as the number of groups, not individual variables. However in users' own discretion more variables can be forced by calling the function as `shapley_r2.shapley_decomposition(df, force=True)` or `shapley_r2.owen_decomposition(df, [groups], force=True)`.

## Examples

1. As the first influence for the model was from WB's Job Structure, accordingly first example is decomposition of change in value added per capita of Turkey from 2000 to 2018 according to `"x1*x2*x3*x4"` where x1 is value added per worker, x2 is employment rate, x3 is participation rate, x4 is share of 15-64 population in total population. This is an identity.

  ```python
  import pandas
  from shapley_decomposition import shapley_change

  df=pandas.DataFrame([[8237.599210,15026.707520],[27017.637990,43770.525560],[0.935050,0.891050],[0.515090,0.57619],[0.633046,0.668674]],index=["val_ad_pc","val_ad_pw","emp_rate","part_rate","working_age"], columns=[2000,2018])
  print(df)
  ```
  |  | 2000 | 2018 |
  | --- | ----------- | ----|
  | **val_ad_pc** | 8237.599210 | 15026.707520 |
  | **val_ad_pw** | 27017.637990 | 43770.525560 |
  | **emp_rate** | 0.935050 | 0.891050 |
  | **part_rate** | 0.515090 | 0.57619 |
  | **part_rate** | 0.633046 | 0.668674 |

  ```python
  shapley_change.decomposition(df,"x1*x2*x3*x4")
  ```
  |  | 2000 | 2018 | dif | shapley | contribution |
  | --- | --- | --- | --- | --- | --- |
  | **val_ad_pc** |	8237.599210 |	15026.707520 |	6789.108310 |	6789.108310 |	1.000000 |
  | **val_ad_pw** |	27017.637990 | 43770.525560 |	16752.887570 | 5431.365538 | 0.800012 |
  | **emp_rate** | 0.935050 |	0.891050 | -0.044000 | -556.985657 | -0.082041 |
  | **part_rate** |	0.515090 | 0.576190 | 0.061100 | 1285.200011 | 0.189303 |
  | **working_age** |	0.633046 | 0.668674 |	0.035628 | 629.528410 |	0.092726 |

2. Second example is the decomposition of change in non-parametric skewness of a normally distributed sample, after the sample is altered with additional data. We are trying to understand how the change in mean, median and standard deviation contributed to the change in skewness parameter. Non parametric skewness is calculated by `"(x1-x2)/x3"`, (mean-median)/standard deviation.

  ```python
  import numpy as np
  import pandas
  from shapley_decomposition import shapley_change

  np.random.seed(210)

  data = np.random.normal(loc=0, scale=1, size=100)

  add = [np.random.uniform(min(data), max(data)) for m in range(5,10)]

  altered_data = np.concatenate([data,add])

  med1, med2 = np.median(data), np.median(altered_data)
  mean1, mean2 = np.mean(data), np.mean(altered_data)
  std1, std2 = np.std(data, ddof=1), np.std(altered_data, ddof=1)
  sk1 = (np.mean(data)-np.median(data))/np.std(data, ddof=1)
  sk2 = (np.mean(altered_data)-np.median(altered_data))/np.std(altered_data, ddof=1)

  df=pandas.DataFrame([[sk1,sk2],[mean1,mean2],[med1,med2],[std1,std2]], columns=["0","1"], index=["non_par_skew","mean","median","std"])

  shapley_change.decomposition(df,"(x1-x2)/x3")
  ```
  |  | 0 | 1 | dif | shapley | contribution |
  | --- | --- | --- | --- | --- | --- |
  | **non_par_skew** |	0.065803 |	0.044443 |	-0.021359 |	-0.021359 |	1.000000 |
  | **mean** |	-0.247181 | -0.285440 	 |	-0.038259 | -0.036146 | 1.692288 |
  | **median** | -0.315957 |	-0.333088 | -0.017131 | 0.016184 | -0.757719 |
  | **std** |	1.045188 | 1.072090 | 0.026902 | -0.001398 | 0.065432 |

3. Third example uses shapley_r2 decomposition with the fish market database from kaggle[^4]:

  ```python
  import numpy as np
  import pandas
  from shapley_decomposition import shapley_r2

  df=pandas.read_csv("Fish.csv")
  #ignoring the species column
  shapley_r2.shapley_decomposition(df.iloc[:,1:])
  ```
  | |shapley_values | contribution |
  | --| -- | --|
  | **Length1** |	0.194879 |	0.220131 |
  | **Length2** |	0.195497 |	0.220829 |
  | **Length3** |	0.198097 |	0.223766 |
  | **Height** |	0.116893 |	0.132040 |
  | **Width** |	0.179920 |	0.203233 |

  ```python
  #using the same dataframe

  groups = [["Length1","Length2","Length3"],["Height","Width"]]

  shapley_r2.owen_decomposition(df.iloc[:,1:], groups)
  ```



  | | owen_values | contribution | group_owen |
  | --- | --- | --- | --- |
  | **Length1** |	0.157523 | 0.177934 | b1 |
  | **Length2** |	0.158178 | 0.178674 | b1 |
  | **Length3** |	0.160276 | 0.181045 | b1 |
  | **Height** |	0.141092 | 0.159374 | b2 |
  | **Width** |	0.268218 | 0.302972 | b2 |


  | | owen_values | contribution |
  | -- | -- | -- |                         
  | **b1** | 0.475977 | 0.537653 |
  | **b2** | 0.409309 | 0.462347 |

  ```python
  # if the species are included as categorical variables

  from sklearn.preprocessing import OneHotEncoder

  enc = OneHotEncoder(handle_unknown="ignore")
  encti = enc.fit_transform(df[["Species"]]).toarray()

  df2 = pandas.concat([pandas.DataFrame(encti,columns=["Species"+str(n) for n in range(len(encti[0]+1))]),df.iloc[:,1:]],axis=1)
  shapley_r2.shapley_decomposition(df2, force=True) # as the number of variables bigger than 10, force=True

  groups=[["Species0","Species1","Species2","Species3","Species4","Species5","Species6"],["Length1","Length2","Length3"],["Height","Width"]]

  shapley_r2.owen_decomposition(df2, groups) #no need for force as the number of groups does not exceed 10
  ```

[^1]: https://www.rand.org/content/dam/rand/pubs/papers/2021/P295.pdf
[^2]: https://datatopics.worldbank.org/jobsdiagnostics/jobs-tools.html
[^3]: https://www.scitepress.org/papers/2017/61137/
[^4]: https://www.kaggle.com/datasets/aungpyaeap/fish-market?resource=download

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/canitez01/shapley_decomposition",
    "name": "shapley-decomposition",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "python, data analysis, descriptive analysis, shapley values, owen values, decomposition",
    "author": "Can Itez",
    "author_email": "<canitez01@hotmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/0d/7a/1d55c2d917f1b73fb796e637a7795483d5e61050ad6983f2cd81dbf8b298/shapley_decomposition-0.0.2.tar.gz",
    "platform": null,
    "description": "# Shapley Decomposition\r\n\r\nThis package consists of two applications of shapley values in descriptive analysis: 1) a generalized module for decomposing change over instance, using shapley values[^1] (initially influenced by the World Bank's Job Structure tool[^2]) and 2) shapley and owen values based decomposition of R^2 (contribution of independent variables to a goodness of fit metric -R^2 in this case-) for linear regression models[^3].\r\n\r\n## Notes\r\n\r\nAim of this package is to help decompose the effect of changing unknowns/variables on two instances of an equation. It should be noted that being able to decompose the contribution of variables doesn't mean that the results are always clearly interpretable. Many features of variables like; scale, relation mode, change dynamics (slow paced/fast paced, instant/lagged), etc. deserves attention when interpreting their individual contribution to the change or result.   \r\n\r\nBoth for the first and second application, the computation time increases exponentially as the number of variables increase. This is the result of powersets and so 2^n calculations.\r\n\r\nShapley value:\r\n\r\n$v(i) = \\sum \\limits _{S \\subseteq M \\setminus i} \\phi(s) \\cdot [V(S \\cup \\{i\\})-V(S)]$\r\n\r\n$\\phi(s) = (m-1-s)! \\cdot s!/m!$\r\n\r\nwhere $i \\in M$ and M is the main set of variables and $m=|M|, s=|S|$. For shapley change decomposition, $[V(S \\cup \\{i_{t_1} \\})-V(S\\cup \\{i_{t_0} \\})]$ and s is the number of variables with $t_1$ instance.  \r\n\r\nOwen value:\r\n\r\n$o(i) = \\sum \\limits _{R \\subseteq N \\setminus k} \\sum \\limits _{T \\subseteq B_k \\setminus i} \\phi(r) \\cdot \\omega(t) \\cdot [V(Q \\cup T \\cup \\{i\\})-V(Q \\cup T)]$\r\n\r\n$\\phi(r) = (n-1-r)! \\cdot r!/n!$\r\n\r\n$\\phi(t) = (b_k-1-t)! \\cdot t!/b_k!$\r\n\r\nwhere $i \\in M$ and M is the main set of variables. N is the powerset of coalition/group set composed of i individuals.  $Q = \\bigcup_{r \\in R}B_r$ and $n=|N|, r=|R|, b_k=|B_k|, t=|T|$.\r\n\r\n## Installation\r\n\r\nRun the following to install\r\n\r\n```python\r\npip install shapley_decomposition\r\n```\r\n\r\n## Workings\r\n\r\n`shapley_decomposition.shapley_change` module consists of three functions: `samples()`, `shapley_values()` and `decomposition()`. `shapley_change.samples(dataframe)` returns cartesian products of variable-instance couples. `shapley_change.shapley_values(dataframe, \"your function\")` returns weighted differences for each variable, sum of which gives the shapley value. `shapley_change.decomposition(dataframe, \"your function\")` returns decomposed total change by variable contributions. These functions of shapley_change module accepts either or both of the **data** and **function** inputs:\r\n\r\n1. The structure of input data is **important**. Module accepts pandas dataframes or 2d arrays:\r\n  * If pandas dataframe is used as input, both the dependent variable and the independent variables should be presented in the given format (variable names as index and years as columns):\r\n\r\n    |  | year1 | year2 |\r\n    | --- | ----------- | ----|\r\n    | **y** | y_value | y_value |\r\n    | **x1** | x1_value | x1_value |\r\n    | **x2** | x2_value | x2_value |\r\n    | **...** | ... | ... |\r\n    | **xn** | xn_value | xn_value |\r\n\r\n  * If an array is preferred, note that module will convert it to a pandas dataframe format and expects y and xs in the following order:\r\n    ```\r\n    [[y_value,y_value],\r\n      [x1_value,x1_value],\r\n      [x2_value,x2_value]]\r\n      ...\r\n    ```\r\n2. Function defines the relation between xs and y. Due to the characteristic of shapley decomposition the sum of xs' contributions must be equal to y (with plus minus 0.0001 freedom in this module due to the residue of arithmetic operations), therefore no place for residuals. An input relation that fails to create the given y will shoot a specific error. Function input is expected in text format. It is evaluated by a custom parser (eval() function is avoided due to security risks). Expected format for the function input is the right hand side of the equation:\r\n\r\n    * `\"x1+x2*(x3/x4)**x5\"`\r\n    * `\"(x1+x2)*x3+x4\"`\r\n    * `\"x1*x2**2\"`\r\n\r\n    All arithmetic operators and paranthesis operations are usable:\r\n    * `\"+\" , \"-\" , \"*\" , \"/\" or \"\u00c3\u00b7\", \"**\" or \"^\"`\r\n\r\n3. If `shapley_change.decomposition(df,\"your function\", cagr=True)` is called, a yearly_growth (using compound annual growth rate - cagr) column will be added, which will index the decomposition to cagr of the y. Default is `cagr=False`.   \r\n\r\nThe `shapley_decomposition.shapley_r2` module consists of three functions as well: `samples()`, `shapley_decomposition()` and `owen_decomposition`. `shapley_r2.samples(dataframe)` returns powerset variable pairs that model uses. `shapley_r2.shapley_decomposition(dataframe)` returns the decomposition of model r^2 to the contributions of variables. `shapley_r2.owen_decomposition(dataframe, [[\"x1\",\"x2\"],[..]])` returns the owen decomposition of model r^2 to the contributions of variables and groups/coalitions. Input features expected by shapley_r2 functions are as:\r\n\r\n  1. The expected format for the input dataframe or array is:\r\n\r\n  |  | x1 | x2 | ... | xn | y |  \r\n  | --- | --- | --- | --- | --- | --- |\r\n  | **0** | x1_value | x2_value | ... | xn_value | y_value |\r\n  | **1** | x1_value | x2_value | ... | xn_value | y_value |\r\n  | **2** | x1_value | x2_value | ... | xn_value | y_value |\r\n  | **...** | ... | ... | ... | ... | ... |\r\n  | **n** | x1_value | x2_value | ... | xn_value | y_value |\r\n\r\n\r\n  2. `shapley_r2.owen_decomposition` expects the group/coalition structure as the second input. This input should be a list of list showing the variables grouped within coalition/group lists. For example a model of 8 variables, x1,x2,...,x8 has three groups/coalitions which are formed as group1:(x1,x2,x3), group2:(x4) and group3:(x5,x6,x7,x8). Then the second input of owen_decomposition should be `[[\"x1\",\"x2\",\"x3\"],[\"x4\"],[\"x5\",\"x6\",\"x7\",\"x8\"]]`. Even if it is a singleton like group2 which has only x4, variable name should be in a list. If every group is a singleton, then the owen values will be equal to shapley values.\r\n\r\n  3. As the computation time increases exponentially with the number of variables. For the shapley_decomposition function a default upper variable limit of 10 variables has been set. Same limit applies for owen_decomposition but as the number of groups, not individual variables. However in users' own discretion more variables can be forced by calling the function as `shapley_r2.shapley_decomposition(df, force=True)` or `shapley_r2.owen_decomposition(df, [groups], force=True)`.\r\n\r\n## Examples\r\n\r\n1. As the first influence for the model was from WB's Job Structure, accordingly first example is decomposition of change in value added per capita of Turkey from 2000 to 2018 according to `\"x1*x2*x3*x4\"` where x1 is value added per worker, x2 is employment rate, x3 is participation rate, x4 is share of 15-64 population in total population. This is an identity.\r\n\r\n  ```python\r\n  import pandas\r\n  from shapley_decomposition import shapley_change\r\n\r\n  df=pandas.DataFrame([[8237.599210,15026.707520],[27017.637990,43770.525560],[0.935050,0.891050],[0.515090,0.57619],[0.633046,0.668674]],index=[\"val_ad_pc\",\"val_ad_pw\",\"emp_rate\",\"part_rate\",\"working_age\"], columns=[2000,2018])\r\n  print(df)\r\n  ```\r\n  |  | 2000 | 2018 |\r\n  | --- | ----------- | ----|\r\n  | **val_ad_pc** | 8237.599210 | 15026.707520 |\r\n  | **val_ad_pw** | 27017.637990 | 43770.525560 |\r\n  | **emp_rate** | 0.935050 | 0.891050 |\r\n  | **part_rate** | 0.515090 | 0.57619 |\r\n  | **part_rate** | 0.633046 | 0.668674 |\r\n\r\n  ```python\r\n  shapley_change.decomposition(df,\"x1*x2*x3*x4\")\r\n  ```\r\n  |  | 2000 | 2018 | dif | shapley | contribution |\r\n  | --- | --- | --- | --- | --- | --- |\r\n  | **val_ad_pc** |\t8237.599210 |\t15026.707520 |\t6789.108310 |\t6789.108310 |\t1.000000 |\r\n  | **val_ad_pw** |\t27017.637990 | 43770.525560 |\t16752.887570 | 5431.365538 | 0.800012 |\r\n  | **emp_rate** | 0.935050 |\t0.891050 | -0.044000 | -556.985657 | -0.082041 |\r\n  | **part_rate** |\t0.515090 | 0.576190 | 0.061100 | 1285.200011 | 0.189303 |\r\n  | **working_age** |\t0.633046 | 0.668674 |\t0.035628 | 629.528410 |\t0.092726 |\r\n\r\n2. Second example is the decomposition of change in non-parametric skewness of a normally distributed sample, after the sample is altered with additional data. We are trying to understand how the change in mean, median and standard deviation contributed to the change in skewness parameter. Non parametric skewness is calculated by `\"(x1-x2)/x3\"`, (mean-median)/standard deviation.\r\n\r\n  ```python\r\n  import numpy as np\r\n  import pandas\r\n  from shapley_decomposition import shapley_change\r\n\r\n  np.random.seed(210)\r\n\r\n  data = np.random.normal(loc=0, scale=1, size=100)\r\n\r\n  add = [np.random.uniform(min(data), max(data)) for m in range(5,10)]\r\n\r\n  altered_data = np.concatenate([data,add])\r\n\r\n  med1, med2 = np.median(data), np.median(altered_data)\r\n  mean1, mean2 = np.mean(data), np.mean(altered_data)\r\n  std1, std2 = np.std(data, ddof=1), np.std(altered_data, ddof=1)\r\n  sk1 = (np.mean(data)-np.median(data))/np.std(data, ddof=1)\r\n  sk2 = (np.mean(altered_data)-np.median(altered_data))/np.std(altered_data, ddof=1)\r\n\r\n  df=pandas.DataFrame([[sk1,sk2],[mean1,mean2],[med1,med2],[std1,std2]], columns=[\"0\",\"1\"], index=[\"non_par_skew\",\"mean\",\"median\",\"std\"])\r\n\r\n  shapley_change.decomposition(df,\"(x1-x2)/x3\")\r\n  ```\r\n  |  | 0 | 1 | dif | shapley | contribution |\r\n  | --- | --- | --- | --- | --- | --- |\r\n  | **non_par_skew** |\t0.065803 |\t0.044443 |\t-0.021359 |\t-0.021359 |\t1.000000 |\r\n  | **mean** |\t-0.247181 | -0.285440 \t |\t-0.038259 | -0.036146 | 1.692288 |\r\n  | **median** | -0.315957 |\t-0.333088 | -0.017131 | 0.016184 | -0.757719 |\r\n  | **std** |\t1.045188 | 1.072090 | 0.026902 | -0.001398 | 0.065432 |\r\n\r\n3. Third example uses shapley_r2 decomposition with the fish market database from kaggle[^4]:\r\n\r\n  ```python\r\n  import numpy as np\r\n  import pandas\r\n  from shapley_decomposition import shapley_r2\r\n\r\n  df=pandas.read_csv(\"Fish.csv\")\r\n  #ignoring the species column\r\n  shapley_r2.shapley_decomposition(df.iloc[:,1:])\r\n  ```\r\n  | |shapley_values | contribution |\r\n  | --| -- | --|\r\n  | **Length1** |\t0.194879 |\t0.220131 |\r\n  | **Length2** |\t0.195497 |\t0.220829 |\r\n  | **Length3** |\t0.198097 |\t0.223766 |\r\n  | **Height** |\t0.116893 |\t0.132040 |\r\n  | **Width** |\t0.179920 |\t0.203233 |\r\n\r\n  ```python\r\n  #using the same dataframe\r\n\r\n  groups = [[\"Length1\",\"Length2\",\"Length3\"],[\"Height\",\"Width\"]]\r\n\r\n  shapley_r2.owen_decomposition(df.iloc[:,1:], groups)\r\n  ```\r\n\r\n\r\n\r\n  | | owen_values | contribution | group_owen |\r\n  | --- | --- | --- | --- |\r\n  | **Length1** |\t0.157523 | 0.177934 | b1 |\r\n  | **Length2** |\t0.158178 | 0.178674 | b1 |\r\n  | **Length3** |\t0.160276 | 0.181045 | b1 |\r\n  | **Height** |\t0.141092 | 0.159374 | b2 |\r\n  | **Width** |\t0.268218 | 0.302972 | b2 |\r\n\r\n\r\n  | | owen_values | contribution |\r\n  | -- | -- | -- |                         \r\n  | **b1** | 0.475977 | 0.537653 |\r\n  | **b2** | 0.409309 | 0.462347 |\r\n\r\n  ```python\r\n  # if the species are included as categorical variables\r\n\r\n  from sklearn.preprocessing import OneHotEncoder\r\n\r\n  enc = OneHotEncoder(handle_unknown=\"ignore\")\r\n  encti = enc.fit_transform(df[[\"Species\"]]).toarray()\r\n\r\n  df2 = pandas.concat([pandas.DataFrame(encti,columns=[\"Species\"+str(n) for n in range(len(encti[0]+1))]),df.iloc[:,1:]],axis=1)\r\n  shapley_r2.shapley_decomposition(df2, force=True) # as the number of variables bigger than 10, force=True\r\n\r\n  groups=[[\"Species0\",\"Species1\",\"Species2\",\"Species3\",\"Species4\",\"Species5\",\"Species6\"],[\"Length1\",\"Length2\",\"Length3\"],[\"Height\",\"Width\"]]\r\n\r\n  shapley_r2.owen_decomposition(df2, groups) #no need for force as the number of groups does not exceed 10\r\n  ```\r\n\r\n[^1]: https://www.rand.org/content/dam/rand/pubs/papers/2021/P295.pdf\r\n[^2]: https://datatopics.worldbank.org/jobsdiagnostics/jobs-tools.html\r\n[^3]: https://www.scitepress.org/papers/2017/61137/\r\n[^4]: https://www.kaggle.com/datasets/aungpyaeap/fish-market?resource=download\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Decomposition using shapley values",
    "version": "0.0.2",
    "project_urls": {
        "Homepage": "https://github.com/canitez01/shapley_decomposition"
    },
    "split_keywords": [
        "python",
        " data analysis",
        " descriptive analysis",
        " shapley values",
        " owen values",
        " decomposition"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "29b4091049dc3e5e46702af61ae7bc18565f7e339160e40e06c96cb06cf0b36e",
                "md5": "0eff494d080ed24a9b4b99fc788f5d28",
                "sha256": "a2d9c62aac76414315bb6fc0cd9d8be85fce7a766f89b7608d460092375928eb"
            },
            "downloads": -1,
            "filename": "shapley_decomposition-0.0.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "0eff494d080ed24a9b4b99fc788f5d28",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 17122,
            "upload_time": "2024-11-07T22:12:48",
            "upload_time_iso_8601": "2024-11-07T22:12:48.875939Z",
            "url": "https://files.pythonhosted.org/packages/29/b4/091049dc3e5e46702af61ae7bc18565f7e339160e40e06c96cb06cf0b36e/shapley_decomposition-0.0.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0d7a1d55c2d917f1b73fb796e637a7795483d5e61050ad6983f2cd81dbf8b298",
                "md5": "84cca88bf4ae0c38a2c2fe4455f8d4ce",
                "sha256": "0cf6b3e630511d5e515f9f11960c4acabfc7e0ad0e66b897eb2255b379c40d20"
            },
            "downloads": -1,
            "filename": "shapley_decomposition-0.0.2.tar.gz",
            "has_sig": false,
            "md5_digest": "84cca88bf4ae0c38a2c2fe4455f8d4ce",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 19257,
            "upload_time": "2024-11-07T22:12:50",
            "upload_time_iso_8601": "2024-11-07T22:12:50.077741Z",
            "url": "https://files.pythonhosted.org/packages/0d/7a/1d55c2d917f1b73fb796e637a7795483d5e61050ad6983f2cd81dbf8b298/shapley_decomposition-0.0.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-07 22:12:50",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "canitez01",
    "github_project": "shapley_decomposition",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "tox": true,
    "lcname": "shapley-decomposition"
}
        
Elapsed time: 3.56622s