splitutils

Name	splitutils JSON
Version	0.2.0 JSON
	download
home_page	https://github.com/reichelu/splitutils
Summary	Group disjunct and stratified data splits
upload_time	2023-12-01 08:57:56
maintainer
docs_url	None
author	Uwe Reichel
requires_python	>=3.8
license
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # <a name="contents">Machine learning data partition tools</a>

- [Contents](#contents)
    - [Author](#author)
    - [Purpose](#purpose)
    - [Installation](#installation)
    - [Synopsis](#synopsis)
        - [optimize_traintest_split()](#otts)
        - [optimize_traindevtest_split()](#otdts)
        - [binning()](#binning)
    - [Usage](#usage)
        - [Example 1: Split dummy data into training and test partitions](#example1)
        - [Example 2: split dummy data into training, development, and test partitions](#example2)
        - [Example 3: split dummy data into training, development, and test partitions, the target and several stratification variables being numeric](#example3)
    - [Algorithm](#algorithm)
    - [How to interprete the returned info dict](#interpretation)
    - [Further documentation](#doc)

## <a name="author">Author</a>

Uwe Reichel, audEERING GmbH, Gilching, Germany

## <a name="purpose">Purpose</a>

* machine learning data splitting tool that allows for:
  * group-disjunct splits (e.g. different speakers in train, dev, and test partition)
  * stratification on multiple target and grouping variables (e.g. emotion, gender, language)

## <a name="installation">Installation</a>

### From PyPI

* set up a virtual environment `venv_splitutils`, activate it, and install `splitutils`. For Linux this works e.g. as follows:

```bash
$ virtualenv --python="/usr/bin/python3" venv_splitutils
$ source venv_splitutils/bin/activate
(venv_splitutils) $ pip install splitutils
```

### From GitHub

* project URL: https://github.com/reichelu/splitutils
* set up a virtual environment venv_copasul, activate it, and install requirements. For Linux this works e.g. as follows:

```bash
$ git clone git@github.com:reichelu/spliutils.git
$ cd splitutils/
$ virtualenv --python="/usr/bin/python3" venv_splitutils
$ source venv_splitutils/bin/activate
$ (venv_splitutils) $ pip install -r requirements.txt
```

## <a name="synopsis">Synopsis</a>

### <a name="otts">optimize_traintest_split()</a>

```python
def optimize_traintest_split(X, y, split_on, stratify_on, weight=None,
                             test_size=.1, k=30, seed=42):

    ''' optimize group-disjunct split into training and test set which is guided by:
    - disjunct split of values in SPLIT_ON
    - stratification by all keys in STRATIFY_ON (targets and groupings)
    - test set proportion in X should be close to test_size (which is the test
      proportion in set(split_on))

    Parameters:
    X: (np.array or pd.DataFrame) of features
    y: (np.array) of targets of length N
      if type(y[0]) in ["str", "int"]: y is assumed to be categorical, so that it is
      additionally tested that all partitions cover all classes.
      Else y is assumed to be numeric and no coverage test is done.
    split_on: (np.array) list of length N with grouping variable (e.g. speaker IDs),
      on which the group-disjunct split is to be performed. Must be categorical.
    stratify_on: (dict) Dict-keys are variable names (targets and/or further groupings)
      the split should be stratified on (groupings could e.g. be sex, age class, etc).
      Dict-Values are np.array-s of length N that contain the variable values. All
      variables must be categorical.
    weight: (dict) weight for each variable in stratify_on. Defines their amount of
      contribution to the optimization score. Uniform weighting by default. Additional
      key: "size_diff" defines how test size diff should be weighted.
    test_size: (float) test proportion in set(split_on), e.g. 10% of speakers to be
      held-out
    k: (int) number of different splits to be tried out
    seed: (int) random seed

    Returns:
    train_i: (np.array) train set indices in X
    test_i: (np.array) test set indices in X
    info: (dict) detail information about reference and achieved prob distributions
        "size_testset_in_spliton": intended test_size
        "size_testset_in_X": optimized test proportion in X
        "p_ref_{c}": reference class distribution calculated from stratify_on[c]
        "p_test_{c}": test set class distribution calculated from stratify_on[c][test_i]
    '''
```

### <a name="otdts">optimize_traindevtest_split()</a>


```python
def optimize_traindevtest_split(X, y, split_on, stratify_on, weight=None,
                                dev_size=.1, test_size=.1, k=30, seed=42):

    ''' optimize group-disjunct split into training, dev, and test set, which is
    guided by:
    - disjunct split of values in SPLIT_ON
    - stratification by all keys in STRATIFY_ON (targets and groupings)
    - test set proportion in X should be close to test_size (which is the test
      proportion in set(split_on))

    Parameters:
    X: (np.array or pd.DataFrame) of features
    y: (np.array) of targets of length N
      if type(y[0]) in ["str", "int"]: y is assumed to be categorical, so
         that it is additionally tested that all partitions cover all classes.
         Else y is assumed to be numeric and no coverage test is done.
    split_on: (np.array) list of length N with grouping variable (e.g. speaker IDs),
      on which the group-disjunct split is to be performed. Must be categorical.
    stratify_on: (dict) Dict-keys are variable names (targets and/or further groupings)
      the split should be stratified on (groupings could e.g. be sex, age class, etc).
      Dict-Values are np.array-s of length N that contain the variable values. All
      variables must be categorical.
    weight: (dict) weight for each variable in stratify_on. Defines their amount of
      contribution to the optimization score. Uniform weighting by default. Additional
      key: "size_diff" defines how the corresponding size differences should be weighted.
    dev_size: (float) proportion in set(split_on) for dev set, e.g. 10% of speakers
      to be held-out
    test_size: (float) test proportion in set(split_on) for test set
    k: (int) number of different splits to be tried out
    seed: (int) random seed

    Returns:
    train_i: (np.array) train set indices in X
    dev_i: (np.array) dev set indices in X
    test_i: (np.array) test set indices in X
    info: (dict) detail information about reference and achieved prob distributions
        "dev_size_in_spliton": intended grouping dev_size
        "dev_size_in_X": optimized dev proportion of observations in X
        "test_size_in_spliton": intended grouping test_size
        "test_size_in_X": optimized test proportion of observations in X
        "p_ref_{c}": reference class distribution calculated from stratify_on[c]
        "p_dev_{c}": dev set class distribution calculated from stratify_on[c][dev_i]
        "p_test_{c}": test set class distribution calculated from stratify_on[c][test_i]
    '''
```

### <a name="binning">binning()</a>

```python

def binning(x, nbins=2, lower_boundaries=None, seed=42):

    '''
    bins numeric data.

    If X is one-dimensional:
        binning is done either intrinsically into nbins classes
        based on an equidistant percentile split, or extrinsically
        by using the lower_boundaries values.
    If X is two-dimensional
        binning is done by kmeans clustering into nbins clusters

    Parameters:
    x: (list, np.array) with numeric data.
    nbins: (int) number of bins
    lower_boundaries: (list) of lower bin boundaries.
      If y is 1-dim and lower_boundaries is provided, nbins will be ignored
      and y is binned extrinsically. The first value of lower_boundaries
      is always corrected not to be higher than min(y).
    seed: (int) random seed for kmeans

    Returns:
    c: (np.array) integers as bin IDs
    '''
```

## <a name="usage">Usage</a>

### <a name="example1">Example 1: Split dummy data into training and test partitions</a>

* see `scripts/run_traintest_split.py`
* partitions are:
    * disjunct on categorical "split_var"
    * stratified on categorical "target", "strat_var1", "strat_var2"
    * each contain all levels of "target"

```python
import numpy as np
import os
import pandas as pd
import sys

# add this line if you have cloned the code from github to PROJECT_DIR
# sys.path.append(PROJECT_DIR)

from splitutils import optimize_traindevtest_split

# set seed
seed = 42
np.random.seed(seed)

# DUMMY DATA
# size
n = 100

# feature array
data = np.random.rand(100, 20)

# target variable
target = np.random.choice(["A", "B"], size=n, replace=True)

# array with variable on which to do a disjunct split
split_var = np.random.choice(["D", "E", "F", "G", "H", "I", "J", "K"],
                             size=n, replace=True)

# dict of variables to stratify on. Key names are arbitrary.
stratif_vars = {
    "target": target,
    "strat_var1": np.random.choice(["L", "M"], size=n, replace=True),
    "strat_var2": np.random.choice(["N", "O"], size=n, replace=True)
}

# ARGUMENTS
# weight importance of all stratification variables in stratify_in
# as well as of "size_diff", which punishes the deviation of intended
# and received partition sizes.
# Key names must match the names in stratif_vars.
weights = {
    "target": 2,
    "strat_var1": 1,
    "strat_var2": 1,
    "size_diff": 1
}

# test partition proportion (from 0 to 1)
test_size = .2

# number of disjunct splits to be tried out in brute force optimization
k = 30

# FIND OPTIMAL SPLIT
train_i, test_i, info = optimize_traintest_split(
    X=data,
    y=target,
    split_on=split_var,
    stratify_on=stratif_vars,
    weight=weights,
    test_size=test_size,
    k=k,
    seed=seed
)

# SOME OUTPUT
print("test levels of split_var:", sorted(set(split_var[test_i])))
print("goodness of split:", info)
```

### <a name="example2">Example 2: Split dummy data into training, development, and test partitionsy</a>

* see `scripts/run_traindevtest_split.py`
* Partitions are
    * disjunct on categorical "split_var"
    * stratified on categorical "target", "strat_var1", "strat_var2"
    * each contain all levels of "target"

```python
import numpy as np
import os
import pandas as pd
import sys

# add this line if you have cloned the code from github to PROJECT_DIR
# sys.path.append(PROJECT_DIR)

from splitutils import optimize_traindevtest_split

# set seed
seed = 42
np.random.seed(seed)

# DUMMY DATA
# size
n = 100

# feature array
data = np.random.rand(100, 20)

# target variable
target = np.random.choice(["A", "B"], size=n, replace=True)

# array with variable on which to do a disjunct split
split_var = np.random.choice(["D", "E", "F", "G", "H", "I", "J", "K"],
                             size=n, replace=True)

# dict of variables to stratify on. Key names are arbitrary.
stratif_vars = {
    "target": target,
    "strat_var1": np.random.choice(["F", "G"], size=n, replace=True),
    "strat_var2": np.random.choice(["H", "I"], size=n, replace=True)
}

# ARGUMENTS
# weight importance of all stratification variables in stratify_in
# as well as of "size_diff", which punishes the deviation of intended
# and received partition sizes.
# Key names must match the names in stratif_vars.
weights = {
    "target": 2,
    "strat_var1": 1,
    "strat_var2": 1,
    "size_diff": 1
}

# dev and test partition proportion (from 0 to 1)
dev_size = .1
test_size = .1

# number of disjunct splits to be tried out in brute force optimization
k = 30

# FIND OPTIMAL SPLIT
train_i, dev_i, test_i, info = optimize_traindevtest_split(
    X=data,
    y=target,
    split_on=split_var,
    stratify_on=stratif_vars,
    weight=weights,
    dev_size=dev_size,
    test_size=test_size,
    k=k,
    seed=seed
)

# SOME OUTPUT
print("test levels of split_var:", sorted(set(split_var[test_i])))
print("goodness of split:", info)
```

### <a name="example3">Example 3: Split dummy data into training, development, and test partitions, the target and several stratification variables being numeric</a>

* see `scripts/run_traindevtest_split_with_binning.py`
* Partitions are
    * disjunct on categorical "split_var"
    * stratified on numeric "target", and on 3 other numeric stratification variables 

```python
import numpy as np
import os
import pandas as pd
import sys

# add this line if you have cloned the code from github to PROJECT_DIR
# sys.path.append(PROJECT_DIR)

from splitutils import (
    binning,
    optimize_traindevtest_split
)

"""
example script how to split dummy data into training, development,
and test partitions that are
* disjunct on categorical "split_var"
* stratified on numeric "target", and on 3 other numeric stratification
  variables
"""

# set seed
seed = 42
np.random.seed(seed)

# DUMMY DATA
# size
n = 100

# features
data = np.random.rand(n, 20)

# numeric target variable
num_target = np.random.rand(n)

# array with variable on which to do a disjunct split
split_var = np.random.choice(["D", "E", "F", "G", "H", "I", "J", "K"],
                             size=n, replace=True)

# further numeric variables to stratify on
num_strat_vars = np.random.rand(n, 3)

# intrinsically bin target into 3 bins by equidistant
# percentile boundaries
binned_target = binning(num_target, nbins=3)

# ... alternatively, a variable can be extrinsically binned by
# specifying lower boundaries:
# binned_target = binning(target, lower_boundaries: [0, 0.33, 0.66])

# bin other stratification variables into a single variable with 6 bins
# (2-dim input is binned by StandardScaling and KMeans clustering)
binned_strat_var = binning(num_strat_vars, nbins=6)

# ... alternatively, each stratification variable could be binned
# individually - intrinsically or extrinsically the same way as num_target
# strat_var1 = binning(num_strat_vars[:,0], nbins=...) etc.

# now add the obtained categorical variable to stratification dict
stratif_vars = {
    "target": binned_target,
    "strat_var": binned_strat_var
}

# ARGUMENTS
# weight importance of all stratification variables in stratify_in
# as well as of "size_diff", which punishes the deviation of intended
# and received partition sizes
weights = {
    "target": 2,
    "strat_var": 1,
    "size_diff": 1
}

# dev and test partition proportion (from 0 to 1)
dev_size = .1
test_size = .1

# number of disjunct splits to be tried out in brute force optimization
k = 30

# FIND OPTIMAL SPLIT
train_i, dev_i, test_i, info = optimize_traindevtest_split(
    X=data,
    y=num_target,
    split_on=split_var,
    stratify_on=stratif_vars,
    weight=weights,
    dev_size=dev_size,
    test_size=test_size,
    k=k,
    seed=seed
)

# SOME OUTPUT
print("test levels of split_var:", sorted(set(split_var[test_i])))
print("goodness of split:", info)
```

## <a name="algorithm">Algorithm</a>

* find optimal train, dev, and test set split based on:
    * disjunct split of a categorical grouping variable *G* (e.g. speaker)
    * optimized joint stratification on an arbitrary amount of categorical target and grouping variables (e.g. emotion, gender, ...)
    * close match of partition proportions in *G* and underlying dataset *X*
* brute-force optimization on *k* disjunct splits of *G*
* **score to be minimzed for train/test set split:**

```
(sum_v[w(v) * irad(v)] + w(d) * d) / (sum_v[w(v)] + w(d))

v: variables to be stratified on
w(v): their weight
irad(v): information radius between reference and test set distribution of factor levels in v
d: absolute difference between test proportions of X and G, i.e. between the proportion of test
   samples and the proportion of groups (e.g. speakers) that go into the test set
w(d): its weight
```

* **score to be minimzed for train / dev / test set split:**

```
(sum_v[w(v) * max_irad(v)] + w(d) * max_d) / (sum_v[w(v)] + w(d))

v: variables to be stratified on
w(v): their weight
max_irad(v): maximum information radius of reference distribution of classes in v and
             - dev set distribution,
             - test set distribution
max_d: maximum of absolute difference between proportions of X and G (see above) calculated for
       the dev and test set
w(d): its weight
```

## <a name="interpretation">How to interprete the returned `info` dict</a>

* let's look at [Example 2](#example2) above. There `info` becomes:

```python
{
  'score': 0.030828359568603338,
  'size_devset_in_spliton': 0.1,
  'size_devset_in_X': 0.14,
  'size_testset_in_spliton': 0.1,
  'size_testset_in_X': 0.13,
  'p_target_ref': {'B': 0.49, 'A': 0.51},
  'p_target_dev': {'A': 0.5, 'B': 0.5},
  'p_target_test': {'A': 0.5384615384615384, 'B': 0.46153846153846156},
  'p_strat_var1_ref': {'G': 0.56, 'F': 0.44},
  'p_strat_var1_dev': {'G': 0.5714285714285714, 'F': 0.42857142857142855},
  'p_strat_var1_test': {'F': 0.5384615384615384, 'G': 0.46153846153846156},
  'p_strat_var2_ref': {'I': 0.48, 'H': 0.52},
  'p_strat_var2_dev': {'I': 0.5, 'H': 0.5},
  'p_strat_var2_test': {'I': 0.46153846153846156, 'H': 0.5384615384615384}
}
```

* **Explanations**
    * **score:** see above, **score to be minimzed for train / dev / test set split:**
    * **size_devset_in_spliton:** proportion of to-be-split-on variable levels in development set
    * **size_devset_in_X:** proportion of rows in X in development set
    * **size_testset_in_spliton:** proportion of to-be-split-on variable levels in test set
    * **size_testset_in_X:** proportion of rows in X in test set
    * **p_target_ref:** reference target class distribution over all data
    * **p_target_dev:** target class distribution in development set
    * **p_target_test:** target class distribution in test set
    * **p_strat_var1_ref:** first stratification variable's reference distribution over all data
    * **p_strat_var1_dev:** first stratification variable's class distribution in development set
    * **p_strat_var1_test:** first stratification variable's class distribution in test set
    * **p_strat_var2_ref:** second stratification variable's reference distribution over all data
    * **p_strat_var2_dev:** second stratification variable's class distribution in development set
    * **p_strat_var2_test:** second stratification variable's class distribution in test set
* **Remarks**
    * for `splitutils.optimize_traintest_split()` no development set results are reported
    * all `*_strat_var*` keys: key names derived from key names in `stratify_on` argument

## <a name="doc">Further documentation</a>

* Please find further details on the split scores and numeric variable binning in this [pdf](https://github.com/reichelu/splitutils/blob/main/docs/splitutils.pdf)

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/reichelu/splitutils",
    "name": "splitutils",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "",
    "keywords": "",
    "author": "Uwe Reichel",
    "author_email": "ureichel@audeering.com",
    "download_url": "https://files.pythonhosted.org/packages/3c/d0/130ad38c5a53eb5bb70ff1b90f515f353498813c71efa73eb842f0716154/splitutils-0.2.0.tar.gz",
    "platform": null,
    "description": "# <a name=\"contents\">Machine learning data partition tools</a>\n\n- [Contents](#contents)\n    - [Author](#author)\n    - [Purpose](#purpose)\n    - [Installation](#installation)\n    - [Synopsis](#synopsis)\n        - [optimize_traintest_split()](#otts)\n        - [optimize_traindevtest_split()](#otdts)\n        - [binning()](#binning)\n    - [Usage](#usage)\n        - [Example 1: Split dummy data into training and test partitions](#example1)\n        - [Example 2: split dummy data into training, development, and test partitions](#example2)\n        - [Example 3: split dummy data into training, development, and test partitions, the target and several stratification variables being numeric](#example3)\n    - [Algorithm](#algorithm)\n    - [How to interprete the returned info dict](#interpretation)\n    - [Further documentation](#doc)\n\n## <a name=\"author\">Author</a>\n\nUwe Reichel, audEERING GmbH, Gilching, Germany\n\n## <a name=\"purpose\">Purpose</a>\n\n* machine learning data splitting tool that allows for:\n  * group-disjunct splits (e.g. different speakers in train, dev, and test partition)\n  * stratification on multiple target and grouping variables (e.g. emotion, gender, language)\n\n## <a name=\"installation\">Installation</a>\n\n### From PyPI\n\n* set up a virtual environment `venv_splitutils`, activate it, and install `splitutils`. For Linux this works e.g. as follows:\n\n```bash\n$ virtualenv --python=\"/usr/bin/python3\" venv_splitutils\n$ source venv_splitutils/bin/activate\n(venv_splitutils) $ pip install splitutils\n```\n\n### From GitHub\n\n* project URL: https://github.com/reichelu/splitutils\n* set up a virtual environment venv_copasul, activate it, and install requirements. For Linux this works e.g. as follows:\n\n```bash\n$ git clone git@github.com:reichelu/spliutils.git\n$ cd splitutils/\n$ virtualenv --python=\"/usr/bin/python3\" venv_splitutils\n$ source venv_splitutils/bin/activate\n$ (venv_splitutils) $ pip install -r requirements.txt\n```\n\n## <a name=\"synopsis\">Synopsis</a>\n\n### <a name=\"otts\">optimize_traintest_split()</a>\n\n```python\ndef optimize_traintest_split(X, y, split_on, stratify_on, weight=None,\n                             test_size=.1, k=30, seed=42):\n\n    ''' optimize group-disjunct split into training and test set which is guided by:\n    - disjunct split of values in SPLIT_ON\n    - stratification by all keys in STRATIFY_ON (targets and groupings)\n    - test set proportion in X should be close to test_size (which is the test\n      proportion in set(split_on))\n\n    Parameters:\n    X: (np.array or pd.DataFrame) of features\n    y: (np.array) of targets of length N\n      if type(y[0]) in [\"str\", \"int\"]: y is assumed to be categorical, so that it is\n      additionally tested that all partitions cover all classes.\n      Else y is assumed to be numeric and no coverage test is done.\n    split_on: (np.array) list of length N with grouping variable (e.g. speaker IDs),\n      on which the group-disjunct split is to be performed. Must be categorical.\n    stratify_on: (dict) Dict-keys are variable names (targets and/or further groupings)\n      the split should be stratified on (groupings could e.g. be sex, age class, etc).\n      Dict-Values are np.array-s of length N that contain the variable values. All\n      variables must be categorical.\n    weight: (dict) weight for each variable in stratify_on. Defines their amount of\n      contribution to the optimization score. Uniform weighting by default. Additional\n      key: \"size_diff\" defines how test size diff should be weighted.\n    test_size: (float) test proportion in set(split_on), e.g. 10% of speakers to be\n      held-out\n    k: (int) number of different splits to be tried out\n    seed: (int) random seed\n\n    Returns:\n    train_i: (np.array) train set indices in X\n    test_i: (np.array) test set indices in X\n    info: (dict) detail information about reference and achieved prob distributions\n        \"size_testset_in_spliton\": intended test_size\n        \"size_testset_in_X\": optimized test proportion in X\n        \"p_ref_{c}\": reference class distribution calculated from stratify_on[c]\n        \"p_test_{c}\": test set class distribution calculated from stratify_on[c][test_i]\n    '''\n```\n\n### <a name=\"otdts\">optimize_traindevtest_split()</a>\n\n\n```python\ndef optimize_traindevtest_split(X, y, split_on, stratify_on, weight=None,\n                                dev_size=.1, test_size=.1, k=30, seed=42):\n\n    ''' optimize group-disjunct split into training, dev, and test set, which is\n    guided by:\n    - disjunct split of values in SPLIT_ON\n    - stratification by all keys in STRATIFY_ON (targets and groupings)\n    - test set proportion in X should be close to test_size (which is the test\n      proportion in set(split_on))\n\n    Parameters:\n    X: (np.array or pd.DataFrame) of features\n    y: (np.array) of targets of length N\n      if type(y[0]) in [\"str\", \"int\"]: y is assumed to be categorical, so\n         that it is additionally tested that all partitions cover all classes.\n         Else y is assumed to be numeric and no coverage test is done.\n    split_on: (np.array) list of length N with grouping variable (e.g. speaker IDs),\n      on which the group-disjunct split is to be performed. Must be categorical.\n    stratify_on: (dict) Dict-keys are variable names (targets and/or further groupings)\n      the split should be stratified on (groupings could e.g. be sex, age class, etc).\n      Dict-Values are np.array-s of length N that contain the variable values. All\n      variables must be categorical.\n    weight: (dict) weight for each variable in stratify_on. Defines their amount of\n      contribution to the optimization score. Uniform weighting by default. Additional\n      key: \"size_diff\" defines how the corresponding size differences should be weighted.\n    dev_size: (float) proportion in set(split_on) for dev set, e.g. 10% of speakers\n      to be held-out\n    test_size: (float) test proportion in set(split_on) for test set\n    k: (int) number of different splits to be tried out\n    seed: (int) random seed\n\n    Returns:\n    train_i: (np.array) train set indices in X\n    dev_i: (np.array) dev set indices in X\n    test_i: (np.array) test set indices in X\n    info: (dict) detail information about reference and achieved prob distributions\n        \"dev_size_in_spliton\": intended grouping dev_size\n        \"dev_size_in_X\": optimized dev proportion of observations in X\n        \"test_size_in_spliton\": intended grouping test_size\n        \"test_size_in_X\": optimized test proportion of observations in X\n        \"p_ref_{c}\": reference class distribution calculated from stratify_on[c]\n        \"p_dev_{c}\": dev set class distribution calculated from stratify_on[c][dev_i]\n        \"p_test_{c}\": test set class distribution calculated from stratify_on[c][test_i]\n    '''\n```\n\n### <a name=\"binning\">binning()</a>\n\n```python\n\ndef binning(x, nbins=2, lower_boundaries=None, seed=42):\n\n    '''\n    bins numeric data.\n\n    If X is one-dimensional:\n        binning is done either intrinsically into nbins classes\n        based on an equidistant percentile split, or extrinsically\n        by using the lower_boundaries values.\n    If X is two-dimensional\n        binning is done by kmeans clustering into nbins clusters\n\n    Parameters:\n    x: (list, np.array) with numeric data.\n    nbins: (int) number of bins\n    lower_boundaries: (list) of lower bin boundaries.\n      If y is 1-dim and lower_boundaries is provided, nbins will be ignored\n      and y is binned extrinsically. The first value of lower_boundaries\n      is always corrected not to be higher than min(y).\n    seed: (int) random seed for kmeans\n\n    Returns:\n    c: (np.array) integers as bin IDs\n    '''\n```\n\n## <a name=\"usage\">Usage</a>\n\n### <a name=\"example1\">Example 1: Split dummy data into training and test partitions</a>\n\n* see `scripts/run_traintest_split.py`\n* partitions are:\n    * disjunct on categorical \"split_var\"\n    * stratified on categorical \"target\", \"strat_var1\", \"strat_var2\"\n    * each contain all levels of \"target\"\n\n```python\nimport numpy as np\nimport os\nimport pandas as pd\nimport sys\n\n# add this line if you have cloned the code from github to PROJECT_DIR\n# sys.path.append(PROJECT_DIR)\n\nfrom splitutils import optimize_traindevtest_split\n\n# set seed\nseed = 42\nnp.random.seed(seed)\n\n# DUMMY DATA\n# size\nn = 100\n\n# feature array\ndata = np.random.rand(100, 20)\n\n# target variable\ntarget = np.random.choice([\"A\", \"B\"], size=n, replace=True)\n\n# array with variable on which to do a disjunct split\nsplit_var = np.random.choice([\"D\", \"E\", \"F\", \"G\", \"H\", \"I\", \"J\", \"K\"],\n                             size=n, replace=True)\n\n# dict of variables to stratify on. Key names are arbitrary.\nstratif_vars = {\n    \"target\": target,\n    \"strat_var1\": np.random.choice([\"L\", \"M\"], size=n, replace=True),\n    \"strat_var2\": np.random.choice([\"N\", \"O\"], size=n, replace=True)\n}\n\n# ARGUMENTS\n# weight importance of all stratification variables in stratify_in\n# as well as of \"size_diff\", which punishes the deviation of intended\n# and received partition sizes.\n# Key names must match the names in stratif_vars.\nweights = {\n    \"target\": 2,\n    \"strat_var1\": 1,\n    \"strat_var2\": 1,\n    \"size_diff\": 1\n}\n\n# test partition proportion (from 0 to 1)\ntest_size = .2\n\n# number of disjunct splits to be tried out in brute force optimization\nk = 30\n\n# FIND OPTIMAL SPLIT\ntrain_i, test_i, info = optimize_traintest_split(\n    X=data,\n    y=target,\n    split_on=split_var,\n    stratify_on=stratif_vars,\n    weight=weights,\n    test_size=test_size,\n    k=k,\n    seed=seed\n)\n\n# SOME OUTPUT\nprint(\"test levels of split_var:\", sorted(set(split_var[test_i])))\nprint(\"goodness of split:\", info)\n```\n\n### <a name=\"example2\">Example 2: Split dummy data into training, development, and test partitionsy</a>\n\n* see `scripts/run_traindevtest_split.py`\n* Partitions are\n    * disjunct on categorical \"split_var\"\n    * stratified on categorical \"target\", \"strat_var1\", \"strat_var2\"\n    * each contain all levels of \"target\"\n\n```python\nimport numpy as np\nimport os\nimport pandas as pd\nimport sys\n\n# add this line if you have cloned the code from github to PROJECT_DIR\n# sys.path.append(PROJECT_DIR)\n\nfrom splitutils import optimize_traindevtest_split\n\n# set seed\nseed = 42\nnp.random.seed(seed)\n\n# DUMMY DATA\n# size\nn = 100\n\n# feature array\ndata = np.random.rand(100, 20)\n\n# target variable\ntarget = np.random.choice([\"A\", \"B\"], size=n, replace=True)\n\n# array with variable on which to do a disjunct split\nsplit_var = np.random.choice([\"D\", \"E\", \"F\", \"G\", \"H\", \"I\", \"J\", \"K\"],\n                             size=n, replace=True)\n\n# dict of variables to stratify on. Key names are arbitrary.\nstratif_vars = {\n    \"target\": target,\n    \"strat_var1\": np.random.choice([\"F\", \"G\"], size=n, replace=True),\n    \"strat_var2\": np.random.choice([\"H\", \"I\"], size=n, replace=True)\n}\n\n# ARGUMENTS\n# weight importance of all stratification variables in stratify_in\n# as well as of \"size_diff\", which punishes the deviation of intended\n# and received partition sizes.\n# Key names must match the names in stratif_vars.\nweights = {\n    \"target\": 2,\n    \"strat_var1\": 1,\n    \"strat_var2\": 1,\n    \"size_diff\": 1\n}\n\n# dev and test partition proportion (from 0 to 1)\ndev_size = .1\ntest_size = .1\n\n# number of disjunct splits to be tried out in brute force optimization\nk = 30\n\n# FIND OPTIMAL SPLIT\ntrain_i, dev_i, test_i, info = optimize_traindevtest_split(\n    X=data,\n    y=target,\n    split_on=split_var,\n    stratify_on=stratif_vars,\n    weight=weights,\n    dev_size=dev_size,\n    test_size=test_size,\n    k=k,\n    seed=seed\n)\n\n# SOME OUTPUT\nprint(\"test levels of split_var:\", sorted(set(split_var[test_i])))\nprint(\"goodness of split:\", info)\n```\n\n### <a name=\"example3\">Example 3: Split dummy data into training, development, and test partitions, the target and several stratification variables being numeric</a>\n\n* see `scripts/run_traindevtest_split_with_binning.py`\n* Partitions are\n    * disjunct on categorical \"split_var\"\n    * stratified on numeric \"target\", and on 3 other numeric stratification variables \n\n```python\nimport numpy as np\nimport os\nimport pandas as pd\nimport sys\n\n# add this line if you have cloned the code from github to PROJECT_DIR\n# sys.path.append(PROJECT_DIR)\n\nfrom splitutils import (\n    binning,\n    optimize_traindevtest_split\n)\n\n\"\"\"\nexample script how to split dummy data into training, development,\nand test partitions that are\n* disjunct on categorical \"split_var\"\n* stratified on numeric \"target\", and on 3 other numeric stratification\n  variables\n\"\"\"\n\n# set seed\nseed = 42\nnp.random.seed(seed)\n\n# DUMMY DATA\n# size\nn = 100\n\n# features\ndata = np.random.rand(n, 20)\n\n# numeric target variable\nnum_target = np.random.rand(n)\n\n# array with variable on which to do a disjunct split\nsplit_var = np.random.choice([\"D\", \"E\", \"F\", \"G\", \"H\", \"I\", \"J\", \"K\"],\n                             size=n, replace=True)\n\n# further numeric variables to stratify on\nnum_strat_vars = np.random.rand(n, 3)\n\n# intrinsically bin target into 3 bins by equidistant\n# percentile boundaries\nbinned_target = binning(num_target, nbins=3)\n\n# ... alternatively, a variable can be extrinsically binned by\n# specifying lower boundaries:\n# binned_target = binning(target, lower_boundaries: [0, 0.33, 0.66])\n\n# bin other stratification variables into a single variable with 6 bins\n# (2-dim input is binned by StandardScaling and KMeans clustering)\nbinned_strat_var = binning(num_strat_vars, nbins=6)\n\n# ... alternatively, each stratification variable could be binned\n# individually - intrinsically or extrinsically the same way as num_target\n# strat_var1 = binning(num_strat_vars[:,0], nbins=...) etc.\n\n# now add the obtained categorical variable to stratification dict\nstratif_vars = {\n    \"target\": binned_target,\n    \"strat_var\": binned_strat_var\n}\n\n# ARGUMENTS\n# weight importance of all stratification variables in stratify_in\n# as well as of \"size_diff\", which punishes the deviation of intended\n# and received partition sizes\nweights = {\n    \"target\": 2,\n    \"strat_var\": 1,\n    \"size_diff\": 1\n}\n\n# dev and test partition proportion (from 0 to 1)\ndev_size = .1\ntest_size = .1\n\n# number of disjunct splits to be tried out in brute force optimization\nk = 30\n\n# FIND OPTIMAL SPLIT\ntrain_i, dev_i, test_i, info = optimize_traindevtest_split(\n    X=data,\n    y=num_target,\n    split_on=split_var,\n    stratify_on=stratif_vars,\n    weight=weights,\n    dev_size=dev_size,\n    test_size=test_size,\n    k=k,\n    seed=seed\n)\n\n# SOME OUTPUT\nprint(\"test levels of split_var:\", sorted(set(split_var[test_i])))\nprint(\"goodness of split:\", info)\n```\n\n## <a name=\"algorithm\">Algorithm</a>\n\n* find optimal train, dev, and test set split based on:\n    * disjunct split of a categorical grouping variable *G* (e.g. speaker)\n    * optimized joint stratification on an arbitrary amount of categorical target and grouping variables (e.g. emotion, gender, ...)\n    * close match of partition proportions in *G* and underlying dataset *X*\n* brute-force optimization on *k* disjunct splits of *G*\n* **score to be minimzed for train/test set split:**\n\n```\n(sum_v[w(v) * irad(v)] + w(d) * d) / (sum_v[w(v)] + w(d))\n\nv: variables to be stratified on\nw(v): their weight\nirad(v): information radius between reference and test set distribution of factor levels in v\nd: absolute difference between test proportions of X and G, i.e. between the proportion of test\n   samples and the proportion of groups (e.g. speakers) that go into the test set\nw(d): its weight\n```\n\n* **score to be minimzed for train / dev / test set split:**\n\n```\n(sum_v[w(v) * max_irad(v)] + w(d) * max_d) / (sum_v[w(v)] + w(d))\n\nv: variables to be stratified on\nw(v): their weight\nmax_irad(v): maximum information radius of reference distribution of classes in v and\n             - dev set distribution,\n             - test set distribution\nmax_d: maximum of absolute difference between proportions of X and G (see above) calculated for\n       the dev and test set\nw(d): its weight\n```\n\n## <a name=\"interpretation\">How to interprete the returned `info` dict</a>\n\n* let's look at [Example 2](#example2) above. There `info` becomes:\n\n```python\n{\n  'score': 0.030828359568603338,\n  'size_devset_in_spliton': 0.1,\n  'size_devset_in_X': 0.14,\n  'size_testset_in_spliton': 0.1,\n  'size_testset_in_X': 0.13,\n  'p_target_ref': {'B': 0.49, 'A': 0.51},\n  'p_target_dev': {'A': 0.5, 'B': 0.5},\n  'p_target_test': {'A': 0.5384615384615384, 'B': 0.46153846153846156},\n  'p_strat_var1_ref': {'G': 0.56, 'F': 0.44},\n  'p_strat_var1_dev': {'G': 0.5714285714285714, 'F': 0.42857142857142855},\n  'p_strat_var1_test': {'F': 0.5384615384615384, 'G': 0.46153846153846156},\n  'p_strat_var2_ref': {'I': 0.48, 'H': 0.52},\n  'p_strat_var2_dev': {'I': 0.5, 'H': 0.5},\n  'p_strat_var2_test': {'I': 0.46153846153846156, 'H': 0.5384615384615384}\n}\n```\n\n* **Explanations**\n    * **score:** see above, **score to be minimzed for train / dev / test set split:**\n    * **size_devset_in_spliton:** proportion of to-be-split-on variable levels in development set\n    * **size_devset_in_X:** proportion of rows in X in development set\n    * **size_testset_in_spliton:** proportion of to-be-split-on variable levels in test set\n    * **size_testset_in_X:** proportion of rows in X in test set\n    * **p_target_ref:** reference target class distribution over all data\n    * **p_target_dev:** target class distribution in development set\n    * **p_target_test:** target class distribution in test set\n    * **p_strat_var1_ref:** first stratification variable's reference distribution over all data\n    * **p_strat_var1_dev:** first stratification variable's class distribution in development set\n    * **p_strat_var1_test:** first stratification variable's class distribution in test set\n    * **p_strat_var2_ref:** second stratification variable's reference distribution over all data\n    * **p_strat_var2_dev:** second stratification variable's class distribution in development set\n    * **p_strat_var2_test:** second stratification variable's class distribution in test set\n* **Remarks**\n    * for `splitutils.optimize_traintest_split()` no development set results are reported\n    * all `*_strat_var*` keys: key names derived from key names in `stratify_on` argument\n\n## <a name=\"doc\">Further documentation</a>\n\n* Please find further details on the split scores and numeric variable binning in this [pdf](https://github.com/reichelu/splitutils/blob/main/docs/splitutils.pdf)\n\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Group disjunct and stratified data splits",
    "version": "0.2.0",
    "project_urls": {
        "Homepage": "https://github.com/reichelu/splitutils",
        "company": "https://www.audeering.com/",
        "documentation": "https://github.com/reichelu/splitutils",
        "repository": "https://github.com/reichelu/splitutils"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "faa45c3bdbdca9f7225c0c6493cd449e1169bb62c83299c1136d56dda900dde8",
                "md5": "44b03bea016e8d13e0db569fdc4164ac",
                "sha256": "3fe9eb7d64609b1797e6657dd3666cf736d541d3265a219f2620a60c57a7f59e"
            },
            "downloads": -1,
            "filename": "splitutils-0.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "44b03bea016e8d13e0db569fdc4164ac",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 11962,
            "upload_time": "2023-12-01T08:57:54",
            "upload_time_iso_8601": "2023-12-01T08:57:54.989897Z",
            "url": "https://files.pythonhosted.org/packages/fa/a4/5c3bdbdca9f7225c0c6493cd449e1169bb62c83299c1136d56dda900dde8/splitutils-0.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3cd0130ad38c5a53eb5bb70ff1b90f515f353498813c71efa73eb842f0716154",
                "md5": "e1b64c868eeb291f8549d2ae3821af11",
                "sha256": "939aa27a7e76d77b22a4eef7a537e54f743b93a5d69097703dadf6643d91b6c8"
            },
            "downloads": -1,
            "filename": "splitutils-0.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "e1b64c868eeb291f8549d2ae3821af11",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 15152,
            "upload_time": "2023-12-01T08:57:56",
            "upload_time_iso_8601": "2023-12-01T08:57:56.511455Z",
            "url": "https://files.pythonhosted.org/packages/3c/d0/130ad38c5a53eb5bb70ff1b90f515f353498813c71efa73eb842f0716154/splitutils-0.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-12-01 08:57:56",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "reichelu",
    "github_project": "splitutils",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "splitutils"
}

Uwe Reichel