# <a name="contents">Machine learning data partition tools</a>
- [Contents](#contents)
- [Author](#author)
- [Purpose](#purpose)
- [Installation](#installation)
- [Synopsis](#synopsis)
- [optimize_traintest_split()](#otts)
- [optimize_traindevtest_split()](#otdts)
- [binning()](#binning)
- [Usage](#usage)
- [Example 1: Split dummy data into training and test partitions](#example1)
- [Example 2: split dummy data into training, development, and test partitions](#example2)
- [Example 3: split dummy data into training, development, and test partitions, the target and several stratification variables being numeric](#example3)
- [Algorithm](#algorithm)
- [How to interprete the returned info dict](#interpretation)
- [Further documentation](#doc)
## <a name="author">Author</a>
Uwe Reichel, audEERING GmbH, Gilching, Germany
## <a name="purpose">Purpose</a>
* machine learning data splitting tool that allows for:
* group-disjunct splits (e.g. different speakers in train, dev, and test partition)
* stratification on multiple target and grouping variables (e.g. emotion, gender, language)
## <a name="installation">Installation</a>
### From PyPI
* set up a virtual environment `venv_splitutils`, activate it, and install `splitutils`. For Linux this works e.g. as follows:
```bash
$ virtualenv --python="/usr/bin/python3" venv_splitutils
$ source venv_splitutils/bin/activate
(venv_splitutils) $ pip install splitutils
```
### From GitHub
* project URL: https://github.com/reichelu/splitutils
* set up a virtual environment venv_copasul, activate it, and install requirements. For Linux this works e.g. as follows:
```bash
$ git clone git@github.com:reichelu/spliutils.git
$ cd splitutils/
$ virtualenv --python="/usr/bin/python3" venv_splitutils
$ source venv_splitutils/bin/activate
$ (venv_splitutils) $ pip install -r requirements.txt
```
## <a name="synopsis">Synopsis</a>
### <a name="otts">optimize_traintest_split()</a>
```python
def optimize_traintest_split(X, y, split_on, stratify_on, weight=None,
test_size=.1, k=30, seed=42):
''' optimize group-disjunct split into training and test set which is guided by:
- disjunct split of values in SPLIT_ON
- stratification by all keys in STRATIFY_ON (targets and groupings)
- test set proportion in X should be close to test_size (which is the test
proportion in set(split_on))
Parameters:
X: (np.array or pd.DataFrame) of features
y: (np.array) of targets of length N
if type(y[0]) in ["str", "int"]: y is assumed to be categorical, so that it is
additionally tested that all partitions cover all classes.
Else y is assumed to be numeric and no coverage test is done.
split_on: (np.array) list of length N with grouping variable (e.g. speaker IDs),
on which the group-disjunct split is to be performed. Must be categorical.
stratify_on: (dict) Dict-keys are variable names (targets and/or further groupings)
the split should be stratified on (groupings could e.g. be sex, age class, etc).
Dict-Values are np.array-s of length N that contain the variable values. All
variables must be categorical.
weight: (dict) weight for each variable in stratify_on. Defines their amount of
contribution to the optimization score. Uniform weighting by default. Additional
key: "size_diff" defines how test size diff should be weighted.
test_size: (float) test proportion in set(split_on), e.g. 10% of speakers to be
held-out
k: (int) number of different splits to be tried out
seed: (int) random seed
Returns:
train_i: (np.array) train set indices in X
test_i: (np.array) test set indices in X
info: (dict) detail information about reference and achieved prob distributions
"size_testset_in_spliton": intended test_size
"size_testset_in_X": optimized test proportion in X
"p_ref_{c}": reference class distribution calculated from stratify_on[c]
"p_test_{c}": test set class distribution calculated from stratify_on[c][test_i]
'''
```
### <a name="otdts">optimize_traindevtest_split()</a>
```python
def optimize_traindevtest_split(X, y, split_on, stratify_on, weight=None,
dev_size=.1, test_size=.1, k=30, seed=42):
''' optimize group-disjunct split into training, dev, and test set, which is
guided by:
- disjunct split of values in SPLIT_ON
- stratification by all keys in STRATIFY_ON (targets and groupings)
- test set proportion in X should be close to test_size (which is the test
proportion in set(split_on))
Parameters:
X: (np.array or pd.DataFrame) of features
y: (np.array) of targets of length N
if type(y[0]) in ["str", "int"]: y is assumed to be categorical, so
that it is additionally tested that all partitions cover all classes.
Else y is assumed to be numeric and no coverage test is done.
split_on: (np.array) list of length N with grouping variable (e.g. speaker IDs),
on which the group-disjunct split is to be performed. Must be categorical.
stratify_on: (dict) Dict-keys are variable names (targets and/or further groupings)
the split should be stratified on (groupings could e.g. be sex, age class, etc).
Dict-Values are np.array-s of length N that contain the variable values. All
variables must be categorical.
weight: (dict) weight for each variable in stratify_on. Defines their amount of
contribution to the optimization score. Uniform weighting by default. Additional
key: "size_diff" defines how the corresponding size differences should be weighted.
dev_size: (float) proportion in set(split_on) for dev set, e.g. 10% of speakers
to be held-out
test_size: (float) test proportion in set(split_on) for test set
k: (int) number of different splits to be tried out
seed: (int) random seed
Returns:
train_i: (np.array) train set indices in X
dev_i: (np.array) dev set indices in X
test_i: (np.array) test set indices in X
info: (dict) detail information about reference and achieved prob distributions
"dev_size_in_spliton": intended grouping dev_size
"dev_size_in_X": optimized dev proportion of observations in X
"test_size_in_spliton": intended grouping test_size
"test_size_in_X": optimized test proportion of observations in X
"p_ref_{c}": reference class distribution calculated from stratify_on[c]
"p_dev_{c}": dev set class distribution calculated from stratify_on[c][dev_i]
"p_test_{c}": test set class distribution calculated from stratify_on[c][test_i]
'''
```
### <a name="binning">binning()</a>
```python
def binning(x, nbins=2, lower_boundaries=None, seed=42):
'''
bins numeric data.
If X is one-dimensional:
binning is done either intrinsically into nbins classes
based on an equidistant percentile split, or extrinsically
by using the lower_boundaries values.
If X is two-dimensional
binning is done by kmeans clustering into nbins clusters
Parameters:
x: (list, np.array) with numeric data.
nbins: (int) number of bins
lower_boundaries: (list) of lower bin boundaries.
If y is 1-dim and lower_boundaries is provided, nbins will be ignored
and y is binned extrinsically. The first value of lower_boundaries
is always corrected not to be higher than min(y).
seed: (int) random seed for kmeans
Returns:
c: (np.array) integers as bin IDs
'''
```
## <a name="usage">Usage</a>
### <a name="example1">Example 1: Split dummy data into training and test partitions</a>
* see `scripts/run_traintest_split.py`
* partitions are:
* disjunct on categorical "split_var"
* stratified on categorical "target", "strat_var1", "strat_var2"
* each contain all levels of "target"
```python
import numpy as np
import os
import pandas as pd
import sys
# add this line if you have cloned the code from github to PROJECT_DIR
# sys.path.append(PROJECT_DIR)
from splitutils import optimize_traindevtest_split
# set seed
seed = 42
np.random.seed(seed)
# DUMMY DATA
# size
n = 100
# feature array
data = np.random.rand(100, 20)
# target variable
target = np.random.choice(["A", "B"], size=n, replace=True)
# array with variable on which to do a disjunct split
split_var = np.random.choice(["D", "E", "F", "G", "H", "I", "J", "K"],
size=n, replace=True)
# dict of variables to stratify on. Key names are arbitrary.
stratif_vars = {
"target": target,
"strat_var1": np.random.choice(["L", "M"], size=n, replace=True),
"strat_var2": np.random.choice(["N", "O"], size=n, replace=True)
}
# ARGUMENTS
# weight importance of all stratification variables in stratify_in
# as well as of "size_diff", which punishes the deviation of intended
# and received partition sizes.
# Key names must match the names in stratif_vars.
weights = {
"target": 2,
"strat_var1": 1,
"strat_var2": 1,
"size_diff": 1
}
# test partition proportion (from 0 to 1)
test_size = .2
# number of disjunct splits to be tried out in brute force optimization
k = 30
# FIND OPTIMAL SPLIT
train_i, test_i, info = optimize_traintest_split(
X=data,
y=target,
split_on=split_var,
stratify_on=stratif_vars,
weight=weights,
test_size=test_size,
k=k,
seed=seed
)
# SOME OUTPUT
print("test levels of split_var:", sorted(set(split_var[test_i])))
print("goodness of split:", info)
```
### <a name="example2">Example 2: Split dummy data into training, development, and test partitionsy</a>
* see `scripts/run_traindevtest_split.py`
* Partitions are
* disjunct on categorical "split_var"
* stratified on categorical "target", "strat_var1", "strat_var2"
* each contain all levels of "target"
```python
import numpy as np
import os
import pandas as pd
import sys
# add this line if you have cloned the code from github to PROJECT_DIR
# sys.path.append(PROJECT_DIR)
from splitutils import optimize_traindevtest_split
# set seed
seed = 42
np.random.seed(seed)
# DUMMY DATA
# size
n = 100
# feature array
data = np.random.rand(100, 20)
# target variable
target = np.random.choice(["A", "B"], size=n, replace=True)
# array with variable on which to do a disjunct split
split_var = np.random.choice(["D", "E", "F", "G", "H", "I", "J", "K"],
size=n, replace=True)
# dict of variables to stratify on. Key names are arbitrary.
stratif_vars = {
"target": target,
"strat_var1": np.random.choice(["F", "G"], size=n, replace=True),
"strat_var2": np.random.choice(["H", "I"], size=n, replace=True)
}
# ARGUMENTS
# weight importance of all stratification variables in stratify_in
# as well as of "size_diff", which punishes the deviation of intended
# and received partition sizes.
# Key names must match the names in stratif_vars.
weights = {
"target": 2,
"strat_var1": 1,
"strat_var2": 1,
"size_diff": 1
}
# dev and test partition proportion (from 0 to 1)
dev_size = .1
test_size = .1
# number of disjunct splits to be tried out in brute force optimization
k = 30
# FIND OPTIMAL SPLIT
train_i, dev_i, test_i, info = optimize_traindevtest_split(
X=data,
y=target,
split_on=split_var,
stratify_on=stratif_vars,
weight=weights,
dev_size=dev_size,
test_size=test_size,
k=k,
seed=seed
)
# SOME OUTPUT
print("test levels of split_var:", sorted(set(split_var[test_i])))
print("goodness of split:", info)
```
### <a name="example3">Example 3: Split dummy data into training, development, and test partitions, the target and several stratification variables being numeric</a>
* see `scripts/run_traindevtest_split_with_binning.py`
* Partitions are
* disjunct on categorical "split_var"
* stratified on numeric "target", and on 3 other numeric stratification variables
```python
import numpy as np
import os
import pandas as pd
import sys
# add this line if you have cloned the code from github to PROJECT_DIR
# sys.path.append(PROJECT_DIR)
from splitutils import (
binning,
optimize_traindevtest_split
)
"""
example script how to split dummy data into training, development,
and test partitions that are
* disjunct on categorical "split_var"
* stratified on numeric "target", and on 3 other numeric stratification
variables
"""
# set seed
seed = 42
np.random.seed(seed)
# DUMMY DATA
# size
n = 100
# features
data = np.random.rand(n, 20)
# numeric target variable
num_target = np.random.rand(n)
# array with variable on which to do a disjunct split
split_var = np.random.choice(["D", "E", "F", "G", "H", "I", "J", "K"],
size=n, replace=True)
# further numeric variables to stratify on
num_strat_vars = np.random.rand(n, 3)
# intrinsically bin target into 3 bins by equidistant
# percentile boundaries
binned_target = binning(num_target, nbins=3)
# ... alternatively, a variable can be extrinsically binned by
# specifying lower boundaries:
# binned_target = binning(target, lower_boundaries: [0, 0.33, 0.66])
# bin other stratification variables into a single variable with 6 bins
# (2-dim input is binned by StandardScaling and KMeans clustering)
binned_strat_var = binning(num_strat_vars, nbins=6)
# ... alternatively, each stratification variable could be binned
# individually - intrinsically or extrinsically the same way as num_target
# strat_var1 = binning(num_strat_vars[:,0], nbins=...) etc.
# now add the obtained categorical variable to stratification dict
stratif_vars = {
"target": binned_target,
"strat_var": binned_strat_var
}
# ARGUMENTS
# weight importance of all stratification variables in stratify_in
# as well as of "size_diff", which punishes the deviation of intended
# and received partition sizes
weights = {
"target": 2,
"strat_var": 1,
"size_diff": 1
}
# dev and test partition proportion (from 0 to 1)
dev_size = .1
test_size = .1
# number of disjunct splits to be tried out in brute force optimization
k = 30
# FIND OPTIMAL SPLIT
train_i, dev_i, test_i, info = optimize_traindevtest_split(
X=data,
y=num_target,
split_on=split_var,
stratify_on=stratif_vars,
weight=weights,
dev_size=dev_size,
test_size=test_size,
k=k,
seed=seed
)
# SOME OUTPUT
print("test levels of split_var:", sorted(set(split_var[test_i])))
print("goodness of split:", info)
```
## <a name="algorithm">Algorithm</a>
* find optimal train, dev, and test set split based on:
* disjunct split of a categorical grouping variable *G* (e.g. speaker)
* optimized joint stratification on an arbitrary amount of categorical target and grouping variables (e.g. emotion, gender, ...)
* close match of partition proportions in *G* and underlying dataset *X*
* brute-force optimization on *k* disjunct splits of *G*
* **score to be minimzed for train/test set split:**
```
(sum_v[w(v) * irad(v)] + w(d) * d) / (sum_v[w(v)] + w(d))
v: variables to be stratified on
w(v): their weight
irad(v): information radius between reference and test set distribution of factor levels in v
d: absolute difference between test proportions of X and G, i.e. between the proportion of test
samples and the proportion of groups (e.g. speakers) that go into the test set
w(d): its weight
```
* **score to be minimzed for train / dev / test set split:**
```
(sum_v[w(v) * max_irad(v)] + w(d) * max_d) / (sum_v[w(v)] + w(d))
v: variables to be stratified on
w(v): their weight
max_irad(v): maximum information radius of reference distribution of classes in v and
- dev set distribution,
- test set distribution
max_d: maximum of absolute difference between proportions of X and G (see above) calculated for
the dev and test set
w(d): its weight
```
## <a name="interpretation">How to interprete the returned `info` dict</a>
* let's look at [Example 2](#example2) above. There `info` becomes:
```python
{
'score': 0.030828359568603338,
'size_devset_in_spliton': 0.1,
'size_devset_in_X': 0.14,
'size_testset_in_spliton': 0.1,
'size_testset_in_X': 0.13,
'p_target_ref': {'B': 0.49, 'A': 0.51},
'p_target_dev': {'A': 0.5, 'B': 0.5},
'p_target_test': {'A': 0.5384615384615384, 'B': 0.46153846153846156},
'p_strat_var1_ref': {'G': 0.56, 'F': 0.44},
'p_strat_var1_dev': {'G': 0.5714285714285714, 'F': 0.42857142857142855},
'p_strat_var1_test': {'F': 0.5384615384615384, 'G': 0.46153846153846156},
'p_strat_var2_ref': {'I': 0.48, 'H': 0.52},
'p_strat_var2_dev': {'I': 0.5, 'H': 0.5},
'p_strat_var2_test': {'I': 0.46153846153846156, 'H': 0.5384615384615384}
}
```
* **Explanations**
* **score:** see above, **score to be minimzed for train / dev / test set split:**
* **size_devset_in_spliton:** proportion of to-be-split-on variable levels in development set
* **size_devset_in_X:** proportion of rows in X in development set
* **size_testset_in_spliton:** proportion of to-be-split-on variable levels in test set
* **size_testset_in_X:** proportion of rows in X in test set
* **p_target_ref:** reference target class distribution over all data
* **p_target_dev:** target class distribution in development set
* **p_target_test:** target class distribution in test set
* **p_strat_var1_ref:** first stratification variable's reference distribution over all data
* **p_strat_var1_dev:** first stratification variable's class distribution in development set
* **p_strat_var1_test:** first stratification variable's class distribution in test set
* **p_strat_var2_ref:** second stratification variable's reference distribution over all data
* **p_strat_var2_dev:** second stratification variable's class distribution in development set
* **p_strat_var2_test:** second stratification variable's class distribution in test set
* **Remarks**
* for `splitutils.optimize_traintest_split()` no development set results are reported
* all `*_strat_var*` keys: key names derived from key names in `stratify_on` argument
## <a name="doc">Further documentation</a>
* Please find further details on the split scores and numeric variable binning in this [pdf](https://github.com/reichelu/splitutils/blob/main/docs/splitutils.pdf)
Raw data
{
"_id": null,
"home_page": "https://github.com/reichelu/splitutils",
"name": "splitutils",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "",
"keywords": "",
"author": "Uwe Reichel",
"author_email": "ureichel@audeering.com",
"download_url": "https://files.pythonhosted.org/packages/3c/d0/130ad38c5a53eb5bb70ff1b90f515f353498813c71efa73eb842f0716154/splitutils-0.2.0.tar.gz",
"platform": null,
"description": "# <a name=\"contents\">Machine learning data partition tools</a>\n\n- [Contents](#contents)\n - [Author](#author)\n - [Purpose](#purpose)\n - [Installation](#installation)\n - [Synopsis](#synopsis)\n - [optimize_traintest_split()](#otts)\n - [optimize_traindevtest_split()](#otdts)\n - [binning()](#binning)\n - [Usage](#usage)\n - [Example 1: Split dummy data into training and test partitions](#example1)\n - [Example 2: split dummy data into training, development, and test partitions](#example2)\n - [Example 3: split dummy data into training, development, and test partitions, the target and several stratification variables being numeric](#example3)\n - [Algorithm](#algorithm)\n - [How to interprete the returned info dict](#interpretation)\n - [Further documentation](#doc)\n\n## <a name=\"author\">Author</a>\n\nUwe Reichel, audEERING GmbH, Gilching, Germany\n\n## <a name=\"purpose\">Purpose</a>\n\n* machine learning data splitting tool that allows for:\n * group-disjunct splits (e.g. different speakers in train, dev, and test partition)\n * stratification on multiple target and grouping variables (e.g. emotion, gender, language)\n\n## <a name=\"installation\">Installation</a>\n\n### From PyPI\n\n* set up a virtual environment `venv_splitutils`, activate it, and install `splitutils`. For Linux this works e.g. as follows:\n\n```bash\n$ virtualenv --python=\"/usr/bin/python3\" venv_splitutils\n$ source venv_splitutils/bin/activate\n(venv_splitutils) $ pip install splitutils\n```\n\n### From GitHub\n\n* project URL: https://github.com/reichelu/splitutils\n* set up a virtual environment venv_copasul, activate it, and install requirements. For Linux this works e.g. as follows:\n\n```bash\n$ git clone git@github.com:reichelu/spliutils.git\n$ cd splitutils/\n$ virtualenv --python=\"/usr/bin/python3\" venv_splitutils\n$ source venv_splitutils/bin/activate\n$ (venv_splitutils) $ pip install -r requirements.txt\n```\n\n## <a name=\"synopsis\">Synopsis</a>\n\n### <a name=\"otts\">optimize_traintest_split()</a>\n\n```python\ndef optimize_traintest_split(X, y, split_on, stratify_on, weight=None,\n test_size=.1, k=30, seed=42):\n\n ''' optimize group-disjunct split into training and test set which is guided by:\n - disjunct split of values in SPLIT_ON\n - stratification by all keys in STRATIFY_ON (targets and groupings)\n - test set proportion in X should be close to test_size (which is the test\n proportion in set(split_on))\n\n Parameters:\n X: (np.array or pd.DataFrame) of features\n y: (np.array) of targets of length N\n if type(y[0]) in [\"str\", \"int\"]: y is assumed to be categorical, so that it is\n additionally tested that all partitions cover all classes.\n Else y is assumed to be numeric and no coverage test is done.\n split_on: (np.array) list of length N with grouping variable (e.g. speaker IDs),\n on which the group-disjunct split is to be performed. Must be categorical.\n stratify_on: (dict) Dict-keys are variable names (targets and/or further groupings)\n the split should be stratified on (groupings could e.g. be sex, age class, etc).\n Dict-Values are np.array-s of length N that contain the variable values. All\n variables must be categorical.\n weight: (dict) weight for each variable in stratify_on. Defines their amount of\n contribution to the optimization score. Uniform weighting by default. Additional\n key: \"size_diff\" defines how test size diff should be weighted.\n test_size: (float) test proportion in set(split_on), e.g. 10% of speakers to be\n held-out\n k: (int) number of different splits to be tried out\n seed: (int) random seed\n\n Returns:\n train_i: (np.array) train set indices in X\n test_i: (np.array) test set indices in X\n info: (dict) detail information about reference and achieved prob distributions\n \"size_testset_in_spliton\": intended test_size\n \"size_testset_in_X\": optimized test proportion in X\n \"p_ref_{c}\": reference class distribution calculated from stratify_on[c]\n \"p_test_{c}\": test set class distribution calculated from stratify_on[c][test_i]\n '''\n```\n\n### <a name=\"otdts\">optimize_traindevtest_split()</a>\n\n\n```python\ndef optimize_traindevtest_split(X, y, split_on, stratify_on, weight=None,\n dev_size=.1, test_size=.1, k=30, seed=42):\n\n ''' optimize group-disjunct split into training, dev, and test set, which is\n guided by:\n - disjunct split of values in SPLIT_ON\n - stratification by all keys in STRATIFY_ON (targets and groupings)\n - test set proportion in X should be close to test_size (which is the test\n proportion in set(split_on))\n\n Parameters:\n X: (np.array or pd.DataFrame) of features\n y: (np.array) of targets of length N\n if type(y[0]) in [\"str\", \"int\"]: y is assumed to be categorical, so\n that it is additionally tested that all partitions cover all classes.\n Else y is assumed to be numeric and no coverage test is done.\n split_on: (np.array) list of length N with grouping variable (e.g. speaker IDs),\n on which the group-disjunct split is to be performed. Must be categorical.\n stratify_on: (dict) Dict-keys are variable names (targets and/or further groupings)\n the split should be stratified on (groupings could e.g. be sex, age class, etc).\n Dict-Values are np.array-s of length N that contain the variable values. All\n variables must be categorical.\n weight: (dict) weight for each variable in stratify_on. Defines their amount of\n contribution to the optimization score. Uniform weighting by default. Additional\n key: \"size_diff\" defines how the corresponding size differences should be weighted.\n dev_size: (float) proportion in set(split_on) for dev set, e.g. 10% of speakers\n to be held-out\n test_size: (float) test proportion in set(split_on) for test set\n k: (int) number of different splits to be tried out\n seed: (int) random seed\n\n Returns:\n train_i: (np.array) train set indices in X\n dev_i: (np.array) dev set indices in X\n test_i: (np.array) test set indices in X\n info: (dict) detail information about reference and achieved prob distributions\n \"dev_size_in_spliton\": intended grouping dev_size\n \"dev_size_in_X\": optimized dev proportion of observations in X\n \"test_size_in_spliton\": intended grouping test_size\n \"test_size_in_X\": optimized test proportion of observations in X\n \"p_ref_{c}\": reference class distribution calculated from stratify_on[c]\n \"p_dev_{c}\": dev set class distribution calculated from stratify_on[c][dev_i]\n \"p_test_{c}\": test set class distribution calculated from stratify_on[c][test_i]\n '''\n```\n\n### <a name=\"binning\">binning()</a>\n\n```python\n\ndef binning(x, nbins=2, lower_boundaries=None, seed=42):\n\n '''\n bins numeric data.\n\n If X is one-dimensional:\n binning is done either intrinsically into nbins classes\n based on an equidistant percentile split, or extrinsically\n by using the lower_boundaries values.\n If X is two-dimensional\n binning is done by kmeans clustering into nbins clusters\n\n Parameters:\n x: (list, np.array) with numeric data.\n nbins: (int) number of bins\n lower_boundaries: (list) of lower bin boundaries.\n If y is 1-dim and lower_boundaries is provided, nbins will be ignored\n and y is binned extrinsically. The first value of lower_boundaries\n is always corrected not to be higher than min(y).\n seed: (int) random seed for kmeans\n\n Returns:\n c: (np.array) integers as bin IDs\n '''\n```\n\n## <a name=\"usage\">Usage</a>\n\n### <a name=\"example1\">Example 1: Split dummy data into training and test partitions</a>\n\n* see `scripts/run_traintest_split.py`\n* partitions are:\n * disjunct on categorical \"split_var\"\n * stratified on categorical \"target\", \"strat_var1\", \"strat_var2\"\n * each contain all levels of \"target\"\n\n```python\nimport numpy as np\nimport os\nimport pandas as pd\nimport sys\n\n# add this line if you have cloned the code from github to PROJECT_DIR\n# sys.path.append(PROJECT_DIR)\n\nfrom splitutils import optimize_traindevtest_split\n\n# set seed\nseed = 42\nnp.random.seed(seed)\n\n# DUMMY DATA\n# size\nn = 100\n\n# feature array\ndata = np.random.rand(100, 20)\n\n# target variable\ntarget = np.random.choice([\"A\", \"B\"], size=n, replace=True)\n\n# array with variable on which to do a disjunct split\nsplit_var = np.random.choice([\"D\", \"E\", \"F\", \"G\", \"H\", \"I\", \"J\", \"K\"],\n size=n, replace=True)\n\n# dict of variables to stratify on. Key names are arbitrary.\nstratif_vars = {\n \"target\": target,\n \"strat_var1\": np.random.choice([\"L\", \"M\"], size=n, replace=True),\n \"strat_var2\": np.random.choice([\"N\", \"O\"], size=n, replace=True)\n}\n\n# ARGUMENTS\n# weight importance of all stratification variables in stratify_in\n# as well as of \"size_diff\", which punishes the deviation of intended\n# and received partition sizes.\n# Key names must match the names in stratif_vars.\nweights = {\n \"target\": 2,\n \"strat_var1\": 1,\n \"strat_var2\": 1,\n \"size_diff\": 1\n}\n\n# test partition proportion (from 0 to 1)\ntest_size = .2\n\n# number of disjunct splits to be tried out in brute force optimization\nk = 30\n\n# FIND OPTIMAL SPLIT\ntrain_i, test_i, info = optimize_traintest_split(\n X=data,\n y=target,\n split_on=split_var,\n stratify_on=stratif_vars,\n weight=weights,\n test_size=test_size,\n k=k,\n seed=seed\n)\n\n# SOME OUTPUT\nprint(\"test levels of split_var:\", sorted(set(split_var[test_i])))\nprint(\"goodness of split:\", info)\n```\n\n### <a name=\"example2\">Example 2: Split dummy data into training, development, and test partitionsy</a>\n\n* see `scripts/run_traindevtest_split.py`\n* Partitions are\n * disjunct on categorical \"split_var\"\n * stratified on categorical \"target\", \"strat_var1\", \"strat_var2\"\n * each contain all levels of \"target\"\n\n```python\nimport numpy as np\nimport os\nimport pandas as pd\nimport sys\n\n# add this line if you have cloned the code from github to PROJECT_DIR\n# sys.path.append(PROJECT_DIR)\n\nfrom splitutils import optimize_traindevtest_split\n\n# set seed\nseed = 42\nnp.random.seed(seed)\n\n# DUMMY DATA\n# size\nn = 100\n\n# feature array\ndata = np.random.rand(100, 20)\n\n# target variable\ntarget = np.random.choice([\"A\", \"B\"], size=n, replace=True)\n\n# array with variable on which to do a disjunct split\nsplit_var = np.random.choice([\"D\", \"E\", \"F\", \"G\", \"H\", \"I\", \"J\", \"K\"],\n size=n, replace=True)\n\n# dict of variables to stratify on. Key names are arbitrary.\nstratif_vars = {\n \"target\": target,\n \"strat_var1\": np.random.choice([\"F\", \"G\"], size=n, replace=True),\n \"strat_var2\": np.random.choice([\"H\", \"I\"], size=n, replace=True)\n}\n\n# ARGUMENTS\n# weight importance of all stratification variables in stratify_in\n# as well as of \"size_diff\", which punishes the deviation of intended\n# and received partition sizes.\n# Key names must match the names in stratif_vars.\nweights = {\n \"target\": 2,\n \"strat_var1\": 1,\n \"strat_var2\": 1,\n \"size_diff\": 1\n}\n\n# dev and test partition proportion (from 0 to 1)\ndev_size = .1\ntest_size = .1\n\n# number of disjunct splits to be tried out in brute force optimization\nk = 30\n\n# FIND OPTIMAL SPLIT\ntrain_i, dev_i, test_i, info = optimize_traindevtest_split(\n X=data,\n y=target,\n split_on=split_var,\n stratify_on=stratif_vars,\n weight=weights,\n dev_size=dev_size,\n test_size=test_size,\n k=k,\n seed=seed\n)\n\n# SOME OUTPUT\nprint(\"test levels of split_var:\", sorted(set(split_var[test_i])))\nprint(\"goodness of split:\", info)\n```\n\n### <a name=\"example3\">Example 3: Split dummy data into training, development, and test partitions, the target and several stratification variables being numeric</a>\n\n* see `scripts/run_traindevtest_split_with_binning.py`\n* Partitions are\n * disjunct on categorical \"split_var\"\n * stratified on numeric \"target\", and on 3 other numeric stratification variables \n\n```python\nimport numpy as np\nimport os\nimport pandas as pd\nimport sys\n\n# add this line if you have cloned the code from github to PROJECT_DIR\n# sys.path.append(PROJECT_DIR)\n\nfrom splitutils import (\n binning,\n optimize_traindevtest_split\n)\n\n\"\"\"\nexample script how to split dummy data into training, development,\nand test partitions that are\n* disjunct on categorical \"split_var\"\n* stratified on numeric \"target\", and on 3 other numeric stratification\n variables\n\"\"\"\n\n# set seed\nseed = 42\nnp.random.seed(seed)\n\n# DUMMY DATA\n# size\nn = 100\n\n# features\ndata = np.random.rand(n, 20)\n\n# numeric target variable\nnum_target = np.random.rand(n)\n\n# array with variable on which to do a disjunct split\nsplit_var = np.random.choice([\"D\", \"E\", \"F\", \"G\", \"H\", \"I\", \"J\", \"K\"],\n size=n, replace=True)\n\n# further numeric variables to stratify on\nnum_strat_vars = np.random.rand(n, 3)\n\n# intrinsically bin target into 3 bins by equidistant\n# percentile boundaries\nbinned_target = binning(num_target, nbins=3)\n\n# ... alternatively, a variable can be extrinsically binned by\n# specifying lower boundaries:\n# binned_target = binning(target, lower_boundaries: [0, 0.33, 0.66])\n\n# bin other stratification variables into a single variable with 6 bins\n# (2-dim input is binned by StandardScaling and KMeans clustering)\nbinned_strat_var = binning(num_strat_vars, nbins=6)\n\n# ... alternatively, each stratification variable could be binned\n# individually - intrinsically or extrinsically the same way as num_target\n# strat_var1 = binning(num_strat_vars[:,0], nbins=...) etc.\n\n# now add the obtained categorical variable to stratification dict\nstratif_vars = {\n \"target\": binned_target,\n \"strat_var\": binned_strat_var\n}\n\n# ARGUMENTS\n# weight importance of all stratification variables in stratify_in\n# as well as of \"size_diff\", which punishes the deviation of intended\n# and received partition sizes\nweights = {\n \"target\": 2,\n \"strat_var\": 1,\n \"size_diff\": 1\n}\n\n# dev and test partition proportion (from 0 to 1)\ndev_size = .1\ntest_size = .1\n\n# number of disjunct splits to be tried out in brute force optimization\nk = 30\n\n# FIND OPTIMAL SPLIT\ntrain_i, dev_i, test_i, info = optimize_traindevtest_split(\n X=data,\n y=num_target,\n split_on=split_var,\n stratify_on=stratif_vars,\n weight=weights,\n dev_size=dev_size,\n test_size=test_size,\n k=k,\n seed=seed\n)\n\n# SOME OUTPUT\nprint(\"test levels of split_var:\", sorted(set(split_var[test_i])))\nprint(\"goodness of split:\", info)\n```\n\n## <a name=\"algorithm\">Algorithm</a>\n\n* find optimal train, dev, and test set split based on:\n * disjunct split of a categorical grouping variable *G* (e.g. speaker)\n * optimized joint stratification on an arbitrary amount of categorical target and grouping variables (e.g. emotion, gender, ...)\n * close match of partition proportions in *G* and underlying dataset *X*\n* brute-force optimization on *k* disjunct splits of *G*\n* **score to be minimzed for train/test set split:**\n\n```\n(sum_v[w(v) * irad(v)] + w(d) * d) / (sum_v[w(v)] + w(d))\n\nv: variables to be stratified on\nw(v): their weight\nirad(v): information radius between reference and test set distribution of factor levels in v\nd: absolute difference between test proportions of X and G, i.e. between the proportion of test\n samples and the proportion of groups (e.g. speakers) that go into the test set\nw(d): its weight\n```\n\n* **score to be minimzed for train / dev / test set split:**\n\n```\n(sum_v[w(v) * max_irad(v)] + w(d) * max_d) / (sum_v[w(v)] + w(d))\n\nv: variables to be stratified on\nw(v): their weight\nmax_irad(v): maximum information radius of reference distribution of classes in v and\n - dev set distribution,\n - test set distribution\nmax_d: maximum of absolute difference between proportions of X and G (see above) calculated for\n the dev and test set\nw(d): its weight\n```\n\n## <a name=\"interpretation\">How to interprete the returned `info` dict</a>\n\n* let's look at [Example 2](#example2) above. There `info` becomes:\n\n```python\n{\n 'score': 0.030828359568603338,\n 'size_devset_in_spliton': 0.1,\n 'size_devset_in_X': 0.14,\n 'size_testset_in_spliton': 0.1,\n 'size_testset_in_X': 0.13,\n 'p_target_ref': {'B': 0.49, 'A': 0.51},\n 'p_target_dev': {'A': 0.5, 'B': 0.5},\n 'p_target_test': {'A': 0.5384615384615384, 'B': 0.46153846153846156},\n 'p_strat_var1_ref': {'G': 0.56, 'F': 0.44},\n 'p_strat_var1_dev': {'G': 0.5714285714285714, 'F': 0.42857142857142855},\n 'p_strat_var1_test': {'F': 0.5384615384615384, 'G': 0.46153846153846156},\n 'p_strat_var2_ref': {'I': 0.48, 'H': 0.52},\n 'p_strat_var2_dev': {'I': 0.5, 'H': 0.5},\n 'p_strat_var2_test': {'I': 0.46153846153846156, 'H': 0.5384615384615384}\n}\n```\n\n* **Explanations**\n * **score:** see above, **score to be minimzed for train / dev / test set split:**\n * **size_devset_in_spliton:** proportion of to-be-split-on variable levels in development set\n * **size_devset_in_X:** proportion of rows in X in development set\n * **size_testset_in_spliton:** proportion of to-be-split-on variable levels in test set\n * **size_testset_in_X:** proportion of rows in X in test set\n * **p_target_ref:** reference target class distribution over all data\n * **p_target_dev:** target class distribution in development set\n * **p_target_test:** target class distribution in test set\n * **p_strat_var1_ref:** first stratification variable's reference distribution over all data\n * **p_strat_var1_dev:** first stratification variable's class distribution in development set\n * **p_strat_var1_test:** first stratification variable's class distribution in test set\n * **p_strat_var2_ref:** second stratification variable's reference distribution over all data\n * **p_strat_var2_dev:** second stratification variable's class distribution in development set\n * **p_strat_var2_test:** second stratification variable's class distribution in test set\n* **Remarks**\n * for `splitutils.optimize_traintest_split()` no development set results are reported\n * all `*_strat_var*` keys: key names derived from key names in `stratify_on` argument\n\n## <a name=\"doc\">Further documentation</a>\n\n* Please find further details on the split scores and numeric variable binning in this [pdf](https://github.com/reichelu/splitutils/blob/main/docs/splitutils.pdf)\n\n",
"bugtrack_url": null,
"license": "",
"summary": "Group disjunct and stratified data splits",
"version": "0.2.0",
"project_urls": {
"Homepage": "https://github.com/reichelu/splitutils",
"company": "https://www.audeering.com/",
"documentation": "https://github.com/reichelu/splitutils",
"repository": "https://github.com/reichelu/splitutils"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "faa45c3bdbdca9f7225c0c6493cd449e1169bb62c83299c1136d56dda900dde8",
"md5": "44b03bea016e8d13e0db569fdc4164ac",
"sha256": "3fe9eb7d64609b1797e6657dd3666cf736d541d3265a219f2620a60c57a7f59e"
},
"downloads": -1,
"filename": "splitutils-0.2.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "44b03bea016e8d13e0db569fdc4164ac",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 11962,
"upload_time": "2023-12-01T08:57:54",
"upload_time_iso_8601": "2023-12-01T08:57:54.989897Z",
"url": "https://files.pythonhosted.org/packages/fa/a4/5c3bdbdca9f7225c0c6493cd449e1169bb62c83299c1136d56dda900dde8/splitutils-0.2.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "3cd0130ad38c5a53eb5bb70ff1b90f515f353498813c71efa73eb842f0716154",
"md5": "e1b64c868eeb291f8549d2ae3821af11",
"sha256": "939aa27a7e76d77b22a4eef7a537e54f743b93a5d69097703dadf6643d91b6c8"
},
"downloads": -1,
"filename": "splitutils-0.2.0.tar.gz",
"has_sig": false,
"md5_digest": "e1b64c868eeb291f8549d2ae3821af11",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 15152,
"upload_time": "2023-12-01T08:57:56",
"upload_time_iso_8601": "2023-12-01T08:57:56.511455Z",
"url": "https://files.pythonhosted.org/packages/3c/d0/130ad38c5a53eb5bb70ff1b90f515f353498813c71efa73eb842f0716154/splitutils-0.2.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-12-01 08:57:56",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "reichelu",
"github_project": "splitutils",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "splitutils"
}