catclass

Name	catclass JSON
Version	0.1.2 JSON
	download
home_page	None
Summary	A robust framework for generating synthetic categorical datasets for evaluation or testing purposes.
upload_time	2024-07-25 16:40:15
maintainer	None
docs_url	None
author	Miha Malenšek
requires_python	>=3.8
license	Copyright (c) 2018 The Python Packaging Authority Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords	synthetic datasets data generation categorical data
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Categorical Classification

A robust framework for generating synthetic categorical datasets for evaluation or testing purposes.

# Usage
---
### Creating a simple dataset
```python
# Creates a simple dataset of 10 features, 10k samples, with feature cardinality of all features being 35
X = cc.generate_data(9, 
                     10000, 
                     cardinality=35, 
                     ensure_rep=True, 
                     random_values=True, 
                     low=0, 
                     high=40)

# Creates target labels via clustering
y = cc.generate_labels(X, n=2, class_relation='cluster')

```

# Documentation
---

### CategoricalClassification.dataset_info
```python
print(CategoricalClassification.dataset_info)
```
Stores a formatted dictionary of operations made. Function _CategoricalClassification.generate\_data_ resets its contents. Each subsequent function call adds information to it.

---

### CategoricalClassification.generate_data
```python
CategoricalClassification.generate_data(n_features, 
                                        n_samples, 
                                        cardinality=5, 
                                        structure=None, 
                                        ensure_rep=False, 
                                        random_values=False, 
                                        low=0, 
                                        high=1000,
                                        k=10,
                                        seed=42)
```
Generates dataset of shape **_(n_samples, n_features)_**, based on given parameters.

- **n\_features:** _int_
  The number of features in a generated dataset.
- **n\_samples:** _int_
  The number of samples in a generated dataset.
- **cardinality:** _int_, default=5.
  Sets the default cardinality of a generated dataset.
- **structure:** _list, numpy.ndarray_, default=None.
  Sets the structure of a generated dataset. Offers more controle over feature value domains and value distributions.
  Follows the format **\[_tuple_, _tuple_, ...\]**, where:
   - **_tuple_** can either be:
      - **(_int_ or _list_, _int_)**: the first element represents the index or list of indexes of features. The second element their cardinality. Generated features will have a roughly normal density distribution of values, with a randomly selected value as a peak. The feature values will be integers, in range \[0, second element of tuple\].
      - **(_int_ or _list_, _list_)**: the first element represents the index or list of indexes of features. The second element offers two options:
        - **_list_**:  a list of values to be used in the feature or features,
        - **\[_list_, _list_\]**: where the first _list_ element represents a set of values the feature or features posses, the second the frequencies or probabilities of individual features.
- **ensure_rep:** _bool_, default=False:
  Control flag. If **_True_**, all possible values **will** appear in the feature.
- **random_values:** _bool_, default=False:
  Control flag. If **_True_**, value domain of feature will be random on interval _\[low, high\]_.
- **low**: _int_
  Sets lower bound of value domain of feature.
- **high**: _int_
  Sets upper bound of value domain of feature. Only used when _random\_values_ is True.
- **k**: _int_ or _float_, default=10.
  Constant, sets width of feature (normal) distribution peak. Higher the value, narrower the peak.
- **seed**: _int_, default=42.
  Controls **_numpy.random.seed_**               

**Returns**: a **_numpy.ndarray_** dataset with **n\_features** features and **n\_samples** samples.

---
### CategoricalClassification.\_configure\_generate\_feature
```python
CategoricalClassification._feature_builder(feature_attributes, 
                                           n_samples, 
                                           ensure_rep=False, 
                                           random_values=False, 
                                           low=0, 
                                           high=1000,
                                           k=10)
```
Helper function used to configure _\_generate\_feature()_ with proper parameters based on _feature\_atributes_.

- **feature\_attributes**: _int_ or _list_ or _numpy.ndarray_
Attributes of feature. Can be just cardinality (_int_), value domain (_list_), or value domain and their respective probabilities  (_list_).
- **n\_samples**: _int_
Number of samples in dataset. Determines generated feature vector size.
- **ensure_rep:** _bool_, default=False:
  Control flag. If **_True_**, all possible values **will** appear in the feature.
- **random_values:** _bool_, default=False:
  Control flag. If **_True_**, value domain of feature will be random on interval _\[low, high\]_.
- **low**: _int_
  Sets lower bound of value domain of feature.
- **high**: _int_
  Sets upper bound of value domain of feature. Only used when _random\_values_ is True.
- **k**: _int_ or _float_, default=10.
  Constant, sets width of feature (normal) distribution peak. Higher the value, narrower the peak.
**Returns:** a **_numpy.ndarray_** feature array.

---

### CategoricalClassification.\_generate\_feature
```python
CategoricalClassification._generate_feature(size, 
                                            vec=None, 
                                            cardinality=5, 
                                            ensure_rep=False, 
                                            random_values=False, 
                                            low=0, 
                                            high=1000,
                                            k=10,
                                            p=None)
```
Generates feature array of length **_size_**. Called by _CategoricalClassification.generate\_data_, by utilizing _numpy.random.choice_. If no probabilites array is given, the value density of the generated feature array will be roughly normal, with a randomly chosen peak. The peak will be chosen from the value array.

- **size**: _int_
  Length of generated feature array.
- **vec**: _list_ or _numpy.ndarray_, default=None
  List of feature values, value domain of feature.
- **cardinality**: _int_, default=5
  Cardinality of feature to use when generating its value domain. If _vec_ is not None, vec is used instead.
- **ensure_rep**: _bool_, default=False
  Control flag. If **_True_**, all possible values **will** appear in the feature array.
- **random_values:** _bool_, default=False:
  Control flag. If **_True_**, value domain of feature will be random on interval _\[low, high\]_.
- **low**: _int_
  Sets lower bound of value domain of feature.
- **high**: _int_
  Sets upper bound of value domain of feature. Only used when _random\_values_ is True.
- - **k**: _int_ or _float_, default=10.
  Constant, sets width of feature (normal) distribution peak. Higher the value, narrower the peak.
- **p**: _list_ or _numpy.ndarray_, default=None
  Array of frequencies or probabilities. Must be of length _v_ or equal to the length of _v_.

**Returns:** a **_numpy.ndarray_** feature array. 

___

### CategoricalClassification.generate\_combinations
```python
CategoricalClassification.generate_combinations(X, 
                                                feature_indices, 
                                                combination_function=None, 
                                                combination_type='linear')
```
Generates and adds a new column to given dataset **X**. The column is the result of a combination of features selected with **feature\_indices**. Combinations can be linear, nonlinear, or custom defined functions.

- **X**: _list_ or _numpy.ndarray_:
  Dataset to perform the combinations on.
- **feature_indices**: _list_ or _numpy.ndarray_:
  List of feature (column) indices to be combined.
- **combination\_function**: _function_, default=None:
  Custom or user-defined combination function. The function parameter **must** be a _list_ or _numpy.ndarray_ of features to be combined. The function **must** return a _list_ or _numpy.ndarray_ column or columns, to be added to given dataset _X_ using _numpy.column\_stack_.
- **combination\_type**: _str_ either _linear_ or _nonlinear_, default='linear':
  Selects which built-in combination type is used.
  - If _'linear'_, the combination is a sum of selected features.
  - If _'nonlinear'_, the combination is the sine value of the sum of selected features.

**Returns:** a **_numpy.ndarray_** dataset X with added feature combinations.

---

### CategoricalClassification.\_xor
```python
CategoricalClassification._xor(arr)
```
Performs bitwise XOR on given vectors and returns result.
- **arr**: _list_ or _numpy.ndarray_
  List of features to perform the combination on.

**Returns:** a **_numpy.ndarray_** result of **_numpy.bitwise\_xor(a,b)_** on given columns in **_arr_**.

___

### CategoricalClassification.\_and
```python
CategoricalClassification._and(arr)
```
Performs bitwise AND on given vectors and returns result.
- **arr**: _list_ or _numpy.ndarray_
  List of features to perform the combination on.

**Returns:** a **_numpy.ndarray_** result of **_numpy.bitwise\_and(a,b)_** on given columns in **_arr_**.


___

### CategoricalClassification.\_or
```python
CategoricalClassification._or(arr)
```
Performs bitwise OR on given vectors and returns result.
- **arr**: _list_ or _numpy.ndarray_
  List of features to perform the combination on.

**Returns:** a **_numpy.ndarray_** result of **_numpy.bitwise\_or(a,b)_** on given columns in **_arr_**.


___

### CategoricalClassification.generate\_correlated
```python
CategoricalClassification.generate_correlated(X, 
                                              feature_indices, 
                                              r=0.8)
```
Generates and adds new columns to given dataset **X**, correlated to the selected features, by a Pearson correlation coefficient of **r**. For vectors with mean 0, their correlation equals the cosine of their angle.  

- **X**: _list_ or _numpy.ndarray_:
  Dataset to perform the combinations on.
- **feature_indices**: _int_ or _list_ or _numpy.ndarray_:
  Index of feature (column) or list of feature (column) indices to generate correlated features to.
- **r**: _float_, default=0.8:
  Desired correlation coefficient.

**Returns:** a **_numpy.ndarray_** dataset X with added correlated features.

---

### CategoricalClassification.generate\_duplicates
```python
CategoricalClassification.generate_duplicates(X, 
                                              feature_indices)
```

Duplicates selected feature (column) indices, and adds the duplicated columns to the given dataset **X**.

- **X**: _list_ or _numpy.ndarray_:
  Dataset to perform the combinations on.
- **feature_indices**: _int_ or _list_ or _numpy.ndarray_:
  Index of feature (column) or list of feature (column) indices to duplicate.

**Returns:** a **_numpy.ndarray_** dataset X with added duplicated features.

---
### CategoricalClassification.generate\_labels
```python
CategoricalClassification.generate_nonlinear_labels(X, 
                                                    n=2, 
                                                    p=0.5, 
                                                    k=2, 
                                                    decision_function=None, 
                                                    class_relation='linear', 
                                                    balance=False)
```

Generates a vector of labels. Labels are (currently) generated as either a linear, nonlinear, or custom defined function. It generates classes using a decision boundary generated by the linear, nonlinear, or custom defined function.

- **X**: _list_ or _numpy.ndarray_:
  Dataset to generate labels for.
- **n**: _int_, default=2:
  Number of classes.
- **p**: _float_ or _list_, default=0.5:
  Class distribution.
- **k**: _int_ or _float_, default=2:
  Constant to be used in the linear or nonlinear combination used to set class values.
- **decision_function**: _function_, default: None
  Custom defined function to use for setting class values. **Must** accept dataset X as input and return a _list_ or _numpy.ndarray_ decision boundary.
- **class_relation**: _str_, either _'linear'_, _'nonlinear'_, or _'cluster'_ default='linear':
  Sets relationship type between class label and sample, by calculating a decision boundary with linear or nonlinear combinations of features in X, or by clustering the samples in X.
- **balance**: _boolean_, default=False:
  Whether to naievly balance clusters generated by KMeans clustering.

 **Returns**: **_numpy.ndarray_** y of class labels.
 
---

### CategoricalClassification.\_cluster\_data
```python
CategoricalClassification._cluster_data(X, 
                                        n, 
                                        p=1.0, 
                                        balance=False)
```
Clusters given data using KMeans clustering.

- **X**: _list_ or _numpy.ndarray_:
  Dataset to cluster.
- **n**: _int_:
  Number of clusters.
- **p**: _float_ or _list_ or _numpy.ndarray_:
  To be used when balance=True, sets class distribution - number of samples per cluster.
- **balance**: _boolean_, default=False:
  Whether to naievly balance clusters generated by KMeans clustering.

**Returns**: **_numpy.ndarray_** cluster_labels of clustering labels.
___

### CategoricalClassification.generate\_noise
```python
CategoricalClassification.generate_noise(X, 
                                         y, 
                                         p=0.2, 
                                         type="categorical", 
                                         missing_val=float('-inf'))
```

Generates categorical noise or simulates missing data on a given dataset. 

- **X**: _list_ or _numpy.ndarray_:
  Dataset to generate noise for.
- **y**: _list_ or _numpy.ndarray_:
  Labels of samples in dataset X. **Required** for generating categorical noise.
- **p**: _float_, p <=1.0, default=0.2:
  Amount of noise to generate.
- **type**: _str_, either _"categorical"_ or _"missing"_, default="categorical":
  Type of noise to generate.
- **missing_val**: default=float('-inf'):
  Value to simulate missing values with. Non-numerical values may cause issues with algorithms unequipped to handle them.

**Returns**: **_numpy.ndarray_** X with added noise.

---

### CategoricalClassification.downsample\_dataset

```python
CategoricalClassification.downsample_dataset(X, 
                                             y, 
                                             n=None, 
                                             seed=42, 
                                             reshuffle=False):
```

Downsamples given dataset according to N or the number of samples in minority class, resulting in a balanced dataset.

- **X**: _list_ or _numpy.ndarray_:
  Dataset to downsample.
- **y**: _list_ or _numpy.ndarray_:
  Labels corresponding to X.
- **N**: _int_, optional:
  Optional number of samples per class to downsample to.
- **seed**: _int_, default=42:
  Seed for random state of resample function.
- **reshuffle**: _boolean_, default=False:
  Reshuffle the dataset after downsampling.

**Returns:** Balanced, downsampled **_numpy.ndarray_** X and **_numpy.ndarray_** y.

---

### CategoricalClassification.print\_dataset
```python
CategoricalClassification.print_dataset(X, y)
```
Prints given dataset in a readable format.

- **X**: _list_ or _numpy.ndarray_:
  Dataset to print.
- **y**: _list_ or _numpy.ndarray_:
  Class labels corresponding to samples in given dataset.

---

### CategoricalClassification.summarize
```python
CategoricalClassification.summarize()
```
Prints stored dataset information dictionary in a digestible manner.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "catclass",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "synthetic datasets, data generation, categorical data",
    "author": "Miha Malen\u0161ek",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/e0/8b/04e052c55dea86a70a8d850f94899a099504b6145f78110f51ad3110d08c/catclass-0.1.2.tar.gz",
    "platform": null,
    "description": "# Categorical Classification\r\n\r\nA robust framework for generating synthetic categorical datasets for evaluation or testing purposes.\r\n\r\n# Usage\r\n---\r\n### Creating a simple dataset\r\n```python\r\n# Creates a simple dataset of 10 features, 10k samples, with feature cardinality of all features being 35\r\nX = cc.generate_data(9, \r\n                     10000, \r\n                     cardinality=35, \r\n                     ensure_rep=True, \r\n                     random_values=True, \r\n                     low=0, \r\n                     high=40)\r\n\r\n# Creates target labels via clustering\r\ny = cc.generate_labels(X, n=2, class_relation='cluster')\r\n\r\n```\r\n\r\n# Documentation\r\n---\r\n\r\n### CategoricalClassification.dataset_info\r\n```python\r\nprint(CategoricalClassification.dataset_info)\r\n```\r\nStores a formatted dictionary of operations made. Function _CategoricalClassification.generate\\_data_ resets its contents. Each subsequent function call adds information to it.\r\n\r\n---\r\n\r\n### CategoricalClassification.generate_data\r\n```python\r\nCategoricalClassification.generate_data(n_features, \r\n                                        n_samples, \r\n                                        cardinality=5, \r\n                                        structure=None, \r\n                                        ensure_rep=False, \r\n                                        random_values=False, \r\n                                        low=0, \r\n                                        high=1000,\r\n                                        k=10,\r\n                                        seed=42)\r\n```\r\nGenerates dataset of shape **_(n_samples, n_features)_**, based on given parameters.\r\n\r\n- **n\\_features:** _int_\r\n  The number of features in a generated dataset.\r\n- **n\\_samples:** _int_\r\n  The number of samples in a generated dataset.\r\n- **cardinality:** _int_, default=5.\r\n  Sets the default cardinality of a generated dataset.\r\n- **structure:** _list, numpy.ndarray_, default=None.\r\n  Sets the structure of a generated dataset. Offers more controle over feature value domains and value distributions.\r\n  Follows the format **\\[_tuple_, _tuple_, ...\\]**, where:\r\n   - **_tuple_** can either be:\r\n      - **(_int_ or _list_, _int_)**: the first element represents the index or list of indexes of features. The second element their cardinality. Generated features will have a roughly normal density distribution of values, with a randomly selected value as a peak. The feature values will be integers, in range \\[0, second element of tuple\\].\r\n      - **(_int_ or _list_, _list_)**: the first element represents the index or list of indexes of features. The second element offers two options:\r\n        - **_list_**:  a list of values to be used in the feature or features,\r\n        - **\\[_list_, _list_\\]**: where the first _list_ element represents a set of values the feature or features posses, the second the frequencies or probabilities of individual features.\r\n- **ensure_rep:** _bool_, default=False:\r\n  Control flag. If **_True_**, all possible values **will** appear in the feature.\r\n- **random_values:** _bool_, default=False:\r\n  Control flag. If **_True_**, value domain of feature will be random on interval _\\[low, high\\]_.\r\n- **low**: _int_\r\n  Sets lower bound of value domain of feature.\r\n- **high**: _int_\r\n  Sets upper bound of value domain of feature. Only used when _random\\_values_ is True.\r\n- **k**: _int_ or _float_, default=10.\r\n  Constant, sets width of feature (normal) distribution peak. Higher the value, narrower the peak.\r\n- **seed**: _int_, default=42.\r\n  Controls **_numpy.random.seed_**               \r\n\r\n**Returns**: a **_numpy.ndarray_** dataset with **n\\_features** features and **n\\_samples** samples.\r\n\r\n---\r\n### CategoricalClassification.\\_configure\\_generate\\_feature\r\n```python\r\nCategoricalClassification._feature_builder(feature_attributes, \r\n                                           n_samples, \r\n                                           ensure_rep=False, \r\n                                           random_values=False, \r\n                                           low=0, \r\n                                           high=1000,\r\n                                           k=10)\r\n```\r\nHelper function used to configure _\\_generate\\_feature()_ with proper parameters based on _feature\\_atributes_.\r\n\r\n- **feature\\_attributes**: _int_ or _list_ or _numpy.ndarray_\r\nAttributes of feature. Can be just cardinality (_int_), value domain (_list_), or value domain and their respective probabilities  (_list_).\r\n- **n\\_samples**: _int_\r\nNumber of samples in dataset. Determines generated feature vector size.\r\n- **ensure_rep:** _bool_, default=False:\r\n  Control flag. If **_True_**, all possible values **will** appear in the feature.\r\n- **random_values:** _bool_, default=False:\r\n  Control flag. If **_True_**, value domain of feature will be random on interval _\\[low, high\\]_.\r\n- **low**: _int_\r\n  Sets lower bound of value domain of feature.\r\n- **high**: _int_\r\n  Sets upper bound of value domain of feature. Only used when _random\\_values_ is True.\r\n- **k**: _int_ or _float_, default=10.\r\n  Constant, sets width of feature (normal) distribution peak. Higher the value, narrower the peak.\r\n**Returns:** a **_numpy.ndarray_** feature array.\r\n\r\n---\r\n\r\n### CategoricalClassification.\\_generate\\_feature\r\n```python\r\nCategoricalClassification._generate_feature(size, \r\n                                            vec=None, \r\n                                            cardinality=5, \r\n                                            ensure_rep=False, \r\n                                            random_values=False, \r\n                                            low=0, \r\n                                            high=1000,\r\n                                            k=10,\r\n                                            p=None)\r\n```\r\nGenerates feature array of length **_size_**. Called by _CategoricalClassification.generate\\_data_, by utilizing _numpy.random.choice_. If no probabilites array is given, the value density of the generated feature array will be roughly normal, with a randomly chosen peak. The peak will be chosen from the value array.\r\n\r\n- **size**: _int_\r\n  Length of generated feature array.\r\n- **vec**: _list_ or _numpy.ndarray_, default=None\r\n  List of feature values, value domain of feature.\r\n- **cardinality**: _int_, default=5\r\n  Cardinality of feature to use when generating its value domain. If _vec_ is not None, vec is used instead.\r\n- **ensure_rep**: _bool_, default=False\r\n  Control flag. If **_True_**, all possible values **will** appear in the feature array.\r\n- **random_values:** _bool_, default=False:\r\n  Control flag. If **_True_**, value domain of feature will be random on interval _\\[low, high\\]_.\r\n- **low**: _int_\r\n  Sets lower bound of value domain of feature.\r\n- **high**: _int_\r\n  Sets upper bound of value domain of feature. Only used when _random\\_values_ is True.\r\n- - **k**: _int_ or _float_, default=10.\r\n  Constant, sets width of feature (normal) distribution peak. Higher the value, narrower the peak.\r\n- **p**: _list_ or _numpy.ndarray_, default=None\r\n  Array of frequencies or probabilities. Must be of length _v_ or equal to the length of _v_.\r\n\r\n**Returns:** a **_numpy.ndarray_** feature array. \r\n\r\n___\r\n\r\n### CategoricalClassification.generate\\_combinations\r\n```python\r\nCategoricalClassification.generate_combinations(X, \r\n                                                feature_indices, \r\n                                                combination_function=None, \r\n                                                combination_type='linear')\r\n```\r\nGenerates and adds a new column to given dataset **X**. The column is the result of a combination of features selected with **feature\\_indices**. Combinations can be linear, nonlinear, or custom defined functions.\r\n\r\n- **X**: _list_ or _numpy.ndarray_:\r\n  Dataset to perform the combinations on.\r\n- **feature_indices**: _list_ or _numpy.ndarray_:\r\n  List of feature (column) indices to be combined.\r\n- **combination\\_function**: _function_, default=None:\r\n  Custom or user-defined combination function. The function parameter **must** be a _list_ or _numpy.ndarray_ of features to be combined. The function **must** return a _list_ or _numpy.ndarray_ column or columns, to be added to given dataset _X_ using _numpy.column\\_stack_.\r\n- **combination\\_type**: _str_ either _linear_ or _nonlinear_, default='linear':\r\n  Selects which built-in combination type is used.\r\n  - If _'linear'_, the combination is a sum of selected features.\r\n  - If _'nonlinear'_, the combination is the sine value of the sum of selected features.\r\n\r\n**Returns:** a **_numpy.ndarray_** dataset X with added feature combinations.\r\n\r\n---\r\n\r\n### CategoricalClassification.\\_xor\r\n```python\r\nCategoricalClassification._xor(arr)\r\n```\r\nPerforms bitwise XOR on given vectors and returns result.\r\n- **arr**: _list_ or _numpy.ndarray_\r\n  List of features to perform the combination on.\r\n\r\n**Returns:** a **_numpy.ndarray_** result of **_numpy.bitwise\\_xor(a,b)_** on given columns in **_arr_**.\r\n\r\n___\r\n\r\n### CategoricalClassification.\\_and\r\n```python\r\nCategoricalClassification._and(arr)\r\n```\r\nPerforms bitwise AND on given vectors and returns result.\r\n- **arr**: _list_ or _numpy.ndarray_\r\n  List of features to perform the combination on.\r\n\r\n**Returns:** a **_numpy.ndarray_** result of **_numpy.bitwise\\_and(a,b)_** on given columns in **_arr_**.\r\n\r\n\r\n___\r\n\r\n### CategoricalClassification.\\_or\r\n```python\r\nCategoricalClassification._or(arr)\r\n```\r\nPerforms bitwise OR on given vectors and returns result.\r\n- **arr**: _list_ or _numpy.ndarray_\r\n  List of features to perform the combination on.\r\n\r\n**Returns:** a **_numpy.ndarray_** result of **_numpy.bitwise\\_or(a,b)_** on given columns in **_arr_**.\r\n\r\n\r\n___\r\n\r\n### CategoricalClassification.generate\\_correlated\r\n```python\r\nCategoricalClassification.generate_correlated(X, \r\n                                              feature_indices, \r\n                                              r=0.8)\r\n```\r\nGenerates and adds new columns to given dataset **X**, correlated to the selected features, by a Pearson correlation coefficient of **r**. For vectors with mean 0, their correlation equals the cosine of their angle.  \r\n\r\n- **X**: _list_ or _numpy.ndarray_:\r\n  Dataset to perform the combinations on.\r\n- **feature_indices**: _int_ or _list_ or _numpy.ndarray_:\r\n  Index of feature (column) or list of feature (column) indices to generate correlated features to.\r\n- **r**: _float_, default=0.8:\r\n  Desired correlation coefficient.\r\n\r\n**Returns:** a **_numpy.ndarray_** dataset X with added correlated features.\r\n\r\n---\r\n\r\n### CategoricalClassification.generate\\_duplicates\r\n```python\r\nCategoricalClassification.generate_duplicates(X, \r\n                                              feature_indices)\r\n```\r\n\r\nDuplicates selected feature (column) indices, and adds the duplicated columns to the given dataset **X**.\r\n\r\n- **X**: _list_ or _numpy.ndarray_:\r\n  Dataset to perform the combinations on.\r\n- **feature_indices**: _int_ or _list_ or _numpy.ndarray_:\r\n  Index of feature (column) or list of feature (column) indices to duplicate.\r\n\r\n**Returns:** a **_numpy.ndarray_** dataset X with added duplicated features.\r\n\r\n---\r\n### CategoricalClassification.generate\\_labels\r\n```python\r\nCategoricalClassification.generate_nonlinear_labels(X, \r\n                                                    n=2, \r\n                                                    p=0.5, \r\n                                                    k=2, \r\n                                                    decision_function=None, \r\n                                                    class_relation='linear', \r\n                                                    balance=False)\r\n```\r\n\r\nGenerates a vector of labels. Labels are (currently) generated as either a linear, nonlinear, or custom defined function. It generates classes using a decision boundary generated by the linear, nonlinear, or custom defined function.\r\n\r\n- **X**: _list_ or _numpy.ndarray_:\r\n  Dataset to generate labels for.\r\n- **n**: _int_, default=2:\r\n  Number of classes.\r\n- **p**: _float_ or _list_, default=0.5:\r\n  Class distribution.\r\n- **k**: _int_ or _float_, default=2:\r\n  Constant to be used in the linear or nonlinear combination used to set class values.\r\n- **decision_function**: _function_, default: None\r\n  Custom defined function to use for setting class values. **Must** accept dataset X as input and return a _list_ or _numpy.ndarray_ decision boundary.\r\n- **class_relation**: _str_, either _'linear'_, _'nonlinear'_, or _'cluster'_ default='linear':\r\n  Sets relationship type between class label and sample, by calculating a decision boundary with linear or nonlinear combinations of features in X, or by clustering the samples in X.\r\n- **balance**: _boolean_, default=False:\r\n  Whether to naievly balance clusters generated by KMeans clustering.\r\n\r\n **Returns**: **_numpy.ndarray_** y of class labels.\r\n \r\n---\r\n\r\n### CategoricalClassification.\\_cluster\\_data\r\n```python\r\nCategoricalClassification._cluster_data(X, \r\n                                        n, \r\n                                        p=1.0, \r\n                                        balance=False)\r\n```\r\nClusters given data using KMeans clustering.\r\n\r\n- **X**: _list_ or _numpy.ndarray_:\r\n  Dataset to cluster.\r\n- **n**: _int_:\r\n  Number of clusters.\r\n- **p**: _float_ or _list_ or _numpy.ndarray_:\r\n  To be used when balance=True, sets class distribution - number of samples per cluster.\r\n- **balance**: _boolean_, default=False:\r\n  Whether to naievly balance clusters generated by KMeans clustering.\r\n\r\n**Returns**: **_numpy.ndarray_** cluster_labels of clustering labels.\r\n___\r\n\r\n### CategoricalClassification.generate\\_noise\r\n```python\r\nCategoricalClassification.generate_noise(X, \r\n                                         y, \r\n                                         p=0.2, \r\n                                         type=\"categorical\", \r\n                                         missing_val=float('-inf'))\r\n```\r\n\r\nGenerates categorical noise or simulates missing data on a given dataset. \r\n\r\n- **X**: _list_ or _numpy.ndarray_:\r\n  Dataset to generate noise for.\r\n- **y**: _list_ or _numpy.ndarray_:\r\n  Labels of samples in dataset X. **Required** for generating categorical noise.\r\n- **p**: _float_, p <=1.0, default=0.2:\r\n  Amount of noise to generate.\r\n- **type**: _str_, either _\"categorical\"_ or _\"missing\"_, default=\"categorical\":\r\n  Type of noise to generate.\r\n- **missing_val**: default=float('-inf'):\r\n  Value to simulate missing values with. Non-numerical values may cause issues with algorithms unequipped to handle them.\r\n\r\n**Returns**: **_numpy.ndarray_** X with added noise.\r\n\r\n---\r\n\r\n### CategoricalClassification.downsample\\_dataset\r\n\r\n```python\r\nCategoricalClassification.downsample_dataset(X, \r\n                                             y, \r\n                                             n=None, \r\n                                             seed=42, \r\n                                             reshuffle=False):\r\n```\r\n\r\nDownsamples given dataset according to N or the number of samples in minority class, resulting in a balanced dataset.\r\n\r\n- **X**: _list_ or _numpy.ndarray_:\r\n  Dataset to downsample.\r\n- **y**: _list_ or _numpy.ndarray_:\r\n  Labels corresponding to X.\r\n- **N**: _int_, optional:\r\n  Optional number of samples per class to downsample to.\r\n- **seed**: _int_, default=42:\r\n  Seed for random state of resample function.\r\n- **reshuffle**: _boolean_, default=False:\r\n  Reshuffle the dataset after downsampling.\r\n\r\n**Returns:** Balanced, downsampled **_numpy.ndarray_** X and **_numpy.ndarray_** y.\r\n\r\n---\r\n\r\n### CategoricalClassification.print\\_dataset\r\n```python\r\nCategoricalClassification.print_dataset(X, y)\r\n```\r\nPrints given dataset in a readable format.\r\n\r\n- **X**: _list_ or _numpy.ndarray_:\r\n  Dataset to print.\r\n- **y**: _list_ or _numpy.ndarray_:\r\n  Class labels corresponding to samples in given dataset.\r\n\r\n---\r\n\r\n### CategoricalClassification.summarize\r\n```python\r\nCategoricalClassification.summarize()\r\n```\r\nPrints stored dataset information dictionary in a digestible manner.\r\n",
    "bugtrack_url": null,
    "license": "Copyright (c) 2018 The Python Packaging Authority  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.  THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.",
    "summary": "A robust framework for generating synthetic categorical datasets for evaluation or testing purposes.",
    "version": "0.1.2",
    "project_urls": {
        "Homepage": "https://github.com/98MM/msc_cc",
        "Repository": "https://github.com/98MM/msc_cc"
    },
    "split_keywords": [
        "synthetic datasets",
        " data generation",
        " categorical data"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "665e88970251ede12873d76ae1d5093a17319af9f542336ed42593acd598892c",
                "md5": "d787941183561833df56b9260c94db8d",
                "sha256": "02f8f96b447193c96f345a084259cf9c9291344c2811cb3ee20260a74bf67e80"
            },
            "downloads": -1,
            "filename": "catclass-0.1.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d787941183561833df56b9260c94db8d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 13191,
            "upload_time": "2024-07-25T16:40:14",
            "upload_time_iso_8601": "2024-07-25T16:40:14.246665Z",
            "url": "https://files.pythonhosted.org/packages/66/5e/88970251ede12873d76ae1d5093a17319af9f542336ed42593acd598892c/catclass-0.1.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e08b04e052c55dea86a70a8d850f94899a099504b6145f78110f51ad3110d08c",
                "md5": "7ac86c22e5f365778f754324c585e136",
                "sha256": "d0a36389c27f6f72945503f659553fd2a3a5347ce2960213193734487db355a7"
            },
            "downloads": -1,
            "filename": "catclass-0.1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "7ac86c22e5f365778f754324c585e136",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 16456,
            "upload_time": "2024-07-25T16:40:15",
            "upload_time_iso_8601": "2024-07-25T16:40:15.818607Z",
            "url": "https://files.pythonhosted.org/packages/e0/8b/04e052c55dea86a70a8d850f94899a099504b6145f78110f51ad3110d08c/catclass-0.1.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-07-25 16:40:15",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "98MM",
    "github_project": "msc_cc",
    "github_not_found": true,
    "lcname": "catclass"
}

Miha Malenšek