datasieve


Namedatasieve JSON
Version 0.1.7 PyPI version JSON
download
home_page
SummaryThis package implements a flexible data pipeline to help organize row removal (e.g. outlier removal) and feature modification (e.g. PCA)
upload_time2023-07-07 07:27:26
maintainer
docs_urlNone
authorRobert Caulk
requires_python>=3.8.1,<4.0
licenseMIT
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ![datasieve-logo](assets/datasieve_logo.png)

# DataSieve

DataSieve is very similar to the SKlearn Pipeline in that it:

- fits an arbitrary series of transformations to an array X
- transforms subsequent arrays of the same dimension according to the fit from the original X
- inverse transforms arrays by inverting the series of transformations

This means that it follows the SKLearn API very closely, and in fact users can use SKLearn transforms directly without making any modifications.

The main **difference** is that DataSieve allows for the manipulation of the y and sample_weight arrays in addition to the X array. This is useful if you find yourself wishing to use the SKLearn pipeline for:

- removing outliers across your X, y, and sample_weights arrays according to simple or complex criteria
- remove feature columns based on arbitrary criteria (e.g. low variance features)
- change feature column names at certain transformations (e.g. PCA)
- needing outlier classification without removal (see `oulier_check`)
- passing dynamic parameters to individual transforms of the pipeline
- passing dataframes/arrays without worrying about converting to arrays and maintaining the proper feature columns
- customizing backend for parallelization (e.g. Dask, Ray, loky, etc.)

These improved flexibilities allow for more customized/creative transformations. For example, the included `DBSCAN` has automated parameter fitting and outlier removal based on clustering. 


## Usage
The user builds the pipeline similarly to SKLearn, and can even use SKLearn transforms directly with the `SKLearnWrapper`:

```python
    from datasieve.pipeline import Pipeline
    import datasieve.transforms as dst
    from sklearn.preprocessing import MinMaxScaler

    feature_pipeline = Pipeline([
        ("detect_constants", dst.VarianceThreshold(threshold=0)),
        ("pre_svm_scaler", dst.SKlearnWrapper(MinMaxScaler(feature_range=(-1, 1)))),
        ("svm", dst.SVMOutlierExtractor()),
        ("pca", dst.PCA(n_components=0.95)),
        ("post_pca_scaler", dst.SKlearnWrapper(MinMaxScaler(feature_range=(-1, 1))))
    ])

```

Once the pipeline is built, it can be fit and transformed similar to a SKLearn pipeline:

```python
X, y, sample_weight = feature_pipeline.fit_transform(X, y, sample_weight)
```

This pipeline demonstrates the various components of `DataSieve` which are missing from SKLearn's pipeline. A dataframe `X` (if desired, else users can input a numpy array without column names) is input with its associated `y` and `sample_weight` arrays/vectors (these are also optional). The `VarianceThreshold` will first detect and remove any features that have zero variance in X, the `SVMOutlierExtractor` will fit `SGDOneClassSVM` to `X` and then remove the detected outliers in `X`, while also propagating those row removals from `y` and `sample_weight`. Finally, the `PCA` will be fit to the remaining `X` array with the features count changing and getting renamed. The returned `X` dataframe will have the correctly named column features/count, and equal row counts across the `X`, `y`, and `sample_weight` arrays.

Next, the `feature_pipeline` can then be used to transform other datasets with the same input feature dimension:

```python
Xprime, _, _ = feature_pipeline.transform(X2)
```

Finally, similar to SKLearn's pipeline, the `feature_pipeline` can be used to inverse_transform the array `Xprime` array that has the same dimensions as the returned `X2`/`X` array from the pipeline:

```python
X2, _ ,_ = feature_pipeline.inverse_transform(Xprime)
```


## Creating a custom transform

An example would be someone who wants to use `SGDOneClassSVM` to detect and remove outliers from their data set before training:

```python
class SVMOutlierExtractor(BaseTransform):
    """
    A subclass of the SKLearn SGDOneClassSVM that adds a transform() method
    for removing detected outliers from X (as well as the associated y and
    sample_weight if they are also furnished.
    """

    def __init__(self, **kwargs):
        self._skl = SGDOneClassSVM(**kwargs)

    def fit_transform(self, X, y=None, sample_weight=None, feature_list=None, **kwargs):
        self.fit(X, y, sample_weight=sample_weight)
        return self.transform(X, y, sample_weight, feature_list)

    def fit(self, X, y=None, sample_weight=None, feature_list=None, **kwargs):
        self._skl.fit(X, y=y, sample_weight=sample_weight)
        return X, y, sample_weight, feature_list

    def transform(self, X, y=None, sample_weight=None, feature_list=None,
                  outlier_check=False, **kwargs):
        y_pred = self._skl.predict(X)
        y_pred = np.where(y_pred == -1, 0, y_pred)
        if not outlier_check:
            X, y, sample_weight = remove_outliers(X, y, sample_weight, y_pred)
            num_tossed = len(y_pred) - len(X)
            if num_tossed > 0:
                logger.info(
                    f"SVM detected {num_tossed} data points "
                    "as outliers."
                )
        else:
            y += y_pred
            y -= 1

        return X, y, sample_weight, feature_list

    def inverse_transform(self, X, y=None, sample_weight=None, feature_list=None, **kwargs):
        """
        Unused
        """
        return X, y, sample_weight, feature_list
```


As shown here, the `fit()` method is actually identical to the SKLearn `fit()` method, but the `transform()` removes data points from X, y, and sample_weight for any outliers detected in the `X` array.

## Data removal

The command `feature_pipeline.fit_transform(X, y, sample_weight)` fits each pipeline step to `X`, and transforms `X` according to each step's `transform()` method. In some cases, this will not affect `y` or `sample_weight`. For example, `MinMaxScaler` simply scales `X` and saves the normalization information.  Meanwhile, in the `SVMOutlierExtractor`, `.fit()` will fit an SVM to `X` and `.transform()` will remove any detected outliers from `X`. Typical `Scikit-Learn` pipelines do not remove those data points from `y` and `sample_weight`. Luckily, the `Pipeline` takes care of the "associated removal" of the same outlier data points from `y` and `sample_weight`. 

## Feature modification

Another feature is demonstrated in the `PCA`, which fits a PCA transform to `X` and then transforms `X` to principal components. This dimensionality reduction means that the features are no longer the same, instead they are now `PC1`, `PC2` ... `PCX`. `Pipeline` handles the feature renaming at that step (which is not a feature available in the `Scikit-Learn` pipeline). Similar to `FlowdaptPCA`, the `VarianceThreshold` subclasses the `Scikit-Learn` `VarianceThreshold` which is geared toward removing features that have a low variance. `VarianceThreshold` ensures that the removed features are properly handled when `X` passes through this part of the pipeline.


## Outlier checking

DataSieve also allows users to fit a pipeline that can be used to flag outliers in data *without* removing them from the dataset. This may be handy in a variety of cases where keeping the data point is important but having some indication of which points are outliers is also important. In order to use this functionality, you can take an already `fit` pipeline and call `transform(X, outlier_check=True)` which will return X as well as a vector of 1s and 0s indicating which points are outliers. This is demonstrated in the following example:

```python
pipeline = Pipeline([
    ("pre_svm_scaler", transforms.DataSieveMinMaxScaler()),
    ("svm", transforms.SVMOutlierExtractor())
])

pipeline.fit(X)

X, outliers, _ = pipeline.transform(X, outlier_check=True)
```

Now X will *not* have any of the outlier data points removed, but the vector `outliers` will be an indication of which points were classified as outliers, where 0 means the point was an outlier and 1 means that the point was an inlier.


# Installation

The easiest way to install `datasieve` is with:

```
pip install datasieve
```

but you can also clone this repository and install it with:

```
git clone https://github.com/emergentmethods/datasieve.git
cd datasieve
poetry install
```


# License

Copyright (c) 2023 DataSieve

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "datasieve",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8.1,<4.0",
    "maintainer_email": "",
    "keywords": "",
    "author": "Robert Caulk",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/2b/d2/9064ffc2fc9cccff943f987ce904bed354511cee8a93a2993ebf8057ab9b/datasieve-0.1.7.tar.gz",
    "platform": null,
    "description": "![datasieve-logo](assets/datasieve_logo.png)\n\n# DataSieve\n\nDataSieve is very similar to the SKlearn Pipeline in that it:\n\n- fits an arbitrary series of transformations to an array X\n- transforms subsequent arrays of the same dimension according to the fit from the original X\n- inverse transforms arrays by inverting the series of transformations\n\nThis means that it follows the SKLearn API very closely, and in fact users can use SKLearn transforms directly without making any modifications.\n\nThe main **difference** is that DataSieve allows for the manipulation of the y and sample_weight arrays in addition to the X array. This is useful if you find yourself wishing to use the SKLearn pipeline for:\n\n- removing outliers across your X, y, and sample_weights arrays according to simple or complex criteria\n- remove feature columns based on arbitrary criteria (e.g. low variance features)\n- change feature column names at certain transformations (e.g. PCA)\n- needing outlier classification without removal (see `oulier_check`)\n- passing dynamic parameters to individual transforms of the pipeline\n- passing dataframes/arrays without worrying about converting to arrays and maintaining the proper feature columns\n- customizing backend for parallelization (e.g. Dask, Ray, loky, etc.)\n\nThese improved flexibilities allow for more customized/creative transformations. For example, the included `DBSCAN` has automated parameter fitting and outlier removal based on clustering. \n\n\n## Usage\nThe user builds the pipeline similarly to SKLearn, and can even use SKLearn transforms directly with the `SKLearnWrapper`:\n\n```python\n    from datasieve.pipeline import Pipeline\n    import datasieve.transforms as dst\n    from sklearn.preprocessing import MinMaxScaler\n\n    feature_pipeline = Pipeline([\n        (\"detect_constants\", dst.VarianceThreshold(threshold=0)),\n        (\"pre_svm_scaler\", dst.SKlearnWrapper(MinMaxScaler(feature_range=(-1, 1)))),\n        (\"svm\", dst.SVMOutlierExtractor()),\n        (\"pca\", dst.PCA(n_components=0.95)),\n        (\"post_pca_scaler\", dst.SKlearnWrapper(MinMaxScaler(feature_range=(-1, 1))))\n    ])\n\n```\n\nOnce the pipeline is built, it can be fit and transformed similar to a SKLearn pipeline:\n\n```python\nX, y, sample_weight = feature_pipeline.fit_transform(X, y, sample_weight)\n```\n\nThis pipeline demonstrates the various components of `DataSieve` which are missing from SKLearn's pipeline. A dataframe `X` (if desired, else users can input a numpy array without column names) is input with its associated `y` and `sample_weight` arrays/vectors (these are also optional). The `VarianceThreshold` will first detect and remove any features that have zero variance in X, the `SVMOutlierExtractor` will fit `SGDOneClassSVM` to `X` and then remove the detected outliers in `X`, while also propagating those row removals from `y` and `sample_weight`. Finally, the `PCA` will be fit to the remaining `X` array with the features count changing and getting renamed. The returned `X` dataframe will have the correctly named column features/count, and equal row counts across the `X`, `y`, and `sample_weight` arrays.\n\nNext, the `feature_pipeline` can then be used to transform other datasets with the same input feature dimension:\n\n```python\nXprime, _, _ = feature_pipeline.transform(X2)\n```\n\nFinally, similar to SKLearn's pipeline, the `feature_pipeline` can be used to inverse_transform the array `Xprime` array that has the same dimensions as the returned `X2`/`X` array from the pipeline:\n\n```python\nX2, _ ,_ = feature_pipeline.inverse_transform(Xprime)\n```\n\n\n## Creating a custom transform\n\nAn example would be someone who wants to use `SGDOneClassSVM` to detect and remove outliers from their data set before training:\n\n```python\nclass SVMOutlierExtractor(BaseTransform):\n    \"\"\"\n    A subclass of the SKLearn SGDOneClassSVM that adds a transform() method\n    for removing detected outliers from X (as well as the associated y and\n    sample_weight if they are also furnished.\n    \"\"\"\n\n    def __init__(self, **kwargs):\n        self._skl = SGDOneClassSVM(**kwargs)\n\n    def fit_transform(self, X, y=None, sample_weight=None, feature_list=None, **kwargs):\n        self.fit(X, y, sample_weight=sample_weight)\n        return self.transform(X, y, sample_weight, feature_list)\n\n    def fit(self, X, y=None, sample_weight=None, feature_list=None, **kwargs):\n        self._skl.fit(X, y=y, sample_weight=sample_weight)\n        return X, y, sample_weight, feature_list\n\n    def transform(self, X, y=None, sample_weight=None, feature_list=None,\n                  outlier_check=False, **kwargs):\n        y_pred = self._skl.predict(X)\n        y_pred = np.where(y_pred == -1, 0, y_pred)\n        if not outlier_check:\n            X, y, sample_weight = remove_outliers(X, y, sample_weight, y_pred)\n            num_tossed = len(y_pred) - len(X)\n            if num_tossed > 0:\n                logger.info(\n                    f\"SVM detected {num_tossed} data points \"\n                    \"as outliers.\"\n                )\n        else:\n            y += y_pred\n            y -= 1\n\n        return X, y, sample_weight, feature_list\n\n    def inverse_transform(self, X, y=None, sample_weight=None, feature_list=None, **kwargs):\n        \"\"\"\n        Unused\n        \"\"\"\n        return X, y, sample_weight, feature_list\n```\n\n\nAs shown here, the `fit()` method is actually identical to the SKLearn `fit()` method, but the `transform()` removes data points from X, y, and sample_weight for any outliers detected in the `X` array.\n\n## Data removal\n\nThe command `feature_pipeline.fit_transform(X, y, sample_weight)` fits each pipeline step to `X`, and transforms `X` according to each step's `transform()` method. In some cases, this will not affect `y` or `sample_weight`. For example, `MinMaxScaler` simply scales `X` and saves the normalization information.  Meanwhile, in the `SVMOutlierExtractor`, `.fit()` will fit an SVM to `X` and `.transform()` will remove any detected outliers from `X`. Typical `Scikit-Learn` pipelines do not remove those data points from `y` and `sample_weight`. Luckily, the `Pipeline` takes care of the \"associated removal\" of the same outlier data points from `y` and `sample_weight`. \n\n## Feature modification\n\nAnother feature is demonstrated in the `PCA`, which fits a PCA transform to `X` and then transforms `X` to principal components. This dimensionality reduction means that the features are no longer the same, instead they are now `PC1`, `PC2` ... `PCX`. `Pipeline` handles the feature renaming at that step (which is not a feature available in the `Scikit-Learn` pipeline). Similar to `FlowdaptPCA`, the `VarianceThreshold` subclasses the `Scikit-Learn` `VarianceThreshold` which is geared toward removing features that have a low variance. `VarianceThreshold` ensures that the removed features are properly handled when `X` passes through this part of the pipeline.\n\n\n## Outlier checking\n\nDataSieve also allows users to fit a pipeline that can be used to flag outliers in data *without* removing them from the dataset. This may be handy in a variety of cases where keeping the data point is important but having some indication of which points are outliers is also important. In order to use this functionality, you can take an already `fit` pipeline and call `transform(X, outlier_check=True)` which will return X as well as a vector of 1s and 0s indicating which points are outliers. This is demonstrated in the following example:\n\n```python\npipeline = Pipeline([\n    (\"pre_svm_scaler\", transforms.DataSieveMinMaxScaler()),\n    (\"svm\", transforms.SVMOutlierExtractor())\n])\n\npipeline.fit(X)\n\nX, outliers, _ = pipeline.transform(X, outlier_check=True)\n```\n\nNow X will *not* have any of the outlier data points removed, but the vector `outliers` will be an indication of which points were classified as outliers, where 0 means the point was an outlier and 1 means that the point was an inlier.\n\n\n# Installation\n\nThe easiest way to install `datasieve` is with:\n\n```\npip install datasieve\n```\n\nbut you can also clone this repository and install it with:\n\n```\ngit clone https://github.com/emergentmethods/datasieve.git\ncd datasieve\npoetry install\n```\n\n\n# License\n\nCopyright (c) 2023 DataSieve\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "This package implements a flexible data pipeline to help organize row removal (e.g. outlier removal) and feature modification (e.g. PCA)",
    "version": "0.1.7",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "bbc2e766030547449890e09cc2f06faae1a054d9a0acbb8f4c810b8b6ae80050",
                "md5": "a89304b920ea6cbe9d1daec230d2e78e",
                "sha256": "e53396e7bbf597cd4fc053b8ac777df334e7cf57c4f13ec1ef74f627c1a3c140"
            },
            "downloads": -1,
            "filename": "datasieve-0.1.7-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a89304b920ea6cbe9d1daec230d2e78e",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8.1,<4.0",
            "size": 16635,
            "upload_time": "2023-07-07T07:27:24",
            "upload_time_iso_8601": "2023-07-07T07:27:24.381510Z",
            "url": "https://files.pythonhosted.org/packages/bb/c2/e766030547449890e09cc2f06faae1a054d9a0acbb8f4c810b8b6ae80050/datasieve-0.1.7-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2bd29064ffc2fc9cccff943f987ce904bed354511cee8a93a2993ebf8057ab9b",
                "md5": "49ca66acd6827167cf25b6d491d6ecb9",
                "sha256": "c1041198f30d24f05ff9bdf9ba6d237a9afafcb2946b1a4483dd06d96b928753"
            },
            "downloads": -1,
            "filename": "datasieve-0.1.7.tar.gz",
            "has_sig": false,
            "md5_digest": "49ca66acd6827167cf25b6d491d6ecb9",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8.1,<4.0",
            "size": 14080,
            "upload_time": "2023-07-07T07:27:26",
            "upload_time_iso_8601": "2023-07-07T07:27:26.012497Z",
            "url": "https://files.pythonhosted.org/packages/2b/d2/9064ffc2fc9cccff943f987ce904bed354511cee8a93a2993ebf8057ab9b/datasieve-0.1.7.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-07-07 07:27:26",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "datasieve"
}
        
Elapsed time: 0.08842s