slice-finder


Nameslice-finder JSON
Version 0.0.22 PyPI version JSON
download
home_pagehttps://github.com/igaloly/slice_finder
SummarySlice Finder: A Framework for Slice Discovery
upload_time2023-04-09 04:38:01
maintainer
docs_urlNone
authorIgal Leikin
requires_python>=3.10,<4.0
licenseMIT
keywords slice discovery slice detection subgroup discovery subgroup detection slice finding
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Slice Finder: A Framework for Slice Discovery

Slice Finder is a versatile and highly configurable framework designed for the discovery of explainable, anomalous data subsets, which exhibit substantially divergent metric values in comparison to the entire dataset.

> To illustrate, imagine that you have developed a model for identifying fraudulent transactions. The model's overall accuracy across the entire dataset is 0.95. However, when transactions occur more than 100 km away from the previous transaction and involve cash (2 filters), the model's accuracy drops significantly to 0.54.

Slice Finder is a crucial investigative instrument, as it enables data scientists to identify regions where their models demonstrate over- or under-performance.

## Algorithmic achievements
* Tackling data quantization can be complex, particularly when transforming continuous values into discrete space. Slice Finder overcomes this challenge by fitting an LGBM model to the data and extracting the appropriate splits.
* As the number of filters, columns, and values increases, so does the combinatorial search space. Slice Finder addresses this issue in two ways:
    * By fitting an LGBM model to the data, the most critical fields and values for splitting are identified, significantly reducing the search space.
    * Incorporating Genetic Algorithm heuristics to converge towards global minima/maxima, which outperforms both the time-consuming "try-it-all" approach and uniform filter sampling in terms of efficiency and results.

## Engineering achievements
By separating data connectors, data structures, and slice finders, SliceFinder offers a flexible framework that enables seamless modifications and replacement of components. Furthermore, by detaching metric mechanism from the system, SliceFinder supports any custom logic metrics.

## Demo
![GA search for anomalous subset with high MSE](./examples/demo.gif)

## Installation
Install Slice Finder via pip:
```python
pip install slice_finder
```

# Quick Start
```python
import pandas as pd
from sklearn import metrics
from slice_finder import GAMuPlusLambdaSliceFinder, FlattenedLGBMDataStructure, PandasDataConnector

# Load data
df = pd.read_csv('your_data.csv')

# Initialize Genetic Algorithm Slice Finder with desired data connector and data structure
slice_finder = GAMuPlusLambdaSliceFinder(
    data_connector=PandasDataConnector(
        df=df,
        X_cols=df.drop(['pred', 'target'], axis=1).columns,
        y_col='target',
        pred_col='pred',
    ),
    data_structure=FlattenedLGBMDataStructure(),
    verbose=True,
    random_state=42,
)

# Find anomalous slice
extreme = slice_finder.find_extreme(
    metric=lambda df: metrics.mean_absolute_error(df['target'], df['pred']),
    n_filters=3,
    maximize=True,
)
extreme[0]
```

## Data Connectors
Built in:
* `PandasDataConnector` - allow you to use Pandas

Base Classes:
* `DataConnector` - Base data connector

More connectors will be added as demand grows.

You can create your custom data connector by extending the base class and implementing the necessary methods.

## Data Structures
Built in:
* `FlattenedLGBMDataStructure` - Utilizes LightGBM decision trees to quantize and partition the data.
Note: Currently, `FlattenedLGBMDataStructure` must work with `PandasDataConnector` because of LGBM constraints. Moreover, this data structure is coupled to pandas connector because categorical values must be modified to `pd.Categorical` class.

Base classes:
* `DataStructure` - Base data structure
* `LGBMDataStructure` - Handles the fitting and partitioning the LGBM trees

More data structures will be added as demand grows.

You can create your custom data structure by extending the base classes and implementing the necessary methods.

## Slice Finders
Built in:
* `GAMuPlusLambdaSliceFinder` - Utilizes `eaMuPlusLambda` evolutionary algorithm to search for the most anomalous slice
* `UniformSliceFinder` - Utilizes uniform sampling out of the data structure

Base classes:
* `SliceFinder` - Base slice finder
* `GASliceFinder` - Extends `SliceFinder` and enables the use of genetic algorithms as search heuristics

More algorithms will be added based on demand. 

You can create your custom data structure by extending the base class and implementing the necessary methods.

## Metrics
Metrics are passed as functions to the `find_extreme` method, allowing you to use any metric or implement your custom logic.

## Neat things to implement
* Calculation parallelism
* More search algorithms. Ant colony optimization?

## License
This project is licensed under the MIT License.

## Contributing
Contributions are welcome!
Clone the repo, run `poetry install` and start hacking.

## Support
For any questions, bug reports, or feature requests, please open an issue.
            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/igaloly/slice_finder",
    "name": "slice-finder",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.10,<4.0",
    "maintainer_email": "",
    "keywords": "slice discovery,slice detection,subgroup discovery,subgroup detection,slice finding",
    "author": "Igal Leikin",
    "author_email": "igaloly@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/20/88/f0ff53399ca11ae865a645d322efd450f9fa028b0ca299165bd605887d04/slice_finder-0.0.22.tar.gz",
    "platform": null,
    "description": "# Slice Finder: A Framework for Slice Discovery\n\nSlice Finder is a versatile and highly configurable framework designed for the discovery of explainable, anomalous data subsets, which exhibit substantially divergent metric values in comparison to the entire dataset.\n\n> To illustrate, imagine that you have developed a model for identifying fraudulent transactions. The model's overall accuracy across the entire dataset is 0.95. However, when transactions occur more than 100 km away from the previous transaction and involve cash (2 filters), the model's accuracy drops significantly to 0.54.\n\nSlice Finder is a crucial investigative instrument, as it enables data scientists to identify regions where their models demonstrate over- or under-performance.\n\n## Algorithmic achievements\n* Tackling data quantization can be complex, particularly when transforming continuous values into discrete space. Slice Finder overcomes this challenge by fitting an LGBM model to the data and extracting the appropriate splits.\n* As the number of filters, columns, and values increases, so does the combinatorial search space. Slice Finder addresses this issue in two ways:\n    * By fitting an LGBM model to the data, the most critical fields and values for splitting are identified, significantly reducing the search space.\n    * Incorporating Genetic Algorithm heuristics to converge towards global minima/maxima, which outperforms both the time-consuming \"try-it-all\" approach and uniform filter sampling in terms of efficiency and results.\n\n## Engineering achievements\nBy separating data connectors, data structures, and slice finders, SliceFinder offers a flexible framework that enables seamless modifications and replacement of components. Furthermore, by detaching metric mechanism from the system, SliceFinder supports any custom logic metrics.\n\n## Demo\n![GA search for anomalous subset with high MSE](./examples/demo.gif)\n\n## Installation\nInstall Slice Finder via pip:\n```python\npip install slice_finder\n```\n\n# Quick Start\n```python\nimport pandas as pd\nfrom sklearn import metrics\nfrom slice_finder import GAMuPlusLambdaSliceFinder, FlattenedLGBMDataStructure, PandasDataConnector\n\n# Load data\ndf = pd.read_csv('your_data.csv')\n\n# Initialize Genetic Algorithm Slice Finder with desired data connector and data structure\nslice_finder = GAMuPlusLambdaSliceFinder(\n    data_connector=PandasDataConnector(\n        df=df,\n        X_cols=df.drop(['pred', 'target'], axis=1).columns,\n        y_col='target',\n        pred_col='pred',\n    ),\n    data_structure=FlattenedLGBMDataStructure(),\n    verbose=True,\n    random_state=42,\n)\n\n# Find anomalous slice\nextreme = slice_finder.find_extreme(\n    metric=lambda df: metrics.mean_absolute_error(df['target'], df['pred']),\n    n_filters=3,\n    maximize=True,\n)\nextreme[0]\n```\n\n## Data Connectors\nBuilt in:\n* `PandasDataConnector` - allow you to use Pandas\n\nBase Classes:\n* `DataConnector` - Base data connector\n\nMore connectors will be added as demand grows.\n\nYou can create your custom data connector by extending the base class and implementing the necessary methods.\n\n## Data Structures\nBuilt in:\n* `FlattenedLGBMDataStructure` - Utilizes LightGBM decision trees to quantize and partition the data.\nNote: Currently, `FlattenedLGBMDataStructure` must work with `PandasDataConnector` because of LGBM constraints. Moreover, this data structure is coupled to pandas connector because categorical values must be modified to `pd.Categorical` class.\n\nBase classes:\n* `DataStructure` - Base data structure\n* `LGBMDataStructure` - Handles the fitting and partitioning the LGBM trees\n\nMore data structures will be added as demand grows.\n\nYou can create your custom data structure by extending the base classes and implementing the necessary methods.\n\n## Slice Finders\nBuilt in:\n* `GAMuPlusLambdaSliceFinder` - Utilizes `eaMuPlusLambda` evolutionary algorithm to search for the most anomalous slice\n* `UniformSliceFinder` - Utilizes uniform sampling out of the data structure\n\nBase classes:\n* `SliceFinder` - Base slice finder\n* `GASliceFinder` - Extends `SliceFinder` and enables the use of genetic algorithms as search heuristics\n\nMore algorithms will be added based on demand. \n\nYou can create your custom data structure by extending the base class and implementing the necessary methods.\n\n## Metrics\nMetrics are passed as functions to the `find_extreme` method, allowing you to use any metric or implement your custom logic.\n\n## Neat things to implement\n* Calculation parallelism\n* More search algorithms. Ant colony optimization?\n\n## License\nThis project is licensed under the MIT License.\n\n## Contributing\nContributions are welcome!\nClone the repo, run `poetry install` and start hacking.\n\n## Support\nFor any questions, bug reports, or feature requests, please open an issue.",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Slice Finder: A Framework for Slice Discovery",
    "version": "0.0.22",
    "split_keywords": [
        "slice discovery",
        "slice detection",
        "subgroup discovery",
        "subgroup detection",
        "slice finding"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a8c893c5d813c2fc75047d63f614249c6879138fcdfc26c08b709c30add1594a",
                "md5": "a68a17aa200bff7c24b7881db697bf0e",
                "sha256": "bacec0c32266b368f4f81141d9594e03610c2637d7325148cb84d89d8b5c62ae"
            },
            "downloads": -1,
            "filename": "slice_finder-0.0.22-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a68a17aa200bff7c24b7881db697bf0e",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10,<4.0",
            "size": 14064,
            "upload_time": "2023-04-09T04:37:59",
            "upload_time_iso_8601": "2023-04-09T04:37:59.909168Z",
            "url": "https://files.pythonhosted.org/packages/a8/c8/93c5d813c2fc75047d63f614249c6879138fcdfc26c08b709c30add1594a/slice_finder-0.0.22-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2088f0ff53399ca11ae865a645d322efd450f9fa028b0ca299165bd605887d04",
                "md5": "1fa975747665fb97d1b2842c79d859d9",
                "sha256": "1e2e8a99a5ba485f9ac8746562b2769ba73b2af262281f10a1e6c6e45a755512"
            },
            "downloads": -1,
            "filename": "slice_finder-0.0.22.tar.gz",
            "has_sig": false,
            "md5_digest": "1fa975747665fb97d1b2842c79d859d9",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10,<4.0",
            "size": 12074,
            "upload_time": "2023-04-09T04:38:01",
            "upload_time_iso_8601": "2023-04-09T04:38:01.525018Z",
            "url": "https://files.pythonhosted.org/packages/20/88/f0ff53399ca11ae865a645d322efd450f9fa028b0ca299165bd605887d04/slice_finder-0.0.22.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-04-09 04:38:01",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "igaloly",
    "github_project": "slice_finder",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "slice-finder"
}
        
Elapsed time: 0.21587s