# Self-Supervised Learning for Outlier Detection
The detection of outliers can be very challenging, especially if the data has features that do not carry
information about the outlyingness of a point.For supervised problems, there are many methods for selecting
appropriate features. For unsupervised problems it can be challenging to select features that are meaningful
for outlier detection. We propose a method to transform the unsupervised problem of outlier detection into a
supervised problem to mitigate the problem of irrelevant features and the hiding of outliers in these features.
We benchmark our model against common outlier detection models and have clear advantages in outlier detection
when many irrelevant features are present.
This repository contains the code used for the experiments, as well as instructions to reproduce our results.
For reproduction of our results, please switch to the "publication" branch
or click [here](https://github.com/JanDiers/self-supervised-outlier/tree/publication).
If you use this code in your publication, we ask you to cite our paper. Find the details below.
## Installation
The software can be installed by using pip. We recommend to use a virtual environment for installation, for example
venv. [See the official guide](https://docs.python.org/3/library/venv.html).
To install our software, run
``pip install noisy_outlier``
## Usage
For outlier detection, you can use the `NoisyOutlierDetector` as follows. The methods follow the scikit-learn syntax:
```python
import numpy as np
from noisy_outlier import NoisyOutlierDetector
X = np.random.randn(50, 2) # some sample data
model = NoisyOutlierDetector()
model.fit(X)
model.predict(X) # returns binary decisions, 1 for outlier, 0 for inlier
model.predict_outlier_probability(X) # predicts probability for being an outlier, this is the recommended way
```
The `NoisyOutlierDetector` has several hyperpararameters such as the number of estimators for the classification
problem or the pruning parameter. To our experience, the default values for the `NoisyOutlierDetector` provide stable
results. However, you also have the choice to run routines for optimizing hyperparameters based on a RandomSearch. Details
can be found in the paper. Use the `HyperparameterOptimizer` as follows:
````python
import numpy as np
from scipy.stats.distributions import uniform, randint
from sklearn import metrics
from noisy_outlier import HyperparameterOptimizer, PercentileScoring
from noisy_outlier import NoisyOutlierDetector
X = np.random.randn(50, 5)
grid = dict(n_estimators=randint(50, 150), ccp_alpha=uniform(0.01, 0.3), min_samples_leaf=randint(5, 10))
optimizer = HyperparameterOptimizer(
estimator=NoisyOutlierDetector(),
param_distributions=grid,
scoring=metrics.make_scorer(PercentileScoring(0.05), needs_proba=True),
n_jobs=None,
n_iter=5,
cv=3,
)
optimizer.fit(X)
# The optimizer is itself a `NoisyOutlierDetector`, so you can use it in the same way:
outlier_probability = optimizer.predict_outlier_probability(X)
````
Details about the algorithms may be found in our publication.
If you use this work for your publication, please cite as follows. To reproduce our results,
please switch to the "publication" branch or click [here](https://github.com/JanDiers/self-supervised-outlier/tree/publication).
````
Diers, J, Pigorsch, C. Selfâsupervised learning for outlier detection. Stat. 2021; 10e322. https://doi.org/10.1002/sta4.322
````
BibTeX:
````
@article{
https://doi.org/10.1002/sta4.322,
author = {Diers, Jan and Pigorsch, Christian},
title = {Self-supervised learning for outlier detection},
journal = {Stat},
volume = {10},
number = {1},
pages = {e322},
keywords = {hyperparameter, machine learning, noisy signal, outlier detection, self-supervised learning},
doi = {https://doi.org/10.1002/sta4.322},
url = {https://onlinelibrary.wiley.com/doi/abs/10.1002/sta4.322},
eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1002/sta4.322},
note = {e322 sta4.322},
year = {2021}
}
````
Raw data
{
"_id": null,
"home_page": "https://github.com/jandiers/self-supervised-outlier",
"name": "noisy-outlier",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8,<4.0",
"maintainer_email": "",
"keywords": "",
"author": "JanDiers",
"author_email": "jan.diers@uni-jena.de",
"download_url": "https://files.pythonhosted.org/packages/b6/a1/1bc15bbee51a562a375a904648e7a032fbbda553238d079fd8f1c78a38d9/noisy_outlier-0.1.6.tar.gz",
"platform": null,
"description": "# Self-Supervised Learning for Outlier Detection\n\nThe detection of outliers can be very challenging, especially if the data has features that do not carry \ninformation about the outlyingness of a point.For supervised problems, there are many methods for selecting \nappropriate features. For unsupervised problems it can be challenging to select features that are meaningful \nfor outlier detection. We propose a method to transform the unsupervised problem of outlier detection into a \nsupervised problem to mitigate the problem of irrelevant features and the hiding of outliers in these features. \nWe benchmark our model against common outlier detection models and have clear advantages in outlier detection \nwhen many irrelevant features are present.\n\nThis repository contains the code used for the experiments, as well as instructions to reproduce our results. \nFor reproduction of our results, please switch to the \"publication\" branch \nor click [here](https://github.com/JanDiers/self-supervised-outlier/tree/publication).\n\nIf you use this code in your publication, we ask you to cite our paper. Find the details below.\n\n## Installation\n\nThe software can be installed by using pip. We recommend to use a virtual environment for installation, for example \nvenv. [See the official guide](https://docs.python.org/3/library/venv.html).\n\nTo install our software, run\n\n``pip install noisy_outlier``\n\n\n## Usage\n\nFor outlier detection, you can use the `NoisyOutlierDetector` as follows. The methods follow the scikit-learn syntax:\n\n```python\nimport numpy as np\nfrom noisy_outlier import NoisyOutlierDetector\nX = np.random.randn(50, 2) # some sample data\nmodel = NoisyOutlierDetector()\nmodel.fit(X)\nmodel.predict(X) # returns binary decisions, 1 for outlier, 0 for inlier\nmodel.predict_outlier_probability(X) # predicts probability for being an outlier, this is the recommended way \n```\n\nThe `NoisyOutlierDetector` has several hyperpararameters such as the number of estimators for the classification \nproblem or the pruning parameter. To our experience, the default values for the `NoisyOutlierDetector` provide stable \nresults. However, you also have the choice to run routines for optimizing hyperparameters based on a RandomSearch. Details\ncan be found in the paper. Use the `HyperparameterOptimizer` as follows:\n\n````python\nimport numpy as np\nfrom scipy.stats.distributions import uniform, randint\nfrom sklearn import metrics\n\nfrom noisy_outlier import HyperparameterOptimizer, PercentileScoring\nfrom noisy_outlier import NoisyOutlierDetector\n\nX = np.random.randn(50, 5)\ngrid = dict(n_estimators=randint(50, 150), ccp_alpha=uniform(0.01, 0.3), min_samples_leaf=randint(5, 10))\noptimizer = HyperparameterOptimizer(\n estimator=NoisyOutlierDetector(),\n param_distributions=grid,\n scoring=metrics.make_scorer(PercentileScoring(0.05), needs_proba=True),\n n_jobs=None,\n n_iter=5,\n cv=3,\n )\noptimizer.fit(X)\n# The optimizer is itself a `NoisyOutlierDetector`, so you can use it in the same way:\noutlier_probability = optimizer.predict_outlier_probability(X)\n````\nDetails about the algorithms may be found in our publication. \nIf you use this work for your publication, please cite as follows. To reproduce our results, \nplease switch to the \"publication\" branch or click [here](https://github.com/JanDiers/self-supervised-outlier/tree/publication).\n\n````\nDiers, J, Pigorsch, C. Self\u2010supervised learning for outlier detection. Stat. 2021; 10e322. https://doi.org/10.1002/sta4.322 \n````\n\nBibTeX:\n\n````\n@article{\n https://doi.org/10.1002/sta4.322,\n author = {Diers, Jan and Pigorsch, Christian},\n title = {Self-supervised learning for outlier detection},\n journal = {Stat},\n volume = {10},\n number = {1},\n pages = {e322},\n keywords = {hyperparameter, machine learning, noisy signal, outlier detection, self-supervised learning},\n doi = {https://doi.org/10.1002/sta4.322},\n url = {https://onlinelibrary.wiley.com/doi/abs/10.1002/sta4.322},\n eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1002/sta4.322},\n note = {e322 sta4.322},\n year = {2021}\n}\n````\n",
"bugtrack_url": null,
"license": "",
"summary": "Self-Supervised Learning for Outlier Detection",
"version": "0.1.6",
"project_urls": {
"Documentation": "https://github.com/jandiers/self-supervised-outlier",
"Homepage": "https://github.com/jandiers/self-supervised-outlier"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "5e754498ca53b6f40642cd345ec944786b69a8b3031d6050fea9241d1bd6820c",
"md5": "76c1302ec915a1780e6ec544dce344d8",
"sha256": "e09d14a3efdf9554ef86f72bb347aea0c26af25d1ac5d6e061632f7ac6332df7"
},
"downloads": -1,
"filename": "noisy_outlier-0.1.6-py3-none-any.whl",
"has_sig": false,
"md5_digest": "76c1302ec915a1780e6ec544dce344d8",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8,<4.0",
"size": 7527,
"upload_time": "2023-05-12T10:53:41",
"upload_time_iso_8601": "2023-05-12T10:53:41.837838Z",
"url": "https://files.pythonhosted.org/packages/5e/75/4498ca53b6f40642cd345ec944786b69a8b3031d6050fea9241d1bd6820c/noisy_outlier-0.1.6-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "b6a11bc15bbee51a562a375a904648e7a032fbbda553238d079fd8f1c78a38d9",
"md5": "486dda867108d4a0893ec5d951b55353",
"sha256": "92596ad5bc32215f59b9aa00b8d82ac10623c7db5e1d0cdc261e9581ef080fd1"
},
"downloads": -1,
"filename": "noisy_outlier-0.1.6.tar.gz",
"has_sig": false,
"md5_digest": "486dda867108d4a0893ec5d951b55353",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8,<4.0",
"size": 6265,
"upload_time": "2023-05-12T10:53:44",
"upload_time_iso_8601": "2023-05-12T10:53:44.714318Z",
"url": "https://files.pythonhosted.org/packages/b6/a1/1bc15bbee51a562a375a904648e7a032fbbda553238d079fd8f1c78a38d9/noisy_outlier-0.1.6.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-05-12 10:53:44",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "jandiers",
"github_project": "self-supervised-outlier",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "noisy-outlier"
}