# Coupled-Biased-Random-Walks
Outlier detection for categorical data
[![Build Status](https://travis-ci.com/dkaslovsky/Coupled-Biased-Random-Walks.svg?branch=master)](https://travis-ci.com/dkaslovsky/Coupled-Biased-Random-Walks)
[![Coverage Status](https://coveralls.io/repos/github/dkaslovsky/Coupled-Biased-Random-Walks/badge.svg?branch=master)](https://coveralls.io/github/dkaslovsky/Coupled-Biased-Random-Walks?branch=master)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/Coupled-Biased-Random-Walks)
### Overview
Python implementation of the Coupled Biased Random Walks (CBRW) outlier detection algorithm described by Pang, Cao, and Chen in https://www.ijcai.org/Proceedings/16/Papers/272.pdf.
__NOTE__: Only Python>=3.7 is supported as of version 2.0.0.
This implementation operates on Python dicts rather than Pandas DataFrames. This has the advantage of allowing the model to be updated with new observations in a trivial manner and is more efficient in certain aspects. However, these advantages come at the cost of iterating a (potentially large) dict of observed values more times than might otherwise be necessary using an underlying DataFrame implementation.
If one is working with data previously loaded into a DataFrame, simply use the result of `pandas.DataFrame.to_dict(orient='records')` instead of the DataFrame itself to add observations to the model. Note that because it is common for a DataFrame to fill missing values with `nan`, the detector will ignore features with value `nan` in any observation record. Therefore, there is no need to further preprocess the DataFrame before using its `to_dict` method to create records.
### Installation
This package is hosted on PyPI and can be installed via `pip`:
```
$ pip install coupled-biased-random-walks
```
To instead install from source:
```
$ git clone git@github.com:dkaslovsky/Coupled-Biased-Random-Walks.git
$ cd Coupled-Biased-Random-Walks
$ python setup.py install
```
### Example
Let's run the CBRW detection algorithm on the authors' example data set from the paper:
<img src="./img/example_table.png" width="400">
This data is saved as a [CSV file](./data/CBRW_paper_example.csv) in this repository and is loaded into memory as a list of dicts by [example.py](./example.py). Note that we drop the `Cheat?` column when loading the data, as this is essentially the target variable indicating the anomalous activity to be detected. The detector is instantiated and observations are added as follows:
```
>>> detector = CBRW()
>>> detector.add_observations(observations)
```
where `observations` is an iterable of dicts such as the one loaded from the example .CSV file. Once all of the observations are loaded, the detector can be finalized for scoring by calling `fit()` and observations can then be scored.
```
>>> detector.fit()
>>> scores = detector.score(observations)
```
Even after fitting and scoring, more observations can be added via `add_observations` and the detector can again be fit to be used for scoring. The advantage of this implementation is this ability to incrementally update with new observations.
The results of scoring the example data are shown below. Note that the only observation (`ID=1`) where fraud was present (`Cheat? = yes`) received the largest anomaly score.
```
Scores:
Observation ID 1: 0.1055
Observation ID 2: 0.0797
Observation ID 3: 0.0741
Observation ID 4: 0.0805
Observation ID 5: 0.0992
Observation ID 6: 0.0752
Observation ID 7: 0.0741
Observation ID 8: 0.0815
Observation ID 9: 0.0728
Observation ID 10: 0.0979
Observation ID 11: 0.0812
Observation ID 12: 0.0887
```
The "value scores" (scores per attribute) for each observation can also be calculated
```
>>> value_scores(observations)
```
and the results for the example data are shown below.
```
Value scores:
Observation ID 1: {'Gender': 0.0088, 'Education': 0.0195, 'Marriage': 0.0379, 'Income': 0.0393}
Observation ID 2: {'Gender': 0.0171, 'Education': 0.0195, 'Marriage': 0.0208, 'Income': 0.0223}
Observation ID 3: {'Gender': 0.0088, 'Education': 0.0195, 'Marriage': 0.0212, 'Income': 0.0247}
Observation ID 4: {'Gender': 0.0088, 'Education': 0.0286, 'Marriage': 0.0208, 'Income': 0.0223}
Observation ID 5: {'Gender': 0.0171, 'Education': 0.0195, 'Marriage': 0.0379, 'Income': 0.0247}
Observation ID 6: {'Gender': 0.0088, 'Education': 0.0209, 'Marriage': 0.0208, 'Income': 0.0247}
Observation ID 7: {'Gender': 0.0088, 'Education': 0.0195, 'Marriage': 0.0212, 'Income': 0.0247}
Observation ID 8: {'Gender': 0.0171, 'Education': 0.0209, 'Marriage': 0.0212, 'Income': 0.0223}
Observation ID 9: {'Gender': 0.0088, 'Education': 0.0209, 'Marriage': 0.0208, 'Income': 0.0223}
Observation ID 10: {'Gender': 0.0088, 'Education': 0.0286, 'Marriage': 0.0212, 'Income': 0.0393}
Observation ID 11: {'Gender': 0.0171, 'Education': 0.0209, 'Marriage': 0.0208, 'Income': 0.0223}
Observation ID 12: {'Gender': 0.0088, 'Education': 0.0195, 'Marriage': 0.0212, 'Income': 0.0393}
```
The entire example can be reproduced by running:
```
$ python example.py
```
The CBRW algorithm can also be used to calculate feature weights. These weights are calculated when the detector is fit and are used during scoring, but can also be used by any other outlier detection algorithm. Thus, the CBRW algorithm can be used simply to calculate feature weights and need not score observations. Feature weights are stored as a property of the detector after the detector's `fit` method has been called:
```
>>> detector = CBRW()
>>> detector.add_observations(observations)
>>> detector.fit()
>>> detector.feature_weights
```
For the example data, the computed feature weights are
```
Feature weights:
{'Gender': 0.1608, 'Education': 0.2627, 'Marriage': 0.2826, 'Income': 0.2939}
```
### Implementation Notes
- For efficiency, the detector state is only (re)computed upon calling `.fit()`. Therefore adding new observations (`.add_observations()`) will not affect scoring until `.fit()` is called. Refitting overwrites previous state but includes contribution from all added observations.
- The `.add_observations()` and `.fit()` methods can be chained together if one-line training is desired: `detector.add_observations(observations).fit()`.
- An observation containing a feature name or feature value that has not been previously fit will be scored as `nan`. To instead ignore any such "new" features and score an observation based on known features only, initialize the detector with `ignore_unknown=True`.
### Tests
To run unit tests:
```
$ python -m unittest discover -v
```
Raw data
{
"_id": null,
"home_page": "https://github.com/dkaslovsky/Coupled-Biased-Random-Walks",
"name": "coupled-biased-random-walks",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "anomaly detection, outlier detection, categorical data, random walk",
"author": "Daniel Kaslovsky",
"author_email": "dkaslovsky@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/18/f5/3ad7800d295df54ba1a0a3821252e77e376634a519c57349d4703259ad47/coupled_biased_random_walks-2.1.1.tar.gz",
"platform": null,
"description": "# Coupled-Biased-Random-Walks\nOutlier detection for categorical data\n\n[![Build Status](https://travis-ci.com/dkaslovsky/Coupled-Biased-Random-Walks.svg?branch=master)](https://travis-ci.com/dkaslovsky/Coupled-Biased-Random-Walks)\n[![Coverage Status](https://coveralls.io/repos/github/dkaslovsky/Coupled-Biased-Random-Walks/badge.svg?branch=master)](https://coveralls.io/github/dkaslovsky/Coupled-Biased-Random-Walks?branch=master)\n![PyPI - Python Version](https://img.shields.io/pypi/pyversions/Coupled-Biased-Random-Walks)\n\n### Overview\nPython implementation of the Coupled Biased Random Walks (CBRW) outlier detection algorithm described by Pang, Cao, and Chen in https://www.ijcai.org/Proceedings/16/Papers/272.pdf.\n\n__NOTE__: Only Python>=3.7 is supported as of version 2.0.0.\n\nThis implementation operates on Python dicts rather than Pandas DataFrames. This has the advantage of allowing the model to be updated with new observations in a trivial manner and is more efficient in certain aspects. However, these advantages come at the cost of iterating a (potentially large) dict of observed values more times than might otherwise be necessary using an underlying DataFrame implementation.\n\nIf one is working with data previously loaded into a DataFrame, simply use the result of `pandas.DataFrame.to_dict(orient='records')` instead of the DataFrame itself to add observations to the model. Note that because it is common for a DataFrame to fill missing values with `nan`, the detector will ignore features with value `nan` in any observation record. Therefore, there is no need to further preprocess the DataFrame before using its `to_dict` method to create records.\n\n### Installation\nThis package is hosted on PyPI and can be installed via `pip`:\n```\n$ pip install coupled-biased-random-walks\n```\nTo instead install from source:\n```\n$ git clone git@github.com:dkaslovsky/Coupled-Biased-Random-Walks.git\n$ cd Coupled-Biased-Random-Walks\n$ python setup.py install\n```\n\n### Example\nLet's run the CBRW detection algorithm on the authors' example data set from the paper:\n\n<img src=\"./img/example_table.png\" width=\"400\">\n\nThis data is saved as a [CSV file](./data/CBRW_paper_example.csv) in this repository and is loaded into memory as a list of dicts by [example.py](./example.py). Note that we drop the `Cheat?` column when loading the data, as this is essentially the target variable indicating the anomalous activity to be detected. The detector is instantiated and observations are added as follows:\n```\n>>> detector = CBRW()\n>>> detector.add_observations(observations)\n```\nwhere `observations` is an iterable of dicts such as the one loaded from the example .CSV file. Once all of the observations are loaded, the detector can be finalized for scoring by calling `fit()` and observations can then be scored.\n```\n>>> detector.fit()\n>>> scores = detector.score(observations)\n```\nEven after fitting and scoring, more observations can be added via `add_observations` and the detector can again be fit to be used for scoring. The advantage of this implementation is this ability to incrementally update with new observations.\n\nThe results of scoring the example data are shown below. Note that the only observation (`ID=1`) where fraud was present (`Cheat? = yes`) received the largest anomaly score.\n```\nScores:\nObservation ID 1: 0.1055\nObservation ID 2: 0.0797\nObservation ID 3: 0.0741\nObservation ID 4: 0.0805\nObservation ID 5: 0.0992\nObservation ID 6: 0.0752\nObservation ID 7: 0.0741\nObservation ID 8: 0.0815\nObservation ID 9: 0.0728\nObservation ID 10: 0.0979\nObservation ID 11: 0.0812\nObservation ID 12: 0.0887\n```\n\nThe \"value scores\" (scores per attribute) for each observation can also be calculated\n```\n>>> value_scores(observations)\n```\nand the results for the example data are shown below.\n```\nValue scores:\nObservation ID 1: {'Gender': 0.0088, 'Education': 0.0195, 'Marriage': 0.0379, 'Income': 0.0393}\nObservation ID 2: {'Gender': 0.0171, 'Education': 0.0195, 'Marriage': 0.0208, 'Income': 0.0223}\nObservation ID 3: {'Gender': 0.0088, 'Education': 0.0195, 'Marriage': 0.0212, 'Income': 0.0247}\nObservation ID 4: {'Gender': 0.0088, 'Education': 0.0286, 'Marriage': 0.0208, 'Income': 0.0223}\nObservation ID 5: {'Gender': 0.0171, 'Education': 0.0195, 'Marriage': 0.0379, 'Income': 0.0247}\nObservation ID 6: {'Gender': 0.0088, 'Education': 0.0209, 'Marriage': 0.0208, 'Income': 0.0247}\nObservation ID 7: {'Gender': 0.0088, 'Education': 0.0195, 'Marriage': 0.0212, 'Income': 0.0247}\nObservation ID 8: {'Gender': 0.0171, 'Education': 0.0209, 'Marriage': 0.0212, 'Income': 0.0223}\nObservation ID 9: {'Gender': 0.0088, 'Education': 0.0209, 'Marriage': 0.0208, 'Income': 0.0223}\nObservation ID 10: {'Gender': 0.0088, 'Education': 0.0286, 'Marriage': 0.0212, 'Income': 0.0393}\nObservation ID 11: {'Gender': 0.0171, 'Education': 0.0209, 'Marriage': 0.0208, 'Income': 0.0223}\nObservation ID 12: {'Gender': 0.0088, 'Education': 0.0195, 'Marriage': 0.0212, 'Income': 0.0393}\n```\n\nThe entire example can be reproduced by running:\n```\n$ python example.py\n```\n\nThe CBRW algorithm can also be used to calculate feature weights. These weights are calculated when the detector is fit and are used during scoring, but can also be used by any other outlier detection algorithm. Thus, the CBRW algorithm can be used simply to calculate feature weights and need not score observations. Feature weights are stored as a property of the detector after the detector's `fit` method has been called:\n```\n>>> detector = CBRW()\n>>> detector.add_observations(observations)\n>>> detector.fit()\n>>> detector.feature_weights\n```\nFor the example data, the computed feature weights are\n```\nFeature weights:\n{'Gender': 0.1608, 'Education': 0.2627, 'Marriage': 0.2826, 'Income': 0.2939}\n```\n\n### Implementation Notes\n- For efficiency, the detector state is only (re)computed upon calling `.fit()`. Therefore adding new observations (`.add_observations()`) will not affect scoring until `.fit()` is called. Refitting overwrites previous state but includes contribution from all added observations.\n- The `.add_observations()` and `.fit()` methods can be chained together if one-line training is desired: `detector.add_observations(observations).fit()`.\n- An observation containing a feature name or feature value that has not been previously fit will be scored as `nan`. To instead ignore any such \"new\" features and score an observation based on known features only, initialize the detector with `ignore_unknown=True`.\n\n### Tests\nTo run unit tests:\n```\n$ python -m unittest discover -v\n```\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Outlier detection for categorical data",
"version": "2.1.1",
"project_urls": {
"Homepage": "https://github.com/dkaslovsky/Coupled-Biased-Random-Walks"
},
"split_keywords": [
"anomaly detection",
" outlier detection",
" categorical data",
" random walk"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "0e31856ee93bc8c5807105d22fdf5678b00db98a018ef425835671f265a9c310",
"md5": "e7f9007798a57089ed82e0916fccc426",
"sha256": "34f0110bb2acbc7872867ed022eededcf4511c3609680ca53c6df2853eb11e99"
},
"downloads": -1,
"filename": "coupled_biased_random_walks-2.1.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "e7f9007798a57089ed82e0916fccc426",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 11089,
"upload_time": "2024-07-11T00:19:40",
"upload_time_iso_8601": "2024-07-11T00:19:40.945431Z",
"url": "https://files.pythonhosted.org/packages/0e/31/856ee93bc8c5807105d22fdf5678b00db98a018ef425835671f265a9c310/coupled_biased_random_walks-2.1.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "18f53ad7800d295df54ba1a0a3821252e77e376634a519c57349d4703259ad47",
"md5": "b0ead7be89b60e9dcf891daa921c78f3",
"sha256": "f88561ddb6645265508e32bc1b540a1c7d77259e31ff9e1627b6299d293ad689"
},
"downloads": -1,
"filename": "coupled_biased_random_walks-2.1.1.tar.gz",
"has_sig": false,
"md5_digest": "b0ead7be89b60e9dcf891daa921c78f3",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 107322,
"upload_time": "2024-07-11T00:19:42",
"upload_time_iso_8601": "2024-07-11T00:19:42.040382Z",
"url": "https://files.pythonhosted.org/packages/18/f5/3ad7800d295df54ba1a0a3821252e77e376634a519c57349d4703259ad47/coupled_biased_random_walks-2.1.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-07-11 00:19:42",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "dkaslovsky",
"github_project": "Coupled-Biased-Random-Walks",
"travis_ci": true,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "numpy",
"specs": [
[
"==",
"2.0.0"
]
]
},
{
"name": "scipy",
"specs": [
[
"==",
"1.14.0"
]
]
},
{
"name": "setuptools",
"specs": [
[
"==",
"70.3.0"
]
]
}
],
"lcname": "coupled-biased-random-walks"
}