gower-multiprocessing


Namegower-multiprocessing JSON
Version 0.2.2 PyPI version JSON
download
home_pageNone
SummaryPython implementation of Gowers distance, pairwise between records in two data sets with multiprocessing support
upload_time2024-05-17 20:54:14
maintainerNone
docs_urlNone
authorMichael Yan, Dominic D
requires_python>=2.7
licenseNone
keywords gower distance matrix
VCS
bugtrack_url
requirements numpy scipy pandas
Travis-CI
coveralls test coverage No coveralls.
            <!-- badges: start -->
[![PyPI version](https://badge.fury.io/py/gower.svg)](https://pypi.org/project/gower-multiprocessing/)
[![Downloads](https://pepy.tech/badge/gower/month)](https://pepy.tech/project/gower-multiprocessing/month)
<!-- badges: end -->

# Introduction

Gower's distance calculation in Python. Gower Distance is a distance measure that can be used to calculate distance between two entity whose attribute has a mixed of categorical and numerical values. [Gower (1971) A general coefficient of similarity and some of its properties. Biometrics 27 857–874.](https://www.jstor.org/stable/2528823?seq=1) 

More details and examples can be found on my personal website here:(https://www.thinkdatascience.com/post/2019-12-16-introducing-python-package-gower/)

Core functions are wrote by [Marcelo Beckmann](https://sourceforge.net/projects/gower-distance-4python/files/).

Multiprocessing added by [Szymon Bobek](https://github.com/sbobek)

# Examples

## Installation

```
pip install gower-multiprocessing
```

## Generate some data

```python
import numpy as np
import pandas as pd
import gower-multiprocessing as gower

Xd=pd.DataFrame({'age':[21,21,19, 30,21,21,19,30,None],
'gender':['M','M','N','M','F','F','F','F',None],
'civil_status':['MARRIED','SINGLE','SINGLE','SINGLE','MARRIED','SINGLE','WIDOW','DIVORCED',None],
'salary':[3000.0,1200.0 ,32000.0,1800.0 ,2900.0 ,1100.0 ,10000.0,1500.0,None],
'has_children':[1,0,1,1,1,0,0,1,None],
'available_credit':[2200,100,22000,1100,2000,100,6000,2200,None]})
Yd = Xd.iloc[1:3,:]
X = np.asarray(Xd)
Y = np.asarray(Yd)

```

## Find the distance matrix

```python
gower.gower_matrix(X)
```




    array([[0.        , 0.3590238 , 0.6707398 , 0.31787416, 0.16872811,
            0.52622986, 0.59697855, 0.47778758,        nan],
           [0.3590238 , 0.        , 0.6964303 , 0.3138769 , 0.523629  ,
            0.16720603, 0.45600235, 0.6539635 ,        nan],
           [0.6707398 , 0.6964303 , 0.        , 0.6552807 , 0.6728013 ,
            0.6969697 , 0.740428  , 0.8151941 ,        nan],
           [0.31787416, 0.3138769 , 0.6552807 , 0.        , 0.4824794 ,
            0.48108295, 0.74818605, 0.34332284,        nan],
           [0.16872811, 0.523629  , 0.6728013 , 0.4824794 , 0.        ,
            0.35750175, 0.43237334, 0.3121036 ,        nan],
           [0.52622986, 0.16720603, 0.6969697 , 0.48108295, 0.35750175,
            0.        , 0.2898751 , 0.4878362 ,        nan],
           [0.59697855, 0.45600235, 0.740428  , 0.74818605, 0.43237334,
            0.2898751 , 0.        , 0.57476616,        nan],
           [0.47778758, 0.6539635 , 0.8151941 , 0.34332284, 0.3121036 ,
            0.4878362 , 0.57476616, 0.        ,        nan],
           [       nan,        nan,        nan,        nan,        nan,
                   nan,        nan,        nan,        nan]], dtype=float32)


## Find Top n results

```python
gower.gower_topn(Xd.iloc[0:2,:], Xd.iloc[:,], n = 5)
```




    {'index': array([4, 3, 1, 7, 5]),
     'values': array([0.16872811, 0.31787416, 0.3590238 , 0.47778758, 0.52622986],
           dtype=float32)}


## Performance comparison with single-process version

```
Single process (DS-size: 10000, time:   15.58 sec.)	█
Multi process  (DS-size: 10000, time:    2.93 sec.)	

Single process (DS-size: 20000, time:   54.30 sec.)	█████
Multi process  (DS-size: 20000, time:   11.57 sec.)	█

Single process (DS-size: 30000, time:  119.80 sec.)	███████████
Multi process  (DS-size: 30000, time:   24.86 sec.)	██

Single process (DS-size: 40000, time:  202.65 sec.)	████████████████████
Multi process  (DS-size: 40000, time:   41.77 sec.)	████

Single process (DS-size: 50000, time:  318.64 sec.)	███████████████████████████████
Multi process  (DS-size: 50000, time:   68.36 sec.)	██████

Single process (DS-size: 60000, time:  469.64 sec.)	██████████████████████████████████████████████
Multi process  (DS-size: 60000, time:   96.24 sec.)	█████████

Single process (DS-size: 70000, time:  653.27 sec.)	█████████████████████████████████████████████████████████████████
Multi process  (DS-size: 70000, time:  143.31 sec.)	██████████████

Single process (DS-size: 80000, time:  857.04 sec.)	█████████████████████████████████████████████████████████████████████████████████████
Multi process  (DS-size: 80000, time:  181.60 sec.)	██████████████████

Single process (DS-size: 90000, time: 1129.21 sec.)	████████████████████████████████████████████████████████████████████████████████████████████████████████████████
Multi process  (DS-size: 90000, time:  252.36 sec.)	█████████████████████████
```


            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "gower-multiprocessing",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=2.7",
    "maintainer_email": null,
    "keywords": "gower, distance, matrix",
    "author": "Michael Yan, Dominic D",
    "author_email": "Szymon Bobek <szymon.bobek@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/81/ec/cf4b0077333e8ea9bb276f190f0ccd9d0fbbde0b5f52e1a80187fa8b768a/gower_multiprocessing-0.2.2.tar.gz",
    "platform": null,
    "description": "<!-- badges: start -->\n[![PyPI version](https://badge.fury.io/py/gower.svg)](https://pypi.org/project/gower-multiprocessing/)\n[![Downloads](https://pepy.tech/badge/gower/month)](https://pepy.tech/project/gower-multiprocessing/month)\n<!-- badges: end -->\n\n# Introduction\n\nGower's distance calculation in Python. Gower Distance is a distance measure that can be used to calculate distance between two entity whose attribute has a mixed of categorical and numerical values. [Gower (1971) A general coefficient of similarity and some of its properties. Biometrics 27 857\u2013874.](https://www.jstor.org/stable/2528823?seq=1) \n\nMore details and examples can be found on my personal website here:(https://www.thinkdatascience.com/post/2019-12-16-introducing-python-package-gower/)\n\nCore functions are wrote by [Marcelo Beckmann](https://sourceforge.net/projects/gower-distance-4python/files/).\n\nMultiprocessing added by [Szymon Bobek](https://github.com/sbobek)\n\n# Examples\n\n## Installation\n\n```\npip install gower-multiprocessing\n```\n\n## Generate some data\n\n```python\nimport numpy as np\nimport pandas as pd\nimport gower-multiprocessing as gower\n\nXd=pd.DataFrame({'age':[21,21,19, 30,21,21,19,30,None],\n'gender':['M','M','N','M','F','F','F','F',None],\n'civil_status':['MARRIED','SINGLE','SINGLE','SINGLE','MARRIED','SINGLE','WIDOW','DIVORCED',None],\n'salary':[3000.0,1200.0 ,32000.0,1800.0 ,2900.0 ,1100.0 ,10000.0,1500.0,None],\n'has_children':[1,0,1,1,1,0,0,1,None],\n'available_credit':[2200,100,22000,1100,2000,100,6000,2200,None]})\nYd = Xd.iloc[1:3,:]\nX = np.asarray(Xd)\nY = np.asarray(Yd)\n\n```\n\n## Find the distance matrix\n\n```python\ngower.gower_matrix(X)\n```\n\n\n\n\n    array([[0.        , 0.3590238 , 0.6707398 , 0.31787416, 0.16872811,\n            0.52622986, 0.59697855, 0.47778758,        nan],\n           [0.3590238 , 0.        , 0.6964303 , 0.3138769 , 0.523629  ,\n            0.16720603, 0.45600235, 0.6539635 ,        nan],\n           [0.6707398 , 0.6964303 , 0.        , 0.6552807 , 0.6728013 ,\n            0.6969697 , 0.740428  , 0.8151941 ,        nan],\n           [0.31787416, 0.3138769 , 0.6552807 , 0.        , 0.4824794 ,\n            0.48108295, 0.74818605, 0.34332284,        nan],\n           [0.16872811, 0.523629  , 0.6728013 , 0.4824794 , 0.        ,\n            0.35750175, 0.43237334, 0.3121036 ,        nan],\n           [0.52622986, 0.16720603, 0.6969697 , 0.48108295, 0.35750175,\n            0.        , 0.2898751 , 0.4878362 ,        nan],\n           [0.59697855, 0.45600235, 0.740428  , 0.74818605, 0.43237334,\n            0.2898751 , 0.        , 0.57476616,        nan],\n           [0.47778758, 0.6539635 , 0.8151941 , 0.34332284, 0.3121036 ,\n            0.4878362 , 0.57476616, 0.        ,        nan],\n           [       nan,        nan,        nan,        nan,        nan,\n                   nan,        nan,        nan,        nan]], dtype=float32)\n\n\n## Find Top n results\n\n```python\ngower.gower_topn(Xd.iloc[0:2,:], Xd.iloc[:,], n = 5)\n```\n\n\n\n\n    {'index': array([4, 3, 1, 7, 5]),\n     'values': array([0.16872811, 0.31787416, 0.3590238 , 0.47778758, 0.52622986],\n           dtype=float32)}\n\n\n## Performance comparison with single-process version\n\n```\nSingle process (DS-size: 10000, time:   15.58 sec.)\t\u2588\nMulti process  (DS-size: 10000, time:    2.93 sec.)\t\n\nSingle process (DS-size: 20000, time:   54.30 sec.)\t\u2588\u2588\u2588\u2588\u2588\nMulti process  (DS-size: 20000, time:   11.57 sec.)\t\u2588\n\nSingle process (DS-size: 30000, time:  119.80 sec.)\t\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\nMulti process  (DS-size: 30000, time:   24.86 sec.)\t\u2588\u2588\n\nSingle process (DS-size: 40000, time:  202.65 sec.)\t\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\nMulti process  (DS-size: 40000, time:   41.77 sec.)\t\u2588\u2588\u2588\u2588\n\nSingle process (DS-size: 50000, time:  318.64 sec.)\t\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\nMulti process  (DS-size: 50000, time:   68.36 sec.)\t\u2588\u2588\u2588\u2588\u2588\u2588\n\nSingle process (DS-size: 60000, time:  469.64 sec.)\t\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\nMulti process  (DS-size: 60000, time:   96.24 sec.)\t\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\n\nSingle process (DS-size: 70000, time:  653.27 sec.)\t\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\nMulti process  (DS-size: 70000, time:  143.31 sec.)\t\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\n\nSingle process (DS-size: 80000, time:  857.04 sec.)\t\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\nMulti process  (DS-size: 80000, time:  181.60 sec.)\t\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\n\nSingle process (DS-size: 90000, time: 1129.21 sec.)\t\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\nMulti process  (DS-size: 90000, time:  252.36 sec.)\t\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\n```\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Python implementation of Gowers distance, pairwise between records in two data sets with multiprocessing support",
    "version": "0.2.2",
    "project_urls": {
        "Homepage": "https://github.com/sbobek/gower"
    },
    "split_keywords": [
        "gower",
        " distance",
        " matrix"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "aee3b4cba618d6ea4ee8e5a4e522050e1a82360f8322ee7e62ddf6bdcc46eb42",
                "md5": "3904238c73a12b2d2cc9b415d0943518",
                "sha256": "a2d3be9db82dd0d17cd16ac7fd25a51ee0d1eb00f34a552cb24eb169c04774d9"
            },
            "downloads": -1,
            "filename": "gower_multiprocessing-0.2.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "3904238c73a12b2d2cc9b415d0943518",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=2.7",
            "size": 6426,
            "upload_time": "2024-05-17T20:54:12",
            "upload_time_iso_8601": "2024-05-17T20:54:12.427487Z",
            "url": "https://files.pythonhosted.org/packages/ae/e3/b4cba618d6ea4ee8e5a4e522050e1a82360f8322ee7e62ddf6bdcc46eb42/gower_multiprocessing-0.2.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "81eccf4b0077333e8ea9bb276f190f0ccd9d0fbbde0b5f52e1a80187fa8b768a",
                "md5": "4b41f99c232893c20361e806e5dc47a5",
                "sha256": "7843fedd109418b4ff7d2e423a17616f753bc04741e29adf8e4e46ee9e77fc4f"
            },
            "downloads": -1,
            "filename": "gower_multiprocessing-0.2.2.tar.gz",
            "has_sig": false,
            "md5_digest": "4b41f99c232893c20361e806e5dc47a5",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=2.7",
            "size": 5981,
            "upload_time": "2024-05-17T20:54:14",
            "upload_time_iso_8601": "2024-05-17T20:54:14.115523Z",
            "url": "https://files.pythonhosted.org/packages/81/ec/cf4b0077333e8ea9bb276f190f0ccd9d0fbbde0b5f52e1a80187fa8b768a/gower_multiprocessing-0.2.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-05-17 20:54:14",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "sbobek",
    "github_project": "gower",
    "travis_ci": true,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "numpy",
            "specs": []
        },
        {
            "name": "scipy",
            "specs": []
        },
        {
            "name": "pandas",
            "specs": []
        }
    ],
    "lcname": "gower-multiprocessing"
}
        
Elapsed time: 4.48806s