multi-column-distribution-sampler


Namemulti-column-distribution-sampler JSON
Version 0.0.1 PyPI version JSON
download
home_pageNone
SummaryDistribution based sampling for multiple categorical columms
upload_time2024-06-17 02:19:53
maintainerNone
docs_urlNone
authorNone
requires_python>=3.7
licenseMIT License Copyright (c) 2024 Alan Abraham Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords dataframe distribution distribution sampler multi column sampler multiple multiple column sampling pandas representative sampling sampler sampling
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # MULTI COLUMN DISTRIBUTION SAMPLER

Function to draw a sample from a given dataframe while maintaining the same distribution of the columns of interest.
Also includes a function to determine the minimum sample size needed to ensure the distribution of the columns of interest in the given dataframe.

## Overview

In order to draw a sample from a `pandas dataframe` we can use the `sample` function. But, this doesn't ensure that the sample drawn would be representative of the distributions of the columns of interest in the dataframe. Although we can use the `train_test_split` with `stratify` from `sklearn.model_selection` for stratified sampling, things get complex once we want the sample to follow the exact distribution for multiple columns. This package abstracts away all the logic and provides you with functions that can be used to determine the minimum sample size reqired for a distributive sample and also the sample itself when involving multiple columns. Since this package ensures a perfect representative sample, the resulting sample will probably be a bit larger compared to the sample size desired. This increase in sample size will be depending on the distributions of the columns in the actual dataframe and also the number of columns factored in while sampling.

## Features

* Multi column based representative sampling
* Handles both continuous and categorical features
* Uses Gini index to measure impurity of partion
* Includes basic tree printing functionality for tree visualization

# Requirements

Python 3.x
[Optional] Any text editor or IDE of your choice for editing the code.

# Installation

multi_column_distribution_sampler can be installed using the following command:

```
pip install multi_column_distribution_sampler
```
or
```
pip3 install multi_column_distribution_sampler
```

# Dependencies

multi_column_distribution_sampler depends on the following packages:-

* numpy
* pandas

## License

MIT License
            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "multi-column-distribution-sampler",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "dataframe, distribution, distribution sampler, multi column sampler, multiple, multiple column sampling, pandas, representative sampling, sampler, sampling",
    "author": null,
    "author_email": "Alan Abraham <alan.abraham.tm@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/c3/03/04aeba897a28d9f71ed553884aa819245f67ab393cf7ab8e4789ceef5b0c/multi_column_distribution_sampler-0.0.1.tar.gz",
    "platform": null,
    "description": "# MULTI COLUMN DISTRIBUTION SAMPLER\n\nFunction to draw a sample from a given dataframe while maintaining the same distribution of the columns of interest.\nAlso includes a function to determine the minimum sample size needed to ensure the distribution of the columns of interest in the given dataframe.\n\n## Overview\n\nIn order to draw a sample from a `pandas dataframe` we can use the `sample` function. But, this doesn't ensure that the sample drawn would be representative of the distributions of the columns of interest in the dataframe. Although we can use the `train_test_split` with `stratify` from `sklearn.model_selection` for stratified sampling, things get complex once we want the sample to follow the exact distribution for multiple columns. This package abstracts away all the logic and provides you with functions that can be used to determine the minimum sample size reqired for a distributive sample and also the sample itself when involving multiple columns. Since this package ensures a perfect representative sample, the resulting sample will probably be a bit larger compared to the sample size desired. This increase in sample size will be depending on the distributions of the columns in the actual dataframe and also the number of columns factored in while sampling.\n\n## Features\n\n* Multi column based representative sampling\n* Handles both continuous and categorical features\n* Uses Gini index to measure impurity of partion\n* Includes basic tree printing functionality for tree visualization\n\n# Requirements\n\nPython 3.x\n[Optional] Any text editor or IDE of your choice for editing the code.\n\n# Installation\n\nmulti_column_distribution_sampler can be installed using the following command:\n\n```\npip install multi_column_distribution_sampler\n```\nor\n```\npip3 install multi_column_distribution_sampler\n```\n\n# Dependencies\n\nmulti_column_distribution_sampler depends on the following packages:-\n\n* numpy\n* pandas\n\n## License\n\nMIT License",
    "bugtrack_url": null,
    "license": "MIT License  Copyright (c) 2024 Alan Abraham  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.  THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.",
    "summary": "Distribution based sampling for multiple categorical columms",
    "version": "0.0.1",
    "project_urls": {
        "Bug Tracker": "https://github.com/a-abrhm/multi_column_distribution_sampler/issues",
        "Homepage": "https://github.com/a-abrhm/multi_column_distribution_sampler"
    },
    "split_keywords": [
        "dataframe",
        " distribution",
        " distribution sampler",
        " multi column sampler",
        " multiple",
        " multiple column sampling",
        " pandas",
        " representative sampling",
        " sampler",
        " sampling"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ea68b087f0fef34127dec308c9e19a45f933925382d3bf9f50b37afdf66e3589",
                "md5": "d4c5d9d82f4d0eccce6abb3a0dc21bd0",
                "sha256": "76a8db0333fb3e8e8e8d69a32b03f68ee1a7f1653aaff90b080f677ac48f6e74"
            },
            "downloads": -1,
            "filename": "multi_column_distribution_sampler-0.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d4c5d9d82f4d0eccce6abb3a0dc21bd0",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 5358,
            "upload_time": "2024-06-17T02:19:52",
            "upload_time_iso_8601": "2024-06-17T02:19:52.107515Z",
            "url": "https://files.pythonhosted.org/packages/ea/68/b087f0fef34127dec308c9e19a45f933925382d3bf9f50b37afdf66e3589/multi_column_distribution_sampler-0.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c30304aeba897a28d9f71ed553884aa819245f67ab393cf7ab8e4789ceef5b0c",
                "md5": "61c5c3357a25ef61948e852062c0ff3c",
                "sha256": "789713d49239424ad78f4d40ae56a86667c14db13c06645720d1794dacb6667c"
            },
            "downloads": -1,
            "filename": "multi_column_distribution_sampler-0.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "61c5c3357a25ef61948e852062c0ff3c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 4755,
            "upload_time": "2024-06-17T02:19:53",
            "upload_time_iso_8601": "2024-06-17T02:19:53.512359Z",
            "url": "https://files.pythonhosted.org/packages/c3/03/04aeba897a28d9f71ed553884aa819245f67ab393cf7ab8e4789ceef5b0c/multi_column_distribution_sampler-0.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-06-17 02:19:53",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "a-abrhm",
    "github_project": "multi_column_distribution_sampler",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "multi-column-distribution-sampler"
}
        
Elapsed time: 0.27245s