absplit


Nameabsplit JSON
Version 1.4.4 PyPI version JSON
download
home_pageNone
SummaryGenerates A/B/n test groups
upload_time2024-02-10 19:07:34
maintainerNone
docs_urlNone
authorNone
requires_python<=3.11
licenseNone
keywords absplit a/b test ab test ab split split set formation group formation
VCS
bugtrack_url
requirements pygad scikit-learn numpy pandas seaborn
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <a name="readme-top"></a>

<div align="center">
<img src="https://raw.githubusercontent.com/cormac-rynne/absplit/main/images/logo.jpeg" width="460" height="140">
<h3><strong>ABSplit</strong></h3>
Split your data into matching A/B/n groups

![license](https://img.shields.io/badge/License-MIT-blue.svg)
![version](https://img.shields.io/badge/version-1.4.4-blue.svg)
![version](https://img.shields.io/badge/python-3-orange.svg)

</div>

<details open>
  <summary>Table of Contents</summary>
  <ol>
    <li>
      <a href="#about-the-project">About The Project</a>
      <ul>
        <li><a href="#calculation">Calculation</a></li>
      </ul>
    </li>
    <li>
      <a href="#getting-started">Getting Started</a>
      <ul>
        <li><a href="#installation">Installation</a></li>
      </ul>
    </li>
    <li><a href="#tutorial">Tutorials</a></li>
    <ul>
        <li><a href="#do-it-yourself">Do it yourself</a></li>
    </ul>
    <li><a href="#usage">Usage</a></li>
    <li><a href="#api-reference">API Reference</a></li>
    <li><a href="#contributing">Contributing</a></li>
    <li><a href="#license">License</a></li>
    <li><a href="#contact">Contact</a></li>
  </ol>
</details>

## About the project
ABSplit is a python package that uses a genetic algorithm to generate as equal as possible A/B, A/B/C, or A/B/n test splits.

The project aims to provide a convenient and efficient way for splitting population data into distinct 
groups (ABSplit), as well as and finding matching samples that closely resemble a given original sample (Match).


Whether you have static population data or time series data, this Python package simplifies the process and allows you to 
analyze and manipulate your population data.

This covers the following use cases:
1. **ABSplit class**: Splitting an entire population into n groups by given proportions
2. **Match class**: Finding a matching group in a population for a given sample

### Calculation

ABSplit standardises the population data (so each metric is weighted as equally as possible), then pivots it into a 
three-dimensional array, by metrics, individuals, and dates. 

The selection from the genetic algorithm, along with its inverse, is applied across this array with broadcasting to 
compute the dot products between the selection and the population data.

As a result, aggregated metrics for each group are calculated. The Mean Squared Error is calculated 
for each metric within the groups and then summed for each metric. The objective of the cost function is to minimize the 
overall MSE between these two groups, ensuring the metrics of both groups track each other as similarly across time
as possible.

<div align="center">
  <img src="https://raw.githubusercontent.com/cormac-rynne/absplit/main/images/calculation_diagram.png" width="80%">
</div>

<p align="right">(<a href="#readme-top">back to top</a>)</p>

## Getting Started
Use the package manager [pip](https://pip.pypa.io/en/stable/) to install ABSplit and it's prerequisites.

ABSplit requires `pygad==3.0.1`

### Installation

```bash
pip install absplit
```

<p align="right">(<a href="#readme-top">back to top</a>)</p>

## Tutorials
Please see [this colab](https://colab.research.google.com/drive/1gL7dxDJrtVoO5m1mSUWutdr7yas7sZwI?usp=sharing) for 
a range of examples on how to use ABSplit and Match

### Do it yourself
See [this colab](https://colab.research.google.com/drive/1SlCNnOtN4WCDTSJHsFrZtI7gKcXEl8-C?usp=sharing) to learn how 
ABSplit works under the hood, and how to build your own group splitting tool using 
[PyGAD](https://pypi.org/project/pygad/),


<p align="right">(<a href="#readme-top">back to top</a>)</p>

## Usage

```python
from absplit import ABSplit
import pandas as pd
import datetime
import numpy as np

# Synthetic data
data_dct = {
    'date': [datetime.date(2030,4,1) + datetime.timedelta(days=x) for x in range(3)]*5,
    'country': ['UK'] * 15,
    'region': [item for sublist in [[x]*6 for x in ['z', 'y']] for item in sublist] + ['x']*3,
    'city': [item for sublist in [[x]*3 for x in ['a', 'b', 'c', 'd', 'e']] for item in sublist],
    'metric1': np.arange(0, 15, 1),
    'metric2': np.arange(0, 150, 10)
}
df = pd.DataFrame(data_dct)

# Identify which columns are metrics, which is the time period, and what to split on
kwargs = {
    'metrics': ['metric1', 'metric2'],
    'date_col': 'date',
    'splitting': 'city'
}

# Initialise
ab = ABSplit(
    df=df,
    split=[.5, .5],  # Split into 2 groups of equal size
    **kwargs,
)

# Generate split
ab.run()

# Visualise generation fitness
ab.fitness()

# Visualise data
ab.visualise()

# Extract bin splits
df = ab.results

# Extract data aggregated by bins
df_agg = ab.aggregations

# Extract summary statistics
df_dist = ab.distributions    # Population counts between groups
df_rmse = ab.rmse             # RMSE between groups for each metric
df_mape = ab.mape             # MAPE between groups for each metric
df_totals = ab.totals         # Total sum of each metric for each group

```
<p align="right">(<a href="#readme-top">back to top</a>)</p>

## API Reference
### Absplit 
`ABSplit(df, metrics, splitting, date_col=None, ga_params={}, metric_weights={}, splits=[0.5, 0.5], size_penalty=0)`

Splits population into n groups. Mutually exclusive, completely exhaustive

Arguments:
* `df` (pd.DataFrame): Dataframe of population to be split
* `metrics` (str, list): Name of, or list of names of, metric columns in DataFrame to be considered in split
* `splitting` (str): Name of column that represents individuals in the population that is getting split. For example, if 
you wanted to split a dataframe of US counties, this would be the county name column
* `date_col` (str, optional): Name of column that represents time periods, if applicable. If left empty, it will
perform a static split, i.e. not across timeseries, (default `None`)
* `ga_params` (dict, optional): Parameters for the genetic algorithm `pygad.GA` module parameters, see 
[here](https://pygad.readthedocs.io/en/latest/README_pygad_ReadTheDocs.html#pygad-ga-class) for arguments you can pass
(default: `{}`)
* `splits` (list, optional): How many groups to split into, and relative size of the groups (default: `[0.5, 0.5]`,
2 groups of equal size)
* `size_penalty` (float, optional): Penalty weighting for differences in the population count between groups 
(default: `0`)
* `sum_penalty` (float, optional): Penalty weighting for the sum of metrics over time. If this is greater than zero,
it will add a penalty to the cost function that will try and make the sum of each metric the same for each group 
(default: `0`)
* `cutoff_date` (str, optional): Cutoff date between fitting and validation data. For example, if you have data between 
2023-01-01 and 2023-03-01, and the cutoff date is 2023-02-01, the algorithm will only perform the fit on data between 
2023-01-01 and 2023-02-01. If `None`, it will fit on all available data. If cutoff date is provided, RMSE scores
  (gotten by using the `ab.rmse` attribute) will only be for validation period (i.e., from 2023-02-01 to end of 
timeseries)
* `missing_dates` (str, optional): How to deal with missing dates in time series data, options: `['drop_dates',
'drop_population', '0', 'median']` (default: `median`)
* `metric_weights` (dict, optional): Weights for each metric in the data. If you want the splitting to focus on 
one metrics more than the other, you can prioritise this here (default: `{}`)


### Match 
`Match(population, sample, metrics, splitting, date_col=None, ga_params={}, metric_weights={})`

Takes DataFrame `sample` and finds a comparable group in `population`.

Arguments:
* `population` (pd.DataFrame): Population to search  for comparable group (**Must exclude sample data**)
* `sample` (pd.DataFrame): Sample we are looking to find a match for.
* `metrics` (str, list): Name of, or list of names of, metric columns in DataFrame
* `splitting` (str): Name of column that represents individuals in the population that is getting split
* `date_col` (str, optional): Name of column that represents time periods, if applicable. If left empty, it will
perform a static split, i.e. not across timeseries, (default `None`)
* `ga_params` (dict, optional): Parameters for the genetic algorithm `pygad.GA` module parameters, see 
[here](https://pygad.readthedocs.io/en/latest/README_pygad_ReadTheDocs.html#pygad-ga-class) for arguments you can pass
(default: `{}`)
* `splits` (list, optional): How many groups to split into, and relative size of the groups (default: `[0.5, 0.5]`,
2 groups of equal size)
* `metric_weights` (dict, optional): Weights for each metric in the data. If you want the splitting to focus on 
one metrics more than the other, you can prioritise this here (default: `{}`)
<p align="right">(<a href="#readme-top">back to top</a>)</p>

## Contributing

I welcome contributions to ABSplit! For major changes, please open an issue first
to discuss what you would like to change.

Please make sure to update tests as appropriate.

<p align="right">(<a href="#readme-top">back to top</a>)</p>

## License

[MIT](https://choosealicense.com/licenses/mit/)

<p align="right">(<a href="#readme-top">back to top</a>)</p>
            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "absplit",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<=3.11",
    "maintainer_email": null,
    "keywords": "absplit,a/b test,ab test,ab split,split,set formation,group formation",
    "author": null,
    "author_email": "Cormac Rynne <cormac.ry@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/9f/02/1b86fb974a4ed9024063c9060aabd3beaf20c9d7997619786bf129b01ce3/absplit-1.4.4.tar.gz",
    "platform": null,
    "description": "<a name=\"readme-top\"></a>\n\n<div align=\"center\">\n<img src=\"https://raw.githubusercontent.com/cormac-rynne/absplit/main/images/logo.jpeg\" width=\"460\" height=\"140\">\n<h3><strong>ABSplit</strong></h3>\nSplit your data into matching A/B/n groups\n\n![license](https://img.shields.io/badge/License-MIT-blue.svg)\n![version](https://img.shields.io/badge/version-1.4.4-blue.svg)\n![version](https://img.shields.io/badge/python-3-orange.svg)\n\n</div>\n\n<details open>\n  <summary>Table of Contents</summary>\n  <ol>\n    <li>\n      <a href=\"#about-the-project\">About The Project</a>\n      <ul>\n        <li><a href=\"#calculation\">Calculation</a></li>\n      </ul>\n    </li>\n    <li>\n      <a href=\"#getting-started\">Getting Started</a>\n      <ul>\n        <li><a href=\"#installation\">Installation</a></li>\n      </ul>\n    </li>\n    <li><a href=\"#tutorial\">Tutorials</a></li>\n    <ul>\n        <li><a href=\"#do-it-yourself\">Do it yourself</a></li>\n    </ul>\n    <li><a href=\"#usage\">Usage</a></li>\n    <li><a href=\"#api-reference\">API Reference</a></li>\n    <li><a href=\"#contributing\">Contributing</a></li>\n    <li><a href=\"#license\">License</a></li>\n    <li><a href=\"#contact\">Contact</a></li>\n  </ol>\n</details>\n\n## About the project\nABSplit is a python package that uses a genetic algorithm to generate as equal as possible A/B, A/B/C, or A/B/n test splits.\n\nThe project aims to provide a convenient and efficient way for splitting population data into distinct \ngroups (ABSplit), as well as and finding matching samples that closely resemble a given original sample (Match).\n\n\nWhether you have static population data or time series data, this Python package simplifies the process and allows you to \nanalyze and manipulate your population data.\n\nThis covers the following use cases:\n1. **ABSplit class**: Splitting an entire population into n groups by given proportions\n2. **Match class**: Finding a matching group in a population for a given sample\n\n### Calculation\n\nABSplit standardises the population data (so each metric is weighted as equally as possible), then pivots it into a \nthree-dimensional array, by metrics, individuals, and dates. \n\nThe selection from the genetic algorithm, along with its inverse, is applied across this array with broadcasting to \ncompute the dot products between the selection and the population data.\n\nAs a result, aggregated metrics for each group are calculated. The Mean Squared Error is calculated \nfor each metric within the groups and then summed for each metric. The objective of the cost function is to minimize the \noverall MSE between these two groups, ensuring the metrics of both groups track each other as similarly across time\nas possible.\n\n<div align=\"center\">\n  <img src=\"https://raw.githubusercontent.com/cormac-rynne/absplit/main/images/calculation_diagram.png\" width=\"80%\">\n</div>\n\n<p align=\"right\">(<a href=\"#readme-top\">back to top</a>)</p>\n\n## Getting Started\nUse the package manager [pip](https://pip.pypa.io/en/stable/) to install ABSplit and it's prerequisites.\n\nABSplit requires `pygad==3.0.1`\n\n### Installation\n\n```bash\npip install absplit\n```\n\n<p align=\"right\">(<a href=\"#readme-top\">back to top</a>)</p>\n\n## Tutorials\nPlease see [this colab](https://colab.research.google.com/drive/1gL7dxDJrtVoO5m1mSUWutdr7yas7sZwI?usp=sharing) for \na range of examples on how to use ABSplit and Match\n\n### Do it yourself\nSee [this colab](https://colab.research.google.com/drive/1SlCNnOtN4WCDTSJHsFrZtI7gKcXEl8-C?usp=sharing) to learn how \nABSplit works under the hood, and how to build your own group splitting tool using \n[PyGAD](https://pypi.org/project/pygad/),\n\n\n<p align=\"right\">(<a href=\"#readme-top\">back to top</a>)</p>\n\n## Usage\n\n```python\nfrom absplit import ABSplit\nimport pandas as pd\nimport datetime\nimport numpy as np\n\n# Synthetic data\ndata_dct = {\n    'date': [datetime.date(2030,4,1) + datetime.timedelta(days=x) for x in range(3)]*5,\n    'country': ['UK'] * 15,\n    'region': [item for sublist in [[x]*6 for x in ['z', 'y']] for item in sublist] + ['x']*3,\n    'city': [item for sublist in [[x]*3 for x in ['a', 'b', 'c', 'd', 'e']] for item in sublist],\n    'metric1': np.arange(0, 15, 1),\n    'metric2': np.arange(0, 150, 10)\n}\ndf = pd.DataFrame(data_dct)\n\n# Identify which columns are metrics, which is the time period, and what to split on\nkwargs = {\n    'metrics': ['metric1', 'metric2'],\n    'date_col': 'date',\n    'splitting': 'city'\n}\n\n# Initialise\nab = ABSplit(\n    df=df,\n    split=[.5, .5],  # Split into 2 groups of equal size\n    **kwargs,\n)\n\n# Generate split\nab.run()\n\n# Visualise generation fitness\nab.fitness()\n\n# Visualise data\nab.visualise()\n\n# Extract bin splits\ndf = ab.results\n\n# Extract data aggregated by bins\ndf_agg = ab.aggregations\n\n# Extract summary statistics\ndf_dist = ab.distributions    # Population counts between groups\ndf_rmse = ab.rmse             # RMSE between groups for each metric\ndf_mape = ab.mape             # MAPE between groups for each metric\ndf_totals = ab.totals         # Total sum of each metric for each group\n\n```\n<p align=\"right\">(<a href=\"#readme-top\">back to top</a>)</p>\n\n## API Reference\n### Absplit \n`ABSplit(df, metrics, splitting, date_col=None, ga_params={}, metric_weights={}, splits=[0.5, 0.5], size_penalty=0)`\n\nSplits population into n groups. Mutually exclusive, completely exhaustive\n\nArguments:\n* `df` (pd.DataFrame): Dataframe of population to be split\n* `metrics` (str, list): Name of, or list of names of, metric columns in DataFrame to be considered in split\n* `splitting` (str): Name of column that represents individuals in the population that is getting split. For example, if \nyou wanted to split a dataframe of US counties, this would be the county name column\n* `date_col` (str, optional): Name of column that represents time periods, if applicable. If left empty, it will\nperform a static split, i.e. not across timeseries, (default `None`)\n* `ga_params` (dict, optional): Parameters for the genetic algorithm `pygad.GA` module parameters, see \n[here](https://pygad.readthedocs.io/en/latest/README_pygad_ReadTheDocs.html#pygad-ga-class) for arguments you can pass\n(default: `{}`)\n* `splits` (list, optional): How many groups to split into, and relative size of the groups (default: `[0.5, 0.5]`,\n2 groups of equal size)\n* `size_penalty` (float, optional): Penalty weighting for differences in the population count between groups \n(default: `0`)\n* `sum_penalty` (float, optional): Penalty weighting for the sum of metrics over time. If this is greater than zero,\nit will add a penalty to the cost function that will try and make the sum of each metric the same for each group \n(default: `0`)\n* `cutoff_date` (str, optional): Cutoff date between fitting and validation data. For example, if you have data between \n2023-01-01 and 2023-03-01, and the cutoff date is 2023-02-01, the algorithm will only perform the fit on data between \n2023-01-01 and 2023-02-01. If `None`, it will fit on all available data. If cutoff date is provided, RMSE scores\n  (gotten by using the `ab.rmse` attribute) will only be for validation period (i.e., from 2023-02-01 to end of \ntimeseries)\n* `missing_dates` (str, optional): How to deal with missing dates in time series data, options: `['drop_dates',\n'drop_population', '0', 'median']` (default: `median`)\n* `metric_weights` (dict, optional): Weights for each metric in the data. If you want the splitting to focus on \none metrics more than the other, you can prioritise this here (default: `{}`)\n\n\n### Match \n`Match(population, sample, metrics, splitting, date_col=None, ga_params={}, metric_weights={})`\n\nTakes DataFrame `sample` and finds a comparable group in `population`.\n\nArguments:\n* `population` (pd.DataFrame): Population to search  for comparable group (**Must exclude sample data**)\n* `sample` (pd.DataFrame): Sample we are looking to find a match for.\n* `metrics` (str, list): Name of, or list of names of, metric columns in DataFrame\n* `splitting` (str): Name of column that represents individuals in the population that is getting split\n* `date_col` (str, optional): Name of column that represents time periods, if applicable. If left empty, it will\nperform a static split, i.e. not across timeseries, (default `None`)\n* `ga_params` (dict, optional): Parameters for the genetic algorithm `pygad.GA` module parameters, see \n[here](https://pygad.readthedocs.io/en/latest/README_pygad_ReadTheDocs.html#pygad-ga-class) for arguments you can pass\n(default: `{}`)\n* `splits` (list, optional): How many groups to split into, and relative size of the groups (default: `[0.5, 0.5]`,\n2 groups of equal size)\n* `metric_weights` (dict, optional): Weights for each metric in the data. If you want the splitting to focus on \none metrics more than the other, you can prioritise this here (default: `{}`)\n<p align=\"right\">(<a href=\"#readme-top\">back to top</a>)</p>\n\n## Contributing\n\nI welcome contributions to ABSplit! For major changes, please open an issue first\nto discuss what you would like to change.\n\nPlease make sure to update tests as appropriate.\n\n<p align=\"right\">(<a href=\"#readme-top\">back to top</a>)</p>\n\n## License\n\n[MIT](https://choosealicense.com/licenses/mit/)\n\n<p align=\"right\">(<a href=\"#readme-top\">back to top</a>)</p>",
    "bugtrack_url": null,
    "license": null,
    "summary": "Generates A/B/n test groups",
    "version": "1.4.4",
    "project_urls": {
        "Home": "https://github.com/cormac-rynne/absplit"
    },
    "split_keywords": [
        "absplit",
        "a/b test",
        "ab test",
        "ab split",
        "split",
        "set formation",
        "group formation"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "579683749e5a5905f120af2f1e552b0dbd874417a439d5f960a401e83eca1d72",
                "md5": "3915de92d6018e45f03dabc744941397",
                "sha256": "937d0188a193549f5f20ebe29c3bbc1cb5091ed6cabc7940a05d97e7abae25b4"
            },
            "downloads": -1,
            "filename": "absplit-1.4.4-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "3915de92d6018e45f03dabc744941397",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": "<=3.11",
            "size": 314565,
            "upload_time": "2024-02-10T19:07:32",
            "upload_time_iso_8601": "2024-02-10T19:07:32.084659Z",
            "url": "https://files.pythonhosted.org/packages/57/96/83749e5a5905f120af2f1e552b0dbd874417a439d5f960a401e83eca1d72/absplit-1.4.4-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "9f021b86fb974a4ed9024063c9060aabd3beaf20c9d7997619786bf129b01ce3",
                "md5": "e093e5cfa83c878c886e2499fc8b7ec0",
                "sha256": "56343e5ead5d2cc4be78b72eed78ed95dc11393b9a8c55fb3ea9929fc10daf8e"
            },
            "downloads": -1,
            "filename": "absplit-1.4.4.tar.gz",
            "has_sig": false,
            "md5_digest": "e093e5cfa83c878c886e2499fc8b7ec0",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<=3.11",
            "size": 320943,
            "upload_time": "2024-02-10T19:07:34",
            "upload_time_iso_8601": "2024-02-10T19:07:34.819617Z",
            "url": "https://files.pythonhosted.org/packages/9f/02/1b86fb974a4ed9024063c9060aabd3beaf20c9d7997619786bf129b01ce3/absplit-1.4.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-10 19:07:34",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "cormac-rynne",
    "github_project": "absplit",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "pygad",
            "specs": [
                [
                    "==",
                    "3.0.1"
                ]
            ]
        },
        {
            "name": "scikit-learn",
            "specs": []
        },
        {
            "name": "numpy",
            "specs": []
        },
        {
            "name": "pandas",
            "specs": []
        },
        {
            "name": "seaborn",
            "specs": []
        }
    ],
    "lcname": "absplit"
}
        
Elapsed time: 0.20422s