data-decomposition-platform


Namedata-decomposition-platform JSON
Version 0.0.1 PyPI version JSON
download
home_pagehttps://github.com/google-marketing-solutions/data_decomposition_platform
SummaryData Decomposition Platform library for data disaggregation.
upload_time2024-02-19 16:42:11
maintainer
docs_urlNone
authorGoogle gTech Ads EMEA Privacy Data Science Team
requires_python>=3.10
licenseApache 2.0
keywords 1pd privacy ai ml aggregated data
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # gTech Ads - Data Decomposition Package
### Data Decomposition Package is an open-source Python package to decompose aggregated observations down to a single observation level.
##### This is not an official Google product.

[Introduction](#introduction) •
[Decomposition process](#predictions)

# Introduction

In a privacy-first world, an increasing volume of data is be collected on an aggregated level which means that certain data points will not contain information on a single observational level. Instead these values are aggregated by a shared id where shared id could be a purchase date, a purchase time window, and others.

This level of data aggregation can be a major challenge when working with data pipelines and projects expecting data structures oriented on an individual-level observation. The single observation levels are representative of the aggregated whole, allowing for privacy-first methods to be used with existing data pipelines and methods which otherwise may break with new privacy-first, aggregated upstream methods.

gTech Ads Data Decomposition Platform is an open-source Python package to decompose aggregated observations down to a statistically representative single observation level.

The package implements 2 methods: 1) Bayesian Disaggregator and 2) Neural Networks Disaggregator. The former allows for decomposing continuous values, while the latter can handle both the continuous and categorical values. Both methods use a similar API.

# Decomposition process with Disaggregator methods

First, the input data needs to be either a comma-separated CSV file or a Pandas DataFrame.

```
dataframe = pd.read_csv(“path/to/your/file.csv”)
```

## Decomposition with Bayesian Disaggregator method

### Numerical aggregated data

The default Bayesian Disaggregator is compatible with only one input feature. It requires the dataframe to have a column with shared id (e.g. date, but represented as an integer), a single feature X (e.g. items in a basket), a feature X aggregated by shared id and a numerical aggregated value Y to be later decomposed in the process.

This method simply estimates a weight to distribute aggregated value Y into value Y, given learned weights from aggregated feature X to feature X. It assumes the weighting is the same between X and Y and that each group always has the same number of items.

*Example 1. Sample dataframe compatible with the Bayesian Disaggregator (numerical aggregated data).*

|**shared_id**| **feature_X** | **aggregated_feature_X**| **aggregated_value_Y** |
|:-----------------:|:-----------:|:--------:|:-------------------------:|
| 0                 | 1      | 3     | 10.5                       |
| 0                 | 2      | 3    | 10.5                      |
| 1                 | 5      | 9    | 16.8                      |
| 1                 | 4      | 9    | 16.8                      |

The following step is to provide a mapping between `shared_id`, `x_aggregates`, `x_values` and `y_aggregates` and the respective column names in the input dataframe.

```
column_name_mapping = preprocessors.BayesianColumnNameMapping(
    shared_id="shared_id",
    x_aggregates="aggregated_feature_X",
    x_values="feature_X",
    y_aggregates="aggregated_value_Y",
)
```

Next, you initialize the Bayesian model that you wish to use in the decomposition process. You can replace the default model with a customized one as long as it’s compatible with the `BayesianDisaggregator` class.

```
bayesian_model = methods.BayesianModel()
```

The final steps are to initialize the `BayesianDisaggregator` class and decompose the input dataframe.

```
bayesian_disaggregator = methods.BayesianDisaggregator(model=bayesian_model)
disaggregated_dataframe = bayesian_disaggregator.fit_transform(
    dataframe=dataframe, column_name_mapping=column_name_mapping
)
```

*Example 2. Sample output from the Bayesian Disaggregator  (numerical aggregated data).*

|**shared_id**| **x_aggregates** | **x_values**| **y_aggregates** | **disaggregated_y_values** |
|:-----------------:|:-----------:|:--------:|:-------------------------:|:--------:|
| 0                 | 3      | 1     | 10.5                       | 5.0      |
| 0                 | 3      | 2    | 10.5                      |5.5      |
| 1                 | 9      | 5    | 16.8                      |6.8      |
| 1                 | 9      | 4    | 16.8                      |10.0  |


## Decomposition with Neural Networks Disaggregator method

### Numerical aggregated data

The default Neural Networks Disaggregator is compatible with multiple input features.
It requires the dataframe to have a column with shared id (e.g. date, but represented as an integer), one or more features (e.g. items in a basket, browser type, device type etc.) and numerical aggregated value Y to be later decomposed in the process.

*Example 3. Sample dataframe compatible with the Neural Networks Disaggregator (numerical aggregated data).*

|**shared_id**| **feature_1** | **feature_2**| **aggregated_value_Y** |
|:-----------------:|:-----------:|:--------:|:-------------------------:|
| 0                 | 1.5      | 1.3     | 10.5                       |
| 0                 | 3.4      | 5.6    | 10.5                      |
| 1                 | 10.1      | 0.0    | 16.8                      |
| 1                 | 4.5      | 9.9    | 16.8                      |

The following step is to provide a mapping between `shared_id`, `features` and `y_aggregates` and the respective column names in the input dataframe.

```
column_name_mapping = preprocessors.NeuralNetworksColumnNameMapping(
    shared_id="shared_id",
    features=["feature_1", “feature_2”],
    y_aggregates="aggregated_value_Y",
)
```

Next, you initialize the Neural Networks `regression` model that you wish to use in the decomposition process. You can replace the default model with a customized TensorFlow Model [TensorFlow Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model).

```
task_type = "regression"
model = methods.NeuralNetworksModel(task_type=task_type)
```

The final steps are to initialize the `NeuralNetworksDisaggreagtor` class and decompose the input dataframe. The `compile` and `fit` arguments accept the exact same kwargs as the `compile` and `fit` methods of a [TensorFlow Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model). Make sure that you use a loss type compatible with the regression problems when decomposing numerical data.

```
disaggregated_dataframe = neural_network_disaggregator.fit_transform(
    dataframe=input_dataframe,
    column_name_mapping=column_name_mapping,
    compile_kwargs=dict(loss="mean_absolute_error", optimizer="adam"),
    fit_kwargs=dict(epochs=30, verbose=False),
)
```

*Example 4. Sample output from the Neural Networks Disaggregator (numerical aggregated data).*

|**shared_id**| **feature_1** | **feature_2**| **y_aggregates** | **disaggregated_y_values** |
|:-----------------:|:-----------:|:--------:|:-------------------------:|:-------------:|
| 0                 | 1.5      | 1.3     | 10.5                       | 5.0 |
| 0                 | 3.4      | 5.6    | 10.5                      | 5.5 |
| 1                 | 10.1      | 0.0    | 16.8                      | 6.8 |
| 1                 | 4.5      | 9.9    | 16.8                      | 10.0 |

### Categorical aggregated data

The statistical windowing process of categorical aggregated data with the Neural Networks Disaggregator is almost identical. At this point the Neural Networks Disaggregator can work only with binary problems. E.g. there is an aggregated number of conversions per given day and the model will only be able to assign 0 to users who did not convert and 1 otherwise.

*Example 5. Sample dataframe compatible with the Neural Networks Disaggregator (categorical aggregated data).*

|**shared_id**| **feature_1** | **feature_2**| **aggregated_value_Y** |
|:-----------------:|:-----------:|:--------:|:-------------------------:|
| 0                 | 1.5      | 1.3     | 2                       |
| 0                 | 3.4      | 5.6    | 2                      |
| 1                 | 10.1      | 0.0    | 1                      |
| 1                 | 4.5      | 9.9    | 1                      |

The following step is also to provide a mapping between `shared_id`, `features` and `y_aggregates` and the respective column names in the input dataframe.

```
column_name_mapping = preprocessors.NeuralNetworksColumnNameMapping(
    shared_id="shared_id",
    features=["feature_1", “feature_2”],
    y_aggregates="aggregated_value_Y",
)
```
Next, you initialize the Neural Networks `classification` model that you wish to use in the decomposition process.

```
task_type = "classification"
model = methods.NeuralNetworksModel(task_type=task_type)
```

The final steps are to initialize the `NeuralNetworksDisaggreagtor` class and decompose the input dataframe. Make sure that you use a loss type compatible with the binary classification problems when decomposing categorical data.

```
neural_network_disaggregator = methods.NeuralNetworksDisaggregator(
    model=model,
    task_type=task_type,
)
disaggregated_dataframe = neural_network_disaggregator.fit_transform(
    dataframe=input_dataframe,
    column_name_mapping=column_name_mapping,
    compile_kwargs=dict(loss="bce", optimizer="adam"),
    fit_kwargs=dict(epochs=30, verbose=False),
)
```

*Example 5. Sample output from the Neural Networks Disaggregator (categorical aggregated data).*

|**shared_id**| **feature_1** | **feature_2**| **y_aggregates** | **disaggregated_y_values** |
|:-----------------:|:-----------:|:--------:|:-------------------------:|:-------------:|
| 0                 | 1.5      | 1.3     | 10.5                       | 1 |
| 0                 | 3.4      | 5.6    | 10.5                      | 1 |
| 1                 | 10.1      | 0.0    | 16.8                      | 0 |
| 1                 | 4.5      | 9.9    | 16.8                      | 1 |

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/google-marketing-solutions/data_decomposition_platform",
    "name": "data-decomposition-platform",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": "",
    "keywords": "1pd privacy AI ML aggregated data",
    "author": "Google gTech Ads EMEA Privacy Data Science Team",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/fa/ac/5d1746212c0b240720fc0c54e56225b33fd589aa22848fce304dc23ed27b/data-decomposition-platform-0.0.1.tar.gz",
    "platform": null,
    "description": "# gTech Ads - Data Decomposition Package\n### Data Decomposition Package is an open-source Python package to decompose aggregated observations down to a single observation level.\n##### This is not an official Google product.\n\n[Introduction](#introduction) \u2022\n[Decomposition process](#predictions)\n\n# Introduction\n\nIn a privacy-first world, an increasing volume of data is be collected on an aggregated level which means that certain data points will not contain information on a single observational level. Instead these values are aggregated by a shared id where shared id could be a purchase date, a purchase time window, and others.\n\nThis level of data aggregation can be a major challenge when working with data pipelines and projects expecting data structures oriented on an individual-level observation. The single observation levels are representative of the aggregated whole, allowing for privacy-first methods to be used with existing data pipelines and methods which otherwise may break with new privacy-first, aggregated upstream methods.\n\ngTech Ads Data Decomposition Platform is an open-source Python package to decompose aggregated observations down to a statistically representative single observation level.\n\nThe package implements 2 methods: 1) Bayesian Disaggregator and 2) Neural Networks Disaggregator. The former allows for decomposing continuous values, while the latter can handle both the continuous and categorical values. Both methods use a similar API.\n\n# Decomposition process with Disaggregator methods\n\nFirst, the input data needs to be either a comma-separated CSV file or a Pandas DataFrame.\n\n```\ndataframe = pd.read_csv(\u201cpath/to/your/file.csv\u201d)\n```\n\n## Decomposition with Bayesian Disaggregator method\n\n### Numerical aggregated data\n\nThe default Bayesian Disaggregator is compatible with only one input feature. It requires the dataframe to have a column with shared id (e.g. date, but represented as an integer), a single feature X (e.g. items in a basket), a feature X aggregated by shared id and a numerical aggregated value Y to be later decomposed in the process.\n\nThis method simply estimates a weight to distribute aggregated value Y into value Y, given learned weights from aggregated feature X to feature X. It assumes the weighting is the same between X and Y and that each group always has the same number of items.\n\n*Example 1. Sample dataframe compatible with the Bayesian Disaggregator (numerical aggregated data).*\n\n|**shared_id**| **feature_X** | **aggregated_feature_X**| **aggregated_value_Y** |\n|:-----------------:|:-----------:|:--------:|:-------------------------:|\n| 0                 | 1      | 3     | 10.5                       |\n| 0                 | 2      | 3    | 10.5                      |\n| 1                 | 5      | 9    | 16.8                      |\n| 1                 | 4      | 9    | 16.8                      |\n\nThe following step is to provide a mapping between `shared_id`, `x_aggregates`, `x_values` and `y_aggregates` and the respective column names in the input dataframe.\n\n```\ncolumn_name_mapping = preprocessors.BayesianColumnNameMapping(\n    shared_id=\"shared_id\",\n    x_aggregates=\"aggregated_feature_X\",\n    x_values=\"feature_X\",\n    y_aggregates=\"aggregated_value_Y\",\n)\n```\n\nNext, you initialize the Bayesian model that you wish to use in the decomposition process. You can replace the default model with a customized one as long as it\u2019s compatible with the `BayesianDisaggregator` class.\n\n```\nbayesian_model = methods.BayesianModel()\n```\n\nThe final steps are to initialize the `BayesianDisaggregator` class and decompose the input dataframe.\n\n```\nbayesian_disaggregator = methods.BayesianDisaggregator(model=bayesian_model)\ndisaggregated_dataframe = bayesian_disaggregator.fit_transform(\n    dataframe=dataframe, column_name_mapping=column_name_mapping\n)\n```\n\n*Example 2. Sample output from the Bayesian Disaggregator  (numerical aggregated data).*\n\n|**shared_id**| **x_aggregates** | **x_values**| **y_aggregates** | **disaggregated_y_values** |\n|:-----------------:|:-----------:|:--------:|:-------------------------:|:--------:|\n| 0                 | 3      | 1     | 10.5                       | 5.0      |\n| 0                 | 3      | 2    | 10.5                      |5.5      |\n| 1                 | 9      | 5    | 16.8                      |6.8      |\n| 1                 | 9      | 4    | 16.8                      |10.0  |\n\n\n## Decomposition with Neural Networks Disaggregator method\n\n### Numerical aggregated data\n\nThe default Neural Networks Disaggregator is compatible with multiple input features.\nIt requires the dataframe to have a column with shared id (e.g. date, but represented as an integer), one or more features (e.g. items in a basket, browser type, device type etc.) and numerical aggregated value Y to be later decomposed in the process.\n\n*Example 3. Sample dataframe compatible with the Neural Networks Disaggregator (numerical aggregated data).*\n\n|**shared_id**| **feature_1** | **feature_2**| **aggregated_value_Y** |\n|:-----------------:|:-----------:|:--------:|:-------------------------:|\n| 0                 | 1.5      | 1.3     | 10.5                       |\n| 0                 | 3.4      | 5.6    | 10.5                      |\n| 1                 | 10.1      | 0.0    | 16.8                      |\n| 1                 | 4.5      | 9.9    | 16.8                      |\n\nThe following step is to provide a mapping between `shared_id`, `features` and `y_aggregates` and the respective column names in the input dataframe.\n\n```\ncolumn_name_mapping = preprocessors.NeuralNetworksColumnNameMapping(\n    shared_id=\"shared_id\",\n    features=[\"feature_1\", \u201cfeature_2\u201d],\n    y_aggregates=\"aggregated_value_Y\",\n)\n```\n\nNext, you initialize the Neural Networks `regression` model that you wish to use in the decomposition process. You can replace the default model with a customized TensorFlow Model [TensorFlow Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model).\n\n```\ntask_type = \"regression\"\nmodel = methods.NeuralNetworksModel(task_type=task_type)\n```\n\nThe final steps are to initialize the `NeuralNetworksDisaggreagtor` class and decompose the input dataframe. The `compile` and `fit` arguments accept the exact same kwargs as the `compile` and `fit` methods of a [TensorFlow Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model). Make sure that you use a loss type compatible with the regression problems when decomposing numerical data.\n\n```\ndisaggregated_dataframe = neural_network_disaggregator.fit_transform(\n    dataframe=input_dataframe,\n    column_name_mapping=column_name_mapping,\n    compile_kwargs=dict(loss=\"mean_absolute_error\", optimizer=\"adam\"),\n    fit_kwargs=dict(epochs=30, verbose=False),\n)\n```\n\n*Example 4. Sample output from the Neural Networks Disaggregator (numerical aggregated data).*\n\n|**shared_id**| **feature_1** | **feature_2**| **y_aggregates** | **disaggregated_y_values** |\n|:-----------------:|:-----------:|:--------:|:-------------------------:|:-------------:|\n| 0                 | 1.5      | 1.3     | 10.5                       | 5.0 |\n| 0                 | 3.4      | 5.6    | 10.5                      | 5.5 |\n| 1                 | 10.1      | 0.0    | 16.8                      | 6.8 |\n| 1                 | 4.5      | 9.9    | 16.8                      | 10.0 |\n\n### Categorical aggregated data\n\nThe statistical windowing process of categorical aggregated data with the Neural Networks Disaggregator is almost identical. At this point the Neural Networks Disaggregator can work only with binary problems. E.g. there is an aggregated number of conversions per given day and the model will only be able to assign 0 to users who did not convert and 1 otherwise.\n\n*Example 5. Sample dataframe compatible with the Neural Networks Disaggregator (categorical aggregated data).*\n\n|**shared_id**| **feature_1** | **feature_2**| **aggregated_value_Y** |\n|:-----------------:|:-----------:|:--------:|:-------------------------:|\n| 0                 | 1.5      | 1.3     | 2                       |\n| 0                 | 3.4      | 5.6    | 2                      |\n| 1                 | 10.1      | 0.0    | 1                      |\n| 1                 | 4.5      | 9.9    | 1                      |\n\nThe following step is also to provide a mapping between `shared_id`, `features` and `y_aggregates` and the respective column names in the input dataframe.\n\n```\ncolumn_name_mapping = preprocessors.NeuralNetworksColumnNameMapping(\n    shared_id=\"shared_id\",\n    features=[\"feature_1\", \u201cfeature_2\u201d],\n    y_aggregates=\"aggregated_value_Y\",\n)\n```\nNext, you initialize the Neural Networks `classification` model that you wish to use in the decomposition process.\n\n```\ntask_type = \"classification\"\nmodel = methods.NeuralNetworksModel(task_type=task_type)\n```\n\nThe final steps are to initialize the `NeuralNetworksDisaggreagtor` class and decompose the input dataframe. Make sure that you use a loss type compatible with the binary classification problems when decomposing categorical data.\n\n```\nneural_network_disaggregator = methods.NeuralNetworksDisaggregator(\n    model=model,\n    task_type=task_type,\n)\ndisaggregated_dataframe = neural_network_disaggregator.fit_transform(\n    dataframe=input_dataframe,\n    column_name_mapping=column_name_mapping,\n    compile_kwargs=dict(loss=\"bce\", optimizer=\"adam\"),\n    fit_kwargs=dict(epochs=30, verbose=False),\n)\n```\n\n*Example 5. Sample output from the Neural Networks Disaggregator (categorical aggregated data).*\n\n|**shared_id**| **feature_1** | **feature_2**| **y_aggregates** | **disaggregated_y_values** |\n|:-----------------:|:-----------:|:--------:|:-------------------------:|:-------------:|\n| 0                 | 1.5      | 1.3     | 10.5                       | 1 |\n| 0                 | 3.4      | 5.6    | 10.5                      | 1 |\n| 1                 | 10.1      | 0.0    | 16.8                      | 0 |\n| 1                 | 4.5      | 9.9    | 16.8                      | 1 |\n",
    "bugtrack_url": null,
    "license": "Apache 2.0",
    "summary": "Data Decomposition Platform library for data disaggregation.",
    "version": "0.0.1",
    "project_urls": {
        "Homepage": "https://github.com/google-marketing-solutions/data_decomposition_platform"
    },
    "split_keywords": [
        "1pd",
        "privacy",
        "ai",
        "ml",
        "aggregated",
        "data"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c66564a0a558b35ebf7593cea6bf06d74c0ab5dbc76d3d2b8cd7be707433061f",
                "md5": "5a71a62918cf4e100845bc572035d184",
                "sha256": "df306cce2c25bcce15f19a46c5f3cac070aae1efad3afa49862b6bb1c20e37b4"
            },
            "downloads": -1,
            "filename": "data_decomposition_platform-0.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "5a71a62918cf4e100845bc572035d184",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 17080,
            "upload_time": "2024-02-19T16:42:09",
            "upload_time_iso_8601": "2024-02-19T16:42:09.780784Z",
            "url": "https://files.pythonhosted.org/packages/c6/65/64a0a558b35ebf7593cea6bf06d74c0ab5dbc76d3d2b8cd7be707433061f/data_decomposition_platform-0.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "faac5d1746212c0b240720fc0c54e56225b33fd589aa22848fce304dc23ed27b",
                "md5": "4460f66a9243fbbde79b634d873cf044",
                "sha256": "745229f2493c4748384e7231825d6a6df03ed812a24daf99c287f5cb22438dff"
            },
            "downloads": -1,
            "filename": "data-decomposition-platform-0.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "4460f66a9243fbbde79b634d873cf044",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 15552,
            "upload_time": "2024-02-19T16:42:11",
            "upload_time_iso_8601": "2024-02-19T16:42:11.970325Z",
            "url": "https://files.pythonhosted.org/packages/fa/ac/5d1746212c0b240720fc0c54e56225b33fd589aa22848fce304dc23ed27b/data-decomposition-platform-0.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-19 16:42:11",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "google-marketing-solutions",
    "github_project": "data_decomposition_platform",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "data-decomposition-platform"
}
        
Elapsed time: 0.18374s