# package `privateAB`: two-sample testing under local differential privacy
The package `privateAB` and the codes in this repository implement the private testing method introduced in the paper *Minimax Optimal Two-Sample Testing under Local Differential Privacy*, authored by Jongmin Mun, Seungwoo Kwak, and Ilmun Kim.
The full paper can be accessed at: [https://arxiv.org/abs/2411.09064](https://arxiv.org/abs/2411.09064).
The code is written and tested in the following environment:
- **Operating System**: CentOS Linux 7 (Core)
- **CPE OS Name**: `cpe:/o:centos:centos:7`
- **Kernel**: `Linux 3.10.0-1127.19.1.el7.x86_64`
- **Architecture**: `x86-64`
- **Python Version**: 3.7.12
The code is guaranteed to work with the following package versions:
- `numpy==1.21.6`
- `pandas==1.3.5`
- `torch==1.7.1`
### Data Requirements
The input data consists of 2D Torch tensors, except for the chi statistic, which requires 1D integer tensors. For multinomial data with a large number of categories, or for continuous data with high dimensionality (`d`) and bin number (`κ`) such that `κ^d` is large, or when the sample size is very large (e.g., `k = κ^d > 1000` or `n > 100,000`), we recommend using a GPU.
### Conda Environment Setup
We recommend importing the conda environment from the following files:
- **For Linux**: `LDPUtsEnvK40.yaml`
- **For Windows**: `LDPUtsEnvK40_windows.yaml`
## Basic usage
Two main objects are utilized in this package: `client`, which implements the privacy mechanism, and `server`, which conducts the test.
### Installation
```
pip install privateAB
```
### Privatization of multinomial data
`client` takes raw data in the form of a PyTorch tensor and releases its locally differentially private representation.
In this example, we use the `data_generator` function from our paper, which internally utilizes the `torch.multinomial` function. Therefore, when using your own data, ensure it follows the same format as the output of `torch.multinomial`.
To get started, first import the necessary packages:
```
from privateAB.client import client
from privateAB.data_generator import data_generator
```
Now, using our `data_generator` function, we generate two independent datasets of multinomial samples.
```
import torch
#set probability vectors
sample_size = 1000
d = 4 #number of categories of the multinomial data
param_dist = 0.04
p = torch.ones(d).div(d)
p2 = p.add(
torch.remainder(
torch.tensor(range(d)),
2
).add(-1/2).mul(2).mul(bump)
)
p1_idx = torch.cat( ( torch.arange(1, d), torch.tensor([0])), 0)
p1 = p2[p1_idx]
#create the data_generator instance
data_gen = data_generator()
# generate raw data
raw_data_1 = data_gen.generate_multinomial_data(p1, sample_size)
raw_data_2 = data_gen.generate_multinomial_data(p2, sample_size)
```
Next, we create an instance of the `client` class and use its `release_private` method to privatize the raw data.
The `release_private` method requires the following five inputs:
1. **Privacy mechanism**: A string specifying the mechanism to use ('bitflip', 'genrr', 'lapu', or 'disclapu').
2. **Raw data**: A `torch.tensor` object representing the input data.
3. **Number of categories**: The number of categories in the multinomial data.
4. **Privacy parameter**: The parameter controlling the level of local differential privacy.
5. **Device**: The computational device to be used ('cpu' or 'gpu') as supported by `torch`.
```
LDPclient = client() #create the client, which privatizes the data
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') #specify gpu or cpu
priv_mech = 'bitflip' #choose among 'bitflip', 'genrr', 'lapu', 'disclapu'. bitflip corresponds to rappor in the paper.
private_data_1 = LDPclient.release_private(
priv_mech,
raw_data_1,
d,
0.9,
device
)
private_data_2 = LDPclient.release_private(
priv_mech,
raw_data_2,
d,
0.9,
device
)
```
### Testing of multinomial data
The test is conducted using one of the following server instances: `server_multinomial_bitflip`, `server_ell2`, or `server_multinomial_genrr`. These correspond to the ProjChi, ell2, and Chi statistics discussed in the paper.
- The first two servers (`server_multinomial_bitflip` and `server_ell2`) can process privatized views generated using the 'bitflip', 'lapu', or 'disclapu' mechanisms.
- The `server_multinomial_genrr` instance, however, exclusively supports privatized views generated by the 'genrr' mechanism.
To proceed, we first create a server instance, which requires the privacy parameter as input. Next, we load the privatized data using the `load_private_data_multinomial` method. This method takes the following five inputs:
1. **First private data object**: The first dataset's privatized representation.
2. **Second private data object**: The second dataset's privatized representation (for A/B testing).
3. **Number of categories**: The number of categories in the multinomial data.
4. **Device for the first private data**: The `torch` device (CPU or GPU) used to process the first dataset.
5. **Device for the second private data**: The `torch` device used to process the second dataset.
We allow two separate devices to accommodate large-scale datasets where GPU memory might be limited, requiring the calculations to be performed separately for each of the two data set. However, you can use the same device for both datasets if memory is not a concern.
```
from privateAB.server import server_multinomial_bitflip
server_multinomial_bitflip(0.9) #create an instance
server_private.load_private_data_multinomial(
private_data_1, private_data_2 ,
d,
device,
device
)
```
Now we run the test. Any of the server instances (`server_ell2`, `server_multinomial_bitflip`, or `server_multinomial_genrr`) can calculate the permutation p-value using the `release_p_value_permutation` method.
This method takes a single input:
- **Number of permutations**: The number of permutations to perform.
It returns two outputs:
1. **p-value**: The significance level of the test.
2. **Test statistic value**: The calculated value of the test statistic.
```
p_value, statistic = server_private.release_p_value_permutation(n_permutation)
```
`server_multinomial_bitflip` and `server_multinomial_genrr` can also compute the p-value based on the asymptotic chi-square null distribution using the `release_p_value` method.
This method does not require any input arguments. It directly outputs:
1. **p-value**: The significance level based on the chi-square null distribution.
```
p_value, statistic = server_private.release_p_value()
```
### Privatization of continuous data
As discussed in our paper, the privatization of continuous data uses a binning method. We support data in the form of a $d$-dimensional PyTorch tensor, where each dimension falls within the interval $[0,1]$. If your data lies outside this range, you should apply an appropriate transformation, such as the CDF transformation mentioned in our paper.
For convenience, we use our `data_generator` function to create two sets of multivariate continuous data. This function ensures the generated data adheres to the required format and simplifies the process of preparing data for privatization.
```
import torch
from privateAB.data_generator import data_generator
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') #specify gpu or cpu
d=3
copula_mean_1 = -0.5 * torch.ones(d).to(device)
copula_mean_2 = -copula_mean_1
copula_sigma = (0.5 * torch.ones(d,d) + 0.5 * torch.eye(d)).to(device)
data_gen = data_generator()
raw_data_1 = data_gen.generate_copula_gaussian_data(sample_size, copula_mean_1, copula_sigma)
raw_data_2 = data_gen.generate_copula_gaussian_data(sample_size, copula_mean_2, copula_sigma)
```
Now we privatize the multivariate continuous data using the `release_private_conti` method. This method is similar to `release_private` but automatically detects the data's dimensionality. Instead of specifying the number of categories, you provide the number of bins for discretizing the data.
The `release_private_conti` method requires the following five inputs:
1. **Privacy mechanism**: A string specifying the mechanism to use ('bitflip', 'genrr', 'lapu', or 'disclapu').
2. **Raw data**: A `torch.tensor` object representing the input multivariate continuous data.
3. **Number of bins**: The number of bins to discretize each dimension of the data.
4. **Privacy parameter**: The parameter controlling the level of local differential privacy.
5. **Device**: The computational device to be used ('cpu' or 'gpu') as supported by `torch`.
```
privacy_level=0.9
n_bin=4
data_y_priv = LDPclient.release_private_conti(
priv_mech,
data_gen.generate_copula_gaussian_data(sample_size, copula_mean_1, copula_sigma),
privacy_level,
n_bin,
device
)
data_z_priv = LDPclient.release_private_conti(
priv_mech,
data_gen.generate_copula_gaussian_data(sample_size, copula_mean_2, copula_sigma),
privacy_level,
n_bin,
device
)
```
### Testing of continuous data
After privatization, the data format aligns with that of multinomial data, allowing the same testing procedures to be applied.
One important note is that the **number of categories** should equal the bin number raised to the power of the data dimension. You don’t need to calculate this manually, as it is automatically stored in `LDPclient.alphabet_size_binned`. This ensures consistency and simplifies the setup for testing.
## Reproducing Simulation Results
To replicate the simulation results in the paper, run the following Python files. Adjust the sample size, data dimension, and privacy parameters as specified in each file:
- **Figure 2**: `Figure2_type_I.py` or `Figure2_type_I.ipynb`
- **Figure 3**: `Figure3_multinomial.py` or `Figure3_multinomial.ipynb`
- **Figure 4**: `Figure4_density_location.py` or `Figure4_density_location.ipynb`
- **Figure 5**: `Figure5_rappor_elltwo_vs_projchi.py` or `Figure5_rappor_elltwo_vs_projchi.ipynb`
- **Figure 6**: `Figure6_density_scale.py` or `Figure6_density_scale.ipynb`
Raw data
{
"_id": null,
"home_page": "https://jong-min-moon.github.io/softwares/",
"name": "privateAB",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": null,
"keywords": "local differential privacy, A/B test, two-sample test, permutation test",
"author": "Jongmin Mun",
"author_email": "jongmin.mun@marshall.usc.edu",
"download_url": "https://files.pythonhosted.org/packages/37/95/8edd7c51519619a7922a9c81c56e2d10890b9c331d63cc39ef09c5faccac/privateAB-0.0.2.tar.gz",
"platform": null,
"description": "\r\n# package `privateAB`: two-sample testing under local differential privacy\r\nThe package `privateAB` and the codes in this repository implement the private testing method introduced in the paper *Minimax Optimal Two-Sample Testing under Local Differential Privacy*, authored by Jongmin Mun, Seungwoo Kwak, and Ilmun Kim. \r\n\r\nThe full paper can be accessed at: [https://arxiv.org/abs/2411.09064](https://arxiv.org/abs/2411.09064). \r\n\r\n\r\n\r\nThe code is written and tested in the following environment:\r\n\r\n- **Operating System**: CentOS Linux 7 (Core) \r\n- **CPE OS Name**: `cpe:/o:centos:centos:7` \r\n- **Kernel**: `Linux 3.10.0-1127.19.1.el7.x86_64` \r\n- **Architecture**: `x86-64` \r\n- **Python Version**: 3.7.12 \r\n\r\nThe code is guaranteed to work with the following package versions:\r\n\r\n- `numpy==1.21.6`\r\n- `pandas==1.3.5`\r\n- `torch==1.7.1`\r\n\r\n### Data Requirements\r\n\r\nThe input data consists of 2D Torch tensors, except for the chi statistic, which requires 1D integer tensors. For multinomial data with a large number of categories, or for continuous data with high dimensionality (`d`) and bin number (`\u03ba`) such that `\u03ba^d` is large, or when the sample size is very large (e.g., `k = \u03ba^d > 1000` or `n > 100,000`), we recommend using a GPU.\r\n\r\n### Conda Environment Setup\r\n\r\nWe recommend importing the conda environment from the following files:\r\n\r\n- **For Linux**: `LDPUtsEnvK40.yaml`\r\n- **For Windows**: `LDPUtsEnvK40_windows.yaml`\r\n\r\n\r\n## Basic usage\r\nTwo main objects are utilized in this package: `client`, which implements the privacy mechanism, and `server`, which conducts the test.\r\n### Installation\r\n```\r\npip install privateAB\r\n```\r\n### Privatization of multinomial data\r\n`client` takes raw data in the form of a PyTorch tensor and releases its locally differentially private representation. \r\n\r\nIn this example, we use the `data_generator` function from our paper, which internally utilizes the `torch.multinomial` function. Therefore, when using your own data, ensure it follows the same format as the output of `torch.multinomial`. \r\n\r\nTo get started, first import the necessary packages:\r\n```\r\nfrom privateAB.client import client\r\nfrom privateAB.data_generator import data_generator\r\n```\r\nNow, using our `data_generator` function, we generate two independent datasets of multinomial samples.\r\n```\r\nimport torch\r\n#set probability vectors\r\nsample_size = 1000\r\nd = 4 #number of categories of the multinomial data\r\nparam_dist = 0.04 \r\np = torch.ones(d).div(d)\r\np2 = p.add(\r\n torch.remainder(\r\n torch.tensor(range(d)),\r\n 2\r\n ).add(-1/2).mul(2).mul(bump)\r\n )\r\np1_idx = torch.cat( ( torch.arange(1, d), torch.tensor([0])), 0)\r\np1 = p2[p1_idx]\r\n\r\n#create the data_generator instance\r\ndata_gen = data_generator() \r\n\r\n# generate raw data \r\nraw_data_1 = data_gen.generate_multinomial_data(p1, sample_size)\r\nraw_data_2 = data_gen.generate_multinomial_data(p2, sample_size)\r\n```\r\nNext, we create an instance of the `client` class and use its `release_private` method to privatize the raw data. \r\n\r\nThe `release_private` method requires the following five inputs: \r\n1. **Privacy mechanism**: A string specifying the mechanism to use ('bitflip', 'genrr', 'lapu', or 'disclapu'). \r\n2. **Raw data**: A `torch.tensor` object representing the input data. \r\n3. **Number of categories**: The number of categories in the multinomial data. \r\n4. **Privacy parameter**: The parameter controlling the level of local differential privacy. \r\n5. **Device**: The computational device to be used ('cpu' or 'gpu') as supported by `torch`.\r\n```\r\nLDPclient = client() #create the client, which privatizes the data\r\n\r\ndevice = torch.device('cuda' if torch.cuda.is_available() else 'cpu') #specify gpu or cpu\r\n\r\npriv_mech = 'bitflip' #choose among 'bitflip', 'genrr', 'lapu', 'disclapu'. bitflip corresponds to rappor in the paper.\r\n\r\nprivate_data_1 = LDPclient.release_private(\r\n priv_mech,\r\n raw_data_1,\r\n d,\r\n 0.9,\r\n device\r\n )\r\nprivate_data_2 = LDPclient.release_private(\r\n priv_mech,\r\n raw_data_2,\r\n d,\r\n 0.9,\r\n device\r\n )\r\n```\r\n### Testing of multinomial data\r\nThe test is conducted using one of the following server instances: `server_multinomial_bitflip`, `server_ell2`, or `server_multinomial_genrr`. These correspond to the ProjChi, ell2, and Chi statistics discussed in the paper. \r\n\r\n- The first two servers (`server_multinomial_bitflip` and `server_ell2`) can process privatized views generated using the 'bitflip', 'lapu', or 'disclapu' mechanisms. \r\n- The `server_multinomial_genrr` instance, however, exclusively supports privatized views generated by the 'genrr' mechanism. \r\n\r\nTo proceed, we first create a server instance, which requires the privacy parameter as input. Next, we load the privatized data using the `load_private_data_multinomial` method. This method takes the following five inputs: \r\n1. **First private data object**: The first dataset's privatized representation. \r\n2. **Second private data object**: The second dataset's privatized representation (for A/B testing). \r\n3. **Number of categories**: The number of categories in the multinomial data. \r\n4. **Device for the first private data**: The `torch` device (CPU or GPU) used to process the first dataset. \r\n5. **Device for the second private data**: The `torch` device used to process the second dataset. \r\n\r\nWe allow two separate devices to accommodate large-scale datasets where GPU memory might be limited, requiring the calculations to be performed separately for each of the two data set. However, you can use the same device for both datasets if memory is not a concern.\r\n```\r\nfrom privateAB.server import server_multinomial_bitflip\r\nserver_multinomial_bitflip(0.9) #create an instance\r\nserver_private.load_private_data_multinomial(\r\n private_data_1, private_data_2 ,\r\n d,\r\n device,\r\n device\r\n )\r\n```\r\nNow we run the test. Any of the server instances (`server_ell2`, `server_multinomial_bitflip`, or `server_multinomial_genrr`) can calculate the permutation p-value using the `release_p_value_permutation` method. \r\n\r\nThis method takes a single input: \r\n- **Number of permutations**: The number of permutations to perform. \r\n\r\nIt returns two outputs: \r\n1. **p-value**: The significance level of the test. \r\n2. **Test statistic value**: The calculated value of the test statistic.\r\n```\r\np_value, statistic = server_private.release_p_value_permutation(n_permutation)\r\n```\r\n`server_multinomial_bitflip` and `server_multinomial_genrr` can also compute the p-value based on the asymptotic chi-square null distribution using the `release_p_value` method. \r\n\r\nThis method does not require any input arguments. It directly outputs: \r\n1. **p-value**: The significance level based on the chi-square null distribution.\r\n```\r\np_value, statistic = server_private.release_p_value()\r\n```\r\n\r\n### Privatization of continuous data\r\nAs discussed in our paper, the privatization of continuous data uses a binning method. We support data in the form of a $d$-dimensional PyTorch tensor, where each dimension falls within the interval $[0,1]$. If your data lies outside this range, you should apply an appropriate transformation, such as the CDF transformation mentioned in our paper. \r\n\r\nFor convenience, we use our `data_generator` function to create two sets of multivariate continuous data. This function ensures the generated data adheres to the required format and simplifies the process of preparing data for privatization. \r\n```\r\nimport torch\r\nfrom privateAB.data_generator import data_generator\r\ndevice = torch.device('cuda' if torch.cuda.is_available() else 'cpu') #specify gpu or cpu\r\n\r\nd=3\r\ncopula_mean_1 = -0.5 * torch.ones(d).to(device)\r\ncopula_mean_2 = -copula_mean_1\r\ncopula_sigma = (0.5 * torch.ones(d,d) + 0.5 * torch.eye(d)).to(device)\r\ndata_gen = data_generator()\r\nraw_data_1 = data_gen.generate_copula_gaussian_data(sample_size, copula_mean_1, copula_sigma)\r\nraw_data_2 = data_gen.generate_copula_gaussian_data(sample_size, copula_mean_2, copula_sigma)\r\n```\r\nNow we privatize the multivariate continuous data using the `release_private_conti` method. This method is similar to `release_private` but automatically detects the data's dimensionality. Instead of specifying the number of categories, you provide the number of bins for discretizing the data. \r\n\r\nThe `release_private_conti` method requires the following five inputs: \r\n1. **Privacy mechanism**: A string specifying the mechanism to use ('bitflip', 'genrr', 'lapu', or 'disclapu'). \r\n2. **Raw data**: A `torch.tensor` object representing the input multivariate continuous data. \r\n3. **Number of bins**: The number of bins to discretize each dimension of the data. \r\n4. **Privacy parameter**: The parameter controlling the level of local differential privacy. \r\n5. **Device**: The computational device to be used ('cpu' or 'gpu') as supported by `torch`. \r\n```\r\nprivacy_level=0.9\r\nn_bin=4\r\ndata_y_priv = LDPclient.release_private_conti(\r\n priv_mech,\r\n data_gen.generate_copula_gaussian_data(sample_size, copula_mean_1, copula_sigma),\r\n privacy_level,\r\n n_bin,\r\n device\r\n )\r\n\r\ndata_z_priv = LDPclient.release_private_conti(\r\n priv_mech,\r\n data_gen.generate_copula_gaussian_data(sample_size, copula_mean_2, copula_sigma),\r\n privacy_level,\r\n n_bin,\r\n device\r\n )\r\n\r\n```\r\n### Testing of continuous data\r\nAfter privatization, the data format aligns with that of multinomial data, allowing the same testing procedures to be applied. \r\n\r\nOne important note is that the **number of categories** should equal the bin number raised to the power of the data dimension. You don\u2019t need to calculate this manually, as it is automatically stored in `LDPclient.alphabet_size_binned`. This ensures consistency and simplifies the setup for testing. \r\n\r\n\r\n\r\n\r\n \r\n\r\n\r\n## Reproducing Simulation Results\r\n\r\nTo replicate the simulation results in the paper, run the following Python files. Adjust the sample size, data dimension, and privacy parameters as specified in each file:\r\n\r\n- **Figure 2**: `Figure2_type_I.py` or `Figure2_type_I.ipynb` \r\n- **Figure 3**: `Figure3_multinomial.py` or `Figure3_multinomial.ipynb`\r\n- **Figure 4**: `Figure4_density_location.py` or `Figure4_density_location.ipynb`\r\n- **Figure 5**: `Figure5_rappor_elltwo_vs_projchi.py` or `Figure5_rappor_elltwo_vs_projchi.ipynb`\r\n- **Figure 6**: `Figure6_density_scale.py` or `Figure6_density_scale.ipynb`\r\n",
"bugtrack_url": null,
"license": null,
"summary": "Two-sample testing (A/B testing) for multinomial and multivariate continuous data under local differential privacy",
"version": "0.0.2",
"project_urls": {
"Bug Tracker": "https://github.com/Jong-Min-Moon/optimal-local-dp-two-sample",
"Homepage": "https://jong-min-moon.github.io/softwares/"
},
"split_keywords": [
"local differential privacy",
" a/b test",
" two-sample test",
" permutation test"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "ff9ca56d4bacb4f90f45ea5f040b5a68332b276b0f733acca2679d07534cfdf1",
"md5": "010d1d743a54e5696a323889f4a95cad",
"sha256": "93f9dcc91577c29b16c30c6997a4b4586f8fda021e5951c008a3159ed05e25c4"
},
"downloads": -1,
"filename": "privateAB-0.0.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "010d1d743a54e5696a323889f4a95cad",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 12006,
"upload_time": "2024-11-20T07:54:32",
"upload_time_iso_8601": "2024-11-20T07:54:32.360095Z",
"url": "https://files.pythonhosted.org/packages/ff/9c/a56d4bacb4f90f45ea5f040b5a68332b276b0f733acca2679d07534cfdf1/privateAB-0.0.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "37958edd7c51519619a7922a9c81c56e2d10890b9c331d63cc39ef09c5faccac",
"md5": "2dba6ca7fe65861eb439c72038751860",
"sha256": "0fd239870a0191d4eb1ae59f4be41731f8a337e2b480d0449f7d0c0146352a59"
},
"downloads": -1,
"filename": "privateAB-0.0.2.tar.gz",
"has_sig": false,
"md5_digest": "2dba6ca7fe65861eb439c72038751860",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 14146,
"upload_time": "2024-11-20T07:54:33",
"upload_time_iso_8601": "2024-11-20T07:54:33.304858Z",
"url": "https://files.pythonhosted.org/packages/37/95/8edd7c51519619a7922a9c81c56e2d10890b9c331d63cc39ef09c5faccac/privateAB-0.0.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-20 07:54:33",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Jong-Min-Moon",
"github_project": "optimal-local-dp-two-sample",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "privateab"
}