### Learned NDV estimator
Learned model to estimate number of distinct values (NDV) of a population using a small sample. The model approximates the maximum likelihood estimation of NDV, which is difficult to obtain analytically.
See our VLDB 2022 paper [Learning to be a Statistician: Learned Estimator for Number of Distinct Values](https://vldb.org/pvldb/vol15/p272-wu.pdf) for more details.
### How to use
1. Install the package
`pip install estndv`
2. Import and create an instance
```python
from estndv import ndvEstimator
estimator = ndvEstimator()
```
4. Assume your sample is S=[1,1,1,3,5,5,12] and the population size is N=100000. You can estimate population ndv by:
`ndv = estimator.sample_predict(S=[1,1,1,3,5,5,12], N=100000)`
5. If you have the sample profile e.g. f=[2,1,1], you can estimate population NDV by:
`ndv = estimator.profile_predict(f=[2,1,1], N=100000)`
6. If you have multiple samples/profiles from multiple populations, you can estimate population NDV for all of them in a batch by method `estimator.sample_predict_batch()` or `estimator.profile_predict_batch()`.
### How to train the ndv estimator
You can directly use our package on PyPI for your datasets, as the pre-trained model is agnostic to any workloads. However, if you want to train the model from scratch anyway, do the following:
1. Go to the model_training folder
`cd model_training`
2. Install requirements
`pip install requirements.txt`
3. Generate training data. (This uses a lot of memory.)
`python training_data_generation.py`
4. Train model
`python model_training.py`
5. Save trained pytorch model parameters to numpy, this generates a file model_paras.npy
`python torch2npy.py`
6. Test with your model parameters by specifying a path to your model_paras.npy
`estimator = ndvEstimator(para_path=your path to model_paras.npy)`
### Citation
If you use our work or found it useful, please cite our paper:
```
@article{wu2022learning,
author = {Wu, Renzhi and Ding, Bolin and Chu, Xu and Wei, Zhewei and Dai, Xiening and Guan, Tao and Zhou, Jingren},
title = {Learning to Be a Statistician: Learned Estimator for Number of Distinct Values},
year = {2021},
issue_date = {October 2021},
publisher = {VLDB Endowment},
volume = {15},
number = {2},
issn = {2150-8097},
url = {https://doi.org/10.14778/3489496.3489508},
doi = {10.14778/3489496.3489508},
journal = {Proc. VLDB Endow.},
month = {oct},
pages = {272–284},
numpages = {13}
}
```
Raw data
{
"_id": null,
"home_page": "https://github.com/wurenzhi/learned_ndv_estimator",
"name": "estndv",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": null,
"keywords": null,
"author": "Renzhi Wu",
"author_email": "renzhiwu@gatech.edu",
"download_url": "https://files.pythonhosted.org/packages/6b/dc/e6cf8628c0bba89a2dc7887dccf27cf6ae8a0cb0217905b7671cdfb8f152/estndv-0.0.3.tar.gz",
"platform": null,
"description": "### Learned NDV estimator\nLearned model to estimate number of distinct values (NDV) of a population using a small sample. The model approximates the maximum likelihood estimation of NDV, which is difficult to obtain analytically.\nSee our VLDB 2022 paper [Learning to be a Statistician: Learned Estimator for Number of Distinct Values](https://vldb.org/pvldb/vol15/p272-wu.pdf) for more details.\n\n### How to use\n1. Install the package\n \n `pip install estndv`\n\n2. Import and create an instance\n\n```python\n from estndv import ndvEstimator\n estimator = ndvEstimator()\n```\n\n4. Assume your sample is S=[1,1,1,3,5,5,12] and the population size is N=100000. You can estimate population ndv by:\n\n `ndv = estimator.sample_predict(S=[1,1,1,3,5,5,12], N=100000)`\n \n5. If you have the sample profile e.g. f=[2,1,1], you can estimate population NDV by:\n \n `ndv = estimator.profile_predict(f=[2,1,1], N=100000)`\n\n6. If you have multiple samples/profiles from multiple populations, you can estimate population NDV for all of them in a batch by method `estimator.sample_predict_batch()` or `estimator.profile_predict_batch()`.\n\n\n\n### How to train the ndv estimator\nYou can directly use our package on PyPI for your datasets, as the pre-trained model is agnostic to any workloads. However, if you want to train the model from scratch anyway, do the following:\n1. Go to the model_training folder\n `cd model_training`\n\n2. Install requirements\n \n `pip install requirements.txt`\n \n3. Generate training data. (This uses a lot of memory.)\n \n `python training_data_generation.py`\n \n4. Train model\n \n `python model_training.py`\n5. Save trained pytorch model parameters to numpy, this generates a file model_paras.npy\n\n `python torch2npy.py`\n\n6. Test with your model parameters by specifying a path to your model_paras.npy\n\n `estimator = ndvEstimator(para_path=your path to model_paras.npy)`\n\n### Citation\nIf you use our work or found it useful, please cite our paper:\n```\n@article{wu2022learning,\n author = {Wu, Renzhi and Ding, Bolin and Chu, Xu and Wei, Zhewei and Dai, Xiening and Guan, Tao and Zhou, Jingren},\n title = {Learning to Be a Statistician: Learned Estimator for Number of Distinct Values},\n year = {2021},\n issue_date = {October 2021},\n publisher = {VLDB Endowment},\n volume = {15},\n number = {2},\n issn = {2150-8097},\n url = {https://doi.org/10.14778/3489496.3489508},\n doi = {10.14778/3489496.3489508},\n journal = {Proc. VLDB Endow.},\n month = {oct},\n pages = {272\u2013284},\n numpages = {13}\n}\n```\n",
"bugtrack_url": null,
"license": null,
"summary": "Learned sample-based estimator for number of distinct values.",
"version": "0.0.3",
"project_urls": {
"Bug Tracker": "https://github.com/wurenzhi/learned_ndv_estimator/issues",
"Homepage": "https://github.com/wurenzhi/learned_ndv_estimator"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "de9f2a0599b9a085eb0f527993a504037931754f320e62b60446670c60769b46",
"md5": "8660e691888aafda1dd7f7a01f110ba7",
"sha256": "44f00b89844b55a03eaff6eeac04a38812e71bc76ff041d5a55964c174d2b2d6"
},
"downloads": -1,
"filename": "estndv-0.0.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "8660e691888aafda1dd7f7a01f110ba7",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6",
"size": 187974,
"upload_time": "2024-09-07T20:43:04",
"upload_time_iso_8601": "2024-09-07T20:43:04.234891Z",
"url": "https://files.pythonhosted.org/packages/de/9f/2a0599b9a085eb0f527993a504037931754f320e62b60446670c60769b46/estndv-0.0.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "6bdce6cf8628c0bba89a2dc7887dccf27cf6ae8a0cb0217905b7671cdfb8f152",
"md5": "8845d56764c313c5607e8d34ea45997e",
"sha256": "72305925ac8516f227971bd312760ce84de9a2853ccc6a89b727cd2e28fed0c8"
},
"downloads": -1,
"filename": "estndv-0.0.3.tar.gz",
"has_sig": false,
"md5_digest": "8845d56764c313c5607e8d34ea45997e",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 188649,
"upload_time": "2024-09-07T20:43:06",
"upload_time_iso_8601": "2024-09-07T20:43:06.423173Z",
"url": "https://files.pythonhosted.org/packages/6b/dc/e6cf8628c0bba89a2dc7887dccf27cf6ae8a0cb0217905b7671cdfb8f152/estndv-0.0.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-09-07 20:43:06",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "wurenzhi",
"github_project": "learned_ndv_estimator",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "estndv"
}