estndv


Nameestndv JSON
Version 0.0.3 PyPI version JSON
download
home_pagehttps://github.com/wurenzhi/learned_ndv_estimator
SummaryLearned sample-based estimator for number of distinct values.
upload_time2024-09-07 20:43:06
maintainerNone
docs_urlNone
authorRenzhi Wu
requires_python>=3.6
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ### Learned NDV estimator
Learned model to estimate number of distinct values (NDV) of a population using a small sample. The model approximates the maximum likelihood estimation of NDV, which is difficult to obtain analytically.
See our VLDB 2022 paper [Learning to be a Statistician: Learned Estimator for Number of Distinct Values](https://vldb.org/pvldb/vol15/p272-wu.pdf) for more details.

### How to use
1. Install the package
   
    `pip install estndv`

2. Import and create an instance

```python
   from estndv import ndvEstimator
   estimator = ndvEstimator()
```

4. Assume your sample is S=[1,1,1,3,5,5,12] and the population size is N=100000. You can estimate population ndv by:

    `ndv = estimator.sample_predict(S=[1,1,1,3,5,5,12], N=100000)`
   
5. If you have the sample profile e.g. f=[2,1,1], you can estimate population NDV by:
   
    `ndv = estimator.profile_predict(f=[2,1,1], N=100000)`

6. If you have multiple samples/profiles from multiple populations, you can estimate population NDV for all of them in a batch by method `estimator.sample_predict_batch()` or `estimator.profile_predict_batch()`.



### How to train the ndv estimator
You can directly use our package on PyPI for your datasets, as the pre-trained model is agnostic to any workloads. However, if you want to train the model from scratch anyway, do the following:
1. Go to the model_training folder
    `cd model_training`

2. Install requirements
   
    `pip install requirements.txt`
   
3. Generate training data. (This uses a lot of memory.)
   
    `python training_data_generation.py`
   
4. Train model
   
    `python model_training.py`
5. Save trained pytorch model parameters to numpy, this generates a file model_paras.npy

    `python torch2npy.py`

6. Test with your model parameters by specifying a path to your model_paras.npy

   `estimator = ndvEstimator(para_path=your path to model_paras.npy)`

### Citation
If you use our work or found it useful, please cite our paper:
```
@article{wu2022learning,
   author = {Wu, Renzhi and Ding, Bolin and Chu, Xu and Wei, Zhewei and Dai, Xiening and Guan, Tao and Zhou, Jingren},
   title = {Learning to Be a Statistician: Learned Estimator for Number of Distinct Values},
   year = {2021},
   issue_date = {October 2021},
   publisher = {VLDB Endowment},
   volume = {15},
   number = {2},
   issn = {2150-8097},
   url = {https://doi.org/10.14778/3489496.3489508},
   doi = {10.14778/3489496.3489508},
   journal = {Proc. VLDB Endow.},
   month = {oct},
   pages = {272–284},
   numpages = {13}
}
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/wurenzhi/learned_ndv_estimator",
    "name": "estndv",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": null,
    "keywords": null,
    "author": "Renzhi Wu",
    "author_email": "renzhiwu@gatech.edu",
    "download_url": "https://files.pythonhosted.org/packages/6b/dc/e6cf8628c0bba89a2dc7887dccf27cf6ae8a0cb0217905b7671cdfb8f152/estndv-0.0.3.tar.gz",
    "platform": null,
    "description": "### Learned NDV estimator\nLearned model to estimate number of distinct values (NDV) of a population using a small sample. The model approximates the maximum likelihood estimation of NDV, which is difficult to obtain analytically.\nSee our VLDB 2022 paper [Learning to be a Statistician: Learned Estimator for Number of Distinct Values](https://vldb.org/pvldb/vol15/p272-wu.pdf) for more details.\n\n### How to use\n1. Install the package\n   \n    `pip install estndv`\n\n2. Import and create an instance\n\n```python\n   from estndv import ndvEstimator\n   estimator = ndvEstimator()\n```\n\n4. Assume your sample is S=[1,1,1,3,5,5,12] and the population size is N=100000. You can estimate population ndv by:\n\n    `ndv = estimator.sample_predict(S=[1,1,1,3,5,5,12], N=100000)`\n   \n5. If you have the sample profile e.g. f=[2,1,1], you can estimate population NDV by:\n   \n    `ndv = estimator.profile_predict(f=[2,1,1], N=100000)`\n\n6. If you have multiple samples/profiles from multiple populations, you can estimate population NDV for all of them in a batch by method `estimator.sample_predict_batch()` or `estimator.profile_predict_batch()`.\n\n\n\n### How to train the ndv estimator\nYou can directly use our package on PyPI for your datasets, as the pre-trained model is agnostic to any workloads. However, if you want to train the model from scratch anyway, do the following:\n1. Go to the model_training folder\n    `cd model_training`\n\n2. Install requirements\n   \n    `pip install requirements.txt`\n   \n3. Generate training data. (This uses a lot of memory.)\n   \n    `python training_data_generation.py`\n   \n4. Train model\n   \n    `python model_training.py`\n5. Save trained pytorch model parameters to numpy, this generates a file model_paras.npy\n\n    `python torch2npy.py`\n\n6. Test with your model parameters by specifying a path to your model_paras.npy\n\n   `estimator = ndvEstimator(para_path=your path to model_paras.npy)`\n\n### Citation\nIf you use our work or found it useful, please cite our paper:\n```\n@article{wu2022learning,\n   author = {Wu, Renzhi and Ding, Bolin and Chu, Xu and Wei, Zhewei and Dai, Xiening and Guan, Tao and Zhou, Jingren},\n   title = {Learning to Be a Statistician: Learned Estimator for Number of Distinct Values},\n   year = {2021},\n   issue_date = {October 2021},\n   publisher = {VLDB Endowment},\n   volume = {15},\n   number = {2},\n   issn = {2150-8097},\n   url = {https://doi.org/10.14778/3489496.3489508},\n   doi = {10.14778/3489496.3489508},\n   journal = {Proc. VLDB Endow.},\n   month = {oct},\n   pages = {272\u2013284},\n   numpages = {13}\n}\n```\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Learned sample-based estimator for number of distinct values.",
    "version": "0.0.3",
    "project_urls": {
        "Bug Tracker": "https://github.com/wurenzhi/learned_ndv_estimator/issues",
        "Homepage": "https://github.com/wurenzhi/learned_ndv_estimator"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "de9f2a0599b9a085eb0f527993a504037931754f320e62b60446670c60769b46",
                "md5": "8660e691888aafda1dd7f7a01f110ba7",
                "sha256": "44f00b89844b55a03eaff6eeac04a38812e71bc76ff041d5a55964c174d2b2d6"
            },
            "downloads": -1,
            "filename": "estndv-0.0.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "8660e691888aafda1dd7f7a01f110ba7",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 187974,
            "upload_time": "2024-09-07T20:43:04",
            "upload_time_iso_8601": "2024-09-07T20:43:04.234891Z",
            "url": "https://files.pythonhosted.org/packages/de/9f/2a0599b9a085eb0f527993a504037931754f320e62b60446670c60769b46/estndv-0.0.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6bdce6cf8628c0bba89a2dc7887dccf27cf6ae8a0cb0217905b7671cdfb8f152",
                "md5": "8845d56764c313c5607e8d34ea45997e",
                "sha256": "72305925ac8516f227971bd312760ce84de9a2853ccc6a89b727cd2e28fed0c8"
            },
            "downloads": -1,
            "filename": "estndv-0.0.3.tar.gz",
            "has_sig": false,
            "md5_digest": "8845d56764c313c5607e8d34ea45997e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 188649,
            "upload_time": "2024-09-07T20:43:06",
            "upload_time_iso_8601": "2024-09-07T20:43:06.423173Z",
            "url": "https://files.pythonhosted.org/packages/6b/dc/e6cf8628c0bba89a2dc7887dccf27cf6ae8a0cb0217905b7671cdfb8f152/estndv-0.0.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-07 20:43:06",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "wurenzhi",
    "github_project": "learned_ndv_estimator",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "estndv"
}
        
Elapsed time: 0.44920s