# `duckreg` : very fast out-of-memory regressions with `duckdb`
python package to run stratified/saturated regressions out-of-memory with duckdb. The package is a wrapper around the `duckdb` package and provides a simple interface to run regressions on very large datasets that do not fit in memory by reducing the data to a set of summary statistics and runs weighted least squares with frequency weights. Robust standard errors are computed from sufficient statistics, while clustered standard errors are computed using the cluster bootstrap.
See examples in `notebooks/introduction.ipynb`.
<p align="center">
<img src="https://static.independent.co.uk/s3fs-public/thumbnails/image/2016/02/14/12/duck-rabbit.png" width="350">
</p>
- install
```
pip install duckreg
```
- dev install (preferably in a `venv`) with
```
(uv) pip install git+https://github.com/apoorvalal/duckreg.git
```
or git clone this repository and install in editable mode.
---
Currently supports the following regression specifications:
1. `DuckRegression`: general linear regression, which compresses the data to y averages stratified by all unique values of the x variables
2. `DuckMundlak`: One- or Two-Way Mundlak regression, which compresses the data to the following RHS and avoids the need to incorporate unit (and time FEs)
$$
y \sim 1, w, \bar{w}\_{i, .}, \bar{w}\_{., t}
$$
3. `DuckDoubleDemeaning`: Double demeaning regression, which compresses the data to y averages by all values of $w$ after demeaning. This also eliminates unit and time FEs
$$
y \sim (W\_{it} - \bar{w}\_{i, .} - \bar{w}\_{., t} + \bar{w}\_{., .})
$$
4. `DuckMundlakEventStudy`: Two-way mundlak with dynamic treatment effects. This incorporates treatment-cohort FEs ($\psi\_i$), time-period FEs ($\gamma\_t$) and dynamic treatment effects $\tau\_k$ given by cohort X time interactions.
$$
y \sim \psi\_i + \gamma\_t + \sum\_{k=1}^{T} \tau\_{k} D\_i 1(t = k)
$$
All the above regressions are run in compressed fashion with `duckdb`.
---
references:
methods:
+ [Arkhangelsky and Imbens (2023)](https://arxiv.org/abs/1807.02099)
+ [Wooldridge 2021](https://www.researchgate.net/publication/353938385_Two-Way_Fixed_Effects_the_Two-Way_Mundlak_Regression_and_Difference-in-Differences_Estimators)
+ [Wong et al 2021](https://arxiv.org/abs/2102.11297)
libraries:
+ [Grant McDermott's duckdb lecture](https://grantmcdermott.com/duckdb-polars/)
Raw data
{
"_id": null,
"home_page": "https://github.com/apoorvalal/duckreg",
"name": "duckreg",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": null,
"keywords": "statistics, econometrics, sufficient statistics, bootstrap",
"author": "Apoorva Lal",
"author_email": "lal.apoorva@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/54/1a/3c745fffe9f1635786b6e69fe51862d35a7e817d81e3f8784782d82d5e15/duckreg-0.1.1.tar.gz",
"platform": null,
"description": "# `duckreg` : very fast out-of-memory regressions with `duckdb`\n\npython package to run stratified/saturated regressions out-of-memory with duckdb. The package is a wrapper around the `duckdb` package and provides a simple interface to run regressions on very large datasets that do not fit in memory by reducing the data to a set of summary statistics and runs weighted least squares with frequency weights. Robust standard errors are computed from sufficient statistics, while clustered standard errors are computed using the cluster bootstrap.\n\nSee examples in `notebooks/introduction.ipynb`.\n\n<p align=\"center\">\n <img src=\"https://static.independent.co.uk/s3fs-public/thumbnails/image/2016/02/14/12/duck-rabbit.png\" width=\"350\">\n</p>\n\n- install\n\n```\npip install duckreg\n```\n\n- dev install (preferably in a `venv`) with\n```\n(uv) pip install git+https://github.com/apoorvalal/duckreg.git\n```\n\nor git clone this repository and install in editable mode.\n\n---\n\nCurrently supports the following regression specifications:\n1. `DuckRegression`: general linear regression, which compresses the data to y averages stratified by all unique values of the x variables\n2. `DuckMundlak`: One- or Two-Way Mundlak regression, which compresses the data to the following RHS and avoids the need to incorporate unit (and time FEs)\n\n$$\ny \\sim 1, w, \\bar{w}\\_{i, .}, \\bar{w}\\_{., t}\n$$\n\n3. `DuckDoubleDemeaning`: Double demeaning regression, which compresses the data to y averages by all values of $w$ after demeaning. This also eliminates unit and time FEs\n\n$$\ny \\sim (W\\_{it} - \\bar{w}\\_{i, .} - \\bar{w}\\_{., t} + \\bar{w}\\_{., .})\n$$\n\n4. `DuckMundlakEventStudy`: Two-way mundlak with dynamic treatment effects. This incorporates treatment-cohort FEs ($\\psi\\_i$), time-period FEs ($\\gamma\\_t$) and dynamic treatment effects $\\tau\\_k$ given by cohort X time interactions.\n\n$$\ny \\sim \\psi\\_i + \\gamma\\_t + \\sum\\_{k=1}^{T} \\tau\\_{k} D\\_i 1(t = k)\n$$\n\nAll the above regressions are run in compressed fashion with `duckdb`.\n\n---\nreferences:\n\nmethods:\n+ [Arkhangelsky and Imbens (2023)](https://arxiv.org/abs/1807.02099)\n+ [Wooldridge 2021](https://www.researchgate.net/publication/353938385_Two-Way_Fixed_Effects_the_Two-Way_Mundlak_Regression_and_Difference-in-Differences_Estimators)\n+ [Wong et al 2021](https://arxiv.org/abs/2102.11297)\n\nlibraries:\n+ [Grant McDermott's duckdb lecture](https://grantmcdermott.com/duckdb-polars/)\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A package for Regression in compressed representation powered by DuckDB",
"version": "0.1.1",
"project_urls": {
"Homepage": "https://github.com/apoorvalal/duckreg"
},
"split_keywords": [
"statistics",
" econometrics",
" sufficient statistics",
" bootstrap"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "2a0c3d8b9567baa4da29adca85503f0f3c383951a9d69cbcb9612fd67dcf4bb3",
"md5": "5cfebae17a96ed3cbfde278a4884217b",
"sha256": "2136c49009c11e21b70b8541a70ec153b3fbee9ae71d3ef80cfcffc6bffe915e"
},
"downloads": -1,
"filename": "duckreg-0.1.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "5cfebae17a96ed3cbfde278a4884217b",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 13559,
"upload_time": "2024-09-08T20:13:19",
"upload_time_iso_8601": "2024-09-08T20:13:19.530883Z",
"url": "https://files.pythonhosted.org/packages/2a/0c/3d8b9567baa4da29adca85503f0f3c383951a9d69cbcb9612fd67dcf4bb3/duckreg-0.1.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "541a3c745fffe9f1635786b6e69fe51862d35a7e817d81e3f8784782d82d5e15",
"md5": "01381e7d5c2a6312e6a90620b2dacdca",
"sha256": "510ea77c4ab0a7527fc26e24f1678d139998bbe1e18396afaa1c140634aa517c"
},
"downloads": -1,
"filename": "duckreg-0.1.1.tar.gz",
"has_sig": false,
"md5_digest": "01381e7d5c2a6312e6a90620b2dacdca",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 13641,
"upload_time": "2024-09-08T20:13:21",
"upload_time_iso_8601": "2024-09-08T20:13:21.252680Z",
"url": "https://files.pythonhosted.org/packages/54/1a/3c745fffe9f1635786b6e69fe51862d35a7e817d81e3f8784782d82d5e15/duckreg-0.1.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-09-08 20:13:21",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "apoorvalal",
"github_project": "duckreg",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "numpy",
"specs": []
},
{
"name": "pandas",
"specs": []
},
{
"name": "tqdm",
"specs": []
},
{
"name": "duckdb",
"specs": []
},
{
"name": "numba",
"specs": []
},
{
"name": "pyfixest",
"specs": []
}
],
"lcname": "duckreg"
}