Name | causalem JSON |
Version |
0.6.1
JSON |
| download |
home_page | None |
Summary | Causal Inference using Ensemble Matching |
upload_time | 2025-08-08 20:48:39 |
maintainer | None |
docs_url | None |
author | None |
requires_python | >=3.9 |
license | MIT License
Copyright (c) 2025-Present Alireza S. Mahani, Mansour T.A. Sharabiani
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
|
keywords |
causal-inference
matching
ensemble learning
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# CausalEM – Ensemble Matching for Causal Inference
> **CausalEM** is a toolbox for multi-arm treatment‑effect estimation using stochastic matching and a stacked ensemble of heterogeneous ML models. It supports continuous, binary, and survival outcomes.
---
## Key Features
1. **Stochastic nearest-neighbor (NN) matching** -> Larger effective sample size (ESS) and improved TE estimation accuracy compared to standard (deterministic) NN matching.
1. **G-computation using two-staged, stacked ensemble of hetrogeneous learners** -> Generalization of standard G-computation framework to ensemble learning; cross-fitting of propensity-score and outcome models, similar to DoubleML.
1. **Support for multi-arm treatments** -> Improved multi-arm ESS via stochastic matching.
1. **Support for survival outcomes** -> Use of data simulation from survival outcome models to implement stacked-ensemble for TE estimation in right-censored, time-to-event data.
1. **Bootstrapped confidence interval (CI) estimation** -> Honest estimation of CI by including entire (matching + TE estimation) pipeline in bootstrap loop.
1. **Compatible with `scikit-learn`** -> Maximum flexibility in using ML models by providing access to `scikit-learn` (and `scikit-survival` for survival) for propensity-score, outcome and meta-learner stages.
1. **Full reproducibility of results** --> Careful implementation of random number generation (RNG) seeding, including in `scikit-learn` models.
<!-- 1. **Available in Python and R** -> Identical - function-centric - API in both languages using `reticulate`; combined with RNG management, leads to identical, reproducible results across the two platforms. -->
---
## API
| Function | Brief description |
| ------------------------ | --------------------------------------------------------- |
| `estimate_te` | Main pipeline – ensemble matching + meta‑learner |
| `StochasticMatcher` | 1:1 nearest‑neighbor matcher (deterministic ↔ stochastic) |
| `summarize_matching` | Diagnostics: ESS, ASMD, variance ratios, overlap plots |
| `load_data_lalonde` | Copy of Lalonde job‑training dataset |
| `load_data_tof` | Simulated TOF dataset (survival or binary outcome) |
---
## ⚙️ Installation <!--- install -->
```bash
pip install causalem
```
Optional dev extras:
```bash
pip install "causalem[dev]"
```
Minimum Python 3.9. Tested on macOS and Windows.
---
## Package Vignette
For a more detailed introduction to `CausalEM`, including the underlying math, see the _package vignette_ [insert link later], available on arXiv.
---
## 🚀 Quick Start <!--- quickstart -->
### Two-arm Analysis
Load the necessary packages:
```python
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from causalem import (
estimate_te,
load_data_tof,
stochastic_match,
summarize_matching
)
```
Load the ToF data with two treatment levels and binarized outcome:
```python
X, t, y = load_data_tof(
raw = False,
treat_levels = ['PrP', 'SPS'],
binarize_outcome=True,
)
```
Stochastic matching using propensity scores:
```python
lr = LogisticRegression(solver="newton-cg", max_iter=1000)
lr.fit(X, t)
score = lr.predict_proba(X)[:, 1]
logit_score = np.log(score / (1 - score))
cluster = stochastic_match(
treatment=t,
score=logit_score,
nsmp=10,
scale=1.0,
random_state=0,
)
diag = summarize_matching(
cluster, X,
treatment=t, plot=False
)
print("Combined Effective Sample Size (ESS):", diag.ess["combined"])
print("Absolute standardized mean difference (ASMD) by covariate:\n")
print(diag.summary)
```
TE estimation:
```python
res = estimate_te(
X,
t,
y,
outcome_type="binary",
niter=5,
matching_scale=1.0,
matching_is_stochastic=True,
random_state_master=1,
)
print("Two-arm TE:", res["te"])
```
### Multi-arm Analysis
Load data for multi-arm analysis:
```python
df = load_data_tof(
raw = True,
binarize_outcome=True,
)
t_all = df["treatment"].to_numpy()
X_all = df[["age", "zscore"]].to_numpy()
y_all = df["outcome"].to_numpy()
```
Constructing propensity scores using multinomial logistic regression:
```python
lr_multi = LogisticRegression(multi_class="multinomial", max_iter=1000)
lr_multi.fit(X_all, t_all)
proba = lr_multi.predict_proba(X_all)
ref = "PrP"
cols = [i for i, c in enumerate(lr_multi.classes_) if c != ref]
logit_multi = np.log(proba[:, cols] / (1 - proba[:, cols]))
```
Multi-arm stochastic matching:
```python
cluster_multi = stochastic_match(
treatment=t_all,
score=logit_multi,
nsmp=5,
scale=1.0,
ref_group=ref,
random_state=0,
)
diag_multi = summarize_matching(
cluster_multi, X_all, treatment=t_all, ref_group=ref, plot=False
)
print("Multi-arm ESS per draw:\n", diag_multi.ess["per_draw"])
```
Multi-arm TE estimation:
```python
res_multi = estimate_te(
X_all,
t_all,
y_all,
outcome_type="binary",
ref_group=ref,
niter=5,
matching_scale=1.0,
matching_is_stochastic=True,
random_state_master=1,
)
print("Multi-arm pairwise effects:\n", res_multi["pairwise"])
```
### Confidence-Interval Calculation
Adding bootstrap CI to the two-arm analysis:
```python
res_boot = estimate_te(
X,
t,
y,
outcome_type="binary",
niter=5,
nboot=200,
matching_scale=1.0,
matching_is_stochastic=True,
random_state_master=1,
random_state_boot=7,
)
print("Bootstrap CI:", res_boot["ci"])
```
### Heterogeneous Ensemble
```python
learners = [
LogisticRegression(max_iter=1000),
RandomForestClassifier(n_estimators=200, max_depth=3),
]
res_ensemble = estimate_te(
X,
t,
y,
outcome_type="binary",
model_outcome=learners,
niter=len(learners),
do_stacking=True,
matching_scale=1.0,
matching_is_stochastic=True,
random_state_master=42,
)
print("Ensemble TE:", res_ensemble["te"])
```
### TE Estimation for Survival Outcomes
```python
X_surv, t_surv, y_surv = load_data_tof(
raw=False
, treat_levels = ['SPS', 'PrP']
)
res_surv = estimate_te(
X_surv,
t_surv,
y_surv,
outcome_type="survival",
niter=5,
matching_scale=1.0,
matching_is_stochastic=True,
random_state_master=0,
)
print("Survival HR:", res_surv["te"])
```
<!-- ## `CausalEM` in `R`
After installing the Python package, install the R wrapper:
```R
install.packages('CausalEM')
```
-->
## License
This project is licensed under the terms of the MIT License.
## Release Notes
### 0.6.1
- Corrected the version number in `pyproject.toml` file.
### 0.6.0
- Improved consistency of return data structure when `do_stacking=False` in multi-arm TE estimation.
### 0.5.4
- Added github action for publishing to PyPI
### 0.5.3
- First public release
### 0.5.1
- Edits to readme
- Added github action for publishing to (test) PyPI
### 0.5.0
- First test release
Raw data
{
"_id": null,
"home_page": null,
"name": "causalem",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "causal-inference, matching, ensemble learning",
"author": null,
"author_email": "\"Alireza S. Mahani, Mansour T.A. Sharabiani\" <alireza.s.mahani@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/f5/cd/2828fb0a7d7826eb131c6c090e0745da44f45a51484c8722db779bb714ba/causalem-0.6.1.tar.gz",
"platform": null,
"description": "# CausalEM \u2013 Ensemble Matching for Causal Inference\n\n> **CausalEM** is a toolbox for multi-arm treatment\u2011effect estimation using stochastic matching and a stacked ensemble of heterogeneous ML models. It supports continuous, binary, and survival outcomes.\n\n---\n\n## Key Features\n\n1. **Stochastic nearest-neighbor (NN) matching** -> Larger effective sample size (ESS) and improved TE estimation accuracy compared to standard (deterministic) NN matching.\n1. **G-computation using two-staged, stacked ensemble of hetrogeneous learners** -> Generalization of standard G-computation framework to ensemble learning; cross-fitting of propensity-score and outcome models, similar to DoubleML.\n1. **Support for multi-arm treatments** -> Improved multi-arm ESS via stochastic matching.\n1. **Support for survival outcomes** -> Use of data simulation from survival outcome models to implement stacked-ensemble for TE estimation in right-censored, time-to-event data.\n1. **Bootstrapped confidence interval (CI) estimation** -> Honest estimation of CI by including entire (matching + TE estimation) pipeline in bootstrap loop.\n1. **Compatible with `scikit-learn`** -> Maximum flexibility in using ML models by providing access to `scikit-learn` (and `scikit-survival` for survival) for propensity-score, outcome and meta-learner stages.\n1. **Full reproducibility of results** --> Careful implementation of random number generation (RNG) seeding, including in `scikit-learn` models.\n<!-- 1. **Available in Python and R** -> Identical - function-centric - API in both languages using `reticulate`; combined with RNG management, leads to identical, reproducible results across the two platforms. -->\n\n---\n\n## API\n\n| Function | Brief description |\n| ------------------------ | --------------------------------------------------------- |\n| `estimate_te` | Main pipeline \u2013 ensemble matching + meta\u2011learner |\n| `StochasticMatcher` | 1:1 nearest\u2011neighbor matcher (deterministic \u2194 stochastic) |\n| `summarize_matching` | Diagnostics: ESS, ASMD, variance ratios, overlap plots |\n| `load_data_lalonde` | Copy of Lalonde job\u2011training dataset |\n| `load_data_tof` | Simulated TOF dataset (survival or binary outcome) |\n\n---\n\n## \u2699\ufe0f Installation <!--- install -->\n\n```bash\npip install causalem\n```\n\nOptional dev extras:\n\n```bash\npip install \"causalem[dev]\"\n```\n\nMinimum Python\u00a03.9. Tested on macOS and Windows.\n\n---\n\n## Package Vignette\n\nFor a more detailed introduction to `CausalEM`, including the underlying math, see the _package vignette_ [insert link later], available on arXiv.\n\n---\n\n## \ud83d\ude80 Quick\u00a0Start <!--- quickstart -->\n\n### Two-arm Analysis\n\nLoad the necessary packages:\n\n```python\nimport numpy as np\nimport pandas as pd\nfrom sklearn.ensemble import RandomForestClassifier\nfrom sklearn.linear_model import LogisticRegression\n\nfrom causalem import (\n estimate_te,\n load_data_tof,\n stochastic_match,\n summarize_matching\n)\n```\nLoad the ToF data with two treatment levels and binarized outcome:\n```python\nX, t, y = load_data_tof(\n raw = False,\n treat_levels = ['PrP', 'SPS'],\n binarize_outcome=True,\n)\n```\nStochastic matching using propensity scores:\n```python\nlr = LogisticRegression(solver=\"newton-cg\", max_iter=1000)\nlr.fit(X, t)\nscore = lr.predict_proba(X)[:, 1]\nlogit_score = np.log(score / (1 - score))\n\ncluster = stochastic_match(\n treatment=t,\n score=logit_score,\n nsmp=10,\n scale=1.0,\n random_state=0,\n)\n\ndiag = summarize_matching(\n cluster, X,\n treatment=t, plot=False\n)\nprint(\"Combined Effective Sample Size (ESS):\", diag.ess[\"combined\"])\nprint(\"Absolute standardized mean difference (ASMD) by covariate:\\n\")\nprint(diag.summary)\n```\nTE estimation:\n```python\nres = estimate_te(\n X,\n t,\n y,\n outcome_type=\"binary\",\n niter=5,\n matching_scale=1.0,\n matching_is_stochastic=True,\n random_state_master=1,\n)\nprint(\"Two-arm TE:\", res[\"te\"])\n```\n\n### Multi-arm Analysis\n\nLoad data for multi-arm analysis:\n```python\ndf = load_data_tof(\n raw = True,\n binarize_outcome=True,\n)\nt_all = df[\"treatment\"].to_numpy()\nX_all = df[[\"age\", \"zscore\"]].to_numpy()\ny_all = df[\"outcome\"].to_numpy()\n```\nConstructing propensity scores using multinomial logistic regression:\n```python\nlr_multi = LogisticRegression(multi_class=\"multinomial\", max_iter=1000)\nlr_multi.fit(X_all, t_all)\nproba = lr_multi.predict_proba(X_all)\nref = \"PrP\"\ncols = [i for i, c in enumerate(lr_multi.classes_) if c != ref]\nlogit_multi = np.log(proba[:, cols] / (1 - proba[:, cols]))\n```\nMulti-arm stochastic matching:\n```python\ncluster_multi = stochastic_match(\n treatment=t_all,\n score=logit_multi,\n nsmp=5,\n scale=1.0,\n ref_group=ref,\n random_state=0,\n)\ndiag_multi = summarize_matching(\n cluster_multi, X_all, treatment=t_all, ref_group=ref, plot=False\n)\nprint(\"Multi-arm ESS per draw:\\n\", diag_multi.ess[\"per_draw\"])\n```\nMulti-arm TE estimation:\n```python\nres_multi = estimate_te(\n X_all,\n t_all,\n y_all,\n outcome_type=\"binary\",\n ref_group=ref,\n niter=5,\n matching_scale=1.0,\n matching_is_stochastic=True,\n random_state_master=1,\n)\nprint(\"Multi-arm pairwise effects:\\n\", res_multi[\"pairwise\"])\n```\n\n### Confidence-Interval Calculation\n\nAdding bootstrap CI to the two-arm analysis:\n```python\nres_boot = estimate_te(\n X,\n t,\n y,\n outcome_type=\"binary\",\n niter=5,\n nboot=200,\n matching_scale=1.0,\n matching_is_stochastic=True,\n random_state_master=1,\n random_state_boot=7,\n)\nprint(\"Bootstrap CI:\", res_boot[\"ci\"])\n```\n\n### Heterogeneous Ensemble\n\n```python\nlearners = [\n LogisticRegression(max_iter=1000),\n RandomForestClassifier(n_estimators=200, max_depth=3),\n]\nres_ensemble = estimate_te(\n X,\n t,\n y,\n outcome_type=\"binary\",\n model_outcome=learners,\n niter=len(learners),\n do_stacking=True,\n matching_scale=1.0,\n matching_is_stochastic=True,\n random_state_master=42,\n)\nprint(\"Ensemble TE:\", res_ensemble[\"te\"])\n```\n\n### TE Estimation for Survival Outcomes\n```python\nX_surv, t_surv, y_surv = load_data_tof(\n raw=False\n , treat_levels = ['SPS', 'PrP']\n)\nres_surv = estimate_te(\n X_surv,\n t_surv,\n y_surv,\n outcome_type=\"survival\",\n niter=5,\n matching_scale=1.0,\n matching_is_stochastic=True,\n random_state_master=0,\n)\nprint(\"Survival HR:\", res_surv[\"te\"])\n```\n\n<!-- ## `CausalEM` in `R`\n\nAfter installing the Python package, install the R wrapper:\n```R\ninstall.packages('CausalEM')\n```\n-->\n\n## License\n\nThis project is licensed under the terms of the MIT License.\n\n## Release Notes\n\n### 0.6.1\n- Corrected the version number in `pyproject.toml` file.\n\n### 0.6.0\n- Improved consistency of return data structure when `do_stacking=False` in multi-arm TE estimation.\n\n### 0.5.4\n- Added github action for publishing to PyPI\n\n### 0.5.3\n- First public release\n\n### 0.5.1\n- Edits to readme\n- Added github action for publishing to (test) PyPI\n\n### 0.5.0\n\n- First test release\n",
"bugtrack_url": null,
"license": "MIT License\n \n Copyright (c) 2025-Present Alireza S. Mahani, Mansour T.A. Sharabiani\n \n Permission is hereby granted, free of charge, to any person obtaining a copy\n of this software and associated documentation files (the \"Software\"), to deal\n in the Software without restriction, including without limitation the rights\n to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n copies of the Software, and to permit persons to whom the Software is\n furnished to do so, subject to the following conditions:\n \n The above copyright notice and this permission notice shall be included in all\n copies or substantial portions of the Software.\n \n THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\n SOFTWARE.\n ",
"summary": "Causal Inference using Ensemble Matching",
"version": "0.6.1",
"project_urls": null,
"split_keywords": [
"causal-inference",
" matching",
" ensemble learning"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "f261915ea0c736ec3fc586857f8a45db49b3d8f57306542e09fcd598db722a3f",
"md5": "7c35f735b6af150a3a6a62bf0effbe88",
"sha256": "5239d376d7f7dc691c73918253353eea55b6fefd2d59f5da737fb82d53221e11"
},
"downloads": -1,
"filename": "causalem-0.6.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "7c35f735b6af150a3a6a62bf0effbe88",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 69209,
"upload_time": "2025-08-08T20:48:38",
"upload_time_iso_8601": "2025-08-08T20:48:38.301823Z",
"url": "https://files.pythonhosted.org/packages/f2/61/915ea0c736ec3fc586857f8a45db49b3d8f57306542e09fcd598db722a3f/causalem-0.6.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "f5cd2828fb0a7d7826eb131c6c090e0745da44f45a51484c8722db779bb714ba",
"md5": "d43044ca11734e3feb8c3a99ade2ec6e",
"sha256": "bd21959e8150718dbbd0fb244d62e51a9ddeee2b6eda5eb753656770b4b00790"
},
"downloads": -1,
"filename": "causalem-0.6.1.tar.gz",
"has_sig": false,
"md5_digest": "d43044ca11734e3feb8c3a99ade2ec6e",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 76664,
"upload_time": "2025-08-08T20:48:39",
"upload_time_iso_8601": "2025-08-08T20:48:39.546316Z",
"url": "https://files.pythonhosted.org/packages/f5/cd/2828fb0a7d7826eb131c6c090e0745da44f45a51484c8722db779bb714ba/causalem-0.6.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-08 20:48:39",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "causalem"
}