# PyAutoCausal
**Automated causal inference pipelines for data scientists**
## Why Causal Inference Matters in Tech
As data scientists, we're often asked to go beyond correlation and answer causal questions:
- "Did our new recommendation algorithm actually increase user engagement, or was it just seasonal trends?"
- "What's the true impact of our premium subscription tier on customer retention?"
- "How much did our marketing campaign increase conversions versus organic growth?"
- "Did our product redesign cause the drop in user activity, or was it market conditions?"
These questions can't be answered with standard predictive models or A/B tests alone. Real-world constraints often prevent randomized experiments:
- **Ethical concerns**: Can't randomly deny users important features
- **Business constraints**: Can't risk revenue on large-scale experiments
- **Natural experiments**: Sometimes changes happen organically (competitor exits, policy changes)
- **Historical analysis**: Need to evaluate past decisions without experimental data
## The Challenge of Observational Data
When working with observational data (logs, user behavior, historical metrics), we face fundamental challenges:
1. **Confounding**: Users who adopt premium features might be inherently more engaged
2. **Selection bias**: Treatment assignment isn't random
3. **Time-varying effects**: Impact changes over time
4. **Heterogeneous effects**: Different user segments respond differently
Traditional ML models are built for prediction, not causal inference. They'll happily exploit confounders and selection bias to maximize accuracy, giving you precisely wrong answers to causal questions.
## PyAutoCausal: Causal Inference Made Practical
PyAutoCausal automates the complex decision tree of modern causal inference methods. Instead of manually implementing and choosing between dozens of estimators, PyAutoCausal:
1. **Analyzes your data structure** to understand treatment timing, units, and available controls
2. **Selects appropriate methods** based on your data characteristics
3. **Validates assumptions** and warns about potential violations
4. **Executes analysis** with proper statistical inference
5. **Exports results** in formats ready for stakeholder communication
## Quick Example: Measuring Feature Impact
```python
from pyautocausal.pipelines.example_graph import causal_pipeline
import pandas as pd
# Your product data with treatment (feature rollout) and outcome (engagement)
data = pd.DataFrame({
'id_unit': [...], # User identifier
't': [...], # Time periods
'treat': [...], # 1 if user has feature, 0 otherwise
'y': [...], # Your KPI (DAU, sessions, revenue, etc.)
'x1': [...], # User characteristics
'x2': [...] # Additional controls
})
# PyAutoCausal automatically:
# - Detects this is panel data with staggered treatment
# - Chooses modern DiD methods (e.g., Callaway-Sant'Anna)
# - Handles heterogeneous treatment effects
# - Produces event study plots
pipeline = causal_pipeline(output_path="./feature_impact_analysis")
pipeline.fit(df=data)
# Results include:
# - Average treatment effect with confidence intervals
# - Dynamic effects over time since treatment
# - Heterogeneity analysis across user segments
# - Diagnostic plots and assumption checks
```
## Real Tech Applications
### Product & Feature Analysis
- **Feature rollout impact**: Measure true lift from new features beyond selection effects
- **UI/UX changes**: Isolate design impact from user self-selection
- **Pricing changes**: Estimate elasticity when users choose their plans
- **Platform migrations**: Quantify the causal effect of moving users to new systems
### Marketing & Growth
- **Campaign effectiveness**: Separate campaign impact from organic trends
- **Channel attribution**: Understand true incremental value of marketing channels
- **Retention interventions**: Measure causal impact of win-back campaigns
- **Geographic expansions**: Estimate market entry effects using synthetic controls
### Business Operations
- **Policy changes**: Evaluate impact of new policies on user behavior
- **Competitive effects**: Measure how competitor actions affect your metrics
- **Seasonal adjustments**: Separate true treatment effects from seasonality
- **Long-term impacts**: Track how effects evolve over months/years
## Why Automation Matters
Modern causal inference has seen an explosion of methods in recent years. Choosing correctly requires deep knowledge of:
- Parallel trends assumptions
- Staggered treatment timing
- Heterogeneous treatment effects
- Two-way fixed effects bias
- Synthetic control construction
PyAutoCausal encodes this expertise, automatically routing your analysis through the appropriate methods while maintaining transparency about assumptions and limitations.
## Installation
```bash
pip install pyautocausal
```
Or for development:
```bash
git clone https://github.com/yourusername/pyautocausal.git
cd pyautocausal
poetry install
```
## Core Concepts
### Graph-Based Pipeline Architecture
PyAutoCausal organizes causal analysis as directed graphs of computational nodes:
```python
from pyautocausal.orchestration.graph import ExecutableGraph
from pyautocausal.persistence.output_config import OutputConfig, OutputType
# Build custom pipelines using the graph API
graph = (ExecutableGraph()
.configure_runtime(output_path="./outputs")
.create_input_node("data", input_dtype=pd.DataFrame)
.create_decision_node("has_multiple_periods",
condition=lambda df: df['t'].nunique() > 1,
predecessors=["data"])
.create_node("cross_sectional_analysis",
cross_sectional_estimator,
predecessors=["has_multiple_periods"])
.create_node("panel_analysis",
panel_estimator,
predecessors=["has_multiple_periods"])
.when_false("has_multiple_periods", "cross_sectional_analysis")
.when_true("has_multiple_periods", "panel_analysis")
)
graph.fit(data=your_dataframe)
```
### Automated Method Selection
The framework automatically routes your data through appropriate causal inference methods:
- **Cross-sectional** (single time period) → OLS with robust inference
- **Panel with single treated unit** → Synthetic control methods
- **Panel with multiple treatment timing** → Modern DiD estimators
- **Staggered treatment adoption** → Callaway-Sant'Anna, BACON decomposition
- **Large datasets** → Double/debiased machine learning approaches
### Built-in Validation
Every analysis includes:
- **Data quality checks**: Missing values, duplicates, proper formatting
- **Assumption testing**: Parallel trends, common support, balance
- **Robustness checks**: Alternative specifications and estimators
- **Diagnostic plots**: Visual assumption validation
## Project Structure
```
pyautocausal/
├── orchestration/ # Core graph execution framework
│ ├── graph.py # ExecutableGraph class and execution logic
│ ├── nodes.py # Node types (standard, decision, input)
│ └── ...
├── pipelines/ # Pre-built causal inference workflows
│ ├── library/ # Reusable causal analysis components
│ │ ├── specifications.py # Treatment/outcome specifications
│ │ ├── estimators.py # Statistical estimators
│ │ ├── conditions.py # Data characteristic detectors
│ │ ├── plots.py # Visualization functions
│ │ └── ...
│ └── example_graph.py # Main causal inference pipeline
├── causal_methods/ # Core statistical methods
│ └── double_ml.py # DoubleML implementation
├── persistence/ # Output handling and export
│ ├── notebook_export.py # Jupyter notebook generation
│ ├── output_config.py # Output format configuration
│ └── ...
└── utils/ # Utility functions
```
## Next Steps
- **📖 [Getting Started Guide](docs/getting-started.md)** - Step-by-step tutorial
- **📊 [Causal Methods Reference](docs/causal-methods.md)** - All available estimators
- **🔧 [Pipeline Development](docs/pipeline-guide.md)** - Building custom workflows
- **📋 [Data Requirements](docs/data-requirements.md)** - Input formats and validation
- **💡 [Examples](docs/examples/)** - Real-world case studies
## Contributing
We welcome contributions! Please see our [contributing guidelines](CONTRIBUTING.md) for details.
## License
[MIT License](LICENSE)
## Citation
If you use PyAutoCausal in your research, please cite:
```bibtex
@software{pyautocausal,
title={PyAutoCausal: Automated Causal Inference Pipelines},
author={Your Name},
year={2024},
url={https://github.com/yourusername/pyautocausal}
}
```
Raw data
{
"_id": null,
"home_page": "https://github.com/NickyTops23/pyautocausal",
"name": "pyautocausal",
"maintainer": null,
"docs_url": null,
"requires_python": "<3.13,>=3.10",
"maintainer_email": null,
"keywords": "causal-inference, econometrics, statistics, difference-in-differences, synthetic-control",
"author": "Nicholas Topousis",
"author_email": "nicktopousis@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/21/79/21b30d5e7d7c0a348411d5d1e598e750178b302096705dc8c2531b8cde07/pyautocausal-0.1.0.tar.gz",
"platform": null,
"description": "# PyAutoCausal\n\n**Automated causal inference pipelines for data scientists**\n\n## Why Causal Inference Matters in Tech\n\nAs data scientists, we're often asked to go beyond correlation and answer causal questions: \n- \"Did our new recommendation algorithm actually increase user engagement, or was it just seasonal trends?\"\n- \"What's the true impact of our premium subscription tier on customer retention?\"\n- \"How much did our marketing campaign increase conversions versus organic growth?\"\n- \"Did our product redesign cause the drop in user activity, or was it market conditions?\"\n\nThese questions can't be answered with standard predictive models or A/B tests alone. Real-world constraints often prevent randomized experiments:\n- **Ethical concerns**: Can't randomly deny users important features\n- **Business constraints**: Can't risk revenue on large-scale experiments \n- **Natural experiments**: Sometimes changes happen organically (competitor exits, policy changes)\n- **Historical analysis**: Need to evaluate past decisions without experimental data\n\n## The Challenge of Observational Data\n\nWhen working with observational data (logs, user behavior, historical metrics), we face fundamental challenges:\n\n1. **Confounding**: Users who adopt premium features might be inherently more engaged\n2. **Selection bias**: Treatment assignment isn't random\n3. **Time-varying effects**: Impact changes over time\n4. **Heterogeneous effects**: Different user segments respond differently\n\nTraditional ML models are built for prediction, not causal inference. They'll happily exploit confounders and selection bias to maximize accuracy, giving you precisely wrong answers to causal questions.\n\n## PyAutoCausal: Causal Inference Made Practical\n\nPyAutoCausal automates the complex decision tree of modern causal inference methods. Instead of manually implementing and choosing between dozens of estimators, PyAutoCausal:\n\n1. **Analyzes your data structure** to understand treatment timing, units, and available controls\n2. **Selects appropriate methods** based on your data characteristics\n3. **Validates assumptions** and warns about potential violations\n4. **Executes analysis** with proper statistical inference\n5. **Exports results** in formats ready for stakeholder communication\n\n## Quick Example: Measuring Feature Impact\n\n```python\nfrom pyautocausal.pipelines.example_graph import causal_pipeline\nimport pandas as pd\n\n# Your product data with treatment (feature rollout) and outcome (engagement)\ndata = pd.DataFrame({\n 'id_unit': [...], # User identifier\n 't': [...], # Time periods\n 'treat': [...], # 1 if user has feature, 0 otherwise\n 'y': [...], # Your KPI (DAU, sessions, revenue, etc.)\n 'x1': [...], # User characteristics\n 'x2': [...] # Additional controls\n})\n\n# PyAutoCausal automatically:\n# - Detects this is panel data with staggered treatment\n# - Chooses modern DiD methods (e.g., Callaway-Sant'Anna)\n# - Handles heterogeneous treatment effects\n# - Produces event study plots\n\npipeline = causal_pipeline(output_path=\"./feature_impact_analysis\")\npipeline.fit(df=data)\n\n# Results include:\n# - Average treatment effect with confidence intervals\n# - Dynamic effects over time since treatment\n# - Heterogeneity analysis across user segments\n# - Diagnostic plots and assumption checks\n```\n\n## Real Tech Applications\n\n### Product & Feature Analysis\n- **Feature rollout impact**: Measure true lift from new features beyond selection effects\n- **UI/UX changes**: Isolate design impact from user self-selection\n- **Pricing changes**: Estimate elasticity when users choose their plans\n- **Platform migrations**: Quantify the causal effect of moving users to new systems\n\n### Marketing & Growth\n- **Campaign effectiveness**: Separate campaign impact from organic trends\n- **Channel attribution**: Understand true incremental value of marketing channels\n- **Retention interventions**: Measure causal impact of win-back campaigns\n- **Geographic expansions**: Estimate market entry effects using synthetic controls\n\n### Business Operations\n- **Policy changes**: Evaluate impact of new policies on user behavior\n- **Competitive effects**: Measure how competitor actions affect your metrics\n- **Seasonal adjustments**: Separate true treatment effects from seasonality\n- **Long-term impacts**: Track how effects evolve over months/years\n\n## Why Automation Matters\n\nModern causal inference has seen an explosion of methods in recent years. Choosing correctly requires deep knowledge of:\n- Parallel trends assumptions\n- Staggered treatment timing\n- Heterogeneous treatment effects\n- Two-way fixed effects bias\n- Synthetic control construction\n\nPyAutoCausal encodes this expertise, automatically routing your analysis through the appropriate methods while maintaining transparency about assumptions and limitations.\n\n## Installation\n\n```bash\npip install pyautocausal\n```\n\nOr for development:\n\n```bash\ngit clone https://github.com/yourusername/pyautocausal.git\ncd pyautocausal\npoetry install\n```\n\n## Core Concepts\n\n### Graph-Based Pipeline Architecture\n\nPyAutoCausal organizes causal analysis as directed graphs of computational nodes:\n\n```python\nfrom pyautocausal.orchestration.graph import ExecutableGraph\nfrom pyautocausal.persistence.output_config import OutputConfig, OutputType\n\n# Build custom pipelines using the graph API\ngraph = (ExecutableGraph()\n .configure_runtime(output_path=\"./outputs\")\n .create_input_node(\"data\", input_dtype=pd.DataFrame)\n .create_decision_node(\"has_multiple_periods\", \n condition=lambda df: df['t'].nunique() > 1,\n predecessors=[\"data\"])\n .create_node(\"cross_sectional_analysis\", \n cross_sectional_estimator,\n predecessors=[\"has_multiple_periods\"])\n .create_node(\"panel_analysis\",\n panel_estimator, \n predecessors=[\"has_multiple_periods\"])\n .when_false(\"has_multiple_periods\", \"cross_sectional_analysis\")\n .when_true(\"has_multiple_periods\", \"panel_analysis\")\n)\n\ngraph.fit(data=your_dataframe)\n```\n\n### Automated Method Selection\n\nThe framework automatically routes your data through appropriate causal inference methods:\n\n- **Cross-sectional** (single time period) \u2192 OLS with robust inference\n- **Panel with single treated unit** \u2192 Synthetic control methods\n- **Panel with multiple treatment timing** \u2192 Modern DiD estimators\n- **Staggered treatment adoption** \u2192 Callaway-Sant'Anna, BACON decomposition\n- **Large datasets** \u2192 Double/debiased machine learning approaches\n\n### Built-in Validation\n\nEvery analysis includes:\n- **Data quality checks**: Missing values, duplicates, proper formatting\n- **Assumption testing**: Parallel trends, common support, balance\n- **Robustness checks**: Alternative specifications and estimators\n- **Diagnostic plots**: Visual assumption validation\n\n## Project Structure\n\n```\npyautocausal/\n\u251c\u2500\u2500 orchestration/ # Core graph execution framework\n\u2502 \u251c\u2500\u2500 graph.py # ExecutableGraph class and execution logic\n\u2502 \u251c\u2500\u2500 nodes.py # Node types (standard, decision, input)\n\u2502 \u2514\u2500\u2500 ...\n\u251c\u2500\u2500 pipelines/ # Pre-built causal inference workflows\n\u2502 \u251c\u2500\u2500 library/ # Reusable causal analysis components\n\u2502 \u2502 \u251c\u2500\u2500 specifications.py # Treatment/outcome specifications\n\u2502 \u2502 \u251c\u2500\u2500 estimators.py # Statistical estimators\n\u2502 \u2502 \u251c\u2500\u2500 conditions.py # Data characteristic detectors\n\u2502 \u2502 \u251c\u2500\u2500 plots.py # Visualization functions\n\u2502 \u2502 \u2514\u2500\u2500 ...\n\u2502 \u2514\u2500\u2500 example_graph.py # Main causal inference pipeline\n\u251c\u2500\u2500 causal_methods/ # Core statistical methods\n\u2502 \u2514\u2500\u2500 double_ml.py # DoubleML implementation\n\u251c\u2500\u2500 persistence/ # Output handling and export\n\u2502 \u251c\u2500\u2500 notebook_export.py # Jupyter notebook generation\n\u2502 \u251c\u2500\u2500 output_config.py # Output format configuration\n\u2502 \u2514\u2500\u2500 ...\n\u2514\u2500\u2500 utils/ # Utility functions\n```\n\n## Next Steps\n\n- **\ud83d\udcd6 [Getting Started Guide](docs/getting-started.md)** - Step-by-step tutorial\n- **\ud83d\udcca [Causal Methods Reference](docs/causal-methods.md)** - All available estimators\n- **\ud83d\udd27 [Pipeline Development](docs/pipeline-guide.md)** - Building custom workflows\n- **\ud83d\udccb [Data Requirements](docs/data-requirements.md)** - Input formats and validation\n- **\ud83d\udca1 [Examples](docs/examples/)** - Real-world case studies\n\n## Contributing\n\nWe welcome contributions! Please see our [contributing guidelines](CONTRIBUTING.md) for details.\n\n## License\n\n[MIT License](LICENSE)\n\n## Citation\n\nIf you use PyAutoCausal in your research, please cite:\n\n```bibtex\n@software{pyautocausal,\n title={PyAutoCausal: Automated Causal Inference Pipelines},\n author={Your Name},\n year={2024},\n url={https://github.com/yourusername/pyautocausal}\n}\n```\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Automated causal inference pipelines for data scientists",
"version": "0.1.0",
"project_urls": {
"Documentation": "https://github.com/NickyTops23/pyautocausal/blob/main/README.md",
"Homepage": "https://github.com/NickyTops23/pyautocausal",
"Repository": "https://github.com/NickyTops23/pyautocausal"
},
"split_keywords": [
"causal-inference",
" econometrics",
" statistics",
" difference-in-differences",
" synthetic-control"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "2fd3c07360194ba5c3cbb5033e87d13f6dc63d06f1da284a12095e9bfc9e009b",
"md5": "d52294c3de227283d116253ca9f96045",
"sha256": "8f19694773cf149543d8efa9c43e56715cf11f4a03235a0c6d25dff8651719a3"
},
"downloads": -1,
"filename": "pyautocausal-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "d52294c3de227283d116253ca9f96045",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<3.13,>=3.10",
"size": 1770865,
"upload_time": "2025-08-17T18:39:25",
"upload_time_iso_8601": "2025-08-17T18:39:25.782603Z",
"url": "https://files.pythonhosted.org/packages/2f/d3/c07360194ba5c3cbb5033e87d13f6dc63d06f1da284a12095e9bfc9e009b/pyautocausal-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "217921b30d5e7d7c0a348411d5d1e598e750178b302096705dc8c2531b8cde07",
"md5": "d024744984ff61b63a8a317e06aff054",
"sha256": "7d378b27c89432cffadf0d2554628df810fa5fffff8737b54667952ccf6bdb8b"
},
"downloads": -1,
"filename": "pyautocausal-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "d024744984ff61b63a8a317e06aff054",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<3.13,>=3.10",
"size": 1721301,
"upload_time": "2025-08-17T18:39:27",
"upload_time_iso_8601": "2025-08-17T18:39:27.771244Z",
"url": "https://files.pythonhosted.org/packages/21/79/21b30d5e7d7c0a348411d5d1e598e750178b302096705dc8c2531b8cde07/pyautocausal-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-17 18:39:27",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "NickyTops23",
"github_project": "pyautocausal",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "pyautocausal"
}