ο»Ώ## Project description
# DataProbe
**DataProbe** is a comprehensive Python toolkit for debugging, profiling, and optimizing data pipelines. It provides powerful tools to track data lineage, identify bottlenecks, monitor memory usage, and visualize pipeline execution flow with **enterprise-grade visualizations**.
## π¨ **NEW: Enterprise-Grade Visualizations**
DataProbe v1.0.0 introduces comprehensive pipeline debugging capabilities with professional-quality visualizations, intelligent optimization recommendations, advanced memory profiling, data lineage tracking, and enterprise-grade reporting.
### **Dashboard Features**
#### π’ **Enterprise Dashboard**
- **KPI Panels**: Real-time success rates, duration, memory usage
- **Pipeline Flowchart**: Interactive operation flow with status indicators
- **Performance Analytics**: Memory usage timelines with peak detection
- **Data Insights**: Comprehensive lineage and transformation tracking
```python
# Generate enterprise dashboard
debugger.visualize_pipeline()
```
#### π **3D Pipeline Network**
- **3D Visualization**: Interactive network showing operation relationships
- **Performance Mapping**: Z-axis represents operation duration
- **Status Color-coding**: Visual error and bottleneck identification
```python
# Create 3D network visualization
debugger.create_3d_pipeline_visualization()
```
#### π **Executive Reports**
- **Multi-page Reports**: Professional stakeholder-ready documentation
- **Performance Trends**: Dual-axis charts showing duration and memory patterns
- **Optimization Recommendations**: AI-powered suggestions for improvements
- **Data Quality Metrics**: Comprehensive pipeline health scoring
```python
# Generate executive report
debugger.generate_executive_report()
```
### **Color-Coded Status System**
- π’ **Success**: Operations completed without issues
- π‘ **Warning**: Performance bottlenecks detected
- π΄ **Error**: Failed operations requiring attention
- π¦ **Info**: Data flow and transformation indicators
## π Features
### PipelineDebugger
* **π Operation Tracking** : Automatically track execution time, memory usage, and data shapes for each operation
* **π Enterprise-Grade Visualizations** : Professional dashboards, 3D networks, and executive reports
* **πΎ Memory Profiling** : Monitor memory usage and identify memory-intensive operations
* **π Data Lineage** : Track data transformations and column changes throughout the pipeline
* **β οΈ Bottleneck Detection** : Automatically identify slow operations and memory peaks
* **π Performance Reports** : Generate comprehensive debugging reports with optimization suggestions
* **π― Error Tracking** : Capture and track errors with full traceback information
* **π³ Nested Operations** : Support for tracking nested function calls and their relationships
## π¦ Installation
```bash
pip install dataprobe
```
For development installation:
```bash
git clone https://github.com/santhoshkrishnan30/dataprobe.git
cd dataprobe
pip install -e ".[dev]"
```
## π― Quick Start
### Basic Usage with Enhanced Visualizations
```python
from dataprobe import PipelineDebugger
import pandas as pd
# Initialize the debugger with enhanced features
debugger = PipelineDebugger(
name="My_ETL_Pipeline",
track_memory=True,
track_lineage=True
)
# Use decorators to track operations
@debugger.track_operation("Load Data")
def load_data(file_path):
return pd.read_csv(file_path)
@debugger.track_operation("Transform Data")
def transform_data(df):
df['new_column'] = df['value'] * 2
return df
# Run your pipeline
df = load_data("data.csv")
df = transform_data(df)
# Generate enterprise-grade visualizations
debugger.visualize_pipeline() # Enterprise dashboard
debugger.create_3d_pipeline_visualization() # 3D network view
debugger.generate_executive_report() # Executive report
# Get AI-powered optimization suggestions
suggestions = debugger.suggest_optimizations()
for suggestion in suggestions:
print(f"π‘ {suggestion['suggestion']}")
# Print summary and reports
debugger.print_summary()
report = debugger.generate_report()
```
### Memory Profiling
```python
@debugger.profile_memory
def memory_intensive_operation():
large_df = pd.DataFrame(np.random.randn(1000000, 50))
result = large_df.groupby(large_df.index % 1000).mean()
return result
```
### DataFrame Analysis
```python
# Analyze DataFrames for potential issues
debugger.analyze_dataframe(df, name="Sales Data")
```
## π Example Output
### Enterprise Dashboard
Professional KPI dashboard with real-time metrics, pipeline flowchart, memory analytics, and performance insights.
### Pipeline Summary
```
Pipeline Summary: My_ETL_Pipeline
βββ Execution Statistics
β βββ Total Operations: 5
β βββ Total Duration: 2.34s
β βββ Total Memory Used: 125.6MB
βββ Bottlenecks (1)
β βββ Transform Data: 1.52s
βββ Memory Peaks (1)
βββ Load Large Dataset: +85.3MB
```
### Optimization Suggestions
```
π‘ OPTIMIZATION RECOMMENDATIONS:
1. [PERFORMANCE] Transform Data
Issue: Operation took 1.52s
π‘ Consider optimizing this operation or parallelizing if possible
2. [MEMORY] Load Large Dataset
Issue: High memory usage: +85.3MB
π‘ Consider processing data in chunks or optimizing memory usage
```
## π§ Advanced Features
### Multiple Visualization Options
```python
# Enterprise dashboard - Professional KPI dashboard
debugger.visualize_pipeline()
# 3D network visualization - Interactive operation relationships
debugger.create_3d_pipeline_visualization()
# Executive report - Multi-page stakeholder documentation
debugger.generate_executive_report()
```
### Data Lineage Tracking
```python
# Export data lineage information
lineage_json = debugger.export_lineage(format="json")
# Track column changes automatically
@debugger.track_operation("Add Features")
def add_features(df):
df['feature_1'] = df['value'].rolling(7).mean()
df['feature_2'] = df['value'].shift(1)
return df
```
### Custom Metadata
```python
@debugger.track_operation("Process Batch", batch_id=123, source="api")
def process_batch(data):
# Operation metadata is stored and included in reports
return processed_data
```
### Checkpoint Saving
```python
# Auto-save is enabled by default
debugger = PipelineDebugger(name="Pipeline", auto_save=True)
# Manual checkpoint
debugger.save_checkpoint()
```
## π Performance Tips
1. **Use with Context** : The debugger adds minimal overhead, but for production pipelines, you can disable tracking:
```python
debugger = PipelineDebugger(name="Pipeline", track_memory=False, track_lineage=False)
```
2. **Batch Operations** : Group small operations together to reduce tracking overhead
3. **Memory Monitoring** : Set appropriate memory thresholds to catch issues early:
```python
debugger = PipelineDebugger(name="Pipeline", memory_threshold_mb=500)
```
## πΌ **Enterprise Features**
β
**Professional Styling**: Modern design matching enterprise standards
β
**Executive Ready**: Suitable for stakeholder presentations
β
**Performance Insights**: AI-powered optimization recommendations
β
**Export Options**: High-resolution PNG outputs
β
**Responsive Design**: Scales from detailed debugging to executive overview
β
**Real-time Metrics**: Live performance and memory tracking
## π€ Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
1. Fork the repository
2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request
## π License
This project is licensed under the MIT License - see the [LICENSE](https://github.com/santhoshkrishnan30/dataprobe/blob/main/LICENSE) file for details.
## π Acknowledgments
* Built with [Rich](https://github.com/Textualize/rich) for beautiful terminal output
* Uses [NetworkX](https://networkx.org/) for pipeline visualization
* Enhanced with [Matplotlib](https://matplotlib.org/) and [Seaborn](https://seaborn.pydata.org/) for enterprise-grade visualizations
* Inspired by the need for better data pipeline debugging tools
## π Support
* π§ Email: [santhoshkrishnan3006@gmail.com](mailto:santhoshkrishnan3006@gmail.com)
* π Issues: [GitHub Issues](https://github.com/santhoshkrishnan30/dataprobe/issues)
* π Documentation: [Read the Docs](https://dataprobe.readthedocs.io/)
## πΊοΈ Roadmap
* [X] Enterprise-grade dashboard visualizations
* [X] 3D pipeline network views
* [X] Executive-level reporting capabilities
* [ ] Support for distributed pipeline debugging
* [ ] Integration with popular orchestration tools (Airflow, Prefect, Dagster)
* [ ] Real-time pipeline monitoring dashboard
* [ ] Advanced anomaly detection in data flow
* [ ] Support for streaming data pipelines
---
Made with β€οΈ by Santhosh Krishnan R
Raw data
{
"_id": null,
"home_page": "https://github.com/santhoshkrishnan30/dataprobe",
"name": "dataprobe",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "data-pipeline, debugging, profiling, data-engineering, etl, data-lineage, memory-profiling, pandas, polars",
"author": "SANTHOSH KRISHNAN R",
"author_email": "santhoshkrishnan3006@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/ba/1a/5770c040890db153e6e4b206da98084cc178a2396cca8b38fad813bac1ed/dataprobe-1.0.0.tar.gz",
"platform": null,
"description": "\ufeff## Project description\r\n\r\n# DataProbe\r\n\r\n**DataProbe** is a comprehensive Python toolkit for debugging, profiling, and optimizing data pipelines. It provides powerful tools to track data lineage, identify bottlenecks, monitor memory usage, and visualize pipeline execution flow with **enterprise-grade visualizations**.\r\n\r\n## \ud83c\udfa8 **NEW: Enterprise-Grade Visualizations**\r\n\r\nDataProbe v1.0.0 introduces comprehensive pipeline debugging capabilities with professional-quality visualizations, intelligent optimization recommendations, advanced memory profiling, data lineage tracking, and enterprise-grade reporting.\r\n\r\n### **Dashboard Features**\r\n\r\n#### \ud83c\udfe2 **Enterprise Dashboard**\r\n\r\n- **KPI Panels**: Real-time success rates, duration, memory usage\r\n- **Pipeline Flowchart**: Interactive operation flow with status indicators\r\n- **Performance Analytics**: Memory usage timelines with peak detection\r\n- **Data Insights**: Comprehensive lineage and transformation tracking\r\n\r\n```python\r\n# Generate enterprise dashboard\r\ndebugger.visualize_pipeline()\r\n```\r\n\r\n#### \ud83c\udf10 **3D Pipeline Network**\r\n\r\n- **3D Visualization**: Interactive network showing operation relationships\r\n- **Performance Mapping**: Z-axis represents operation duration\r\n- **Status Color-coding**: Visual error and bottleneck identification\r\n\r\n```python\r\n# Create 3D network visualization\r\ndebugger.create_3d_pipeline_visualization()\r\n```\r\n\r\n#### \ud83d\udcca **Executive Reports**\r\n\r\n- **Multi-page Reports**: Professional stakeholder-ready documentation\r\n- **Performance Trends**: Dual-axis charts showing duration and memory patterns\r\n- **Optimization Recommendations**: AI-powered suggestions for improvements\r\n- **Data Quality Metrics**: Comprehensive pipeline health scoring\r\n\r\n```python\r\n# Generate executive report\r\ndebugger.generate_executive_report()\r\n```\r\n\r\n### **Color-Coded Status System**\r\n\r\n- \ud83d\udfe2 **Success**: Operations completed without issues\r\n- \ud83d\udfe1 **Warning**: Performance bottlenecks detected\r\n- \ud83d\udd34 **Error**: Failed operations requiring attention\r\n- \ud83d\udfe6 **Info**: Data flow and transformation indicators\r\n\r\n## \ud83d\ude80 Features\r\n\r\n### PipelineDebugger\r\n\r\n* **\ud83d\udd0d Operation Tracking** : Automatically track execution time, memory usage, and data shapes for each operation\r\n* **\ud83d\udcca Enterprise-Grade Visualizations** : Professional dashboards, 3D networks, and executive reports\r\n* **\ud83d\udcbe Memory Profiling** : Monitor memory usage and identify memory-intensive operations\r\n* **\ud83d\udd17 Data Lineage** : Track data transformations and column changes throughout the pipeline\r\n* **\u26a0\ufe0f Bottleneck Detection** : Automatically identify slow operations and memory peaks\r\n* **\ud83d\udcc8 Performance Reports** : Generate comprehensive debugging reports with optimization suggestions\r\n* **\ud83c\udfaf Error Tracking** : Capture and track errors with full traceback information\r\n* **\ud83c\udf33 Nested Operations** : Support for tracking nested function calls and their relationships\r\n\r\n## \ud83d\udce6 Installation\r\n\r\n```bash\r\npip install dataprobe\r\n```\r\n\r\nFor development installation:\r\n\r\n```bash\r\ngit clone https://github.com/santhoshkrishnan30/dataprobe.git\r\ncd dataprobe\r\npip install -e \".[dev]\"\r\n```\r\n\r\n## \ud83c\udfaf Quick Start\r\n\r\n### Basic Usage with Enhanced Visualizations\r\n\r\n```python\r\nfrom dataprobe import PipelineDebugger\r\nimport pandas as pd\r\n\r\n# Initialize the debugger with enhanced features\r\ndebugger = PipelineDebugger(\r\n name=\"My_ETL_Pipeline\",\r\n track_memory=True,\r\n track_lineage=True\r\n)\r\n\r\n# Use decorators to track operations\r\n@debugger.track_operation(\"Load Data\")\r\ndef load_data(file_path):\r\n return pd.read_csv(file_path)\r\n\r\n@debugger.track_operation(\"Transform Data\")\r\ndef transform_data(df):\r\n df['new_column'] = df['value'] * 2\r\n return df\r\n\r\n# Run your pipeline\r\ndf = load_data(\"data.csv\")\r\ndf = transform_data(df)\r\n\r\n# Generate enterprise-grade visualizations\r\ndebugger.visualize_pipeline() # Enterprise dashboard\r\ndebugger.create_3d_pipeline_visualization() # 3D network view \r\ndebugger.generate_executive_report() # Executive report\r\n\r\n# Get AI-powered optimization suggestions\r\nsuggestions = debugger.suggest_optimizations()\r\nfor suggestion in suggestions:\r\n print(f\"\ud83d\udca1 {suggestion['suggestion']}\")\r\n\r\n# Print summary and reports\r\ndebugger.print_summary()\r\nreport = debugger.generate_report()\r\n```\r\n\r\n### Memory Profiling\r\n\r\n```python\r\n@debugger.profile_memory\r\ndef memory_intensive_operation():\r\n large_df = pd.DataFrame(np.random.randn(1000000, 50))\r\n result = large_df.groupby(large_df.index % 1000).mean()\r\n return result\r\n```\r\n\r\n### DataFrame Analysis\r\n\r\n```python\r\n# Analyze DataFrames for potential issues\r\ndebugger.analyze_dataframe(df, name=\"Sales Data\")\r\n```\r\n\r\n## \ud83d\udcca Example Output\r\n\r\n### Enterprise Dashboard\r\n\r\nProfessional KPI dashboard with real-time metrics, pipeline flowchart, memory analytics, and performance insights.\r\n\r\n### Pipeline Summary\r\n\r\n```\r\nPipeline Summary: My_ETL_Pipeline\r\n\u251c\u2500\u2500 Execution Statistics\r\n\u2502 \u251c\u2500\u2500 Total Operations: 5\r\n\u2502 \u251c\u2500\u2500 Total Duration: 2.34s\r\n\u2502 \u2514\u2500\u2500 Total Memory Used: 125.6MB\r\n\u251c\u2500\u2500 Bottlenecks (1)\r\n\u2502 \u2514\u2500\u2500 Transform Data: 1.52s\r\n\u2514\u2500\u2500 Memory Peaks (1)\r\n \u2514\u2500\u2500 Load Large Dataset: +85.3MB\r\n```\r\n\r\n### Optimization Suggestions\r\n\r\n```\r\n\ud83d\udca1 OPTIMIZATION RECOMMENDATIONS:\r\n\r\n1. [PERFORMANCE] Transform Data\r\n Issue: Operation took 1.52s\r\n \ud83d\udca1 Consider optimizing this operation or parallelizing if possible\r\n\r\n2. [MEMORY] Load Large Dataset \r\n Issue: High memory usage: +85.3MB\r\n \ud83d\udca1 Consider processing data in chunks or optimizing memory usage\r\n```\r\n\r\n## \ud83d\udd27 Advanced Features\r\n\r\n### Multiple Visualization Options\r\n\r\n```python\r\n# Enterprise dashboard - Professional KPI dashboard\r\ndebugger.visualize_pipeline()\r\n\r\n# 3D network visualization - Interactive operation relationships \r\ndebugger.create_3d_pipeline_visualization()\r\n\r\n# Executive report - Multi-page stakeholder documentation\r\ndebugger.generate_executive_report()\r\n```\r\n\r\n### Data Lineage Tracking\r\n\r\n```python\r\n# Export data lineage information\r\nlineage_json = debugger.export_lineage(format=\"json\")\r\n\r\n# Track column changes automatically\r\n@debugger.track_operation(\"Add Features\")\r\ndef add_features(df):\r\n df['feature_1'] = df['value'].rolling(7).mean()\r\n df['feature_2'] = df['value'].shift(1)\r\n return df\r\n```\r\n\r\n### Custom Metadata\r\n\r\n```python\r\n@debugger.track_operation(\"Process Batch\", batch_id=123, source=\"api\")\r\ndef process_batch(data):\r\n # Operation metadata is stored and included in reports\r\n return processed_data\r\n```\r\n\r\n### Checkpoint Saving\r\n\r\n```python\r\n# Auto-save is enabled by default\r\ndebugger = PipelineDebugger(name=\"Pipeline\", auto_save=True)\r\n\r\n# Manual checkpoint\r\ndebugger.save_checkpoint()\r\n```\r\n\r\n## \ud83d\udcc8 Performance Tips\r\n\r\n1. **Use with Context** : The debugger adds minimal overhead, but for production pipelines, you can disable tracking:\r\n\r\n```python\r\n debugger = PipelineDebugger(name=\"Pipeline\", track_memory=False, track_lineage=False)\r\n```\r\n\r\n2. **Batch Operations** : Group small operations together to reduce tracking overhead\r\n3. **Memory Monitoring** : Set appropriate memory thresholds to catch issues early:\r\n\r\n```python\r\n debugger = PipelineDebugger(name=\"Pipeline\", memory_threshold_mb=500)\r\n```\r\n\r\n## \ud83d\udcbc **Enterprise Features**\r\n\r\n\u2705 **Professional Styling**: Modern design matching enterprise standards\r\n\u2705 **Executive Ready**: Suitable for stakeholder presentations\r\n\u2705 **Performance Insights**: AI-powered optimization recommendations\r\n\u2705 **Export Options**: High-resolution PNG outputs\r\n\u2705 **Responsive Design**: Scales from detailed debugging to executive overview\r\n\u2705 **Real-time Metrics**: Live performance and memory tracking\r\n\r\n## \ud83e\udd1d Contributing\r\n\r\nContributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.\r\n\r\n1. Fork the repository\r\n2. Create your feature branch (`git checkout -b feature/AmazingFeature`)\r\n3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)\r\n4. Push to the branch (`git push origin feature/AmazingFeature`)\r\n5. Open a Pull Request\r\n\r\n## \ud83d\udcc4 License\r\n\r\nThis project is licensed under the MIT License - see the [LICENSE](https://github.com/santhoshkrishnan30/dataprobe/blob/main/LICENSE) file for details.\r\n\r\n## \ud83d\ude4f Acknowledgments\r\n\r\n* Built with [Rich](https://github.com/Textualize/rich) for beautiful terminal output\r\n* Uses [NetworkX](https://networkx.org/) for pipeline visualization\r\n* Enhanced with [Matplotlib](https://matplotlib.org/) and [Seaborn](https://seaborn.pydata.org/) for enterprise-grade visualizations\r\n* Inspired by the need for better data pipeline debugging tools\r\n\r\n## \ud83d\udcde Support\r\n\r\n* \ud83d\udce7 Email: [santhoshkrishnan3006@gmail.com](mailto:santhoshkrishnan3006@gmail.com)\r\n* \ud83d\udc1b Issues: [GitHub Issues](https://github.com/santhoshkrishnan30/dataprobe/issues)\r\n* \ud83d\udcd6 Documentation: [Read the Docs](https://dataprobe.readthedocs.io/)\r\n\r\n## \ud83d\uddfa\ufe0f Roadmap\r\n\r\n* [X] Enterprise-grade dashboard visualizations\r\n* [X] 3D pipeline network views\r\n* [X] Executive-level reporting capabilities\r\n* [ ] Support for distributed pipeline debugging\r\n* [ ] Integration with popular orchestration tools (Airflow, Prefect, Dagster)\r\n* [ ] Real-time pipeline monitoring dashboard\r\n* [ ] Advanced anomaly detection in data flow\r\n* [ ] Support for streaming data pipelines\r\n\r\n---\r\n\r\nMade with \u2764\ufe0f by Santhosh Krishnan R\r\n",
"bugtrack_url": null,
"license": null,
"summary": "Advanced data pipeline debugging and profiling tools for Python",
"version": "1.0.0",
"project_urls": {
"Bug Tracker": "https://github.com/santhoshkrishnan30/dataprobe/issues",
"Documentation": "https://dataprobe.readthedocs.io",
"Homepage": "https://github.com/santhoshkrishnan30/dataprobe",
"Source Code": "https://github.com/santhoshkrishnan30/dataprobe"
},
"split_keywords": [
"data-pipeline",
" debugging",
" profiling",
" data-engineering",
" etl",
" data-lineage",
" memory-profiling",
" pandas",
" polars"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "1515d32e7e38ebc64c417a4fe43bcbea017f8250af77f685bd9b1e11321eb3b5",
"md5": "feead159879dec248b6ab9c224d9ddad",
"sha256": "3235bc757c3d84ff19f0415977f44d60f0f6b090c89ed460abe0ce44e421b3da"
},
"downloads": -1,
"filename": "dataprobe-1.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "feead159879dec248b6ab9c224d9ddad",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 24004,
"upload_time": "2025-07-28T18:26:24",
"upload_time_iso_8601": "2025-07-28T18:26:24.908764Z",
"url": "https://files.pythonhosted.org/packages/15/15/d32e7e38ebc64c417a4fe43bcbea017f8250af77f685bd9b1e11321eb3b5/dataprobe-1.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "ba1a5770c040890db153e6e4b206da98084cc178a2396cca8b38fad813bac1ed",
"md5": "2897faf711af0f58d726aa9d0b963fa0",
"sha256": "2a6ca8b0e61ecab351dcfcb0c6a3a959253a29b5041b3d96184ce2a408e69fc2"
},
"downloads": -1,
"filename": "dataprobe-1.0.0.tar.gz",
"has_sig": false,
"md5_digest": "2897faf711af0f58d726aa9d0b963fa0",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 29248,
"upload_time": "2025-07-28T18:26:25",
"upload_time_iso_8601": "2025-07-28T18:26:25.983287Z",
"url": "https://files.pythonhosted.org/packages/ba/1a/5770c040890db153e6e4b206da98084cc178a2396cca8b38fad813bac1ed/dataprobe-1.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-28 18:26:25",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "santhoshkrishnan30",
"github_project": "dataprobe",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "dataprobe"
}