ο»Ώ## Project description
# DataProbe
**DataProbe** is a comprehensive Python toolkit for debugging, profiling, and optimizing data pipelines. It provides powerful tools to track data lineage, identify bottlenecks, monitor memory usage, and visualize pipeline execution flow.
## π Features
### PipelineDebugger
* **π Operation Tracking** : Automatically track execution time, memory usage, and data shapes for each operation
* **π Visual Pipeline Flow** : Generate interactive visualizations of your pipeline execution
* **πΎ Memory Profiling** : Monitor memory usage and identify memory-intensive operations
* **π Data Lineage** : Track data transformations and column changes throughout the pipeline
* **β οΈ Bottleneck Detection** : Automatically identify slow operations and memory peaks
* **π Performance Reports** : Generate comprehensive debugging reports with optimization suggestions
* **π― Error Tracking** : Capture and track errors with full traceback information
* **π³ Nested Operations** : Support for tracking nested function calls and their relationships
## π¦ Installation
```
pip install dataprobe
```
For development installation:
```
git clone https://github.com/santhoshkrishnan30/dataprobe.git
cd dataprobe
pip install -e ".[dev]"
```
## π― Quick Start
### Basic Usage
```
from dataprobe import PipelineDebugger
import pandas as pd
# Initialize the debugger
debugger = PipelineDebugger(
name="My_ETL_Pipeline",
track_memory=True,
track_lineage=True
)
# Use decorators to track operations
@debugger.track_operation("Load Data")
def load_data(file_path):
return pd.read_csv(file_path)
@debugger.track_operation("Transform Data")
def transform_data(df):
df['new_column'] = df['value'] * 2
return df
# Run your pipeline
df = load_data("data.csv")
df = transform_data(df)
# Generate reports and visualizations
debugger.print_summary()
debugger.visualize_pipeline()
report = debugger.generate_report()
```
### Memory Profiling
```
@debugger.profile_memory
def memory_intensive_operation():
large_df = pd.DataFrame(np.random.randn(1000000, 50))
result = large_df.groupby(large_df.index % 1000).mean()
return result
```
### DataFrame Analysis
```
# Analyze DataFrames for potential issues
debugger.analyze_dataframe(df, name="Sales Data")
```
## π Example Output
### Pipeline Summary
```
Pipeline Summary: My_ETL_Pipeline
βββ Execution Statistics
β βββ Total Operations: 5
β βββ Total Duration: 2.34s
β βββ Total Memory Used: 125.6MB
βββ Bottlenecks (1)
β βββ Transform Data: 1.52s
βββ Memory Peaks (1)
βββ Load Large Dataset: +85.3MB
```
### Optimization Suggestions
```
- [PERFORMANCE] Transform Data: Operation took 1.52s
Suggestion: Consider optimizing this operation or parallelizing if possible
- [MEMORY] Load Large Dataset: High memory usage: +85.3MB
Suggestion: Consider processing data in chunks or optimizing memory usage
```
## π§ Advanced Features
### Data Lineage Tracking
```
# Export data lineage information
lineage_json = debugger.export_lineage(format="json")
# Track column changes automatically
@debugger.track_operation("Add Features")
defadd_features(df):
df['feature_1'] = df['value'].rolling(7).mean()
df['feature_2'] = df['value'].shift(1)
return df
```
### Custom Metadata
```
@debugger.track_operation("Process Batch", batch_id=123, source="api")
defprocess_batch(data):
# Operation metadata is stored and included in reports
return processed_data
```
### Checkpoint Saving
```
# Auto-save is enabled by default
debugger = PipelineDebugger(name="Pipeline", auto_save=True)
# Manual checkpoint
debugger.save_checkpoint()
```
## π Performance Tips
1. **Use with Context** : The debugger adds minimal overhead, but for production pipelines, you can disable tracking:
```
debugger = PipelineDebugger(name="Pipeline", track_memory=False, track_lineage=False)
```
1. **Batch Operations** : Group small operations together to reduce tracking overhead
2. **Memory Monitoring** : Set appropriate memory thresholds to catch issues early:
```
debugger = PipelineDebugger(name="Pipeline", memory_threshold_mb=500)
```
## π€ Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
1. Fork the repository
2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request
## π License
This project is licensed under the MIT License - see the [LICENSE](https://github.com/santhoshkrishnan30/dataprobe/blob/main/LICENSE) file for details.
## π Acknowledgments
* Built with [Rich](https://github.com/Textualize/rich) for beautiful terminal output
* Uses [NetworkX](https://networkx.org/) for pipeline visualization
* Inspired by the need for better data pipeline debugging tools
## π Support
* π§ Email: [santhoshkrishnan3006@gmail.com](mailto:santhoshkrishnan3006@gmail.com)
* π Issues: [GitHub Issues](https://github.com/santhoshkrishnan30/dataprobe/issues)
* π Documentation: [Read the Docs](https://dataprobe.readthedocs.io/)
## πΊοΈ Roadmap
* [ ] Support for distributed pipeline debugging
* [ ] Integration with popular orchestration tools (Airflow, Prefect, Dagster)
* [ ] Real-time pipeline monitoring dashboard
* [ ] Advanced anomaly detection in data flow
* [ ] Support for streaming data pipelines
---
Made with β€οΈ by Santhosh Krishnan R
Raw data
{
"_id": null,
"home_page": "https://github.com/santhoshkrishnan30/dataprobe",
"name": "dataprobe",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "data-pipeline, debugging, profiling, data-engineering, etl, data-lineage, memory-profiling, pandas, polars",
"author": "SANTHOSH KRISHNAN R",
"author_email": "santhoshkrishnan3006@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/af/84/a852031aa977f99cb86ba61b5ca4f748c149143487a3c1a674ea5dc78897/dataprobe-0.1.5.tar.gz",
"platform": null,
"description": "\ufeff## Project description\r\n\r\n# DataProbe\r\n\r\n**DataProbe** is a comprehensive Python toolkit for debugging, profiling, and optimizing data pipelines. It provides powerful tools to track data lineage, identify bottlenecks, monitor memory usage, and visualize pipeline execution flow.\r\n\r\n## \ud83d\ude80 Features\r\n\r\n### PipelineDebugger\r\n\r\n* **\ud83d\udd0d Operation Tracking** : Automatically track execution time, memory usage, and data shapes for each operation\r\n* **\ud83d\udcca Visual Pipeline Flow** : Generate interactive visualizations of your pipeline execution\r\n* **\ud83d\udcbe Memory Profiling** : Monitor memory usage and identify memory-intensive operations\r\n* **\ud83d\udd17 Data Lineage** : Track data transformations and column changes throughout the pipeline\r\n* **\u26a0\ufe0f Bottleneck Detection** : Automatically identify slow operations and memory peaks\r\n* **\ud83d\udcc8 Performance Reports** : Generate comprehensive debugging reports with optimization suggestions\r\n* **\ud83c\udfaf Error Tracking** : Capture and track errors with full traceback information\r\n* **\ud83c\udf33 Nested Operations** : Support for tracking nested function calls and their relationships\r\n\r\n## \ud83d\udce6 Installation\r\n\r\n```\r\npip install dataprobe\r\n```\r\n\r\nFor development installation:\r\n\r\n```\r\ngit clone https://github.com/santhoshkrishnan30/dataprobe.git\r\ncd dataprobe\r\npip install -e \".[dev]\"\r\n```\r\n\r\n## \ud83c\udfaf Quick Start\r\n\r\n### Basic Usage\r\n\r\n```\r\nfrom dataprobe import PipelineDebugger\r\nimport pandas as pd\r\n\r\n# Initialize the debugger\r\ndebugger = PipelineDebugger(\r\n name=\"My_ETL_Pipeline\",\r\n track_memory=True,\r\n track_lineage=True\r\n)\r\n\r\n# Use decorators to track operations\r\n@debugger.track_operation(\"Load Data\")\r\ndef load_data(file_path):\r\n return pd.read_csv(file_path)\r\n\r\n@debugger.track_operation(\"Transform Data\")\r\ndef transform_data(df):\r\n df['new_column'] = df['value'] * 2\r\n return df\r\n\r\n# Run your pipeline\r\ndf = load_data(\"data.csv\")\r\ndf = transform_data(df)\r\n\r\n# Generate reports and visualizations\r\ndebugger.print_summary()\r\ndebugger.visualize_pipeline()\r\nreport = debugger.generate_report()\r\n```\r\n\r\n### Memory Profiling\r\n\r\n```\r\n@debugger.profile_memory\r\ndef memory_intensive_operation():\r\n large_df = pd.DataFrame(np.random.randn(1000000, 50))\r\n result = large_df.groupby(large_df.index % 1000).mean()\r\n return result\r\n```\r\n\r\n### DataFrame Analysis\r\n\r\n```\r\n# Analyze DataFrames for potential issues\r\ndebugger.analyze_dataframe(df, name=\"Sales Data\")\r\n```\r\n\r\n## \ud83d\udcca Example Output\r\n\r\n### Pipeline Summary\r\n\r\n```\r\nPipeline Summary: My_ETL_Pipeline\r\n\u251c\u2500\u2500 Execution Statistics\r\n\u2502 \u251c\u2500\u2500 Total Operations: 5\r\n\u2502 \u251c\u2500\u2500 Total Duration: 2.34s\r\n\u2502 \u2514\u2500\u2500 Total Memory Used: 125.6MB\r\n\u251c\u2500\u2500 Bottlenecks (1)\r\n\u2502 \u2514\u2500\u2500 Transform Data: 1.52s\r\n\u2514\u2500\u2500 Memory Peaks (1)\r\n \u2514\u2500\u2500 Load Large Dataset: +85.3MB\r\n```\r\n\r\n### Optimization Suggestions\r\n\r\n```\r\n- [PERFORMANCE] Transform Data: Operation took 1.52s\r\n Suggestion: Consider optimizing this operation or parallelizing if possible\r\n\r\n- [MEMORY] Load Large Dataset: High memory usage: +85.3MB\r\n Suggestion: Consider processing data in chunks or optimizing memory usage\r\n```\r\n\r\n## \ud83d\udd27 Advanced Features\r\n\r\n### Data Lineage Tracking\r\n\r\n```\r\n# Export data lineage information\r\nlineage_json = debugger.export_lineage(format=\"json\")\r\n\r\n# Track column changes automatically\r\n@debugger.track_operation(\"Add Features\")\r\ndefadd_features(df):\r\n df['feature_1'] = df['value'].rolling(7).mean()\r\n df['feature_2'] = df['value'].shift(1)\r\n return df\r\n```\r\n\r\n### Custom Metadata\r\n\r\n```\r\n@debugger.track_operation(\"Process Batch\", batch_id=123, source=\"api\")\r\ndefprocess_batch(data):\r\n # Operation metadata is stored and included in reports\r\n return processed_data\r\n```\r\n\r\n### Checkpoint Saving\r\n\r\n```\r\n# Auto-save is enabled by default\r\ndebugger = PipelineDebugger(name=\"Pipeline\", auto_save=True)\r\n\r\n# Manual checkpoint\r\ndebugger.save_checkpoint()\r\n```\r\n\r\n## \ud83d\udcc8 Performance Tips\r\n\r\n1. **Use with Context** : The debugger adds minimal overhead, but for production pipelines, you can disable tracking:\r\n\r\n```\r\n debugger = PipelineDebugger(name=\"Pipeline\", track_memory=False, track_lineage=False)\r\n```\r\n\r\n1. **Batch Operations** : Group small operations together to reduce tracking overhead\r\n2. **Memory Monitoring** : Set appropriate memory thresholds to catch issues early:\r\n\r\n```\r\n debugger = PipelineDebugger(name=\"Pipeline\", memory_threshold_mb=500)\r\n```\r\n\r\n## \ud83e\udd1d Contributing\r\n\r\nContributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.\r\n\r\n1. Fork the repository\r\n2. Create your feature branch (`git checkout -b feature/AmazingFeature`)\r\n3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)\r\n4. Push to the branch (`git push origin feature/AmazingFeature`)\r\n5. Open a Pull Request\r\n\r\n## \ud83d\udcc4 License\r\n\r\nThis project is licensed under the MIT License - see the [LICENSE](https://github.com/santhoshkrishnan30/dataprobe/blob/main/LICENSE) file for details.\r\n\r\n## \ud83d\ude4f Acknowledgments\r\n\r\n* Built with [Rich](https://github.com/Textualize/rich) for beautiful terminal output\r\n* Uses [NetworkX](https://networkx.org/) for pipeline visualization\r\n* Inspired by the need for better data pipeline debugging tools\r\n\r\n## \ud83d\udcde Support\r\n\r\n* \ud83d\udce7 Email: [santhoshkrishnan3006@gmail.com](mailto:santhoshkrishnan3006@gmail.com)\r\n* \ud83d\udc1b Issues: [GitHub Issues](https://github.com/santhoshkrishnan30/dataprobe/issues)\r\n* \ud83d\udcd6 Documentation: [Read the Docs](https://dataprobe.readthedocs.io/)\r\n\r\n## \ud83d\uddfa\ufe0f Roadmap\r\n\r\n* [ ] Support for distributed pipeline debugging\r\n* [ ] Integration with popular orchestration tools (Airflow, Prefect, Dagster)\r\n* [ ] Real-time pipeline monitoring dashboard\r\n* [ ] Advanced anomaly detection in data flow\r\n* [ ] Support for streaming data pipelines\r\n\r\n---\r\n\r\nMade with \u2764\ufe0f by Santhosh Krishnan R\r\n",
"bugtrack_url": null,
"license": null,
"summary": "Advanced data pipeline debugging and profiling tools for Python",
"version": "0.1.5",
"project_urls": {
"Bug Tracker": "https://github.com/santhoshkrishnan30/dataprobe/issues",
"Documentation": "https://dataprobe.readthedocs.io",
"Homepage": "https://github.com/santhoshkrishnan30/dataprobe",
"Source Code": "https://github.com/santhoshkrishnan30/dataprobe"
},
"split_keywords": [
"data-pipeline",
" debugging",
" profiling",
" data-engineering",
" etl",
" data-lineage",
" memory-profiling",
" pandas",
" polars"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "57bffc8958e940847ce31eacc0b22657108ffaab43c1e0ab33f716c7149628c0",
"md5": "a6314e98b35a5daa11a4a6fc09b79267",
"sha256": "259dd54e22cbf5167d7f10c72395b1f73468cb7d2f24fbba964fbba026c5fd53"
},
"downloads": -1,
"filename": "dataprobe-0.1.5-py3-none-any.whl",
"has_sig": false,
"md5_digest": "a6314e98b35a5daa11a4a6fc09b79267",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 15769,
"upload_time": "2025-07-20T09:58:19",
"upload_time_iso_8601": "2025-07-20T09:58:19.564002Z",
"url": "https://files.pythonhosted.org/packages/57/bf/fc8958e940847ce31eacc0b22657108ffaab43c1e0ab33f716c7149628c0/dataprobe-0.1.5-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "af84a852031aa977f99cb86ba61b5ca4f748c149143487a3c1a674ea5dc78897",
"md5": "2b095c6457b0ef5424b1aaeb9cedb682",
"sha256": "9fb72fa8b85e5cbf6cd35d241e5dbd80353eb80c5281584a6aa8eec5110acc91"
},
"downloads": -1,
"filename": "dataprobe-0.1.5.tar.gz",
"has_sig": false,
"md5_digest": "2b095c6457b0ef5424b1aaeb9cedb682",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 19314,
"upload_time": "2025-07-20T09:58:20",
"upload_time_iso_8601": "2025-07-20T09:58:20.920457Z",
"url": "https://files.pythonhosted.org/packages/af/84/a852031aa977f99cb86ba61b5ca4f748c149143487a3c1a674ea5dc78897/dataprobe-0.1.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-20 09:58:20",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "santhoshkrishnan30",
"github_project": "dataprobe",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "dataprobe"
}