dataprobe


Namedataprobe JSON
Version 0.1.5 PyPI version JSON
download
home_pagehttps://github.com/santhoshkrishnan30/dataprobe
SummaryAdvanced data pipeline debugging and profiling tools for Python
upload_time2025-07-20 09:58:20
maintainerNone
docs_urlNone
authorSANTHOSH KRISHNAN R
requires_python>=3.8
licenseNone
keywords data-pipeline debugging profiling data-engineering etl data-lineage memory-profiling pandas polars
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ο»Ώ## Project description

# DataProbe

**DataProbe** is a comprehensive Python toolkit for debugging, profiling, and optimizing data pipelines. It provides powerful tools to track data lineage, identify bottlenecks, monitor memory usage, and visualize pipeline execution flow.

## πŸš€ Features

### PipelineDebugger

* **πŸ” Operation Tracking** : Automatically track execution time, memory usage, and data shapes for each operation
* **πŸ“Š Visual Pipeline Flow** : Generate interactive visualizations of your pipeline execution
* **πŸ’Ύ Memory Profiling** : Monitor memory usage and identify memory-intensive operations
* **πŸ”— Data Lineage** : Track data transformations and column changes throughout the pipeline
* **⚠️ Bottleneck Detection** : Automatically identify slow operations and memory peaks
* **πŸ“ˆ Performance Reports** : Generate comprehensive debugging reports with optimization suggestions
* **🎯 Error Tracking** : Capture and track errors with full traceback information
* **🌳 Nested Operations** : Support for tracking nested function calls and their relationships

## πŸ“¦ Installation

```
pip install dataprobe
```

For development installation:

```
git clone https://github.com/santhoshkrishnan30/dataprobe.git
cd dataprobe
pip install -e ".[dev]"
```

## 🎯 Quick Start

### Basic Usage

```
from dataprobe import PipelineDebugger
import pandas as pd

# Initialize the debugger
debugger = PipelineDebugger(
    name="My_ETL_Pipeline",
    track_memory=True,
    track_lineage=True
)

# Use decorators to track operations
@debugger.track_operation("Load Data")
def load_data(file_path):
    return pd.read_csv(file_path)

@debugger.track_operation("Transform Data")
def transform_data(df):
    df['new_column'] = df['value'] * 2
    return df

# Run your pipeline
df = load_data("data.csv")
df = transform_data(df)

# Generate reports and visualizations
debugger.print_summary()
debugger.visualize_pipeline()
report = debugger.generate_report()
```

### Memory Profiling

```
@debugger.profile_memory
def memory_intensive_operation():
    large_df = pd.DataFrame(np.random.randn(1000000, 50))
    result = large_df.groupby(large_df.index % 1000).mean()
    return result
```

### DataFrame Analysis

```
# Analyze DataFrames for potential issues
debugger.analyze_dataframe(df, name="Sales Data")
```

## πŸ“Š Example Output

### Pipeline Summary

```
Pipeline Summary: My_ETL_Pipeline
β”œβ”€β”€ Execution Statistics
β”‚   β”œβ”€β”€ Total Operations: 5
β”‚   β”œβ”€β”€ Total Duration: 2.34s
β”‚   └── Total Memory Used: 125.6MB
β”œβ”€β”€ Bottlenecks (1)
β”‚   └── Transform Data: 1.52s
└── Memory Peaks (1)
    └── Load Large Dataset: +85.3MB
```

### Optimization Suggestions

```
- [PERFORMANCE] Transform Data: Operation took 1.52s
  Suggestion: Consider optimizing this operation or parallelizing if possible

- [MEMORY] Load Large Dataset: High memory usage: +85.3MB
  Suggestion: Consider processing data in chunks or optimizing memory usage
```

## πŸ”§ Advanced Features

### Data Lineage Tracking

```
# Export data lineage information
lineage_json = debugger.export_lineage(format="json")

# Track column changes automatically
@debugger.track_operation("Add Features")
defadd_features(df):
    df['feature_1'] = df['value'].rolling(7).mean()
    df['feature_2'] = df['value'].shift(1)
    return df
```

### Custom Metadata

```
@debugger.track_operation("Process Batch", batch_id=123, source="api")
defprocess_batch(data):
    # Operation metadata is stored and included in reports
    return processed_data
```

### Checkpoint Saving

```
# Auto-save is enabled by default
debugger = PipelineDebugger(name="Pipeline", auto_save=True)

# Manual checkpoint
debugger.save_checkpoint()
```

## πŸ“ˆ Performance Tips

1. **Use with Context** : The debugger adds minimal overhead, but for production pipelines, you can disable tracking:

```
   debugger = PipelineDebugger(name="Pipeline", track_memory=False, track_lineage=False)
```

1. **Batch Operations** : Group small operations together to reduce tracking overhead
2. **Memory Monitoring** : Set appropriate memory thresholds to catch issues early:

```
   debugger = PipelineDebugger(name="Pipeline", memory_threshold_mb=500)
```

## 🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

1. Fork the repository
2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request

## πŸ“„ License

This project is licensed under the MIT License - see the [LICENSE](https://github.com/santhoshkrishnan30/dataprobe/blob/main/LICENSE) file for details.

## πŸ™ Acknowledgments

* Built with [Rich](https://github.com/Textualize/rich) for beautiful terminal output
* Uses [NetworkX](https://networkx.org/) for pipeline visualization
* Inspired by the need for better data pipeline debugging tools

## πŸ“ž Support

* πŸ“§ Email: [santhoshkrishnan3006@gmail.com](mailto:santhoshkrishnan3006@gmail.com)
* πŸ› Issues: [GitHub Issues](https://github.com/santhoshkrishnan30/dataprobe/issues)
* πŸ“– Documentation: [Read the Docs](https://dataprobe.readthedocs.io/)

## πŸ—ΊοΈ Roadmap

* [ ] Support for distributed pipeline debugging
* [ ] Integration with popular orchestration tools (Airflow, Prefect, Dagster)
* [ ] Real-time pipeline monitoring dashboard
* [ ] Advanced anomaly detection in data flow
* [ ] Support for streaming data pipelines

---

Made with ❀️ by Santhosh Krishnan R

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/santhoshkrishnan30/dataprobe",
    "name": "dataprobe",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "data-pipeline, debugging, profiling, data-engineering, etl, data-lineage, memory-profiling, pandas, polars",
    "author": "SANTHOSH KRISHNAN R",
    "author_email": "santhoshkrishnan3006@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/af/84/a852031aa977f99cb86ba61b5ca4f748c149143487a3c1a674ea5dc78897/dataprobe-0.1.5.tar.gz",
    "platform": null,
    "description": "\ufeff## Project description\r\n\r\n# DataProbe\r\n\r\n**DataProbe** is a comprehensive Python toolkit for debugging, profiling, and optimizing data pipelines. It provides powerful tools to track data lineage, identify bottlenecks, monitor memory usage, and visualize pipeline execution flow.\r\n\r\n## \ud83d\ude80 Features\r\n\r\n### PipelineDebugger\r\n\r\n* **\ud83d\udd0d Operation Tracking** : Automatically track execution time, memory usage, and data shapes for each operation\r\n* **\ud83d\udcca Visual Pipeline Flow** : Generate interactive visualizations of your pipeline execution\r\n* **\ud83d\udcbe Memory Profiling** : Monitor memory usage and identify memory-intensive operations\r\n* **\ud83d\udd17 Data Lineage** : Track data transformations and column changes throughout the pipeline\r\n* **\u26a0\ufe0f Bottleneck Detection** : Automatically identify slow operations and memory peaks\r\n* **\ud83d\udcc8 Performance Reports** : Generate comprehensive debugging reports with optimization suggestions\r\n* **\ud83c\udfaf Error Tracking** : Capture and track errors with full traceback information\r\n* **\ud83c\udf33 Nested Operations** : Support for tracking nested function calls and their relationships\r\n\r\n## \ud83d\udce6 Installation\r\n\r\n```\r\npip install dataprobe\r\n```\r\n\r\nFor development installation:\r\n\r\n```\r\ngit clone https://github.com/santhoshkrishnan30/dataprobe.git\r\ncd dataprobe\r\npip install -e \".[dev]\"\r\n```\r\n\r\n## \ud83c\udfaf Quick Start\r\n\r\n### Basic Usage\r\n\r\n```\r\nfrom dataprobe import PipelineDebugger\r\nimport pandas as pd\r\n\r\n# Initialize the debugger\r\ndebugger = PipelineDebugger(\r\n    name=\"My_ETL_Pipeline\",\r\n    track_memory=True,\r\n    track_lineage=True\r\n)\r\n\r\n# Use decorators to track operations\r\n@debugger.track_operation(\"Load Data\")\r\ndef load_data(file_path):\r\n    return pd.read_csv(file_path)\r\n\r\n@debugger.track_operation(\"Transform Data\")\r\ndef transform_data(df):\r\n    df['new_column'] = df['value'] * 2\r\n    return df\r\n\r\n# Run your pipeline\r\ndf = load_data(\"data.csv\")\r\ndf = transform_data(df)\r\n\r\n# Generate reports and visualizations\r\ndebugger.print_summary()\r\ndebugger.visualize_pipeline()\r\nreport = debugger.generate_report()\r\n```\r\n\r\n### Memory Profiling\r\n\r\n```\r\n@debugger.profile_memory\r\ndef memory_intensive_operation():\r\n    large_df = pd.DataFrame(np.random.randn(1000000, 50))\r\n    result = large_df.groupby(large_df.index % 1000).mean()\r\n    return result\r\n```\r\n\r\n### DataFrame Analysis\r\n\r\n```\r\n# Analyze DataFrames for potential issues\r\ndebugger.analyze_dataframe(df, name=\"Sales Data\")\r\n```\r\n\r\n## \ud83d\udcca Example Output\r\n\r\n### Pipeline Summary\r\n\r\n```\r\nPipeline Summary: My_ETL_Pipeline\r\n\u251c\u2500\u2500 Execution Statistics\r\n\u2502   \u251c\u2500\u2500 Total Operations: 5\r\n\u2502   \u251c\u2500\u2500 Total Duration: 2.34s\r\n\u2502   \u2514\u2500\u2500 Total Memory Used: 125.6MB\r\n\u251c\u2500\u2500 Bottlenecks (1)\r\n\u2502   \u2514\u2500\u2500 Transform Data: 1.52s\r\n\u2514\u2500\u2500 Memory Peaks (1)\r\n    \u2514\u2500\u2500 Load Large Dataset: +85.3MB\r\n```\r\n\r\n### Optimization Suggestions\r\n\r\n```\r\n- [PERFORMANCE] Transform Data: Operation took 1.52s\r\n  Suggestion: Consider optimizing this operation or parallelizing if possible\r\n\r\n- [MEMORY] Load Large Dataset: High memory usage: +85.3MB\r\n  Suggestion: Consider processing data in chunks or optimizing memory usage\r\n```\r\n\r\n## \ud83d\udd27 Advanced Features\r\n\r\n### Data Lineage Tracking\r\n\r\n```\r\n# Export data lineage information\r\nlineage_json = debugger.export_lineage(format=\"json\")\r\n\r\n# Track column changes automatically\r\n@debugger.track_operation(\"Add Features\")\r\ndefadd_features(df):\r\n    df['feature_1'] = df['value'].rolling(7).mean()\r\n    df['feature_2'] = df['value'].shift(1)\r\n    return df\r\n```\r\n\r\n### Custom Metadata\r\n\r\n```\r\n@debugger.track_operation(\"Process Batch\", batch_id=123, source=\"api\")\r\ndefprocess_batch(data):\r\n    # Operation metadata is stored and included in reports\r\n    return processed_data\r\n```\r\n\r\n### Checkpoint Saving\r\n\r\n```\r\n# Auto-save is enabled by default\r\ndebugger = PipelineDebugger(name=\"Pipeline\", auto_save=True)\r\n\r\n# Manual checkpoint\r\ndebugger.save_checkpoint()\r\n```\r\n\r\n## \ud83d\udcc8 Performance Tips\r\n\r\n1. **Use with Context** : The debugger adds minimal overhead, but for production pipelines, you can disable tracking:\r\n\r\n```\r\n   debugger = PipelineDebugger(name=\"Pipeline\", track_memory=False, track_lineage=False)\r\n```\r\n\r\n1. **Batch Operations** : Group small operations together to reduce tracking overhead\r\n2. **Memory Monitoring** : Set appropriate memory thresholds to catch issues early:\r\n\r\n```\r\n   debugger = PipelineDebugger(name=\"Pipeline\", memory_threshold_mb=500)\r\n```\r\n\r\n## \ud83e\udd1d Contributing\r\n\r\nContributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.\r\n\r\n1. Fork the repository\r\n2. Create your feature branch (`git checkout -b feature/AmazingFeature`)\r\n3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)\r\n4. Push to the branch (`git push origin feature/AmazingFeature`)\r\n5. Open a Pull Request\r\n\r\n## \ud83d\udcc4 License\r\n\r\nThis project is licensed under the MIT License - see the [LICENSE](https://github.com/santhoshkrishnan30/dataprobe/blob/main/LICENSE) file for details.\r\n\r\n## \ud83d\ude4f Acknowledgments\r\n\r\n* Built with [Rich](https://github.com/Textualize/rich) for beautiful terminal output\r\n* Uses [NetworkX](https://networkx.org/) for pipeline visualization\r\n* Inspired by the need for better data pipeline debugging tools\r\n\r\n## \ud83d\udcde Support\r\n\r\n* \ud83d\udce7 Email: [santhoshkrishnan3006@gmail.com](mailto:santhoshkrishnan3006@gmail.com)\r\n* \ud83d\udc1b Issues: [GitHub Issues](https://github.com/santhoshkrishnan30/dataprobe/issues)\r\n* \ud83d\udcd6 Documentation: [Read the Docs](https://dataprobe.readthedocs.io/)\r\n\r\n## \ud83d\uddfa\ufe0f Roadmap\r\n\r\n* [ ] Support for distributed pipeline debugging\r\n* [ ] Integration with popular orchestration tools (Airflow, Prefect, Dagster)\r\n* [ ] Real-time pipeline monitoring dashboard\r\n* [ ] Advanced anomaly detection in data flow\r\n* [ ] Support for streaming data pipelines\r\n\r\n---\r\n\r\nMade with \u2764\ufe0f by Santhosh Krishnan R\r\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Advanced data pipeline debugging and profiling tools for Python",
    "version": "0.1.5",
    "project_urls": {
        "Bug Tracker": "https://github.com/santhoshkrishnan30/dataprobe/issues",
        "Documentation": "https://dataprobe.readthedocs.io",
        "Homepage": "https://github.com/santhoshkrishnan30/dataprobe",
        "Source Code": "https://github.com/santhoshkrishnan30/dataprobe"
    },
    "split_keywords": [
        "data-pipeline",
        " debugging",
        " profiling",
        " data-engineering",
        " etl",
        " data-lineage",
        " memory-profiling",
        " pandas",
        " polars"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "57bffc8958e940847ce31eacc0b22657108ffaab43c1e0ab33f716c7149628c0",
                "md5": "a6314e98b35a5daa11a4a6fc09b79267",
                "sha256": "259dd54e22cbf5167d7f10c72395b1f73468cb7d2f24fbba964fbba026c5fd53"
            },
            "downloads": -1,
            "filename": "dataprobe-0.1.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a6314e98b35a5daa11a4a6fc09b79267",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 15769,
            "upload_time": "2025-07-20T09:58:19",
            "upload_time_iso_8601": "2025-07-20T09:58:19.564002Z",
            "url": "https://files.pythonhosted.org/packages/57/bf/fc8958e940847ce31eacc0b22657108ffaab43c1e0ab33f716c7149628c0/dataprobe-0.1.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "af84a852031aa977f99cb86ba61b5ca4f748c149143487a3c1a674ea5dc78897",
                "md5": "2b095c6457b0ef5424b1aaeb9cedb682",
                "sha256": "9fb72fa8b85e5cbf6cd35d241e5dbd80353eb80c5281584a6aa8eec5110acc91"
            },
            "downloads": -1,
            "filename": "dataprobe-0.1.5.tar.gz",
            "has_sig": false,
            "md5_digest": "2b095c6457b0ef5424b1aaeb9cedb682",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 19314,
            "upload_time": "2025-07-20T09:58:20",
            "upload_time_iso_8601": "2025-07-20T09:58:20.920457Z",
            "url": "https://files.pythonhosted.org/packages/af/84/a852031aa977f99cb86ba61b5ca4f748c149143487a3c1a674ea5dc78897/dataprobe-0.1.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-20 09:58:20",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "santhoshkrishnan30",
    "github_project": "dataprobe",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "dataprobe"
}
        
Elapsed time: 0.42410s