# DataBeak
[](https://github.com/jonpspri/databeak/actions/workflows/test.yml)
[](https://codecov.io/gh/jonpspri/databeak)
[](https://www.python.org/downloads/)
[](https://opensource.org/licenses/Apache-2.0)
[](https://github.com/astral-sh/ruff)
## AI-Powered CSV Processing via Model Context Protocol
Transform how AI assistants work with CSV data. DataBeak provides 40+
specialized tools for data manipulation, analysis, and validation through the
Model Context Protocol (MCP).
## Features
- 🔄 **Complete Data Operations** - Load, transform, analyze, and export CSV data
- 📊 **Advanced Analytics** - Statistics, correlations, outlier detection, data
profiling
- ✅ **Data Validation** - Schema validation, quality scoring, anomaly detection
- 🎯 **Stateless Design** - Clean MCP architecture with external context
management
- âš¡ **High Performance** - Handles large datasets with streaming and chunking
- 🔒 **Session Management** - Multi-user support with isolated sessions
- 🌟 **Code Quality** - Zero ruff violations, 100% mypy compliance, perfect MCP
documentation standards, comprehensive test coverage
## Getting Started
The fastest way to use DataBeak is with `uvx` (no installation required):
### For Claude Desktop
Add this to your MCP Settings file:
```json
{
"mcpServers": {
"databeak": {
"command": "uvx",
"args": [
"--from",
"git+https://github.com/jonpspri/databeak.git",
"databeak"
]
}
}
}
```
### For Other AI Clients
DataBeak works with Continue, Cline, Windsurf, and Zed. See the
[installation guide](https://jonpspri.github.io/databeak/installation) for
specific configuration examples.
### HTTP Mode (Advanced)
For HTTP-based AI clients or custom deployments:
```bash
# Run in HTTP mode
uv run databeak --transport http --host 0.0.0.0 --port 8000
# Access server at http://localhost:8000/mcp
# Health check at http://localhost:8000/health
```
### Quick Test
Once configured, ask your AI assistant:
```text
"Load a CSV file and show me basic statistics"
"Remove duplicate rows and export as Excel"
"Find outliers in the price column"
```
## Documentation
📚 **[Complete Documentation](https://jonpspri.github.io/databeak/)**
- [Installation Guide](https://jonpspri.github.io/databeak/installation) - Setup
for all AI clients
- [Quick Start Tutorial](https://jonpspri.github.io/databeak/tutorials/quickstart)
\- Learn in 10 minutes
- [API Reference](https://jonpspri.github.io/databeak/api/overview) - All 40+
tools documented
- [Architecture](https://jonpspri.github.io/databeak/architecture) - Technical
details
## Environment Variables
| Variable | Default | Description |
| --------------------------- | ------- | ------------------------- |
| `DATABEAK_MAX_FILE_SIZE_MB` | 1024 | Maximum file size |
| `DATABEAK_CSV_HISTORY_DIR` | "." | History storage location |
| `DATABEAK_SESSION_TIMEOUT` | 3600 | Session timeout (seconds) |
## Known Limitations
DataBeak is designed for interactive CSV processing with AI assistants. Be aware
of these constraints:
- **File Size**: Maximum 1024MB per file (configurable via
`DATABEAK_MAX_FILE_SIZE_MB`)
- **Session Management**: Maximum 100 concurrent sessions, 1-hour timeout
(configurable)
- **Memory**: Large datasets may require significant memory; monitor with
`system_info` tool
- **CSV Dialects**: Assumes standard CSV format; complex dialects may require
pre-processing
- **Concurrency**: Single-threaded processing per session; parallel sessions
supported
- **Data Types**: Automatic type inference; complex types may need explicit
conversion
- **URL Loading**: HTTPS only; blocks private networks (127.0.0.1, 192.168.x.x,
10.x.x.x) for security
For production deployments with larger datasets, consider adjusting environment
variables and monitoring resource usage.
## Contributing
We welcome contributions! Please:
1. Fork the repository
1. Create a feature branch (`git checkout -b feature/amazing-feature`)
1. Make your changes with tests
1. Run quality checks: `uv run -m pytest`
1. Submit a pull request
**Note**: All changes must go through pull requests. Direct commits to `main`
are blocked by pre-commit hooks.
## Development
```bash
# Setup development environment
git clone https://github.com/jonpspri/databeak.git
cd databeak
uv sync
# Run the server locally
uv run databeak
# Run tests
uv run -m pytest tests/unit/ # Unit tests (primary)
uv run -m pytest # All tests
# Run quality checks
uv run ruff check
uv run mypy src/databeak/
```
### Testing Structure
DataBeak implements comprehensive unit and integration testing:
- **Unit Tests** (`tests/unit/`) - 940+ fast, isolated module tests
- **Integration Tests** (`tests/integration/`) - 43 FastMCP Client-based
protocol tests across 7 test files
- **E2E Tests** (`tests/e2e/`) - Planned: Complete workflow validation
**Test Execution:**
```bash
uv run pytest -n auto tests/unit/ # Run unit tests (940+ tests)
uv run pytest -n auto tests/integration/ # Run integration tests (43 tests)
uv run pytest -n auto --cov=src/databeak # Run with coverage analysis
```
See [Testing Guide](tests/README.md) for comprehensive testing details.
## License
Apache 2.0 - see [LICENSE](LICENSE) file.
## Support
- **Issues**: [GitHub Issues](https://github.com/jonpspri/databeak/issues)
- **Discussions**:
[GitHub Discussions](https://github.com/jonpspri/databeak/discussions)
- **Documentation**:
[jonpspri.github.io/databeak](https://jonpspri.github.io/databeak/)
Raw data
{
"_id": null,
"home_page": null,
"name": "databeak",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.12",
"maintainer_email": "Jonathan Springer <jps@s390x.com>",
"keywords": "csv, data-analysis, data-manipulation, data-profiling, data-quality, data-validation, fastmcp, mcp, model-context-protocol, outlier-detection, pandas",
"author": null,
"author_email": "Jonathan Springer <jps@s390x.com>",
"download_url": "https://files.pythonhosted.org/packages/37/f2/6084a9e1165bf13846208427f2babe7b317ac6f6f5acd2eca7bf82e65a42/databeak-0.1.2.tar.gz",
"platform": null,
"description": "# DataBeak\n\n[](https://github.com/jonpspri/databeak/actions/workflows/test.yml)\n[](https://codecov.io/gh/jonpspri/databeak)\n[](https://www.python.org/downloads/)\n[](https://opensource.org/licenses/Apache-2.0)\n[](https://github.com/astral-sh/ruff)\n\n## AI-Powered CSV Processing via Model Context Protocol\n\nTransform how AI assistants work with CSV data. DataBeak provides 40+\nspecialized tools for data manipulation, analysis, and validation through the\nModel Context Protocol (MCP).\n\n## Features\n\n- \ud83d\udd04 **Complete Data Operations** - Load, transform, analyze, and export CSV data\n- \ud83d\udcca **Advanced Analytics** - Statistics, correlations, outlier detection, data\n profiling\n- \u2705 **Data Validation** - Schema validation, quality scoring, anomaly detection\n- \ud83c\udfaf **Stateless Design** - Clean MCP architecture with external context\n management\n- \u26a1 **High Performance** - Handles large datasets with streaming and chunking\n- \ud83d\udd12 **Session Management** - Multi-user support with isolated sessions\n- \ud83c\udf1f **Code Quality** - Zero ruff violations, 100% mypy compliance, perfect MCP\n documentation standards, comprehensive test coverage\n\n## Getting Started\n\nThe fastest way to use DataBeak is with `uvx` (no installation required):\n\n### For Claude Desktop\n\nAdd this to your MCP Settings file:\n\n```json\n{\n \"mcpServers\": {\n \"databeak\": {\n \"command\": \"uvx\",\n \"args\": [\n \"--from\",\n \"git+https://github.com/jonpspri/databeak.git\",\n \"databeak\"\n ]\n }\n }\n}\n```\n\n### For Other AI Clients\n\nDataBeak works with Continue, Cline, Windsurf, and Zed. See the\n[installation guide](https://jonpspri.github.io/databeak/installation) for\nspecific configuration examples.\n\n### HTTP Mode (Advanced)\n\nFor HTTP-based AI clients or custom deployments:\n\n```bash\n# Run in HTTP mode\nuv run databeak --transport http --host 0.0.0.0 --port 8000\n\n# Access server at http://localhost:8000/mcp\n# Health check at http://localhost:8000/health\n```\n\n### Quick Test\n\nOnce configured, ask your AI assistant:\n\n```text\n\"Load a CSV file and show me basic statistics\"\n\"Remove duplicate rows and export as Excel\"\n\"Find outliers in the price column\"\n```\n\n## Documentation\n\n\ud83d\udcda **[Complete Documentation](https://jonpspri.github.io/databeak/)**\n\n- [Installation Guide](https://jonpspri.github.io/databeak/installation) - Setup\n for all AI clients\n- [Quick Start Tutorial](https://jonpspri.github.io/databeak/tutorials/quickstart)\n \\- Learn in 10 minutes\n- [API Reference](https://jonpspri.github.io/databeak/api/overview) - All 40+\n tools documented\n- [Architecture](https://jonpspri.github.io/databeak/architecture) - Technical\n details\n\n## Environment Variables\n\n| Variable | Default | Description |\n| --------------------------- | ------- | ------------------------- |\n| `DATABEAK_MAX_FILE_SIZE_MB` | 1024 | Maximum file size |\n| `DATABEAK_CSV_HISTORY_DIR` | \".\" | History storage location |\n| `DATABEAK_SESSION_TIMEOUT` | 3600 | Session timeout (seconds) |\n\n## Known Limitations\n\nDataBeak is designed for interactive CSV processing with AI assistants. Be aware\nof these constraints:\n\n- **File Size**: Maximum 1024MB per file (configurable via\n `DATABEAK_MAX_FILE_SIZE_MB`)\n- **Session Management**: Maximum 100 concurrent sessions, 1-hour timeout\n (configurable)\n- **Memory**: Large datasets may require significant memory; monitor with\n `system_info` tool\n- **CSV Dialects**: Assumes standard CSV format; complex dialects may require\n pre-processing\n- **Concurrency**: Single-threaded processing per session; parallel sessions\n supported\n- **Data Types**: Automatic type inference; complex types may need explicit\n conversion\n- **URL Loading**: HTTPS only; blocks private networks (127.0.0.1, 192.168.x.x,\n 10.x.x.x) for security\n\nFor production deployments with larger datasets, consider adjusting environment\nvariables and monitoring resource usage.\n\n## Contributing\n\nWe welcome contributions! Please:\n\n1. Fork the repository\n1. Create a feature branch (`git checkout -b feature/amazing-feature`)\n1. Make your changes with tests\n1. Run quality checks: `uv run -m pytest`\n1. Submit a pull request\n\n**Note**: All changes must go through pull requests. Direct commits to `main`\nare blocked by pre-commit hooks.\n\n## Development\n\n```bash\n# Setup development environment\ngit clone https://github.com/jonpspri/databeak.git\ncd databeak\nuv sync\n\n# Run the server locally\nuv run databeak\n\n# Run tests\nuv run -m pytest tests/unit/ # Unit tests (primary)\nuv run -m pytest # All tests\n\n# Run quality checks\nuv run ruff check\nuv run mypy src/databeak/\n```\n\n### Testing Structure\n\nDataBeak implements comprehensive unit and integration testing:\n\n- **Unit Tests** (`tests/unit/`) - 940+ fast, isolated module tests\n- **Integration Tests** (`tests/integration/`) - 43 FastMCP Client-based\n protocol tests across 7 test files\n- **E2E Tests** (`tests/e2e/`) - Planned: Complete workflow validation\n\n**Test Execution:**\n\n```bash\nuv run pytest -n auto tests/unit/ # Run unit tests (940+ tests)\nuv run pytest -n auto tests/integration/ # Run integration tests (43 tests)\nuv run pytest -n auto --cov=src/databeak # Run with coverage analysis\n```\n\nSee [Testing Guide](tests/README.md) for comprehensive testing details.\n\n## License\n\nApache 2.0 - see [LICENSE](LICENSE) file.\n\n## Support\n\n- **Issues**: [GitHub Issues](https://github.com/jonpspri/databeak/issues)\n- **Discussions**:\n [GitHub Discussions](https://github.com/jonpspri/databeak/discussions)\n- **Documentation**:\n [jonpspri.github.io/databeak](https://jonpspri.github.io/databeak/)\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "DataBeak: MCP server for comprehensive CSV file operations with pandas-based tools",
"version": "0.1.2",
"project_urls": {
"Changelog": "https://github.com/jonpspri/databeak/blob/main/CHANGELOG.md",
"Documentation": "https://github.com/jonpspri/databeak#readme",
"Homepage": "https://github.com/jonpspri/databeak",
"Issues": "https://github.com/jonpspri/databeak/issues",
"Release Notes": "https://github.com/jonpspri/databeak/releases",
"Repository": "https://github.com/jonpspri/databeak"
},
"split_keywords": [
"csv",
" data-analysis",
" data-manipulation",
" data-profiling",
" data-quality",
" data-validation",
" fastmcp",
" mcp",
" model-context-protocol",
" outlier-detection",
" pandas"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "5ee4a19f336544c1765f25fbedb0bbbbbbf17c5885075fb0ab1c379c33ea9ca4",
"md5": "e280b5164ef80a2dcff52a240a034334",
"sha256": "3e6d4afa6d0c25cc97148bfe8e94d18d4807439b65776d5eff38333af59d890d"
},
"downloads": -1,
"filename": "databeak-0.1.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "e280b5164ef80a2dcff52a240a034334",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.12",
"size": 102138,
"upload_time": "2025-10-07T13:20:42",
"upload_time_iso_8601": "2025-10-07T13:20:42.030963Z",
"url": "https://files.pythonhosted.org/packages/5e/e4/a19f336544c1765f25fbedb0bbbbbbf17c5885075fb0ab1c379c33ea9ca4/databeak-0.1.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "37f26084a9e1165bf13846208427f2babe7b317ac6f6f5acd2eca7bf82e65a42",
"md5": "f1ba5f7f842fcaa6da2ad4005c0a2a7f",
"sha256": "e21011067bc117dcbe1cfb4d0cf8cdbb641c4dfeb02c38785608d81074f6fa76"
},
"downloads": -1,
"filename": "databeak-0.1.2.tar.gz",
"has_sig": false,
"md5_digest": "f1ba5f7f842fcaa6da2ad4005c0a2a7f",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.12",
"size": 191378,
"upload_time": "2025-10-07T13:20:43",
"upload_time_iso_8601": "2025-10-07T13:20:43.257813Z",
"url": "https://files.pythonhosted.org/packages/37/f2/6084a9e1165bf13846208427f2babe7b317ac6f6f5acd2eca7bf82e65a42/databeak-0.1.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-07 13:20:43",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "jonpspri",
"github_project": "databeak",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "fastmcp",
"specs": [
[
">=",
"2.11.3"
]
]
},
{
"name": "pandas",
"specs": [
[
">=",
"2.2.3"
]
]
},
{
"name": "numpy",
"specs": [
[
">=",
"2.1.3"
]
]
},
{
"name": "pydantic",
"specs": [
[
">=",
"2.10.4"
]
]
},
{
"name": "aiofiles",
"specs": [
[
">=",
"24.1.0"
]
]
},
{
"name": "python-dateutil",
"specs": [
[
">=",
"2.9.0"
]
]
},
{
"name": "httpx",
"specs": [
[
">=",
"0.27.0"
]
]
},
{
"name": "openpyxl",
"specs": [
[
">=",
"3.1.5"
]
]
},
{
"name": "pyarrow",
"specs": [
[
">=",
"17.0.0"
]
]
},
{
"name": "tabulate",
"specs": [
[
">=",
"0.9.0"
]
]
},
{
"name": "numexpr",
"specs": [
[
">=",
"2.10.0"
]
]
},
{
"name": "bottleneck",
"specs": [
[
">=",
"1.4.0"
]
]
},
{
"name": "pytz",
"specs": [
[
">=",
"2024.2"
]
]
}
],
"lcname": "databeak"
}