# Arc + Apache Superset Integration
This directory contains the custom SQLAlchemy dialect and configuration for integrating Apache Superset with the Arc time-series data warehouse.
## Features
- ✅ **Custom SQLAlchemy dialect** for Arc API
- ✅ **Authentication** via API keys (set and forget)
- ✅ **Full SQL support** for dashboard creation
- ✅ **Auto-discovery** of Arc tables and measurements
- ✅ **Docker Compose integration**
## Quick Start
### Option 1: Install in Existing Superset
```bash
# Install the Arc dialect package
pip install arc-superset-dialect
# Restart Superset
superset run -h 0.0.0.0 -p 8088
```
### Option 2: Docker Image (Recommended for Production)
Use the included Dockerfile to build a Superset image with Arc pre-installed:
```bash
# Clone the repository
git clone https://github.com/basekick-labs/arc-superset-dialect.git
cd arc-superset-dialect
# Build the image
docker build -t superset-arc:latest .
# Run Superset with Arc support
docker run -d \
-p 8088:8088 \
-v superset_home:/app/superset_home \
--name superset-arc \
superset-arc:latest
# Check logs
docker logs -f superset-arc
```
The Dockerfile includes:
- Arc dialect pre-installed
- Custom Superset configuration
- Automatic database initialization
- Default admin user (admin/admin - **change in production!**)
### Connect to Arc
1. **Access Superset**: http://localhost:8088 (admin/admin)
2. **Create Arc Connection**:
- Go to **Settings** → **Database Connections**
- Click **+ Database**
- Select **Other** as database type
- Use this connection string:
```
arc://YOUR_API_KEY@arc-api:8000/default
```
3. **Replace `YOUR_API_KEY`** with your Arc token (see below)
## Getting Your API Key
Arc creates an initial admin token on first startup. Check the logs:
```bash
# Docker
docker logs arc-api | grep "Initial admin token"
# Native/systemd
journalctl -u arc | grep "Initial admin token"
```
Or create a new token using the Arc CLI or Python:
```bash
# Using Python directly
DB_PATH="./data/arc.db" python3 -c "
from api.auth import AuthManager
auth = AuthManager(db_path='./data/arc.db')
token = auth.create_token('superset-integration', description='Superset dashboard access')
print(f'Token: {token}')
"
```
Save the returned token - it's only shown once!
## Multi-Database Support
Arc supports multiple databases (namespaces) within a single instance. In Superset, databases are exposed as **schemas**:
### Database Structure
```
Schema: default
├── cpu (CPU metrics)
├── mem (Memory metrics)
└── disk (Disk metrics)
Schema: production
├── cpu
├── mem
└── disk
Schema: staging
├── cpu
├── mem
└── disk
```
### Using Databases in Superset
1. **View All Databases (Schemas)**:
- When creating a dataset, select the schema (database) from the dropdown
- Each database appears as a separate schema in Superset
2. **Query Specific Database**:
```sql
-- Query default database
SELECT * FROM cpu WHERE timestamp > NOW() - INTERVAL 1 HOUR
-- Query specific database
SELECT * FROM production.cpu WHERE timestamp > NOW() - INTERVAL 1 HOUR
-- Cross-database joins
SELECT
p.timestamp,
p.usage_idle as prod_cpu,
s.usage_idle as staging_cpu
FROM production.cpu p
JOIN staging.cpu s ON p.timestamp = s.timestamp
WHERE p.timestamp > NOW() - INTERVAL 1 HOUR
```
### Available Commands
```sql
-- List all databases
SHOW DATABASES;
-- List all tables in current database
SHOW TABLES;
```
## Available Tables
The dialect auto-discovers all tables using `SHOW TABLES`. Common examples:
- `cpu` - CPU metrics
- `mem` - Memory metrics
- `disk` - Disk metrics
- Any custom measurements you've ingested
## Example Queries
### Basic Queries
```sql
-- Recent CPU metrics (use 'timestamp' column)
SELECT timestamp, host, usage_idle, usage_user
FROM cpu
WHERE timestamp > NOW() - INTERVAL 1 HOUR
ORDER BY timestamp DESC
LIMIT 100;
-- Average CPU usage by host
SELECT host, AVG(usage_idle) as avg_idle
FROM cpu
WHERE timestamp > NOW() - INTERVAL 24 HOUR
GROUP BY host
ORDER BY avg_idle DESC;
```
### Time-Series Aggregation
Arc supports DuckDB's powerful time functions:
```sql
-- Time bucket aggregation (5-minute intervals)
SELECT
time_bucket(INTERVAL '5 minutes', timestamp) as bucket,
host,
AVG(usage_idle) as avg_idle,
MAX(usage_user) as max_user
FROM cpu
WHERE timestamp > NOW() - INTERVAL 6 HOUR
GROUP BY bucket, host
ORDER BY bucket DESC;
-- Daily aggregation with DATE_TRUNC
SELECT
DATE_TRUNC('day', timestamp) as day,
host,
AVG(usage_idle) as avg_cpu_idle,
COUNT(*) as samples
FROM cpu
WHERE timestamp > NOW() - INTERVAL 7 DAY
GROUP BY day, host
ORDER BY day DESC;
```
### Join Queries
Join multiple measurements for correlated analysis:
```sql
-- Correlate CPU and Memory usage
SELECT
c.timestamp,
c.host,
c.usage_idle as cpu_idle,
m.used_percent as mem_used
FROM cpu c
JOIN mem m ON c.timestamp = m.timestamp AND c.host = m.host
WHERE c.timestamp > NOW() - INTERVAL 10 MINUTE
ORDER BY c.timestamp DESC
LIMIT 1000;
```
### Window Functions
```sql
-- Moving average over last 6 data points
SELECT
timestamp,
host,
usage_idle,
AVG(usage_idle) OVER (
PARTITION BY host
ORDER BY timestamp
ROWS BETWEEN 5 PRECEDING AND CURRENT ROW
) as moving_avg
FROM cpu
ORDER BY timestamp DESC
LIMIT 100;
```
## Architecture
```
Superset → Custom Dialect → HTTP Requests → Arc API → DuckDB → Parquet Files → MinIO/S3
↓ ↓ ↓ ↓ ↓ ↓ ↓
Dashboard SQL Query API Key Auth Query Columnar Compacted Object
Engine Storage Files Storage
```
## Connection String Format
```
arc://API_KEY@HOST:PORT/DATABASE
```
**Examples:**
```
# Local development
arc://your-api-key@localhost:8000/default
# Docker Compose
arc://your-api-key@arc-api:8000/default
# Remote server
arc://your-api-key@arc.example.com:8000/default
```
## Troubleshooting
### Connection Issues
- **Verify your API key is correct**: Test with `curl -H "Authorization: Bearer YOUR_KEY" http://host:8000/health`
- **Check that Arc API is running**: `docker logs arc-api` or check systemd logs
- **Ensure network connectivity**: Superset and Arc must be on the same network or accessible via hostname
### Query Issues
- **Check Arc API logs**: `docker logs arc-api` to see query execution
- **Verify table names exist**: Use `SHOW TABLES` in Arc to list available measurements
- **Use `LIMIT` clauses**: Always limit large queries for performance
- **Add time filters**: Time-series queries should filter by time range
### Performance Tips
- **Always use `LIMIT`** for exploratory queries to avoid loading millions of rows
- **Add time range filters** for time-series data: `WHERE timestamp > NOW() - INTERVAL 1 HOUR`
- **Use column names correctly**: Arc stores timestamps in `timestamp` column (not `time`)
- **Leverage query cache**: Arc caches query results for 60 seconds by default
- **Use compacted partitions**: Arc automatically merges small files for 10-50x faster queries
- **Optimize aggregations**: Use `time_bucket()` for time-series bucketing instead of `DATE_TRUNC` when possible
- **Partition filtering**: Include `host` or other tag filters to reduce data scanned
## Package Contents
- `arc_dialect.py` - Custom SQLAlchemy dialect for Arc API
- `setup.py` - PyPI package configuration
- `MANIFEST.in` - Package file manifest
- `README.md` - This documentation
- `PUBLISHING.md` - Publishing guide for PyPI
## Security Notes
- API keys are stored encrypted in Superset's database
- All queries go through Arc's authentication system
- **Change the default Superset admin password in production**
- **Set a strong `SUPERSET_SECRET_KEY` in production**
- Use HTTPS for production deployments
## Advanced Features
### Schema Support
Arc supports multiple schemas/databases:
```sql
-- Query specific database
SELECT * FROM my_database.cpu LIMIT 10;
-- List all databases
SHOW DATABASES;
```
### Advanced DuckDB Features
Arc leverages DuckDB's full SQL capabilities:
```sql
-- List all measurements (tables)
SHOW TABLES;
-- Get table schema
DESCRIBE cpu;
-- Count records by measurement
SELECT COUNT(*) as total_records FROM cpu;
-- Complex aggregations with FILTER
SELECT
host,
COUNT(*) FILTER (WHERE usage_idle > 90) as idle_count,
COUNT(*) FILTER (WHERE usage_idle < 50) as busy_count,
AVG(usage_idle) as avg_idle
FROM cpu
WHERE timestamp > NOW() - INTERVAL 1 HOUR
GROUP BY host;
```
## Development
To modify or extend the dialect:
1. Clone the repository:
```bash
git clone https://github.com/basekick-labs/arc-superset-dialect.git
cd arc-superset-dialect
```
2. Edit `arc_dialect.py`
3. Install locally for testing:
```bash
pip install -e .
```
4. Test with Superset:
```bash
superset run -h 0.0.0.0 -p 8088
```
5. Submit a PR if you add improvements!
## License
Same license as Arc Core (AGPL-3.0)
Raw data
{
"_id": null,
"home_page": "https://github.com/basekick-labs/arc-superset-dialect",
"name": "arc-superset-dialect",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "arc superset sqlalchemy dialect timeseries database",
"author": "Arc Core Team",
"author_email": "support@basekick.net",
"download_url": "https://files.pythonhosted.org/packages/c3/01/0d5b6b114130d5c4320467b78f4840ca6e979ba824d3ca3bac4183e78992/arc_superset_dialect-1.2.0.tar.gz",
"platform": null,
"description": "# Arc + Apache Superset Integration\n\nThis directory contains the custom SQLAlchemy dialect and configuration for integrating Apache Superset with the Arc time-series data warehouse.\n\n## Features\n\n- \u2705 **Custom SQLAlchemy dialect** for Arc API\n- \u2705 **Authentication** via API keys (set and forget)\n- \u2705 **Full SQL support** for dashboard creation\n- \u2705 **Auto-discovery** of Arc tables and measurements\n- \u2705 **Docker Compose integration**\n\n## Quick Start\n\n### Option 1: Install in Existing Superset\n\n```bash\n# Install the Arc dialect package\npip install arc-superset-dialect\n\n# Restart Superset\nsuperset run -h 0.0.0.0 -p 8088\n```\n\n### Option 2: Docker Image (Recommended for Production)\n\nUse the included Dockerfile to build a Superset image with Arc pre-installed:\n\n```bash\n# Clone the repository\ngit clone https://github.com/basekick-labs/arc-superset-dialect.git\ncd arc-superset-dialect\n\n# Build the image\ndocker build -t superset-arc:latest .\n\n# Run Superset with Arc support\ndocker run -d \\\n -p 8088:8088 \\\n -v superset_home:/app/superset_home \\\n --name superset-arc \\\n superset-arc:latest\n\n# Check logs\ndocker logs -f superset-arc\n```\n\nThe Dockerfile includes:\n- Arc dialect pre-installed\n- Custom Superset configuration\n- Automatic database initialization\n- Default admin user (admin/admin - **change in production!**)\n\n### Connect to Arc\n\n1. **Access Superset**: http://localhost:8088 (admin/admin)\n\n2. **Create Arc Connection**:\n - Go to **Settings** \u2192 **Database Connections**\n - Click **+ Database**\n - Select **Other** as database type\n - Use this connection string:\n ```\n arc://YOUR_API_KEY@arc-api:8000/default\n ```\n\n3. **Replace `YOUR_API_KEY`** with your Arc token (see below)\n\n## Getting Your API Key\n\nArc creates an initial admin token on first startup. Check the logs:\n\n```bash\n# Docker\ndocker logs arc-api | grep \"Initial admin token\"\n\n# Native/systemd\njournalctl -u arc | grep \"Initial admin token\"\n```\n\nOr create a new token using the Arc CLI or Python:\n\n```bash\n# Using Python directly\nDB_PATH=\"./data/arc.db\" python3 -c \"\nfrom api.auth import AuthManager\nauth = AuthManager(db_path='./data/arc.db')\ntoken = auth.create_token('superset-integration', description='Superset dashboard access')\nprint(f'Token: {token}')\n\"\n```\n\nSave the returned token - it's only shown once!\n\n## Multi-Database Support\n\nArc supports multiple databases (namespaces) within a single instance. In Superset, databases are exposed as **schemas**:\n\n### Database Structure\n```\nSchema: default\n \u251c\u2500\u2500 cpu (CPU metrics)\n \u251c\u2500\u2500 mem (Memory metrics)\n \u2514\u2500\u2500 disk (Disk metrics)\n\nSchema: production\n \u251c\u2500\u2500 cpu\n \u251c\u2500\u2500 mem\n \u2514\u2500\u2500 disk\n\nSchema: staging\n \u251c\u2500\u2500 cpu\n \u251c\u2500\u2500 mem\n \u2514\u2500\u2500 disk\n```\n\n### Using Databases in Superset\n\n1. **View All Databases (Schemas)**:\n - When creating a dataset, select the schema (database) from the dropdown\n - Each database appears as a separate schema in Superset\n\n2. **Query Specific Database**:\n ```sql\n -- Query default database\n SELECT * FROM cpu WHERE timestamp > NOW() - INTERVAL 1 HOUR\n\n -- Query specific database\n SELECT * FROM production.cpu WHERE timestamp > NOW() - INTERVAL 1 HOUR\n\n -- Cross-database joins\n SELECT\n p.timestamp,\n p.usage_idle as prod_cpu,\n s.usage_idle as staging_cpu\n FROM production.cpu p\n JOIN staging.cpu s ON p.timestamp = s.timestamp\n WHERE p.timestamp > NOW() - INTERVAL 1 HOUR\n ```\n\n### Available Commands\n\n```sql\n-- List all databases\nSHOW DATABASES;\n\n-- List all tables in current database\nSHOW TABLES;\n```\n\n## Available Tables\n\nThe dialect auto-discovers all tables using `SHOW TABLES`. Common examples:\n\n- `cpu` - CPU metrics\n- `mem` - Memory metrics\n- `disk` - Disk metrics\n- Any custom measurements you've ingested\n\n## Example Queries\n\n### Basic Queries\n\n```sql\n-- Recent CPU metrics (use 'timestamp' column)\nSELECT timestamp, host, usage_idle, usage_user\nFROM cpu\nWHERE timestamp > NOW() - INTERVAL 1 HOUR\nORDER BY timestamp DESC\nLIMIT 100;\n\n-- Average CPU usage by host\nSELECT host, AVG(usage_idle) as avg_idle\nFROM cpu\nWHERE timestamp > NOW() - INTERVAL 24 HOUR\nGROUP BY host\nORDER BY avg_idle DESC;\n```\n\n### Time-Series Aggregation\n\nArc supports DuckDB's powerful time functions:\n\n```sql\n-- Time bucket aggregation (5-minute intervals)\nSELECT\n time_bucket(INTERVAL '5 minutes', timestamp) as bucket,\n host,\n AVG(usage_idle) as avg_idle,\n MAX(usage_user) as max_user\nFROM cpu\nWHERE timestamp > NOW() - INTERVAL 6 HOUR\nGROUP BY bucket, host\nORDER BY bucket DESC;\n\n-- Daily aggregation with DATE_TRUNC\nSELECT\n DATE_TRUNC('day', timestamp) as day,\n host,\n AVG(usage_idle) as avg_cpu_idle,\n COUNT(*) as samples\nFROM cpu\nWHERE timestamp > NOW() - INTERVAL 7 DAY\nGROUP BY day, host\nORDER BY day DESC;\n```\n\n### Join Queries\n\nJoin multiple measurements for correlated analysis:\n\n```sql\n-- Correlate CPU and Memory usage\nSELECT\n c.timestamp,\n c.host,\n c.usage_idle as cpu_idle,\n m.used_percent as mem_used\nFROM cpu c\nJOIN mem m ON c.timestamp = m.timestamp AND c.host = m.host\nWHERE c.timestamp > NOW() - INTERVAL 10 MINUTE\nORDER BY c.timestamp DESC\nLIMIT 1000;\n```\n\n### Window Functions\n\n```sql\n-- Moving average over last 6 data points\nSELECT\n timestamp,\n host,\n usage_idle,\n AVG(usage_idle) OVER (\n PARTITION BY host\n ORDER BY timestamp\n ROWS BETWEEN 5 PRECEDING AND CURRENT ROW\n ) as moving_avg\nFROM cpu\nORDER BY timestamp DESC\nLIMIT 100;\n```\n\n## Architecture\n\n```\nSuperset \u2192 Custom Dialect \u2192 HTTP Requests \u2192 Arc API \u2192 DuckDB \u2192 Parquet Files \u2192 MinIO/S3\n \u2193 \u2193 \u2193 \u2193 \u2193 \u2193 \u2193\nDashboard SQL Query API Key Auth Query Columnar Compacted Object\n Engine Storage Files Storage\n```\n\n## Connection String Format\n\n```\narc://API_KEY@HOST:PORT/DATABASE\n```\n\n**Examples:**\n```\n# Local development\narc://your-api-key@localhost:8000/default\n\n# Docker Compose\narc://your-api-key@arc-api:8000/default\n\n# Remote server\narc://your-api-key@arc.example.com:8000/default\n```\n\n## Troubleshooting\n\n### Connection Issues\n- **Verify your API key is correct**: Test with `curl -H \"Authorization: Bearer YOUR_KEY\" http://host:8000/health`\n- **Check that Arc API is running**: `docker logs arc-api` or check systemd logs\n- **Ensure network connectivity**: Superset and Arc must be on the same network or accessible via hostname\n\n### Query Issues\n- **Check Arc API logs**: `docker logs arc-api` to see query execution\n- **Verify table names exist**: Use `SHOW TABLES` in Arc to list available measurements\n- **Use `LIMIT` clauses**: Always limit large queries for performance\n- **Add time filters**: Time-series queries should filter by time range\n\n### Performance Tips\n\n- **Always use `LIMIT`** for exploratory queries to avoid loading millions of rows\n- **Add time range filters** for time-series data: `WHERE timestamp > NOW() - INTERVAL 1 HOUR`\n- **Use column names correctly**: Arc stores timestamps in `timestamp` column (not `time`)\n- **Leverage query cache**: Arc caches query results for 60 seconds by default\n- **Use compacted partitions**: Arc automatically merges small files for 10-50x faster queries\n- **Optimize aggregations**: Use `time_bucket()` for time-series bucketing instead of `DATE_TRUNC` when possible\n- **Partition filtering**: Include `host` or other tag filters to reduce data scanned\n\n## Package Contents\n\n- `arc_dialect.py` - Custom SQLAlchemy dialect for Arc API\n- `setup.py` - PyPI package configuration\n- `MANIFEST.in` - Package file manifest\n- `README.md` - This documentation\n- `PUBLISHING.md` - Publishing guide for PyPI\n\n## Security Notes\n\n- API keys are stored encrypted in Superset's database\n- All queries go through Arc's authentication system\n- **Change the default Superset admin password in production**\n- **Set a strong `SUPERSET_SECRET_KEY` in production**\n- Use HTTPS for production deployments\n\n## Advanced Features\n\n### Schema Support\n\nArc supports multiple schemas/databases:\n\n```sql\n-- Query specific database\nSELECT * FROM my_database.cpu LIMIT 10;\n\n-- List all databases\nSHOW DATABASES;\n```\n\n### Advanced DuckDB Features\n\nArc leverages DuckDB's full SQL capabilities:\n\n```sql\n-- List all measurements (tables)\nSHOW TABLES;\n\n-- Get table schema\nDESCRIBE cpu;\n\n-- Count records by measurement\nSELECT COUNT(*) as total_records FROM cpu;\n\n-- Complex aggregations with FILTER\nSELECT\n host,\n COUNT(*) FILTER (WHERE usage_idle > 90) as idle_count,\n COUNT(*) FILTER (WHERE usage_idle < 50) as busy_count,\n AVG(usage_idle) as avg_idle\nFROM cpu\nWHERE timestamp > NOW() - INTERVAL 1 HOUR\nGROUP BY host;\n```\n\n## Development\n\nTo modify or extend the dialect:\n\n1. Clone the repository:\n ```bash\n git clone https://github.com/basekick-labs/arc-superset-dialect.git\n cd arc-superset-dialect\n ```\n\n2. Edit `arc_dialect.py`\n\n3. Install locally for testing:\n ```bash\n pip install -e .\n ```\n\n4. Test with Superset:\n ```bash\n superset run -h 0.0.0.0 -p 8088\n ```\n\n5. Submit a PR if you add improvements!\n\n## License\n\nSame license as Arc Core (AGPL-3.0)\n",
"bugtrack_url": null,
"license": null,
"summary": "SQLAlchemy dialect for Arc time-series database with multi-database support for Apache Superset",
"version": "1.2.0",
"project_urls": {
"Bug Tracker": "https://github.com/basekick-labs/arc-superset-dialect/issues",
"Documentation": "https://github.com/basekick-labs/arc-superset-dialect",
"Homepage": "https://github.com/basekick-labs/arc-superset-dialect",
"Source Code": "https://github.com/basekick-labs/arc-superset-dialect"
},
"split_keywords": [
"arc",
"superset",
"sqlalchemy",
"dialect",
"timeseries",
"database"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "e986dae92655e77787781b163cf87f8e9354a64d88495e74d3c243addb0c886f",
"md5": "e56fb857c78c253bae1ea78333efd571",
"sha256": "b2dea2f58a4d9272de3b27bca04464e38e23f757a4352b90771566d8c0cb89ee"
},
"downloads": -1,
"filename": "arc_superset_dialect-1.2.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "e56fb857c78c253bae1ea78333efd571",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 10109,
"upload_time": "2025-10-09T01:59:49",
"upload_time_iso_8601": "2025-10-09T01:59:49.063594Z",
"url": "https://files.pythonhosted.org/packages/e9/86/dae92655e77787781b163cf87f8e9354a64d88495e74d3c243addb0c886f/arc_superset_dialect-1.2.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "c3010d5b6b114130d5c4320467b78f4840ca6e979ba824d3ca3bac4183e78992",
"md5": "96647c67baddae5471e09c89e8ba27dd",
"sha256": "598b6e0006fab2985f57c4a6031ba65d82ee7da3bd225a4926fa11f56509f12a"
},
"downloads": -1,
"filename": "arc_superset_dialect-1.2.0.tar.gz",
"has_sig": false,
"md5_digest": "96647c67baddae5471e09c89e8ba27dd",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 16744,
"upload_time": "2025-10-09T01:59:50",
"upload_time_iso_8601": "2025-10-09T01:59:50.253611Z",
"url": "https://files.pythonhosted.org/packages/c3/01/0d5b6b114130d5c4320467b78f4840ca6e979ba824d3ca3bac4183e78992/arc_superset_dialect-1.2.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-09 01:59:50",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "basekick-labs",
"github_project": "arc-superset-dialect",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "arc-superset-dialect"
}