Name | flow-acc JSON |
Version |
2.0.0
JSON |
| download |
home_page | None |
Summary | Simplified SDK for Foundry Compute Platform - GPU compute made simple |
upload_time | 2025-07-20 13:37:52 |
maintainer | None |
docs_url | None |
author | None |
requires_python | >=3.8 |
license | Apache-2.0 |
keywords |
ai
cloud
compute
foundry
gpu
ml
|
VCS |
 |
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# Flow SDK
[](https://pypi.org/project/flow-sdk/)
[](https://pypi.org/project/flow-sdk/)
[](LICENSE.txt)
[](https://docs.mithril.ai/flow)
**GPU compute in seconds, not hours** — Flow SDK provides seamless access to GPU infrastructure with a single, simple API. The Flow system is tailored for launching batch tasks and experiments.
## Table of Contents
- [Quick Start](#quick-start)
- [Overview](#overview)
- [Installation](#installation)
- [Authentication](#authentication)
- [Basic Usage](#basic-usage)
- [Guide for SLURM Users](#guide-for-slurm-users)
- [Instance Types](#instance-types)
- [Task Management](#task-management)
- [Persistent Storage](#persistent-storage)
- [Zero-Import Remote Execution](#zero-import-remote-execution)
- [Decorator Pattern](#decorator-pattern)
- [Data Mounting](#data-mounting)
- [Multi-Node Training](#multi-node-training)
- [Advanced Features](#advanced-features)
- [Error Handling](#error-handling)
- [Common Patterns](#common-patterns)
- [Performance](#performance)
- [Support](#support)
- [License](#license)
## Quick Start
**Prerequisites:** Get your API key at [app.mlfoundry.com](https://app.mlfoundry.com/account/apikeys)
### 1. Install and Configure
```bash
pip install flow-sdk
flow init # One-time setup wizard
```
### 2. Run on GPU
```python
import flow
# Your code launches on GPU in minutes
task = flow.run("python train.py", instance_type="a100")
```
That's it. Your local `train.py` file and project files are automatically uploaded and running on an A100 GPU.
**Note:** Your code is uploaded, but you need to install dependencies. See [Handling Dependencies](#handling-dependencies).
## Why Flow
Long-standing AI research labs have invested to build sophisticated infrastructure abstractions that enable researchers to focus on research rather than DevOps. DeepMind's Xmanager handles experiments from single GPUs to hundreds of hosts. Meta's submitit brings Python-native patterns to cluster computing. OpenAI's internal platform was designed to seamlessly scale from interactive notebooks to thousand-GPU training runs.
Flow brings these same capabilities to every AI developer. Like these internal tools, Flow provides:
- **Progressive disclosure** - Simple tasks stay simple, complex workflows remain possible
- **Unified abstraction** - One interface whether running locally or across cloud hardware
- **Fail-fast validation** - Catch configuration errors before expensive compute starts
- **Experiment tracking** - Built-in task history and reproducibility
The goal: democratize the infrastructure abstractions that enable breakthrough AI research.
## Overview
Flow SDK provides a high-level interface for GPU workload submission across heterogeneous infrastructure. Our design philosophy emphasizes explicit behavior, progressive disclosure, and fail-fast validation.
```
┌─────────────┐ ┌──────────────┐ ┌─────────────────┐
│ Your Code │ --> │ Flow SDK │ --> │ Cloud Infra │
│ train.py │ │ Unified API │ │ FCP, ... others │
└─────────────┘ └──────────────┘ └─────────────────┘
Local client-side cloud accelerators
```
### Core Capabilities
- **Unified API**: Single interface across cloud providers (FCP, AWS, GCP, Azure)
- **Zero DevOps**: Automatic instance provisioning, driver setup, and environment configuration
- **Cost Control**: Built-in safeguards with max price and runtime limits
- **Persistent Storage**: Volumes that persist across task lifecycles
- **Multi-Node**: Native support for distributed training
- **Real-Time Monitoring**: Log streaming, SSH access, and status tracking
- **Notebook Integration**: Google Colab and Jupyter notebook support
## Installation
```bash
pip install flow-sdk
```
Requirements:
- Python 3.11+
- Linux, macOS, or Windows
- API key from [ML Foundry](https://app.mlfoundry.com)
## Authentication
Run the interactive setup wizard:
```bash
uv run flow init
```
This will:
1. Prompt for your API key (get one at [app.mlfoundry.com](https://app.mlfoundry.com))
2. Help you select a project
3. Configure SSH keys (optional)
4. Save settings for all Flow tools
Alternative methods:
**Environment Variables**
```bash
export FCP_API_KEY="fcp-..."
export FCP_PROJECT="my-project"
```
**Manual Config File**
```yaml
# ~/.flow/config.yaml
api_key: fcp-...
project: my-project
region: us-central1-a
```
**Verify Setup**
```bash
uv run flow status
# Should show "No tasks found" if authenticated
```
## Basic Usage
### Python API
```python
import flow
from flow import TaskConfig
# Simple GPU job - automatically uploads your local code
task = flow.run("python train.py", instance_type="a100")
# Wait for completion
task.wait()
print(task.logs())
# Full configuration
config = TaskConfig(
name="distributed-training",
instance_type="8xa100", # 8x A100 GPUs
command=["python", "-m", "torch.distributed.launch",
"--nproc_per_node=8", "train.py"],
volumes=[{"size_gb": 100, "mount_path": "/data"}],
max_price_per_hour=25.0, # Cost protection
max_run_time_hours=24.0 # Time limit
)
task = flow.run(config)
# Monitor execution
print(f"Status: {task.status}")
print(f"SSH: {task.ssh_command}")
print(f"Cost: {task.cost_per_hour}")
```
### Code Upload
By default, Flow automatically uploads your current directory to the GPU instance:
```python
# This uploads your local files and runs them on GPU
task = flow.run("python train.py", instance_type="a100")
# Disable code upload (use pre-built Docker image)
task = flow.run(
"python /app/train.py",
instance_type="a100",
image="mycompany/training:latest",
upload_code=False
)
```
Use `.flowignore` file to exclude files from upload (same syntax as `.gitignore`).
### Handling Dependencies
Your code is uploaded, but dependencies need to be installed:
```python
# Install from requirements.txt
task = flow.run(
"pip install -r requirements.txt && python train.py",
instance_type="a100"
)
# Using uv (recommended for speed)
task = flow.run(
"uv pip install . && uv run python train.py",
instance_type="a100"
)
# Pre-installed in Docker image (fastest)
task = flow.run(
"python train.py",
instance_type="a100",
image="pytorch/pytorch:2.0.0-cuda11.8-cudnn8" # PyTorch pre-installed
)
```
### Command Line
```bash
# Submit tasks
uv run flow run "python train.py" --instance-type a100
uv run flow run config.yaml
# Monitor tasks
uv run flow status # List all tasks
uv run flow logs task-abc123 -f # Stream logs
uv run flow ssh task-abc123 # SSH access
# Manage tasks
uv run flow cancel task-abc123 # Stop execution
```
### YAML Configuration
```yaml
# config.yaml
name: model-training
instance_type: 4xa100
command: python train.py --epochs 100
env:
BATCH_SIZE: "256"
LEARNING_RATE: "0.001"
volumes:
- size_gb: 500
mount_path: /data
name: training-data
max_price_per_hour: 20.0
max_run_time_hours: 72.0
ssh_keys:
- my-ssh-key
```
## Guide for SLURM Users
Flow SDK provides a modern cloud-native alternative to SLURM while maintaining compatibility with existing workflows. This guide helps SLURM users transition to Flow.
### Command Equivalents
| SLURM Command | Flow Command | Description |
|---------------|--------------|-------------|
| `sbatch job.sh` | `flow run job.yaml` | Submit batch job |
| `sbatch script.slurm` | `flow run script.slurm` | Direct SLURM script support |
| `squeue` | `flow status` | View job queue |
| `scancel <job_id>` | `flow cancel <task_id>` | Cancel job |
| `scontrol show job <id>` | `flow info <task_id>` | Show job details |
| `sacct` | *Not applicable* | Flow tracks costs differently |
| `sinfo` | *Not applicable* | Cloud resources are dynamic |
| `srun` | `flow ssh <task_id>` | Interactive access |
### Log Access
```bash
# SLURM: View output files
cat slurm-12345.out
# Flow: Stream logs directly
uv run flow logs task-abc123
uv run flow logs task-abc123 --follow # Like tail -f
uv run flow logs task-abc123 --stderr # Error output
```
### SLURM Script Compatibility
Flow can directly run existing SLURM scripts:
```bash
# Your existing SLURM script
uv run flow run job.slurm
# Behind the scenes, Flow parses #SBATCH directives:
#SBATCH --job-name=training
#SBATCH --nodes=2
#SBATCH --gpus=a100:4
#SBATCH --time=24:00:00
#SBATCH --mem=64G
```
### Migration Examples
#### Basic GPU Job
**SLURM:**
```bash
#!/bin/bash
#SBATCH --job-name=train-model
#SBATCH --partition=gpu
#SBATCH --gpus=1
#SBATCH --time=12:00:00
#SBATCH --mem=32G
module load cuda/11.8
python train.py
```
**Flow (YAML):**
```yaml
name: train-model
instance_type: a100
command: python train.py
max_run_time_hours: 12.0
```
**Flow (Python):**
```python
flow.run("python train.py", instance_type="a100", max_run_time_hours=12)
```
#### Multi-GPU Training
**SLURM:**
```bash
#!/bin/bash
#SBATCH --job-name=distributed
#SBATCH --nodes=4
#SBATCH --gpus-per-node=8
#SBATCH --ntasks-per-node=8
srun python -m torch.distributed.launch train.py
```
**Flow:**
```yaml
name: distributed
instance_type: 8xa100
num_instances: 4
command: |
torchrun --nproc_per_node=8 --nnodes=4 \
--node_rank=$FLOW_NODE_RANK \
--master_addr=$FLOW_MAIN_IP \
train.py
```
#### Array Jobs
**SLURM:**
```bash
#!/bin/bash
#SBATCH --array=1-10
#SBATCH --job-name=sweep
python experiment.py --task-id $SLURM_ARRAY_TASK_ID
```
**Flow (using loop):**
```python
for i in range(1, 11):
flow.run(f"python experiment.py --task-id {i}",
name=f"sweep-{i}", instance_type="a100")
```
### Key Differences
1. **Resource Allocation**: Flow uses instance types (e.g., `a100`, `4xa100`) instead of partition/node specifications
2. **Cost Control**: Built-in `max_price_per_hour` instead of account-based billing
3. **Storage**: Cloud volumes (block storage) instead of shared filesystems
- FCP platform supports both block storage and file shares
- Flow SDK currently only creates block storage volumes (requires mounting/formatting)
- File share support is planned for easier multi-node access
4. **Environment**: Container-based instead of module system
5. **Scheduling**: Cloud-native provisioning instead of queue-based scheduling
### Environment Variables
When using the SLURM adapter (`flow run script.slurm`), Flow sets SLURM-compatible environment variables:
| SLURM Variable | Set By SLURM Adapter | Flow Native Variable |
|----------------|---------------------|---------------------|
| `SLURM_JOB_ID` | ✓ (maps to `$FLOW_TASK_ID`) | `FLOW_TASK_ID` |
| `SLURM_JOB_NAME` | ✓ | `FLOW_TASK_NAME` |
| `SLURM_ARRAY_TASK_ID` | ✓ (planned) | - |
| `SLURM_NTASKS` | ✓ | - |
| `SLURM_CPUS_PER_TASK` | ✓ | - |
| `SLURM_NNODES` | ✓ | `FLOW_NODE_COUNT` |
| `SLURM_JOB_PARTITION` | ✓ (if set) | - |
For all Flow tasks (regardless of adapter), these variables are available:
- `FLOW_TASK_ID` - Unique task identifier
- `FLOW_TASK_NAME` - Task name from config
### Advanced Features
**Module System → Container Images:**
```yaml
# SLURM: module load pytorch/2.0
# Flow equivalent:
image: pytorch/pytorch:2.0.0-cuda11.8-cudnn8
```
**Dependency Management:**
```bash
# SLURM: --dependency=afterok:12345
# Flow: Use task.wait() in Python or chain commands
```
**Output Formatting:**
```bash
# Get SLURM-style output (coming soon)
uv run flow status --format=slurm
```
### Future Compatibility
We're considering adding direct SLURM command aliases for easier migration:
- `flow sbatch` → `flow run`
- `flow squeue` → `flow status`
- `flow scancel` → `flow cancel`
If you need specific SLURM features, please [open an issue](https://github.com/mlfoundry/flow-sdk/issues).
## Instance Types
| Type | GPUs | Total Memory |
|------|------|--------------|
| `a100` | 1x A100 | 80GB |
| `4xa100` | 4x A100 | 320GB |
| `8xa100` | 8x A100 | 640GB |
| `h100` | 8x H100 | 640GB |
```python
# Examples
flow.run("python train.py", instance_type="a100") # Single GPU
flow.run("python train.py", instance_type="4xa100") # Multi-GPU
flow.run("python train.py", instance_type="8xh100") # Maximum performance
```
## Task Management
### Task Object
```python
# Get task handle
task = flow.run(config)
# Or retrieve existing
task = flow.get_task("task-abc123")
# Properties
task.task_id # Unique identifier
task.status # Current state
task.ssh_command # SSH connection string
task.cost_per_hour # Current pricing
task.created_at # Submission time
# Methods
task.wait(timeout=3600) # Block until complete
task.refresh() # Update status
task.cancel() # Terminate execution
```
### Logging
```python
# Get recent logs
logs = task.logs(tail=100)
# Stream in real-time
for line in task.logs(follow=True):
if "loss:" in line:
print(line)
```
### SSH Access
```python
# Interactive shell
task.ssh()
# Run command
task.ssh("nvidia-smi")
task.ssh("tail -f /workspace/train.log")
# Multi-node access
task.ssh(node=1) # Connect to specific node
```
### Extended Information
```python
# Get task creator
user = task.get_user()
print(f"Created by: {user.username} ({user.email})")
# Get instance details
instances = task.get_instances()
for inst in instances:
print(f"Node {inst.instance_id}:")
print(f" Public IP: {inst.public_ip}")
print(f" Private IP: {inst.private_ip}")
print(f" Status: {inst.status}")
```
## Persistent Storage
### Volume Management
```python
# Create volume (currently creates block storage)
with Flow() as client:
vol = client.create_volume(size_gb=1000, name="datasets")
# Use in task
config = TaskConfig(
name="training",
instance_type="a100",
command="python train.py",
volumes=[{
"volume_id": vol.volume_id,
"mount_path": "/data"
}]
)
# Or reference by name
config.volumes = [{
"name": "datasets",
"mount_path": "/data"
}]
```
**Note**: Flow SDK currently creates block storage volumes, which need to be formatted on first use. The underlying FCP platform also supports file shares (pre-formatted, multi-node accessible), but this is not yet exposed in the SDK.
### Docker Cache Optimization
```python
# Mount at Docker directory for layer caching
volumes=[{
"name": "docker-cache",
"size_gb": 50,
"mount_path": "/var/lib/docker"
}]
```
## Zero-Import Remote Execution
Flow SDK's `invoke()` function lets you run Python functions on GPUs without modifying your code:
### The Invoker Pattern
```python
# train.py - Your existing code, no Flow imports needed
def train_model(data_path: str, epochs: int = 100):
import torch
model = torch.nn.Linear(10, 1)
# ... training logic ...
return {"accuracy": 0.95, "loss": 0.01}
```
```python
# runner.py - Execute remotely on GPU
from flow import invoke
result = invoke(
"train.py", # Python file
"train_model", # Function name
args=["s3://data"], # Arguments
kwargs={"epochs": 200}, # Keyword arguments
gpu="a100" # GPU type
)
print(result) # {"accuracy": 0.95, "loss": 0.01}
```
### Why Use invoke()?
- **Zero contamination**: Keep ML code pure Python
- **Easy testing**: Run functions locally without changes
- **Flexible**: Any function, any module
- **Type safe**: JSON serialization ensures compatibility
See the [Invoker Pattern Guide](docs/INVOKER_PATTERN.md) for detailed documentation.
## Decorator Pattern
Flow SDK provides a decorator-based API similar to popular serverless frameworks:
### Basic Usage
```python
from flow import FlowApp
app = FlowApp()
@app.function(gpu="a100")
def train_model(data_path: str, epochs: int = 100):
import torch
model = torch.nn.Linear(10, 1)
# ... training logic ...
return {"accuracy": 0.95, "loss": 0.01}
# Execute remotely on GPU
result = train_model.remote("s3://data.csv", epochs=50)
# Execute locally for testing
local_result = train_model("./local_data.csv")
```
### Advanced Configuration
```python
@app.function(
gpu="h100:8", # 8x H100 GPUs
image="pytorch/pytorch:2.0.0",
volumes={"/data": "training-data"},
env={"WANDB_API_KEY": "..."}
)
def distributed_training(config_path: str):
# Multi-GPU training code
return {"status": "completed"}
# Async execution
task_id = distributed_training.spawn("config.yaml")
```
### Module-Level Usage
```python
from flow import function
# Use without creating an app instance
@function(gpu="a100")
def inference(text: str) -> dict:
# Run inference
return {"sentiment": "positive"}
```
The decorator pattern provides:
- **Clean syntax**: Familiar to Flask/FastAPI users
- **Local testing**: Call functions directly without infrastructure
- **Type safety**: Full IDE support and type hints
- **Flexibility**: Mix local and remote execution seamlessly
## Data Mounting
Flow SDK provides seamless data access from S3 and volumes through the Flow client API:
### Quick Start
```python
# Mount S3 dataset
from flow import Flow
with Flow() as client:
task = client.submit(
"python train.py --data /data",
gpu="a100",
mounts="s3://my-bucket/datasets/imagenet"
)
# Mount multiple sources
with Flow() as client:
task = client.submit(
"python train.py",
gpu="a100:4",
mounts={
"/datasets": "s3://ml-bucket/imagenet",
"/models": "volume://pretrained-models", # Auto-creates if missing
"/outputs": "volume://training-outputs"
}
)
```
### Supported Sources
- **S3**: Read-only access via s3fs (`s3://bucket/path`)
- Requires AWS credentials in environment
- Cached locally for performance
- **Volumes**: Persistent read-write storage (`volume://name`)
- Auto-creates with 100GB if not found
- High-performance NVMe storage
### Example: Training Pipeline
```python
# Set AWS credentials (from secure source)
import os
os.environ["AWS_ACCESS_KEY_ID"] = get_secret("aws_key")
os.environ["AWS_SECRET_ACCESS_KEY"] = get_secret("aws_secret")
# Submit training with data mounting
with Flow() as client:
task = client.submit(
"""
python train.py \\
--data /datasets/train \\
--validation /datasets/val \\
--output /outputs
""",
gpu="a100:8",
mounts={
"/datasets": "s3://ml-datasets/imagenet",
"/outputs": "volume://experiment-results"
}
)
```
See the [Data Mounting Guide](docs/DATA_MOUNTING_GUIDE.md) for detailed documentation.
## Distributed Training
### Single-Node Multi-GPU (Recommended)
```python
config = TaskConfig(
name="distributed-training",
instance_type="8xa100", # 8x A100 GPUs on single node
command="torchrun --nproc_per_node=8 --standalone train.py"
)
```
### Multi-Node Training
For multi-node training, explicitly set coordination environment variables:
```python
config = TaskConfig(
name="multi-node-training",
instance_type="8xa100",
num_instances=4, # 32 GPUs total
env={
"FLOW_NODE_RANK": "0", # Set per node: 0, 1, 2, 3
"FLOW_NUM_NODES": "4",
"FLOW_MAIN_IP": "10.0.0.1" # IP of rank 0 node
},
command=[
"torchrun",
"--nproc_per_node=8",
"--nnodes=4",
"--node_rank=$FLOW_NODE_RANK",
"--master_addr=$FLOW_MAIN_IP",
"--master_port=29500",
"train.py"
]
)
```
## Advanced Features
### Cost Optimization
```python
# Use spot instances with price cap
config = TaskConfig(
name="experiment",
instance_type="a100",
max_price_per_hour=5.0, # Use spot if available
max_run_time_hours=12.0 # Prevent runaway costs
)
```
### Environment Setup
```python
# Custom container
config.image = "pytorch/pytorch:2.0.0-cuda11.8-cudnn8"
# Environment variables
config.env = {
"WANDB_API_KEY": "...",
"HF_TOKEN": "...",
"CUDA_VISIBLE_DEVICES": "0,1,2,3"
}
# Working directory
config.working_dir = "/workspace"
```
### Data Access
```python
# S3 integration
config = TaskConfig(
name="s3-processing",
instance_type="a100",
command="python process.py",
env={
"AWS_ACCESS_KEY_ID": "...",
"AWS_SECRET_ACCESS_KEY": "..."
}
)
# Or use mounts parameter (simplified API)
with Flow() as client:
task = client.submit(
"python analyze.py",
gpu="a100",
mounts={
"/input": "s3://my-bucket/data/",
"/output": "volume://results"
}
)
```
## Error Handling
Flow provides structured errors with recovery guidance:
```python
from flow.errors import (
FlowError,
AuthenticationError,
ResourceNotFoundError,
ValidationError,
QuotaExceededError
)
try:
task = flow.run(config)
except ValidationError as e:
print(f"Configuration error: {e.message}")
for suggestion in e.suggestions:
print(f" - {suggestion}")
except QuotaExceededError as e:
print(f"Quota exceeded: {e.message}")
print("Suggestions:", e.suggestions)
except FlowError as e:
print(f"Error: {e}")
```
## Common Patterns
### Interactive Development
#### Google Colab Integration
Connect Google Colab notebooks to Flow GPU instances:
```bash
# Launch GPU instance configured for Colab
uv run flow colab connect --instance-type a100 --hours 4
# You'll receive:
# 1. SSH tunnel command to run locally
# 2. Connection URL for Colab
```
Then in Google Colab:
1. Go to Runtime → Connect to local runtime
2. Paste the connection URL
3. Click Connect
Your Colab notebook now runs on Flow GPU infrastructure!
#### Direct Jupyter Notebooks
Run Jupyter directly on Flow instances:
```python
# Launch Jupyter server
config = TaskConfig(
name="notebook",
instance_type="a100",
command="jupyter lab --ip=0.0.0.0 --no-browser",
ports=[8888],
max_run_time_hours=8.0
)
task = flow.run(config)
print(f"Access at: {task.endpoints['jupyter']}")
```
### Checkpointing
```python
# Resume training from checkpoint
config = TaskConfig(
name="resume-training",
instance_type="a100",
command="python train.py --resume",
volumes=[{
"name": "checkpoints",
"mount_path": "/checkpoints"
}]
)
```
### Experiment Sweep
```python
# Run multiple experiments
for lr in [0.001, 0.01, 0.1]:
config = TaskConfig(
name=f"exp-lr-{lr}",
instance_type="a100",
command=f"python train.py --lr {lr}",
env={"WANDB_RUN_NAME": f"lr_{lr}"}
)
flow.run(config)
```
## Architecture
Flow SDK follows Domain-Driven Design with clear boundaries:
### High-Level Overview
```
┌─────────────────────────────────────────────┐
│ User Interface Layer │
│ (Python API, CLI, YAML) │
├─────────────────────────────────────────────┤
│ Core Domain Layer │
│ (TaskConfig, Task, Volume models) │
├─────────────────────────────────────────────┤
│ Provider Abstraction Layer │
│ (IProvider Protocol) │
├─────────────────────────────────────────────┤
│ Provider Implementations │
│ (FCP, AWS, GCP, Azure - future) │
└─────────────────────────────────────────────┘
```
### Key Components
- **Flow SDK** (`src/flow/`): High-level Python SDK for ML/AI workloads
- **Mithril CLI** (`mithril/`): Low-level IaaS control following Unix philosophy
- **Provider Abstraction**: Cloud-agnostic interface for multi-cloud support
### Current Provider Support
**FCP (ML Foundry)** - Production Ready
- Ubuntu 22.04 environment with bash
- 10KB startup script limit
- Spot instances with preemption handling
- Block storage volumes (file shares available in some regions)
- See [FCP provider documentation](src/flow/providers/fcp/README.md) for implementation details
**AWS, GCP, Azure** - Planned
- Provider abstraction designed for multi-cloud
- Contributions welcome
### Additional Documentation
- [Architecture Overview](docs/ARCHITECTURE.md) - System design and concepts
- [FCP Provider Details](src/flow/providers/fcp/README.md) - Provider-specific implementation
- [Colab Troubleshooting](docs/COLAB_TROUBLESHOOTING.md) - Colab setup guide
- [Configuration Guide](docs/CONFIGURATION.md) - Configuration options
- [Data Handling](docs/DATA_HANDLING.md) - Data management patterns
### Example Code
- [Verify Instance Setup](examples/01_verify_instance.py) - Basic GPU verification
- [Jupyter Server](examples/02_jupyter_server.py) - Launch Jupyter on GPU
- [Multi-Node Training](examples/03_multi_node_training.py) - Distributed training setup
- [S3 Data Access](examples/04_s3_data_access.py) - Cloud storage integration
- [More Examples](examples/) - Additional usage patterns
## Performance
- **Cold start**: 10-15 minutes (instance provisioning on FCP core)
- **Warm start**: 30-60 seconds (pre-allocated pool; pending feature: let FCP know if interesting)
## Troubleshooting
### Common Errors
**Authentication Failed**
```
Error: Invalid API key
```
Solution: Run `flow init` and ensure your API key is correct. Get a new key at [app.mlfoundry.com](https://app.mlfoundry.com/account/apikeys).
**No Available Instances**
```
Error: No instances available for type 'a100'
```
Solution: Try a different region or instance type. Check availability with `flow status`.
**Quota Exceeded**
```
Error: GPU quota exceeded in region us-east-1
```
Solution: Try a different region or contact support for quota increase.
**Invalid Instance Type**
```
ValidationError: Invalid instance type 'a100x8'
```
Solution: Use correct format: `8xa100` (not `a100x8`). See [Instance Types](#instance-types).
**Task Timeout**
```
Error: Task exceeded max_run_time_hours limit
```
Solution: Increase `max_run_time_hours` in your config or optimize your code.
**File Not Found**
```
python: can't open file 'train.py': No such file or directory
```
Solution: Ensure `upload_code=True` (default) or that your file exists in the Docker image.
**Module Not Found**
```
ModuleNotFoundError: No module named 'torch'
```
Solution: Install dependencies first: `flow.run("pip install torch && python train.py")`. See [Handling Dependencies](#handling-dependencies).
**Upload Size Limit**
```
Error: Project size (15.2MB) exceeds limit (10MB)
```
Note: Files are automatically compressed (gzip), but the 10MB limit applies after compression.
Solutions (in order of preference):
1. **Use .flowignore** to exclude unnecessary files (models, datasets, caches)
2. **Clone from Git**:
```python
flow.run("git clone https://github.com/myorg/myrepo.git . && python train.py",
instance_type="a100", upload_code=False)
```
3. **Pre-built Docker image** with your code:
```python
flow.run("python /app/train.py", instance_type="a100",
image="myorg/myapp:latest", upload_code=False)
```
4. **Download from S3/GCS**:
```python
flow.run("aws s3 cp s3://mybucket/code.tar.gz . && tar -xzf code.tar.gz && python train.py",
instance_type="a100", upload_code=False)
```
5. **Mount code via volume** (for development):
```python
# First upload to a volume manually, then:
flow.run("python /code/train.py", instance_type="a100",
volumes=[{"name": "my-code", "mount_path": "/code"}],
upload_code=False)
```
Note: Volumes are empty by default. You must manually populate them first (e.g., via git clone or rsync).
## Support
- **Issues**: [GitHub Issues](https://github.com/foundrytechnologies/flow-v2/issues)
- **Email**: support@mlfoundry.com
## License
Apache License 2.0 - see [LICENSE.txt](LICENSE.txt)
Raw data
{
"_id": null,
"home_page": null,
"name": "flow-acc",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "ai, cloud, compute, foundry, gpu, ml",
"author": null,
"author_email": "\"Foundry Technologies, Inc.\" <support@mlfoundry.com>",
"download_url": "https://files.pythonhosted.org/packages/7b/a7/7cc6a0f949ea80e6c609cdcb9ca39033e475339f4f0b9190f007f7c7e672/flow_acc-2.0.0.tar.gz",
"platform": null,
"description": "# Flow SDK\n\n[](https://pypi.org/project/flow-sdk/)\n[](https://pypi.org/project/flow-sdk/)\n[](LICENSE.txt)\n[](https://docs.mithril.ai/flow)\n\n**GPU compute in seconds, not hours** \u2014 Flow SDK provides seamless access to GPU infrastructure with a single, simple API. The Flow system is tailored for launching batch tasks and experiments. \n\n## Table of Contents\n\n- [Quick Start](#quick-start)\n- [Overview](#overview)\n- [Installation](#installation)\n- [Authentication](#authentication)\n- [Basic Usage](#basic-usage)\n- [Guide for SLURM Users](#guide-for-slurm-users)\n- [Instance Types](#instance-types)\n- [Task Management](#task-management)\n- [Persistent Storage](#persistent-storage)\n- [Zero-Import Remote Execution](#zero-import-remote-execution)\n- [Decorator Pattern](#decorator-pattern)\n- [Data Mounting](#data-mounting)\n- [Multi-Node Training](#multi-node-training)\n- [Advanced Features](#advanced-features)\n- [Error Handling](#error-handling)\n- [Common Patterns](#common-patterns)\n- [Performance](#performance)\n- [Support](#support)\n- [License](#license)\n\n## Quick Start\n\n**Prerequisites:** Get your API key at [app.mlfoundry.com](https://app.mlfoundry.com/account/apikeys)\n\n### 1. Install and Configure\n\n```bash\npip install flow-sdk\nflow init # One-time setup wizard\n```\n\n### 2. Run on GPU\n\n```python\nimport flow\n\n# Your code launches on GPU in minutes\ntask = flow.run(\"python train.py\", instance_type=\"a100\")\n```\n\nThat's it. Your local `train.py` file and project files are automatically uploaded and running on an A100 GPU.\n\n**Note:** Your code is uploaded, but you need to install dependencies. See [Handling Dependencies](#handling-dependencies).\n\n## Why Flow\n\nLong-standing AI research labs have invested to build sophisticated infrastructure abstractions that enable researchers to focus on research rather than DevOps. DeepMind's Xmanager handles experiments from single GPUs to hundreds of hosts. Meta's submitit brings Python-native patterns to cluster computing. OpenAI's internal platform was designed to seamlessly scale from interactive notebooks to thousand-GPU training runs.\n\nFlow brings these same capabilities to every AI developer. Like these internal tools, Flow provides:\n- **Progressive disclosure** - Simple tasks stay simple, complex workflows remain possible\n- **Unified abstraction** - One interface whether running locally or across cloud hardware \n- **Fail-fast validation** - Catch configuration errors before expensive compute starts\n- **Experiment tracking** - Built-in task history and reproducibility\n\nThe goal: democratize the infrastructure abstractions that enable breakthrough AI research.\n\n## Overview\n\nFlow SDK provides a high-level interface for GPU workload submission across heterogeneous infrastructure. Our design philosophy emphasizes explicit behavior, progressive disclosure, and fail-fast validation.\n\n```\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Your Code \u2502 --> \u2502 Flow SDK \u2502 --> \u2502 Cloud Infra \u2502\n\u2502 train.py \u2502 \u2502 Unified API \u2502 \u2502 FCP, ... others \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n Local client-side cloud accelerators \n```\n\n### Core Capabilities\n\n- **Unified API**: Single interface across cloud providers (FCP, AWS, GCP, Azure)\n- **Zero DevOps**: Automatic instance provisioning, driver setup, and environment configuration\n- **Cost Control**: Built-in safeguards with max price and runtime limits\n- **Persistent Storage**: Volumes that persist across task lifecycles\n- **Multi-Node**: Native support for distributed training\n- **Real-Time Monitoring**: Log streaming, SSH access, and status tracking\n- **Notebook Integration**: Google Colab and Jupyter notebook support\n\n## Installation\n\n```bash\npip install flow-sdk\n```\n\nRequirements:\n- Python 3.11+\n- Linux, macOS, or Windows\n- API key from [ML Foundry](https://app.mlfoundry.com)\n\n## Authentication\n\nRun the interactive setup wizard:\n\n```bash\nuv run flow init\n```\n\nThis will:\n1. Prompt for your API key (get one at [app.mlfoundry.com](https://app.mlfoundry.com))\n2. Help you select a project\n3. Configure SSH keys (optional)\n4. Save settings for all Flow tools\n\nAlternative methods:\n\n**Environment Variables**\n```bash\nexport FCP_API_KEY=\"fcp-...\"\nexport FCP_PROJECT=\"my-project\"\n```\n\n**Manual Config File**\n```yaml\n# ~/.flow/config.yaml\napi_key: fcp-...\nproject: my-project\nregion: us-central1-a\n```\n\n**Verify Setup**\n```bash\nuv run flow status\n# Should show \"No tasks found\" if authenticated\n```\n\n## Basic Usage\n\n### Python API\n\n```python\nimport flow\nfrom flow import TaskConfig\n\n# Simple GPU job - automatically uploads your local code\ntask = flow.run(\"python train.py\", instance_type=\"a100\")\n\n# Wait for completion\ntask.wait()\nprint(task.logs())\n\n# Full configuration\nconfig = TaskConfig(\n name=\"distributed-training\",\n instance_type=\"8xa100\", # 8x A100 GPUs\n command=[\"python\", \"-m\", \"torch.distributed.launch\", \n \"--nproc_per_node=8\", \"train.py\"],\n volumes=[{\"size_gb\": 100, \"mount_path\": \"/data\"}],\n max_price_per_hour=25.0, # Cost protection\n max_run_time_hours=24.0 # Time limit\n)\ntask = flow.run(config)\n\n# Monitor execution\nprint(f\"Status: {task.status}\")\nprint(f\"SSH: {task.ssh_command}\")\nprint(f\"Cost: {task.cost_per_hour}\")\n```\n\n### Code Upload\n\nBy default, Flow automatically uploads your current directory to the GPU instance:\n\n```python\n# This uploads your local files and runs them on GPU\ntask = flow.run(\"python train.py\", instance_type=\"a100\")\n\n# Disable code upload (use pre-built Docker image)\ntask = flow.run(\n \"python /app/train.py\",\n instance_type=\"a100\",\n image=\"mycompany/training:latest\",\n upload_code=False\n)\n```\n\nUse `.flowignore` file to exclude files from upload (same syntax as `.gitignore`).\n\n### Handling Dependencies\n\nYour code is uploaded, but dependencies need to be installed:\n\n```python\n# Install from requirements.txt\ntask = flow.run(\n \"pip install -r requirements.txt && python train.py\",\n instance_type=\"a100\"\n)\n\n# Using uv (recommended for speed)\ntask = flow.run(\n \"uv pip install . && uv run python train.py\",\n instance_type=\"a100\"\n)\n\n# Pre-installed in Docker image (fastest)\ntask = flow.run(\n \"python train.py\",\n instance_type=\"a100\",\n image=\"pytorch/pytorch:2.0.0-cuda11.8-cudnn8\" # PyTorch pre-installed\n)\n```\n\n### Command Line\n\n```bash\n# Submit tasks\nuv run flow run \"python train.py\" --instance-type a100\nuv run flow run config.yaml\n\n# Monitor tasks\nuv run flow status # List all tasks\nuv run flow logs task-abc123 -f # Stream logs\nuv run flow ssh task-abc123 # SSH access\n\n# Manage tasks\nuv run flow cancel task-abc123 # Stop execution\n```\n\n### YAML Configuration\n\n```yaml\n# config.yaml\nname: model-training\ninstance_type: 4xa100\ncommand: python train.py --epochs 100\nenv:\n BATCH_SIZE: \"256\"\n LEARNING_RATE: \"0.001\"\nvolumes:\n - size_gb: 500\n mount_path: /data\n name: training-data\nmax_price_per_hour: 20.0\nmax_run_time_hours: 72.0\nssh_keys:\n - my-ssh-key\n```\n\n## Guide for SLURM Users\n\nFlow SDK provides a modern cloud-native alternative to SLURM while maintaining compatibility with existing workflows. This guide helps SLURM users transition to Flow.\n\n### Command Equivalents\n\n| SLURM Command | Flow Command | Description |\n|---------------|--------------|-------------|\n| `sbatch job.sh` | `flow run job.yaml` | Submit batch job |\n| `sbatch script.slurm` | `flow run script.slurm` | Direct SLURM script support |\n| `squeue` | `flow status` | View job queue |\n| `scancel <job_id>` | `flow cancel <task_id>` | Cancel job |\n| `scontrol show job <id>` | `flow info <task_id>` | Show job details |\n| `sacct` | *Not applicable* | Flow tracks costs differently |\n| `sinfo` | *Not applicable* | Cloud resources are dynamic |\n| `srun` | `flow ssh <task_id>` | Interactive access |\n\n### Log Access\n\n```bash\n# SLURM: View output files\ncat slurm-12345.out\n\n# Flow: Stream logs directly\nuv run flow logs task-abc123\nuv run flow logs task-abc123 --follow # Like tail -f\nuv run flow logs task-abc123 --stderr # Error output\n```\n\n### SLURM Script Compatibility\n\nFlow can directly run existing SLURM scripts:\n\n```bash\n# Your existing SLURM script\nuv run flow run job.slurm\n\n# Behind the scenes, Flow parses #SBATCH directives:\n#SBATCH --job-name=training\n#SBATCH --nodes=2\n#SBATCH --gpus=a100:4\n#SBATCH --time=24:00:00\n#SBATCH --mem=64G\n```\n\n### Migration Examples\n\n#### Basic GPU Job\n\n**SLURM:**\n```bash\n#!/bin/bash\n#SBATCH --job-name=train-model\n#SBATCH --partition=gpu\n#SBATCH --gpus=1\n#SBATCH --time=12:00:00\n#SBATCH --mem=32G\n\nmodule load cuda/11.8\npython train.py\n```\n\n**Flow (YAML):**\n```yaml\nname: train-model\ninstance_type: a100\ncommand: python train.py\nmax_run_time_hours: 12.0\n```\n\n**Flow (Python):**\n```python\nflow.run(\"python train.py\", instance_type=\"a100\", max_run_time_hours=12)\n```\n\n#### Multi-GPU Training\n\n**SLURM:**\n```bash\n#!/bin/bash\n#SBATCH --job-name=distributed\n#SBATCH --nodes=4\n#SBATCH --gpus-per-node=8\n#SBATCH --ntasks-per-node=8\n\nsrun python -m torch.distributed.launch train.py\n```\n\n**Flow:**\n```yaml\nname: distributed\ninstance_type: 8xa100\nnum_instances: 4\ncommand: |\n torchrun --nproc_per_node=8 --nnodes=4 \\\n --node_rank=$FLOW_NODE_RANK \\\n --master_addr=$FLOW_MAIN_IP \\\n train.py\n```\n\n#### Array Jobs\n\n**SLURM:**\n```bash\n#!/bin/bash\n#SBATCH --array=1-10\n#SBATCH --job-name=sweep\n\npython experiment.py --task-id $SLURM_ARRAY_TASK_ID\n```\n\n**Flow (using loop):**\n```python\nfor i in range(1, 11):\n flow.run(f\"python experiment.py --task-id {i}\", \n name=f\"sweep-{i}\", instance_type=\"a100\")\n```\n\n### Key Differences\n\n1. **Resource Allocation**: Flow uses instance types (e.g., `a100`, `4xa100`) instead of partition/node specifications\n2. **Cost Control**: Built-in `max_price_per_hour` instead of account-based billing\n3. **Storage**: Cloud volumes (block storage) instead of shared filesystems\n - FCP platform supports both block storage and file shares\n - Flow SDK currently only creates block storage volumes (requires mounting/formatting)\n - File share support is planned for easier multi-node access\n4. **Environment**: Container-based instead of module system\n5. **Scheduling**: Cloud-native provisioning instead of queue-based scheduling\n\n### Environment Variables\n\nWhen using the SLURM adapter (`flow run script.slurm`), Flow sets SLURM-compatible environment variables:\n\n| SLURM Variable | Set By SLURM Adapter | Flow Native Variable |\n|----------------|---------------------|---------------------|\n| `SLURM_JOB_ID` | \u2713 (maps to `$FLOW_TASK_ID`) | `FLOW_TASK_ID` |\n| `SLURM_JOB_NAME` | \u2713 | `FLOW_TASK_NAME` |\n| `SLURM_ARRAY_TASK_ID` | \u2713 (planned) | - |\n| `SLURM_NTASKS` | \u2713 | - |\n| `SLURM_CPUS_PER_TASK` | \u2713 | - |\n| `SLURM_NNODES` | \u2713 | `FLOW_NODE_COUNT` |\n| `SLURM_JOB_PARTITION` | \u2713 (if set) | - |\n\nFor all Flow tasks (regardless of adapter), these variables are available:\n- `FLOW_TASK_ID` - Unique task identifier\n- `FLOW_TASK_NAME` - Task name from config\n\n### Advanced Features\n\n**Module System \u2192 Container Images:**\n```yaml\n# SLURM: module load pytorch/2.0\n# Flow equivalent:\nimage: pytorch/pytorch:2.0.0-cuda11.8-cudnn8\n```\n\n**Dependency Management:**\n```bash\n# SLURM: --dependency=afterok:12345\n# Flow: Use task.wait() in Python or chain commands\n```\n\n**Output Formatting:**\n```bash\n# Get SLURM-style output (coming soon)\nuv run flow status --format=slurm\n```\n\n### Future Compatibility\n\nWe're considering adding direct SLURM command aliases for easier migration:\n- `flow sbatch` \u2192 `flow run`\n- `flow squeue` \u2192 `flow status`\n- `flow scancel` \u2192 `flow cancel`\n\nIf you need specific SLURM features, please [open an issue](https://github.com/mlfoundry/flow-sdk/issues).\n\n## Instance Types\n\n| Type | GPUs | Total Memory |\n|------|------|--------------|\n| `a100` | 1x A100 | 80GB |\n| `4xa100` | 4x A100 | 320GB |\n| `8xa100` | 8x A100 | 640GB |\n| `h100` | 8x H100 | 640GB |\n\n```python\n# Examples\nflow.run(\"python train.py\", instance_type=\"a100\") # Single GPU\nflow.run(\"python train.py\", instance_type=\"4xa100\") # Multi-GPU\nflow.run(\"python train.py\", instance_type=\"8xh100\") # Maximum performance\n```\n\n## Task Management\n\n### Task Object\n\n```python\n# Get task handle\ntask = flow.run(config)\n# Or retrieve existing\ntask = flow.get_task(\"task-abc123\")\n\n# Properties\ntask.task_id # Unique identifier\ntask.status # Current state\ntask.ssh_command # SSH connection string\ntask.cost_per_hour # Current pricing\ntask.created_at # Submission time\n\n# Methods\ntask.wait(timeout=3600) # Block until complete\ntask.refresh() # Update status\ntask.cancel() # Terminate execution\n```\n\n### Logging\n\n```python\n# Get recent logs\nlogs = task.logs(tail=100)\n\n# Stream in real-time\nfor line in task.logs(follow=True):\n if \"loss:\" in line:\n print(line)\n```\n\n### SSH Access\n\n```python\n# Interactive shell\ntask.ssh()\n\n# Run command\ntask.ssh(\"nvidia-smi\")\ntask.ssh(\"tail -f /workspace/train.log\")\n\n# Multi-node access\ntask.ssh(node=1) # Connect to specific node\n```\n\n### Extended Information\n\n```python\n# Get task creator\nuser = task.get_user()\nprint(f\"Created by: {user.username} ({user.email})\")\n\n# Get instance details\ninstances = task.get_instances()\nfor inst in instances:\n print(f\"Node {inst.instance_id}:\")\n print(f\" Public IP: {inst.public_ip}\")\n print(f\" Private IP: {inst.private_ip}\")\n print(f\" Status: {inst.status}\")\n```\n\n## Persistent Storage\n\n### Volume Management\n\n```python\n# Create volume (currently creates block storage)\nwith Flow() as client:\n vol = client.create_volume(size_gb=1000, name=\"datasets\")\n\n# Use in task\nconfig = TaskConfig(\n name=\"training\",\n instance_type=\"a100\",\n command=\"python train.py\",\n volumes=[{\n \"volume_id\": vol.volume_id,\n \"mount_path\": \"/data\"\n }]\n)\n\n# Or reference by name\nconfig.volumes = [{\n \"name\": \"datasets\",\n \"mount_path\": \"/data\"\n}]\n```\n\n**Note**: Flow SDK currently creates block storage volumes, which need to be formatted on first use. The underlying FCP platform also supports file shares (pre-formatted, multi-node accessible), but this is not yet exposed in the SDK.\n\n### Docker Cache Optimization\n\n```python\n# Mount at Docker directory for layer caching\nvolumes=[{\n \"name\": \"docker-cache\",\n \"size_gb\": 50,\n \"mount_path\": \"/var/lib/docker\"\n}]\n```\n\n## Zero-Import Remote Execution\n\nFlow SDK's `invoke()` function lets you run Python functions on GPUs without modifying your code:\n\n### The Invoker Pattern\n\n```python\n# train.py - Your existing code, no Flow imports needed\ndef train_model(data_path: str, epochs: int = 100):\n import torch\n model = torch.nn.Linear(10, 1)\n # ... training logic ...\n return {\"accuracy\": 0.95, \"loss\": 0.01}\n```\n\n```python\n# runner.py - Execute remotely on GPU\nfrom flow import invoke\n\nresult = invoke(\n \"train.py\", # Python file\n \"train_model\", # Function name \n args=[\"s3://data\"], # Arguments\n kwargs={\"epochs\": 200}, # Keyword arguments\n gpu=\"a100\" # GPU type\n)\nprint(result) # {\"accuracy\": 0.95, \"loss\": 0.01}\n```\n\n### Why Use invoke()?\n\n- **Zero contamination**: Keep ML code pure Python\n- **Easy testing**: Run functions locally without changes\n- **Flexible**: Any function, any module\n- **Type safe**: JSON serialization ensures compatibility\n\nSee the [Invoker Pattern Guide](docs/INVOKER_PATTERN.md) for detailed documentation.\n\n## Decorator Pattern\n\nFlow SDK provides a decorator-based API similar to popular serverless frameworks:\n\n### Basic Usage\n\n```python\nfrom flow import FlowApp\n\napp = FlowApp()\n\n@app.function(gpu=\"a100\")\ndef train_model(data_path: str, epochs: int = 100):\n import torch\n model = torch.nn.Linear(10, 1)\n # ... training logic ...\n return {\"accuracy\": 0.95, \"loss\": 0.01}\n\n# Execute remotely on GPU\nresult = train_model.remote(\"s3://data.csv\", epochs=50)\n\n# Execute locally for testing\nlocal_result = train_model(\"./local_data.csv\")\n```\n\n### Advanced Configuration\n\n```python\n@app.function(\n gpu=\"h100:8\", # 8x H100 GPUs\n image=\"pytorch/pytorch:2.0.0\",\n volumes={\"/data\": \"training-data\"},\n env={\"WANDB_API_KEY\": \"...\"}\n)\ndef distributed_training(config_path: str):\n # Multi-GPU training code\n return {\"status\": \"completed\"}\n\n# Async execution\ntask_id = distributed_training.spawn(\"config.yaml\")\n```\n\n### Module-Level Usage\n\n```python\nfrom flow import function\n\n# Use without creating an app instance\n@function(gpu=\"a100\")\ndef inference(text: str) -> dict:\n # Run inference\n return {\"sentiment\": \"positive\"}\n```\n\nThe decorator pattern provides:\n- **Clean syntax**: Familiar to Flask/FastAPI users\n- **Local testing**: Call functions directly without infrastructure\n- **Type safety**: Full IDE support and type hints\n- **Flexibility**: Mix local and remote execution seamlessly\n\n## Data Mounting\n\nFlow SDK provides seamless data access from S3 and volumes through the Flow client API:\n\n### Quick Start\n\n```python\n# Mount S3 dataset\nfrom flow import Flow\n\nwith Flow() as client:\n task = client.submit(\n \"python train.py --data /data\",\n gpu=\"a100\",\n mounts=\"s3://my-bucket/datasets/imagenet\"\n )\n\n# Mount multiple sources\nwith Flow() as client:\n task = client.submit(\n \"python train.py\",\n gpu=\"a100:4\",\n mounts={\n \"/datasets\": \"s3://ml-bucket/imagenet\",\n \"/models\": \"volume://pretrained-models\", # Auto-creates if missing\n \"/outputs\": \"volume://training-outputs\"\n }\n )\n```\n\n### Supported Sources\n\n- **S3**: Read-only access via s3fs (`s3://bucket/path`)\n - Requires AWS credentials in environment\n - Cached locally for performance\n \n- **Volumes**: Persistent read-write storage (`volume://name`)\n - Auto-creates with 100GB if not found\n - High-performance NVMe storage\n\n### Example: Training Pipeline\n\n```python\n# Set AWS credentials (from secure source)\nimport os\nos.environ[\"AWS_ACCESS_KEY_ID\"] = get_secret(\"aws_key\")\nos.environ[\"AWS_SECRET_ACCESS_KEY\"] = get_secret(\"aws_secret\")\n\n# Submit training with data mounting\nwith Flow() as client:\n task = client.submit(\n \"\"\"\n python train.py \\\\\n --data /datasets/train \\\\\n --validation /datasets/val \\\\\n --output /outputs\n \"\"\",\n gpu=\"a100:8\",\n mounts={\n \"/datasets\": \"s3://ml-datasets/imagenet\",\n \"/outputs\": \"volume://experiment-results\"\n }\n )\n```\n\nSee the [Data Mounting Guide](docs/DATA_MOUNTING_GUIDE.md) for detailed documentation.\n\n## Distributed Training\n\n### Single-Node Multi-GPU (Recommended)\n\n```python\nconfig = TaskConfig(\n name=\"distributed-training\",\n instance_type=\"8xa100\", # 8x A100 GPUs on single node\n command=\"torchrun --nproc_per_node=8 --standalone train.py\"\n)\n```\n\n### Multi-Node Training\n\nFor multi-node training, explicitly set coordination environment variables:\n\n```python\nconfig = TaskConfig(\n name=\"multi-node-training\",\n instance_type=\"8xa100\",\n num_instances=4, # 32 GPUs total\n env={\n \"FLOW_NODE_RANK\": \"0\", # Set per node: 0, 1, 2, 3\n \"FLOW_NUM_NODES\": \"4\",\n \"FLOW_MAIN_IP\": \"10.0.0.1\" # IP of rank 0 node\n },\n command=[\n \"torchrun\",\n \"--nproc_per_node=8\",\n \"--nnodes=4\",\n \"--node_rank=$FLOW_NODE_RANK\",\n \"--master_addr=$FLOW_MAIN_IP\",\n \"--master_port=29500\",\n \"train.py\"\n ]\n)\n```\n\n## Advanced Features\n\n### Cost Optimization\n\n```python\n# Use spot instances with price cap\nconfig = TaskConfig(\n name=\"experiment\",\n instance_type=\"a100\",\n max_price_per_hour=5.0, # Use spot if available\n max_run_time_hours=12.0 # Prevent runaway costs\n)\n```\n\n### Environment Setup\n\n```python\n# Custom container\nconfig.image = \"pytorch/pytorch:2.0.0-cuda11.8-cudnn8\"\n\n# Environment variables\nconfig.env = {\n \"WANDB_API_KEY\": \"...\",\n \"HF_TOKEN\": \"...\",\n \"CUDA_VISIBLE_DEVICES\": \"0,1,2,3\"\n}\n\n# Working directory\nconfig.working_dir = \"/workspace\"\n```\n\n### Data Access\n\n```python\n# S3 integration\nconfig = TaskConfig(\n name=\"s3-processing\",\n instance_type=\"a100\",\n command=\"python process.py\",\n env={\n \"AWS_ACCESS_KEY_ID\": \"...\",\n \"AWS_SECRET_ACCESS_KEY\": \"...\"\n }\n)\n\n# Or use mounts parameter (simplified API)\nwith Flow() as client:\n task = client.submit(\n \"python analyze.py\",\n gpu=\"a100\",\n mounts={\n \"/input\": \"s3://my-bucket/data/\",\n \"/output\": \"volume://results\"\n }\n )\n```\n\n## Error Handling\n\nFlow provides structured errors with recovery guidance:\n\n```python\nfrom flow.errors import (\n FlowError,\n AuthenticationError,\n ResourceNotFoundError,\n ValidationError,\n QuotaExceededError\n)\n\ntry:\n task = flow.run(config)\nexcept ValidationError as e:\n print(f\"Configuration error: {e.message}\")\n for suggestion in e.suggestions:\n print(f\" - {suggestion}\")\nexcept QuotaExceededError as e:\n print(f\"Quota exceeded: {e.message}\")\n print(\"Suggestions:\", e.suggestions)\nexcept FlowError as e:\n print(f\"Error: {e}\")\n```\n\n## Common Patterns\n\n### Interactive Development\n\n#### Google Colab Integration\n\nConnect Google Colab notebooks to Flow GPU instances:\n\n```bash\n# Launch GPU instance configured for Colab\nuv run flow colab connect --instance-type a100 --hours 4\n\n# You'll receive:\n# 1. SSH tunnel command to run locally\n# 2. Connection URL for Colab\n```\n\nThen in Google Colab:\n1. Go to Runtime \u2192 Connect to local runtime\n2. Paste the connection URL\n3. Click Connect\n\nYour Colab notebook now runs on Flow GPU infrastructure!\n\n#### Direct Jupyter Notebooks\n\nRun Jupyter directly on Flow instances:\n\n```python\n# Launch Jupyter server\nconfig = TaskConfig(\n name=\"notebook\",\n instance_type=\"a100\",\n command=\"jupyter lab --ip=0.0.0.0 --no-browser\",\n ports=[8888],\n max_run_time_hours=8.0\n)\ntask = flow.run(config)\nprint(f\"Access at: {task.endpoints['jupyter']}\")\n```\n\n### Checkpointing\n\n```python\n# Resume training from checkpoint\nconfig = TaskConfig(\n name=\"resume-training\",\n instance_type=\"a100\",\n command=\"python train.py --resume\",\n volumes=[{\n \"name\": \"checkpoints\",\n \"mount_path\": \"/checkpoints\"\n }]\n)\n```\n\n### Experiment Sweep\n\n```python\n# Run multiple experiments\nfor lr in [0.001, 0.01, 0.1]:\n config = TaskConfig(\n name=f\"exp-lr-{lr}\",\n instance_type=\"a100\",\n command=f\"python train.py --lr {lr}\",\n env={\"WANDB_RUN_NAME\": f\"lr_{lr}\"}\n )\n flow.run(config)\n```\n\n## Architecture\n\nFlow SDK follows Domain-Driven Design with clear boundaries:\n\n### High-Level Overview\n\n```\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 User Interface Layer \u2502\n\u2502 (Python API, CLI, YAML) \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502 Core Domain Layer \u2502\n\u2502 (TaskConfig, Task, Volume models) \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502 Provider Abstraction Layer \u2502\n\u2502 (IProvider Protocol) \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502 Provider Implementations \u2502\n\u2502 (FCP, AWS, GCP, Azure - future) \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n```\n\n### Key Components\n\n- **Flow SDK** (`src/flow/`): High-level Python SDK for ML/AI workloads\n- **Mithril CLI** (`mithril/`): Low-level IaaS control following Unix philosophy\n- **Provider Abstraction**: Cloud-agnostic interface for multi-cloud support\n\n### Current Provider Support\n\n**FCP (ML Foundry)** - Production Ready\n- Ubuntu 22.04 environment with bash\n- 10KB startup script limit\n- Spot instances with preemption handling\n- Block storage volumes (file shares available in some regions)\n- See [FCP provider documentation](src/flow/providers/fcp/README.md) for implementation details\n\n**AWS, GCP, Azure** - Planned\n- Provider abstraction designed for multi-cloud\n- Contributions welcome\n\n### Additional Documentation\n\n- [Architecture Overview](docs/ARCHITECTURE.md) - System design and concepts\n- [FCP Provider Details](src/flow/providers/fcp/README.md) - Provider-specific implementation\n- [Colab Troubleshooting](docs/COLAB_TROUBLESHOOTING.md) - Colab setup guide\n- [Configuration Guide](docs/CONFIGURATION.md) - Configuration options\n- [Data Handling](docs/DATA_HANDLING.md) - Data management patterns\n\n### Example Code\n\n- [Verify Instance Setup](examples/01_verify_instance.py) - Basic GPU verification\n- [Jupyter Server](examples/02_jupyter_server.py) - Launch Jupyter on GPU\n- [Multi-Node Training](examples/03_multi_node_training.py) - Distributed training setup\n- [S3 Data Access](examples/04_s3_data_access.py) - Cloud storage integration\n- [More Examples](examples/) - Additional usage patterns\n\n## Performance\n\n- **Cold start**: 10-15 minutes (instance provisioning on FCP core)\n- **Warm start**: 30-60 seconds (pre-allocated pool; pending feature: let FCP know if interesting)\n\n\n## Troubleshooting\n\n### Common Errors\n\n**Authentication Failed**\n```\nError: Invalid API key\n```\nSolution: Run `flow init` and ensure your API key is correct. Get a new key at [app.mlfoundry.com](https://app.mlfoundry.com/account/apikeys).\n\n**No Available Instances**\n```\nError: No instances available for type 'a100'\n```\nSolution: Try a different region or instance type. Check availability with `flow status`.\n\n**Quota Exceeded**\n```\nError: GPU quota exceeded in region us-east-1\n```\nSolution: Try a different region or contact support for quota increase.\n\n**Invalid Instance Type**\n```\nValidationError: Invalid instance type 'a100x8'\n```\nSolution: Use correct format: `8xa100` (not `a100x8`). See [Instance Types](#instance-types).\n\n**Task Timeout**\n```\nError: Task exceeded max_run_time_hours limit\n```\nSolution: Increase `max_run_time_hours` in your config or optimize your code.\n\n**File Not Found**\n```\npython: can't open file 'train.py': No such file or directory\n```\nSolution: Ensure `upload_code=True` (default) or that your file exists in the Docker image.\n\n**Module Not Found**\n```\nModuleNotFoundError: No module named 'torch'\n```\nSolution: Install dependencies first: `flow.run(\"pip install torch && python train.py\")`. See [Handling Dependencies](#handling-dependencies).\n\n**Upload Size Limit**\n```\nError: Project size (15.2MB) exceeds limit (10MB)\n```\nNote: Files are automatically compressed (gzip), but the 10MB limit applies after compression.\n\nSolutions (in order of preference):\n1. **Use .flowignore** to exclude unnecessary files (models, datasets, caches)\n2. **Clone from Git**:\n ```python\n flow.run(\"git clone https://github.com/myorg/myrepo.git . && python train.py\", \n instance_type=\"a100\", upload_code=False)\n ```\n3. **Pre-built Docker image** with your code:\n ```python\n flow.run(\"python /app/train.py\", instance_type=\"a100\",\n image=\"myorg/myapp:latest\", upload_code=False)\n ```\n4. **Download from S3/GCS**:\n ```python\n flow.run(\"aws s3 cp s3://mybucket/code.tar.gz . && tar -xzf code.tar.gz && python train.py\",\n instance_type=\"a100\", upload_code=False)\n ```\n5. **Mount code via volume** (for development):\n ```python\n # First upload to a volume manually, then:\n flow.run(\"python /code/train.py\", instance_type=\"a100\",\n volumes=[{\"name\": \"my-code\", \"mount_path\": \"/code\"}],\n upload_code=False)\n ```\n Note: Volumes are empty by default. You must manually populate them first (e.g., via git clone or rsync).\n\n## Support\n\n- **Issues**: [GitHub Issues](https://github.com/foundrytechnologies/flow-v2/issues)\n- **Email**: support@mlfoundry.com\n\n## License\n\nApache License 2.0 - see [LICENSE.txt](LICENSE.txt)",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "Simplified SDK for Foundry Compute Platform - GPU compute made simple",
"version": "2.0.0",
"project_urls": {
"Documentation": "https://docs.mlfoundry.com/flow-sdk",
"Homepage": "https://github.com/mlfoundry/flow-sdk",
"Issues": "https://github.com/mlfoundry/flow-sdk/issues",
"Repository": "https://github.com/mlfoundry/flow-sdk"
},
"split_keywords": [
"ai",
" cloud",
" compute",
" foundry",
" gpu",
" ml"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "575602aa4282d4ea4a6a662f7f90d6f751159769cd3272dd90e082c62a9a26cb",
"md5": "ec01ef681b966783cdd1cb9c00a9160e",
"sha256": "cfe540f609d8e1c6f1fb7e6a668ccd10241d943ebbf1dd835b8eac8bc9fbe3c3"
},
"downloads": -1,
"filename": "flow_acc-2.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "ec01ef681b966783cdd1cb9c00a9160e",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 289885,
"upload_time": "2025-07-20T13:37:51",
"upload_time_iso_8601": "2025-07-20T13:37:51.010024Z",
"url": "https://files.pythonhosted.org/packages/57/56/02aa4282d4ea4a6a662f7f90d6f751159769cd3272dd90e082c62a9a26cb/flow_acc-2.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "7ba77cc6a0f949ea80e6c609cdcb9ca39033e475339f4f0b9190f007f7c7e672",
"md5": "5d74ee55cdcc83cb676d85f153fcf297",
"sha256": "00a29fbbb25f05a06df15b7b3c780c7169386b1ee2a296eb0ba4a351d38fa158"
},
"downloads": -1,
"filename": "flow_acc-2.0.0.tar.gz",
"has_sig": false,
"md5_digest": "5d74ee55cdcc83cb676d85f153fcf297",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 716376,
"upload_time": "2025-07-20T13:37:52",
"upload_time_iso_8601": "2025-07-20T13:37:52.836017Z",
"url": "https://files.pythonhosted.org/packages/7b/a7/7cc6a0f949ea80e6c609cdcb9ca39033e475339f4f0b9190f007f7c7e672/flow_acc-2.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-20 13:37:52",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "mlfoundry",
"github_project": "flow-sdk",
"github_not_found": true,
"lcname": "flow-acc"
}