| Name | ml-dash JSON |
| Version |
0.4.0
JSON |
| download |
| home_page | None |
| Summary | Add your description here |
| upload_time | 2025-10-25 19:56:03 |
| maintainer | None |
| docs_url | None |
| author | Ge Yang |
| requires_python | >=3.12 |
| license | None |
| keywords |
|
| VCS |
|
| bugtrack_url |
|
| requirements |
No requirements were recorded.
|
| Travis-CI |
No Travis.
|
| coveralls test coverage |
No coveralls.
|
# ML-Logger API Documentation
**Version:** 0.1.0
ML-Logger is a minimal, experiment tracking library for machine learning. It provides a simple API to log
parameters, metrics, files, and logs during your ML experiments.
## Table of Contents
1. [Installation](#installation)
2. [Quick Start](#quick-start)
3. [Core Concepts](#core-concepts)
4. [API Reference](#api-reference)
- [Experiment](#experiment)
- [Parameters](#parameters)
- [Metrics](#metrics)
- [Logs](#logs)
- [Files](#files)
5. [Usage Patterns](#usage-patterns)
6. [Examples](#examples)
7. [Remote Backend](#remote-backend)
8. [Best Practices](#best-practices)
---
## Installation
```bash
# Using pip
pip install -i https://test.pypi.org/simple/ ml-logger-beta
```
---
## Quick Start
### Basic Example
```python
from ml_dash import Experiment
# Create an experiment
exp = Experiment(
namespace="alice", # Your username or team
workspace="my-project", # Project name
prefix="experiment-1" # Experiment name
)
# Start tracking
with exp.run():
# Log parameters
exp.params.set(
learning_rate=0.001,
batch_size=32,
epochs=100
)
# Log metrics
for epoch in range(100):
exp.metrics.log(
step=epoch,
loss=0.5 - epoch * 0.01,
accuracy=0.5 + epoch * 0.005
)
# Log messages
exp.info("Training completed!")
# Save files
exp.files.save({"final": "results"}, "results.json")
```
This creates a local directory structure:
```
.ml-logger/
└── alice/
└── my-project/
└── experiment-1/
├── .ml-logger.meta.json
├── parameters.jsonl
├── metrics.jsonl
├── logs.jsonl
└── files/
└── results.json
```
---
## Core Concepts
### 1. **Namespace**
Your username or organization (e.g., `"alice"`, `"research-team"`)
### 2. **Workspace**
Project or research area (e.g., `"image-classification"`, `"nlp-experiments"`)
### 3. **Prefix**
Unique experiment name (e.g., `"resnet50-run-001"`)
### 4. **Directory** (Optional)
Hierarchical organization within workspace (e.g., `"models/resnet/cifar10"`)
### 5. **Local or Remote**
Everything is saved locally by default. Remote sync is optional.
### 6. **Append-Only**
All data is written in append-only JSONL format for crash safety.
---
## API Reference
### Experiment
The main class for experiment tracking.
#### Constructor
```python
Experiment(
namespace: str, # Required: User/team namespace
workspace: str, # Required: Project workspace
prefix: str, # Required: Experiment name
remote: str = None, # Optional: Remote server URL
local_root: str = ".ml-logger", # Local storage directory
directory: str = None, # Optional: Subdirectory path
readme: str = None, # Optional: Description
experiment_id: str = None, # Optional: Server experiment ID
)
```
**Example:**
```python
exp = Experiment(
namespace="alice",
workspace="vision",
prefix="resnet-experiment",
directory="image-classification/cifar10",
readme="ResNet50 transfer learning on CIFAR-10",
remote="https://qwqdug4btp.us-east-1.awsapprunner.com" # Optional remote server
)
```
#### Methods
##### `run(func=None)`
Mark experiment as running. Supports 3 patterns:
**Pattern 1: Direct Call**
```python
exp.run()
# ... your training code ...
exp.complete()
```
**Pattern 2: Context Manager (Recommended)**
```python
with exp.run():
# ... your training code ...
# Automatically calls complete() on success
# Automatically calls fail() on exception
```
**Pattern 3: Decorator**
```python
@exp.run
def train():
# ... your training code ...
train()
```
##### `complete()`
Mark experiment as completed.
```python
exp.complete()
```
##### `fail(error: str)`
Mark experiment as failed with error message.
```python
try:
# training code
except Exception as e:
exp.fail(str(e))
raise
```
##### Logging Convenience Methods
```python
exp.info(message: str, ** context) # Log info message
exp.warning(message: str, ** context) # Log warning message
exp.error(message: str, ** context) # Log error message
exp.debug(message: str, ** context) # Log debug message
```
**Example:**
```python
exp.info("Epoch completed", epoch=5, loss=0.3, accuracy=0.85)
exp.warning("Memory usage high", usage_gb=15.2)
exp.error("Training failed", error="CUDA out of memory")
```
#### Properties
```python
exp.namespace # str: Namespace
exp.workspace # str: Workspace
exp.prefix # str: Experiment prefix
exp.directory # str | None: Directory path
exp.remote # str | None: Remote server URL
exp.experiment_id # str | None: Server experiment ID
exp.run_id # str | None: Server run ID
```
#### Components
```python
exp.params # ParameterManager
exp.metrics # MetricsLogger
exp.files # FileManager
exp.logs # LogManager
```
---
### Parameters
Manages experiment parameters (hyperparameters, config, etc.)
#### Methods
##### `set(**kwargs)`
Set parameters (replaces existing).
```python
exp.params.set(
learning_rate=0.001,
batch_size=32,
optimizer="adam",
model={
"layers": 50,
"dropout": 0.2
}
)
```
##### `extend(**kwargs)`
Extend parameters (deep merge with existing).
```python
# First call
exp.params.set(model={"layers": 50})
# Extend (merges with existing)
exp.params.extend(model={"dropout": 0.2})
# Result: {"model": {"layers": 50, "dropout": 0.2}}
```
##### `update(key: str, value: Any)`
Update a single parameter (supports dot notation).
```python
exp.params.update("model.layers", 100)
exp.params.update("learning_rate", 0.0001)
```
##### `read() -> dict`
Read current parameters.
```python
params = exp.params.read()
print(params["learning_rate"]) # 0.001
```
##### `log(**kwargs)`
Alias for `set()` (for API consistency).
```python
exp.params.log(batch_size=64)
```
---
### Metrics
Logs time-series metrics with optional namespacing.
#### Methods
##### `log(step=None, **metrics)`
Log metrics immediately.
```python
# Simple logging
exp.metrics.log(step=1, loss=0.5, accuracy=0.8)
# Multiple metrics at once
exp.metrics.log(
step=10,
train_loss=0.3,
val_loss=0.4,
train_acc=0.85,
val_acc=0.82
)
# Without step (uses timestamp only)
exp.metrics.log(gpu_memory=8.5, cpu_usage=45.2)
```
##### `collect(step=None, **metrics)`
Collect metrics for later aggregation (useful for batch-level logging).
```python
for batch in train_loader:
loss = train_batch(batch)
# Collect batch metrics (not logged yet)
exp.metrics.collect(loss=loss.item(), accuracy=acc.item())
# Aggregate and log after epoch
exp.metrics.flush(_aggregation="mean", step=epoch)
```
##### `flush(_aggregation="mean", step=None, **additional_metrics)`
Flush collected metrics with aggregation.
**Aggregation methods:**
- `"mean"` - Average of collected values (default)
- `"sum"` - Sum of collected values
- `"min"` - Minimum value
- `"max"` - Maximum value
- `"last"` - Last value
```python
# Collect during training
for batch in batches:
metrics.collect(loss=loss, accuracy=acc)
# Flush with mean aggregation
metrics.flush(_aggregation="mean", step=epoch, learning_rate=lr)
# Flush with max aggregation
metrics.flush(_aggregation="max", step=epoch)
```
##### Namespacing: `__call__(namespace: str)`
Create a namespaced metrics logger.
```python
# Create namespaced loggers
train_metrics = exp.metrics("train")
val_metrics = exp.metrics("val")
# Log to different namespaces
train_metrics.log(step=1, loss=0.5, accuracy=0.8)
val_metrics.log(step=1, loss=0.6, accuracy=0.75)
# Results in metrics named: "train.loss", "train.accuracy", "val.loss", "val.accuracy"
```
##### `read() -> list`
Read all logged metrics.
```python
metrics_data = exp.metrics.read()
for entry in metrics_data:
print(entry["step"], entry["metrics"])
```
---
### Logs
Structured text logging with levels and context.
#### Methods
##### `log(message: str, level: str = "INFO", **context)`
Log a message with level and context.
```python
exp.logs.log("Training started", level="INFO", epoch=0, lr=0.001)
```
##### Level-Specific Methods
```python
exp.logs.info(message: str, ** context) # INFO level
exp.logs.warning(message: str, ** context) # WARNING level
exp.logs.error(message: str, ** context) # ERROR level
exp.logs.debug(message: str, ** context) # DEBUG level
```
**Examples:**
```python
# Info log
exp.logs.info("Epoch started", epoch=5, batches=100)
# Warning log
exp.logs.warning("High memory usage", memory_gb=14.5, threshold_gb=16.0)
# Error log
exp.logs.error("Training failed", error="CUDA OOM", batch_size=128)
# Debug log
exp.logs.debug("Gradient norm", grad_norm=2.3, step=1000)
```
##### `read() -> list`
Read all logs.
```python
logs = exp.logs.read()
for log_entry in logs:
print(f"[{log_entry['level']}] {log_entry['message']}")
if 'context' in log_entry:
print(f" Context: {log_entry['context']}")
```
---
### Files
Manages file storage with auto-format detection.
#### Methods
##### `save(data: Any, filename: str)`
Save data with automatic format detection.
**Supported formats:**
- `.json` - JSON files
- `.pkl`, `.pickle` - Pickle files
- `.pt`, `.pth` - PyTorch tensors/models
- `.npy`, `.npz` - NumPy arrays
- Other extensions - Raw bytes or fallback to pickle
```python
# JSON
exp.files.save({"results": [1, 2, 3]}, "results.json")
# PyTorch model
exp.files.save(model.state_dict(), "model.pt")
# NumPy array
exp.files.save(numpy_array, "embeddings.npy")
# Pickle
exp.files.save(custom_object, "object.pkl")
# Raw bytes
exp.files.save(b"binary data", "data.bin")
# Text
exp.files.save("text content", "notes.txt")
```
##### `save_pkl(data: Any, filename: str)`
Save as pickle (automatically adds .pkl extension).
```python
exp.files.save_pkl(complex_object, "checkpoint")
# Saves as "checkpoint.pkl"
```
##### `load(filename: str) -> Any`
Load data with automatic format detection.
```python
# JSON
results = exp.files.load("results.json")
# PyTorch
state_dict = exp.files.load("model.pt")
# NumPy
array = exp.files.load("embeddings.npy")
# Pickle
obj = exp.files.load("object.pkl")
```
##### `load_torch(filename: str) -> Any`
Load PyTorch checkpoint (adds .pt extension if missing).
```python
checkpoint = exp.files.load_torch("best_model")
# Loads "best_model.pt"
```
##### Namespacing: `__call__(namespace: str)`
Create a namespaced file manager.
```python
# Create namespaced file managers
checkpoints = exp.files("checkpoints")
configs = exp.files("configs")
# Save to different directories
checkpoints.save(model.state_dict(), "epoch_10.pt")
# Saves to: files/checkpoints/epoch_10.pt
configs.save(config, "training.json")
# Saves to: files/configs/training.json
```
##### `exists(filename: str) -> bool`
Check if file exists.
```python
if exp.files.exists("checkpoint.pt"):
model.load_state_dict(exp.files.load("checkpoint.pt"))
```
##### `list() -> list`
List files in current namespace.
```python
files = exp.files.list()
print(f"Files: {files}")
# With namespace
checkpoint_files = exp.files("checkpoints").list()
```
---
## Usage Patterns
### Pattern 1: Simple Training Loop
```python
from ml_dash import Experiment
exp = Experiment(
namespace="alice",
workspace="mnist",
prefix="simple-cnn"
)
with exp.run():
# Log hyperparameters
exp.params.set(lr=0.001, epochs=10, batch_size=32)
# Training loop
for epoch in range(10):
train_loss = train_one_epoch()
val_loss = validate()
exp.metrics.log(
step=epoch,
train_loss=train_loss,
val_loss=val_loss
)
# Save model
exp.files.save(model.state_dict(), "final_model.pt")
```
### Pattern 2: Batch-Level Metrics with Aggregation
```python
with exp.run():
exp.params.set(lr=0.001, batch_size=128)
for epoch in range(100):
# Collect batch-level metrics
for batch in train_loader:
loss, acc = train_step(batch)
exp.metrics.collect(loss=loss, accuracy=acc)
# Aggregate and log epoch metrics
exp.metrics.flush(_aggregation="mean", step=epoch)
```
### Pattern 3: Separate Train/Val Metrics
```python
with exp.run():
# Create namespaced loggers
train_metrics = exp.metrics("train")
val_metrics = exp.metrics("val")
for epoch in range(100):
# Training phase
for batch in train_loader:
loss, acc = train_step(batch)
train_metrics.collect(loss=loss, accuracy=acc)
train_metrics.flush(_aggregation="mean", step=epoch)
# Validation phase
val_loss, val_acc = validate()
val_metrics.log(step=epoch, loss=val_loss, accuracy=val_acc)
```
### Pattern 4: Checkpoint Management
```python
with exp.run():
checkpoints = exp.files("checkpoints")
best_val_loss = float('inf')
for epoch in range(100):
train_loss = train()
val_loss = validate()
# Save regular checkpoint
if epoch % 10 == 0:
checkpoints.save(model.state_dict(), f"epoch_{epoch}.pt")
# Save best model
if val_loss < best_val_loss:
best_val_loss = val_loss
checkpoints.save({
"epoch": epoch,
"model": model.state_dict(),
"optimizer": optimizer.state_dict(),
"val_loss": val_loss
}, "best_model.pt")
# Save final model
checkpoints.save(model.state_dict(), "final_model.pt")
```
### Pattern 5: Hierarchical Organization
```python
# Organize experiments in a directory hierarchy
exp = Experiment(
namespace="alice",
workspace="vision",
prefix="run-001",
directory="image-classification/resnet50/cifar10"
)
# Creates: .ml-logger/alice/vision/image-classification/resnet50/cifar10/run-001/
```
---
## Examples
### Example 1: Basic MNIST Training
```python
from ml_dash import Experiment
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
# Create experiment
exp = Experiment(
namespace="alice",
workspace="mnist",
prefix="basic-cnn-001"
)
# Define model
model = nn.Sequential(
nn.Conv2d(1, 32, 3),
nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(32, 64, 3),
nn.ReLU(),
nn.MaxPool2d(2),
nn.Flatten(),
nn.Linear(1600, 128),
nn.ReLU(),
nn.Linear(128, 10)
)
# Training
with exp.run():
# Log configuration
exp.params.set(
learning_rate=0.001,
batch_size=64,
epochs=10,
optimizer="adam"
)
# Setup
train_loader = DataLoader(
datasets.MNIST('../data', train=True, download=True,
transform=transforms.ToTensor()),
batch_size=64, shuffle=True
)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()
# Training loop
for epoch in range(10):
total_loss = 0
correct = 0
total = 0
for data, target in train_loader:
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
total_loss += loss.item()
pred = output.argmax(dim=1)
correct += (pred == target).sum().item()
total += target.size(0)
# Log epoch metrics
avg_loss = total_loss / len(train_loader)
accuracy = correct / total
exp.metrics.log(
step=epoch,
loss=avg_loss,
accuracy=accuracy
)
exp.info(f"Epoch {epoch}", loss=avg_loss, accuracy=accuracy)
# Save final model
exp.files.save(model.state_dict(), "model.pt")
exp.info("Training completed!")
```
### Example 2: Transfer Learning with Checkpointing
```python
from ml_dash import Experiment
from torchvision import models
import torch.nn as nn
exp = Experiment(
namespace="alice",
workspace="vision",
prefix="resnet-transfer-001",
directory="transfer-learning/cifar10",
readme="ResNet50 transfer learning on CIFAR-10"
)
with exp.run():
# Configuration
config = {
"model": "resnet50",
"pretrained": True,
"num_classes": 10,
"learning_rate": 0.001,
"epochs": 50,
"batch_size": 128,
"early_stopping_patience": 10
}
exp.params.set(**config)
# Load pretrained model
model = models.resnet50(pretrained=True)
model.fc = nn.Linear(model.fc.in_features, 10)
# Create namespaced loggers
train_metrics = exp.metrics("train")
val_metrics = exp.metrics("val")
checkpoints = exp.files("checkpoints")
best_val_acc = 0.0
patience_counter = 0
for epoch in range(config["epochs"]):
# Training phase
model.train()
for batch in train_loader:
loss, acc = train_step(model, batch)
train_metrics.collect(loss=loss, accuracy=acc)
train_metrics.flush(_aggregation="mean", step=epoch)
# Validation phase
model.eval()
val_loss, val_acc = validate(model, val_loader)
val_metrics.log(step=epoch, loss=val_loss, accuracy=val_acc)
# Checkpoint best model
if val_acc > best_val_acc:
best_val_acc = val_acc
patience_counter = 0
checkpoints.save({
"epoch": epoch,
"model_state_dict": model.state_dict(),
"val_accuracy": val_acc,
"config": config
}, "best_model.pt")
exp.info("New best model!", epoch=epoch, val_acc=val_acc)
else:
patience_counter += 1
# Early stopping
if patience_counter >= config["early_stopping_patience"]:
exp.info("Early stopping", epoch=epoch)
break
# Regular checkpoint
if epoch % 10 == 0:
checkpoints.save(model.state_dict(), f"checkpoint_epoch_{epoch}.pt")
# Save final summary
exp.files.save({
"best_val_accuracy": best_val_acc,
"total_epochs": epoch + 1,
"config": config
}, "summary.json")
exp.info("Training completed!", best_val_acc=best_val_acc)
```
### Example 3: Hyperparameter Sweep
```python
from ml_dash import Experiment
# Define hyperparameter grid
learning_rates = [0.001, 0.0001, 0.00001]
batch_sizes = [32, 64, 128]
for lr in learning_rates:
for bs in batch_sizes:
# Create unique experiment for each combination
exp = Experiment(
namespace="alice",
workspace="hp-sweep",
prefix=f"lr{lr}_bs{bs}",
directory="mnist/grid-search"
)
with exp.run():
# Log this combination
exp.params.set(
learning_rate=lr,
batch_size=bs,
model="simple-cnn"
)
# Train with these hyperparameters
final_acc = train_model(lr, bs)
# Log final result
exp.metrics.log(step=0, final_accuracy=final_acc)
exp.info("Sweep run completed", lr=lr, bs=bs, acc=final_acc)
print("Hyperparameter sweep completed!")
```
### Example 4: Multi-Stage Training
```python
from ml_dash import Experiment
exp = Experiment(
namespace="alice",
workspace="nlp",
prefix="bert-finetuning-001",
directory="transformers/bert/squad"
)
with exp.run():
# Stage 1: Warmup
exp.params.set(stage="warmup", lr=0.00001, epochs=5)
exp.info("Starting warmup phase")
warmup_metrics = exp.metrics("warmup")
for epoch in range(5):
loss = train_epoch(lr=0.00001)
warmup_metrics.log(step=epoch, loss=loss)
# Stage 2: Main training
exp.params.extend(stage="main", lr=0.0001, epochs=20)
exp.info("Starting main training phase")
train_metrics = exp.metrics("train")
val_metrics = exp.metrics("val")
for epoch in range(20):
train_loss = train_epoch(lr=0.0001)
val_loss = validate()
train_metrics.log(step=epoch, loss=train_loss)
val_metrics.log(step=epoch, loss=val_loss)
# Stage 3: Fine-tuning
exp.params.extend(stage="finetune", lr=0.00001, epochs=10)
exp.info("Starting fine-tuning phase")
finetune_metrics = exp.metrics("finetune")
for epoch in range(10):
loss = train_epoch(lr=0.00001)
finetune_metrics.log(step=epoch, loss=loss)
exp.info("Multi-stage training completed!")
```
---
## Remote Backend
ML-Logger supports syncing to a remote server for team collaboration.
### Setup Remote Backend
```python
exp = Experiment(
namespace="alice",
workspace="shared-project",
prefix="experiment-001",
remote="http://qwqdug4btp.us-east-1.awsapprunner.com", # Remote server URL
readme="Shared experiment for team"
)
```
### How It Works
1. **Local or Remote **: Data can be saved locally or remotely
2. **Automatic Sync**: When `remote` is specified, data syncs to server
3. **Experiment Creation**: Server creates an experiment record
4. **Run Tracking**: Server tracks run status (RUNNING, COMPLETED, FAILED)
5. **GraphQL API**: Query experiments via GraphQL at `http://qwqdug4btp.us-east-1.awsapprunner.com/graphql`
### Environment Variables
Configure remote backend via environment:
```bash
export ML_LOGGER_REMOTE="http://qwqdug4btp.us-east-1.awsapprunner.com"
export ML_LOGGER_NAMESPACE="alice"
export ML_LOGGER_WORKSPACE="production"
```
```python
# Uses environment variables if not specified
exp = Experiment(prefix="my-experiment")
```
### Server Requirements
To use remote backend, you need the dash-server running:
```bash
cd ml-dash/ml-dash-server
pnpm install
pnpm dev
```
Server will be available at `http://qwqdug4btp.us-east-1.awsapprunner.com`
---
## Best Practices
### 1. **Use Context Manager**
Always use `with exp.run():` for automatic cleanup:
```python
# Good
with exp.run():
train()
# Avoid
exp.run()
train()
exp.complete() # Easy to forget!
```
### 2. **Namespace Metrics and Files**
Organize metrics and files with namespaces:
```python
train_metrics = exp.metrics("train")
val_metrics = exp.metrics("val")
test_metrics = exp.metrics("test")
checkpoints = exp.files("checkpoints")
configs = exp.files("configs")
visualizations = exp.files("plots")
```
### 3. **Use collect() + flush() for Batch Metrics**
For fine-grained batch logging with epoch aggregation:
```python
for epoch in range(epochs):
for batch in batches:
loss = train_batch(batch)
exp.metrics.collect(loss=loss)
# Log aggregated metrics once per epoch
exp.metrics.flush(_aggregation="mean", step=epoch)
```
### 4. **Log Configuration Early**
Log all hyperparameters at the start:
```python
with exp.run():
exp.params.set(
model="resnet50",
learning_rate=0.001,
batch_size=128,
epochs=100,
optimizer="adam",
dataset="cifar10"
)
# ... training ...
```
### 5. **Use Hierarchical Organization**
Organize experiments with directories:
```python
exp = Experiment(
namespace="alice",
workspace="vision",
prefix="run-001",
directory="models/resnet/cifar10" # Hierarchical organization
)
```
### 6. **Add Context to Logs**
Make logs searchable with context:
```python
exp.info("Epoch completed",
epoch=5,
train_loss=0.3,
val_loss=0.35,
learning_rate=0.001)
exp.warning("High memory usage",
memory_gb=14.5,
available_gb=16.0,
batch_size=128)
```
### 7. **Save Comprehensive Checkpoints**
Include all state needed for resumption:
```python
checkpoints.save({
"epoch": epoch,
"model_state_dict": model.state_dict(),
"optimizer_state_dict": optimizer.state_dict(),
"scheduler_state_dict": scheduler.state_dict(),
"best_val_loss": best_val_loss,
"config": config
}, "checkpoint.pt")
```
### 8. **Version Control Integration**
Log git information:
```python
import subprocess
git_hash = subprocess.check_output(
["git", "rev-parse", "HEAD"]
).decode().strip()
exp.params.set(
git_commit=git_hash,
git_branch="main"
)
```
### 9. **Error Handling**
The context manager handles errors automatically, but you can add custom handling:
```python
with exp.run():
try:
train()
except RuntimeError as e:
exp.error("Training error", error=str(e), device="cuda:0")
raise
```
---
## File Format Details
### metrics.jsonl
```json
{
"timestamp": 1234567890.123,
"step": 0,
"metrics": {
"loss": 0.5,
"accuracy": 0.8
}
}
{
"timestamp": 1234567891.456,
"step": 1,
"metrics": {
"loss": 0.4,
"accuracy": 0.85
}
}
```
### parameters.jsonl
```json
{
"timestamp": 1234567890.123,
"operation": "set",
"data": {
"lr": 0.001,
"batch_size": 32
}
}
{
"timestamp": 1234567892.456,
"operation": "update",
"key": "lr",
"value": 0.0001
}
```
### logs.jsonl
```json
{
"timestamp": 1234567890.123,
"level": "INFO",
"message": "Training started",
"context": {
"epoch": 0
}
}
{
"timestamp": 1234567891.456,
"level": "WARNING",
"message": "High memory",
"context": {
"memory_gb": 14.5
}
}
```
### .ml-logger.meta.json
```json
{
"namespace": "alice",
"workspace": "vision",
"prefix": "experiment-1",
"status": "completed",
"started_at": 1234567890.123,
"completed_at": 1234567900.456,
"readme": "ResNet experiment",
"experiment_id": "exp_123",
"hostname": "gpu-server-01"
}
```
---
## Troubleshooting
### Issue: Remote server connection failed
```python
# Warning: Failed to initialize experiment on remote server
# Solution: Check if ml-dash-server is running
cd
ml - dash / dash - server & & pnpm
dev
```
### Issue: File not found when loading
```python
# Check if file exists first
if exp.files.exists("model.pt"):
model.load_state_dict(exp.files.load("model.pt"))
else:
print("Checkpoint not found")
```
### Issue: Metrics not aggregating correctly
```python
# Make sure to call flush() after collect()
for batch in batches:
metrics.collect(loss=loss)
metrics.flush(_aggregation="mean", step=epoch) # Don't forget this!
```
---
## API Summary
| Component | Key Methods | Purpose |
|----------------|---------------------------------------------|-----------------------------|
| **Experiment** | `run()`, `complete()`, `fail()`, `info()` | Manage experiment lifecycle |
| **Parameters** | `set()`, `extend()`, `update()`, `read()` | Store configuration |
| **Metrics** | `log()`, `collect()`, `flush()`, `read()` | Track time-series metrics |
| **Logs** | `info()`, `warning()`, `error()`, `debug()` | Structured logging |
| **Files** | `save()`, `load()`, `exists()`, `list()` | File management |
---
## Additional Resources
- **GitHub**: https://github.com/vuer-ai/vuer-dashboard
- **Examples**: See `ml-logger/examples/` directory
- **Tests**: See `ml-logger/tests/` for usage examples
- **Dashboard**: http://qwqdug4btp.us-east-1.awsapprunner.com (when dash-server is running)
---
## Contributing & Development
### Development Setup
1. **Clone the repository**:
```bash
git clone https://github.com/vuer-ai/vuer-dashboard.git
cd vuer-dashboard/ml-logger
```
2. **Install with development dependencies**:
```bash
# Install all dev dependencies (includes testing, docs, and torch)
uv sync --extra dev
```
### Running Tests
```bash
# Run all tests
uv run pytest
# Run with coverage
uv run pytest --cov=ml_dash --cov-report=html
# Run specific test file
uv run pytest tests/test_backends.py
```
### Building Documentation
```bash
# Build HTML documentation
cd docs && make html
# Serve docs with live reload (auto-refreshes on file changes)
cd docs && make serve
# Clean build artifacts
cd docs && make clean
```
The built documentation will be in `docs/_build/html/`. The `make serve` command starts a local server at `http://localhost:8000` with automatic rebuilding on file changes.
### Linting and Code Checks
```bash
# Format code
uv run ruff format .
# Lint code
uv run ruff check .
# Fix auto-fixable issues
uv run ruff check --fix .
```
### Project Structure
```
ml-logger/
├── src/ml_logger_beta/ # Main package source
│ ├── __init__.py # Package exports
│ ├── run.py # Experiment class
│ ├── ml_logger.py # ML_Logger class
│ ├── job_logger.py # JobLogger class
│ ├── backends/ # Storage backends
│ │ ├── base.py # Base backend interface
│ │ ├── local_backend.py # Local filesystem backend
│ │ └── dash_backend.py # Remote server backend
│ └── components/ # Component managers
│ ├── parameters.py # Parameter management
│ ├── metrics.py # Metrics logging
│ ├── logs.py # Structured logging
│ └── files.py # File management
├── tests/ # Test suite
├── docs/ # Sphinx documentation
├── pyproject.toml # Package configuration
└── README.md # This file
```
### Dependency Structure
The project uses a simplified dependency structure:
- **`dependencies`**: Core runtime dependencies (always installed)
- `msgpack`, `numpy`, `requests`
- **`dev`**: All development dependencies
- Linting and formatting: `ruff`
- Testing: `pytest`, `pytest-cov`, `pytest-asyncio`
- Documentation: `sphinx`, `furo`, `myst-parser`, `sphinx-copybutton`, `sphinx-autobuild`
- Optional features: `torch` (for saving/loading .pt/.pth files)
### Making Changes
1. Create a new branch for your changes
2. Make your modifications
3. Run tests to ensure everything works: `uv run pytest`
4. Run linting: `uv run ruff check .`
5. Format code: `uv run ruff format .`
6. Update documentation if needed
7. Submit a pull request
### Building and Publishing
```bash
# Build the package
uv build
# Publish to PyPI (requires credentials)
uv publish
```
### Tips for Contributors
- Follow the existing code style (enforced by ruff)
- Add tests for new features
- Update documentation for API changes
- Use type hints where appropriate
- Keep functions focused and modular
- Write descriptive commit messages
---
**Happy Experimenting!** 🚀
Raw data
{
"_id": null,
"home_page": null,
"name": "ml-dash",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.12",
"maintainer_email": null,
"keywords": null,
"author": "Ge Yang",
"author_email": "Ge Yang <ge.ike.yang@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/10/31/9d5ccf1f697f722399956de5ac16737b8854c819276757211ef24caaa258/ml_dash-0.4.0.tar.gz",
"platform": null,
"description": "# ML-Logger API Documentation\n\n**Version:** 0.1.0\n\nML-Logger is a minimal, experiment tracking library for machine learning. It provides a simple API to log\nparameters, metrics, files, and logs during your ML experiments.\n\n## Table of Contents\n\n1. [Installation](#installation)\n2. [Quick Start](#quick-start)\n3. [Core Concepts](#core-concepts)\n4. [API Reference](#api-reference)\n - [Experiment](#experiment)\n - [Parameters](#parameters)\n - [Metrics](#metrics)\n - [Logs](#logs)\n - [Files](#files)\n5. [Usage Patterns](#usage-patterns)\n6. [Examples](#examples)\n7. [Remote Backend](#remote-backend)\n8. [Best Practices](#best-practices)\n\n---\n\n## Installation\n\n```bash\n# Using pip\npip install -i https://test.pypi.org/simple/ ml-logger-beta\n```\n\n---\n\n## Quick Start\n\n### Basic Example\n\n```python\nfrom ml_dash import Experiment\n\n# Create an experiment\nexp = Experiment(\n namespace=\"alice\", # Your username or team\n workspace=\"my-project\", # Project name\n prefix=\"experiment-1\" # Experiment name\n)\n\n# Start tracking\nwith exp.run():\n # Log parameters\n exp.params.set(\n learning_rate=0.001,\n batch_size=32,\n epochs=100\n )\n\n # Log metrics\n for epoch in range(100):\n exp.metrics.log(\n step=epoch,\n loss=0.5 - epoch * 0.01,\n accuracy=0.5 + epoch * 0.005\n )\n\n # Log messages\n exp.info(\"Training completed!\")\n\n # Save files\n exp.files.save({\"final\": \"results\"}, \"results.json\")\n```\n\nThis creates a local directory structure:\n\n```\n.ml-logger/\n\u2514\u2500\u2500 alice/\n \u2514\u2500\u2500 my-project/\n \u2514\u2500\u2500 experiment-1/\n \u251c\u2500\u2500 .ml-logger.meta.json\n \u251c\u2500\u2500 parameters.jsonl\n \u251c\u2500\u2500 metrics.jsonl\n \u251c\u2500\u2500 logs.jsonl\n \u2514\u2500\u2500 files/\n \u2514\u2500\u2500 results.json\n```\n\n---\n\n## Core Concepts\n\n### 1. **Namespace**\n\nYour username or organization (e.g., `\"alice\"`, `\"research-team\"`)\n\n### 2. **Workspace**\n\nProject or research area (e.g., `\"image-classification\"`, `\"nlp-experiments\"`)\n\n### 3. **Prefix**\n\nUnique experiment name (e.g., `\"resnet50-run-001\"`)\n\n### 4. **Directory** (Optional)\n\nHierarchical organization within workspace (e.g., `\"models/resnet/cifar10\"`)\n\n### 5. **Local or Remote**\n\nEverything is saved locally by default. Remote sync is optional.\n\n### 6. **Append-Only**\n\nAll data is written in append-only JSONL format for crash safety.\n\n---\n\n## API Reference\n\n### Experiment\n\nThe main class for experiment tracking.\n\n#### Constructor\n\n```python\nExperiment(\n namespace: str, # Required: User/team namespace\nworkspace: str, # Required: Project workspace\nprefix: str, # Required: Experiment name\nremote: str = None, # Optional: Remote server URL\nlocal_root: str = \".ml-logger\", # Local storage directory\ndirectory: str = None, # Optional: Subdirectory path\nreadme: str = None, # Optional: Description\nexperiment_id: str = None, # Optional: Server experiment ID\n)\n```\n\n**Example:**\n\n```python\nexp = Experiment(\n namespace=\"alice\",\n workspace=\"vision\",\n prefix=\"resnet-experiment\",\n directory=\"image-classification/cifar10\",\n readme=\"ResNet50 transfer learning on CIFAR-10\",\n remote=\"https://qwqdug4btp.us-east-1.awsapprunner.com\" # Optional remote server\n)\n```\n\n#### Methods\n\n##### `run(func=None)`\n\nMark experiment as running. Supports 3 patterns:\n\n**Pattern 1: Direct Call**\n\n```python\nexp.run()\n# ... your training code ...\nexp.complete()\n```\n\n**Pattern 2: Context Manager (Recommended)**\n\n```python\nwith exp.run():\n# ... your training code ...\n# Automatically calls complete() on success\n# Automatically calls fail() on exception\n```\n\n**Pattern 3: Decorator**\n\n```python\n@exp.run\ndef train():\n\n\n# ... your training code ...\n\ntrain()\n```\n\n##### `complete()`\n\nMark experiment as completed.\n\n```python\nexp.complete()\n```\n\n##### `fail(error: str)`\n\nMark experiment as failed with error message.\n\n```python\ntry:\n# training code\nexcept Exception as e:\n exp.fail(str(e))\n raise\n```\n\n##### Logging Convenience Methods\n\n```python\nexp.info(message: str, ** context) # Log info message\nexp.warning(message: str, ** context) # Log warning message\nexp.error(message: str, ** context) # Log error message\nexp.debug(message: str, ** context) # Log debug message\n```\n\n**Example:**\n\n```python\nexp.info(\"Epoch completed\", epoch=5, loss=0.3, accuracy=0.85)\nexp.warning(\"Memory usage high\", usage_gb=15.2)\nexp.error(\"Training failed\", error=\"CUDA out of memory\")\n```\n\n#### Properties\n\n```python\nexp.namespace # str: Namespace\nexp.workspace # str: Workspace\nexp.prefix # str: Experiment prefix\nexp.directory # str | None: Directory path\nexp.remote # str | None: Remote server URL\nexp.experiment_id # str | None: Server experiment ID\nexp.run_id # str | None: Server run ID\n```\n\n#### Components\n\n```python\nexp.params # ParameterManager\nexp.metrics # MetricsLogger\nexp.files # FileManager\nexp.logs # LogManager\n```\n\n---\n\n### Parameters\n\nManages experiment parameters (hyperparameters, config, etc.)\n\n#### Methods\n\n##### `set(**kwargs)`\n\nSet parameters (replaces existing).\n\n```python\nexp.params.set(\n learning_rate=0.001,\n batch_size=32,\n optimizer=\"adam\",\n model={\n \"layers\": 50,\n \"dropout\": 0.2\n }\n)\n```\n\n##### `extend(**kwargs)`\n\nExtend parameters (deep merge with existing).\n\n```python\n# First call\nexp.params.set(model={\"layers\": 50})\n\n# Extend (merges with existing)\nexp.params.extend(model={\"dropout\": 0.2})\n\n# Result: {\"model\": {\"layers\": 50, \"dropout\": 0.2}}\n```\n\n##### `update(key: str, value: Any)`\n\nUpdate a single parameter (supports dot notation).\n\n```python\nexp.params.update(\"model.layers\", 100)\nexp.params.update(\"learning_rate\", 0.0001)\n```\n\n##### `read() -> dict`\n\nRead current parameters.\n\n```python\nparams = exp.params.read()\nprint(params[\"learning_rate\"]) # 0.001\n```\n\n##### `log(**kwargs)`\n\nAlias for `set()` (for API consistency).\n\n```python\nexp.params.log(batch_size=64)\n```\n\n---\n\n### Metrics\n\nLogs time-series metrics with optional namespacing.\n\n#### Methods\n\n##### `log(step=None, **metrics)`\n\nLog metrics immediately.\n\n```python\n# Simple logging\nexp.metrics.log(step=1, loss=0.5, accuracy=0.8)\n\n# Multiple metrics at once\nexp.metrics.log(\n step=10,\n train_loss=0.3,\n val_loss=0.4,\n train_acc=0.85,\n val_acc=0.82\n)\n\n# Without step (uses timestamp only)\nexp.metrics.log(gpu_memory=8.5, cpu_usage=45.2)\n```\n\n##### `collect(step=None, **metrics)`\n\nCollect metrics for later aggregation (useful for batch-level logging).\n\n```python\nfor batch in train_loader:\n loss = train_batch(batch)\n\n # Collect batch metrics (not logged yet)\n exp.metrics.collect(loss=loss.item(), accuracy=acc.item())\n\n# Aggregate and log after epoch\nexp.metrics.flush(_aggregation=\"mean\", step=epoch)\n```\n\n##### `flush(_aggregation=\"mean\", step=None, **additional_metrics)`\n\nFlush collected metrics with aggregation.\n\n**Aggregation methods:**\n\n- `\"mean\"` - Average of collected values (default)\n- `\"sum\"` - Sum of collected values\n- `\"min\"` - Minimum value\n- `\"max\"` - Maximum value\n- `\"last\"` - Last value\n\n```python\n# Collect during training\nfor batch in batches:\n metrics.collect(loss=loss, accuracy=acc)\n\n# Flush with mean aggregation\nmetrics.flush(_aggregation=\"mean\", step=epoch, learning_rate=lr)\n\n# Flush with max aggregation\nmetrics.flush(_aggregation=\"max\", step=epoch)\n```\n\n##### Namespacing: `__call__(namespace: str)`\n\nCreate a namespaced metrics logger.\n\n```python\n# Create namespaced loggers\ntrain_metrics = exp.metrics(\"train\")\nval_metrics = exp.metrics(\"val\")\n\n# Log to different namespaces\ntrain_metrics.log(step=1, loss=0.5, accuracy=0.8)\nval_metrics.log(step=1, loss=0.6, accuracy=0.75)\n\n# Results in metrics named: \"train.loss\", \"train.accuracy\", \"val.loss\", \"val.accuracy\"\n```\n\n##### `read() -> list`\n\nRead all logged metrics.\n\n```python\nmetrics_data = exp.metrics.read()\nfor entry in metrics_data:\n print(entry[\"step\"], entry[\"metrics\"])\n```\n\n---\n\n### Logs\n\nStructured text logging with levels and context.\n\n#### Methods\n\n##### `log(message: str, level: str = \"INFO\", **context)`\n\nLog a message with level and context.\n\n```python\nexp.logs.log(\"Training started\", level=\"INFO\", epoch=0, lr=0.001)\n```\n\n##### Level-Specific Methods\n\n```python\nexp.logs.info(message: str, ** context) # INFO level\nexp.logs.warning(message: str, ** context) # WARNING level\nexp.logs.error(message: str, ** context) # ERROR level\nexp.logs.debug(message: str, ** context) # DEBUG level\n```\n\n**Examples:**\n\n```python\n# Info log\nexp.logs.info(\"Epoch started\", epoch=5, batches=100)\n\n# Warning log\nexp.logs.warning(\"High memory usage\", memory_gb=14.5, threshold_gb=16.0)\n\n# Error log\nexp.logs.error(\"Training failed\", error=\"CUDA OOM\", batch_size=128)\n\n# Debug log\nexp.logs.debug(\"Gradient norm\", grad_norm=2.3, step=1000)\n```\n\n##### `read() -> list`\n\nRead all logs.\n\n```python\nlogs = exp.logs.read()\nfor log_entry in logs:\n print(f\"[{log_entry['level']}] {log_entry['message']}\")\n if 'context' in log_entry:\n print(f\" Context: {log_entry['context']}\")\n```\n\n---\n\n### Files\n\nManages file storage with auto-format detection.\n\n#### Methods\n\n##### `save(data: Any, filename: str)`\n\nSave data with automatic format detection.\n\n**Supported formats:**\n\n- `.json` - JSON files\n- `.pkl`, `.pickle` - Pickle files\n- `.pt`, `.pth` - PyTorch tensors/models\n- `.npy`, `.npz` - NumPy arrays\n- Other extensions - Raw bytes or fallback to pickle\n\n```python\n# JSON\nexp.files.save({\"results\": [1, 2, 3]}, \"results.json\")\n\n# PyTorch model\nexp.files.save(model.state_dict(), \"model.pt\")\n\n# NumPy array\nexp.files.save(numpy_array, \"embeddings.npy\")\n\n# Pickle\nexp.files.save(custom_object, \"object.pkl\")\n\n# Raw bytes\nexp.files.save(b\"binary data\", \"data.bin\")\n\n# Text\nexp.files.save(\"text content\", \"notes.txt\")\n```\n\n##### `save_pkl(data: Any, filename: str)`\n\nSave as pickle (automatically adds .pkl extension).\n\n```python\nexp.files.save_pkl(complex_object, \"checkpoint\")\n# Saves as \"checkpoint.pkl\"\n```\n\n##### `load(filename: str) -> Any`\n\nLoad data with automatic format detection.\n\n```python\n# JSON\nresults = exp.files.load(\"results.json\")\n\n# PyTorch\nstate_dict = exp.files.load(\"model.pt\")\n\n# NumPy\narray = exp.files.load(\"embeddings.npy\")\n\n# Pickle\nobj = exp.files.load(\"object.pkl\")\n```\n\n##### `load_torch(filename: str) -> Any`\n\nLoad PyTorch checkpoint (adds .pt extension if missing).\n\n```python\ncheckpoint = exp.files.load_torch(\"best_model\")\n# Loads \"best_model.pt\"\n```\n\n##### Namespacing: `__call__(namespace: str)`\n\nCreate a namespaced file manager.\n\n```python\n# Create namespaced file managers\ncheckpoints = exp.files(\"checkpoints\")\nconfigs = exp.files(\"configs\")\n\n# Save to different directories\ncheckpoints.save(model.state_dict(), \"epoch_10.pt\")\n# Saves to: files/checkpoints/epoch_10.pt\n\nconfigs.save(config, \"training.json\")\n# Saves to: files/configs/training.json\n```\n\n##### `exists(filename: str) -> bool`\n\nCheck if file exists.\n\n```python\nif exp.files.exists(\"checkpoint.pt\"):\n model.load_state_dict(exp.files.load(\"checkpoint.pt\"))\n```\n\n##### `list() -> list`\n\nList files in current namespace.\n\n```python\nfiles = exp.files.list()\nprint(f\"Files: {files}\")\n\n# With namespace\ncheckpoint_files = exp.files(\"checkpoints\").list()\n```\n\n---\n\n## Usage Patterns\n\n### Pattern 1: Simple Training Loop\n\n```python\nfrom ml_dash import Experiment\n\nexp = Experiment(\n namespace=\"alice\",\n workspace=\"mnist\",\n prefix=\"simple-cnn\"\n)\n\nwith exp.run():\n # Log hyperparameters\n exp.params.set(lr=0.001, epochs=10, batch_size=32)\n\n # Training loop\n for epoch in range(10):\n train_loss = train_one_epoch()\n val_loss = validate()\n\n exp.metrics.log(\n step=epoch,\n train_loss=train_loss,\n val_loss=val_loss\n )\n\n # Save model\n exp.files.save(model.state_dict(), \"final_model.pt\")\n```\n\n### Pattern 2: Batch-Level Metrics with Aggregation\n\n```python\nwith exp.run():\n exp.params.set(lr=0.001, batch_size=128)\n\n for epoch in range(100):\n # Collect batch-level metrics\n for batch in train_loader:\n loss, acc = train_step(batch)\n exp.metrics.collect(loss=loss, accuracy=acc)\n\n # Aggregate and log epoch metrics\n exp.metrics.flush(_aggregation=\"mean\", step=epoch)\n```\n\n### Pattern 3: Separate Train/Val Metrics\n\n```python\nwith exp.run():\n # Create namespaced loggers\n train_metrics = exp.metrics(\"train\")\n val_metrics = exp.metrics(\"val\")\n\n for epoch in range(100):\n # Training phase\n for batch in train_loader:\n loss, acc = train_step(batch)\n train_metrics.collect(loss=loss, accuracy=acc)\n train_metrics.flush(_aggregation=\"mean\", step=epoch)\n\n # Validation phase\n val_loss, val_acc = validate()\n val_metrics.log(step=epoch, loss=val_loss, accuracy=val_acc)\n```\n\n### Pattern 4: Checkpoint Management\n\n```python\nwith exp.run():\n checkpoints = exp.files(\"checkpoints\")\n\n best_val_loss = float('inf')\n\n for epoch in range(100):\n train_loss = train()\n val_loss = validate()\n\n # Save regular checkpoint\n if epoch % 10 == 0:\n checkpoints.save(model.state_dict(), f\"epoch_{epoch}.pt\")\n\n # Save best model\n if val_loss < best_val_loss:\n best_val_loss = val_loss\n checkpoints.save({\n \"epoch\": epoch,\n \"model\": model.state_dict(),\n \"optimizer\": optimizer.state_dict(),\n \"val_loss\": val_loss\n }, \"best_model.pt\")\n\n # Save final model\n checkpoints.save(model.state_dict(), \"final_model.pt\")\n```\n\n### Pattern 5: Hierarchical Organization\n\n```python\n# Organize experiments in a directory hierarchy\nexp = Experiment(\n namespace=\"alice\",\n workspace=\"vision\",\n prefix=\"run-001\",\n directory=\"image-classification/resnet50/cifar10\"\n)\n# Creates: .ml-logger/alice/vision/image-classification/resnet50/cifar10/run-001/\n```\n\n---\n\n## Examples\n\n### Example 1: Basic MNIST Training\n\n```python\nfrom ml_dash import Experiment\nimport torch\nimport torch.nn as nn\nfrom torch.utils.data import DataLoader\nfrom torchvision import datasets, transforms\n\n# Create experiment\nexp = Experiment(\n namespace=\"alice\",\n workspace=\"mnist\",\n prefix=\"basic-cnn-001\"\n)\n\n# Define model\nmodel = nn.Sequential(\n nn.Conv2d(1, 32, 3),\n nn.ReLU(),\n nn.MaxPool2d(2),\n nn.Conv2d(32, 64, 3),\n nn.ReLU(),\n nn.MaxPool2d(2),\n nn.Flatten(),\n nn.Linear(1600, 128),\n nn.ReLU(),\n nn.Linear(128, 10)\n)\n\n# Training\nwith exp.run():\n # Log configuration\n exp.params.set(\n learning_rate=0.001,\n batch_size=64,\n epochs=10,\n optimizer=\"adam\"\n )\n\n # Setup\n train_loader = DataLoader(\n datasets.MNIST('../data', train=True, download=True,\n transform=transforms.ToTensor()),\n batch_size=64, shuffle=True\n )\n\n optimizer = torch.optim.Adam(model.parameters(), lr=0.001)\n criterion = nn.CrossEntropyLoss()\n\n # Training loop\n for epoch in range(10):\n total_loss = 0\n correct = 0\n total = 0\n\n for data, target in train_loader:\n optimizer.zero_grad()\n output = model(data)\n loss = criterion(output, target)\n loss.backward()\n optimizer.step()\n\n total_loss += loss.item()\n pred = output.argmax(dim=1)\n correct += (pred == target).sum().item()\n total += target.size(0)\n\n # Log epoch metrics\n avg_loss = total_loss / len(train_loader)\n accuracy = correct / total\n\n exp.metrics.log(\n step=epoch,\n loss=avg_loss,\n accuracy=accuracy\n )\n\n exp.info(f\"Epoch {epoch}\", loss=avg_loss, accuracy=accuracy)\n\n # Save final model\n exp.files.save(model.state_dict(), \"model.pt\")\n exp.info(\"Training completed!\")\n```\n\n### Example 2: Transfer Learning with Checkpointing\n\n```python\nfrom ml_dash import Experiment\nfrom torchvision import models\nimport torch.nn as nn\n\nexp = Experiment(\n namespace=\"alice\",\n workspace=\"vision\",\n prefix=\"resnet-transfer-001\",\n directory=\"transfer-learning/cifar10\",\n readme=\"ResNet50 transfer learning on CIFAR-10\"\n)\n\nwith exp.run():\n # Configuration\n config = {\n \"model\": \"resnet50\",\n \"pretrained\": True,\n \"num_classes\": 10,\n \"learning_rate\": 0.001,\n \"epochs\": 50,\n \"batch_size\": 128,\n \"early_stopping_patience\": 10\n }\n\n exp.params.set(**config)\n\n # Load pretrained model\n model = models.resnet50(pretrained=True)\n model.fc = nn.Linear(model.fc.in_features, 10)\n\n # Create namespaced loggers\n train_metrics = exp.metrics(\"train\")\n val_metrics = exp.metrics(\"val\")\n checkpoints = exp.files(\"checkpoints\")\n\n best_val_acc = 0.0\n patience_counter = 0\n\n for epoch in range(config[\"epochs\"]):\n # Training phase\n model.train()\n for batch in train_loader:\n loss, acc = train_step(model, batch)\n train_metrics.collect(loss=loss, accuracy=acc)\n\n train_metrics.flush(_aggregation=\"mean\", step=epoch)\n\n # Validation phase\n model.eval()\n val_loss, val_acc = validate(model, val_loader)\n val_metrics.log(step=epoch, loss=val_loss, accuracy=val_acc)\n\n # Checkpoint best model\n if val_acc > best_val_acc:\n best_val_acc = val_acc\n patience_counter = 0\n\n checkpoints.save({\n \"epoch\": epoch,\n \"model_state_dict\": model.state_dict(),\n \"val_accuracy\": val_acc,\n \"config\": config\n }, \"best_model.pt\")\n\n exp.info(\"New best model!\", epoch=epoch, val_acc=val_acc)\n else:\n patience_counter += 1\n\n # Early stopping\n if patience_counter >= config[\"early_stopping_patience\"]:\n exp.info(\"Early stopping\", epoch=epoch)\n break\n\n # Regular checkpoint\n if epoch % 10 == 0:\n checkpoints.save(model.state_dict(), f\"checkpoint_epoch_{epoch}.pt\")\n\n # Save final summary\n exp.files.save({\n \"best_val_accuracy\": best_val_acc,\n \"total_epochs\": epoch + 1,\n \"config\": config\n }, \"summary.json\")\n\n exp.info(\"Training completed!\", best_val_acc=best_val_acc)\n```\n\n### Example 3: Hyperparameter Sweep\n\n```python\nfrom ml_dash import Experiment\n\n# Define hyperparameter grid\nlearning_rates = [0.001, 0.0001, 0.00001]\nbatch_sizes = [32, 64, 128]\n\nfor lr in learning_rates:\n for bs in batch_sizes:\n # Create unique experiment for each combination\n exp = Experiment(\n namespace=\"alice\",\n workspace=\"hp-sweep\",\n prefix=f\"lr{lr}_bs{bs}\",\n directory=\"mnist/grid-search\"\n )\n\n with exp.run():\n # Log this combination\n exp.params.set(\n learning_rate=lr,\n batch_size=bs,\n model=\"simple-cnn\"\n )\n\n # Train with these hyperparameters\n final_acc = train_model(lr, bs)\n\n # Log final result\n exp.metrics.log(step=0, final_accuracy=final_acc)\n exp.info(\"Sweep run completed\", lr=lr, bs=bs, acc=final_acc)\n\nprint(\"Hyperparameter sweep completed!\")\n```\n\n### Example 4: Multi-Stage Training\n\n```python\nfrom ml_dash import Experiment\n\nexp = Experiment(\n namespace=\"alice\",\n workspace=\"nlp\",\n prefix=\"bert-finetuning-001\",\n directory=\"transformers/bert/squad\"\n)\n\nwith exp.run():\n # Stage 1: Warmup\n exp.params.set(stage=\"warmup\", lr=0.00001, epochs=5)\n exp.info(\"Starting warmup phase\")\n\n warmup_metrics = exp.metrics(\"warmup\")\n for epoch in range(5):\n loss = train_epoch(lr=0.00001)\n warmup_metrics.log(step=epoch, loss=loss)\n\n # Stage 2: Main training\n exp.params.extend(stage=\"main\", lr=0.0001, epochs=20)\n exp.info(\"Starting main training phase\")\n\n train_metrics = exp.metrics(\"train\")\n val_metrics = exp.metrics(\"val\")\n\n for epoch in range(20):\n train_loss = train_epoch(lr=0.0001)\n val_loss = validate()\n\n train_metrics.log(step=epoch, loss=train_loss)\n val_metrics.log(step=epoch, loss=val_loss)\n\n # Stage 3: Fine-tuning\n exp.params.extend(stage=\"finetune\", lr=0.00001, epochs=10)\n exp.info(\"Starting fine-tuning phase\")\n\n finetune_metrics = exp.metrics(\"finetune\")\n for epoch in range(10):\n loss = train_epoch(lr=0.00001)\n finetune_metrics.log(step=epoch, loss=loss)\n\n exp.info(\"Multi-stage training completed!\")\n```\n\n---\n\n## Remote Backend\n\nML-Logger supports syncing to a remote server for team collaboration.\n\n### Setup Remote Backend\n\n```python\nexp = Experiment(\n namespace=\"alice\",\n workspace=\"shared-project\",\n prefix=\"experiment-001\",\n remote=\"http://qwqdug4btp.us-east-1.awsapprunner.com\", # Remote server URL\n readme=\"Shared experiment for team\"\n)\n```\n\n### How It Works\n\n1. **Local or Remote **: Data can be saved locally or remotely\n2. **Automatic Sync**: When `remote` is specified, data syncs to server\n3. **Experiment Creation**: Server creates an experiment record\n4. **Run Tracking**: Server tracks run status (RUNNING, COMPLETED, FAILED)\n5. **GraphQL API**: Query experiments via GraphQL at `http://qwqdug4btp.us-east-1.awsapprunner.com/graphql`\n\n### Environment Variables\n\nConfigure remote backend via environment:\n\n```bash\nexport ML_LOGGER_REMOTE=\"http://qwqdug4btp.us-east-1.awsapprunner.com\"\nexport ML_LOGGER_NAMESPACE=\"alice\"\nexport ML_LOGGER_WORKSPACE=\"production\"\n```\n\n```python\n# Uses environment variables if not specified\nexp = Experiment(prefix=\"my-experiment\")\n```\n\n### Server Requirements\n\nTo use remote backend, you need the dash-server running:\n\n```bash\ncd ml-dash/ml-dash-server\npnpm install\npnpm dev\n```\n\nServer will be available at `http://qwqdug4btp.us-east-1.awsapprunner.com`\n\n---\n\n## Best Practices\n\n### 1. **Use Context Manager**\n\nAlways use `with exp.run():` for automatic cleanup:\n\n```python\n# Good\nwith exp.run():\n train()\n\n# Avoid\nexp.run()\ntrain()\nexp.complete() # Easy to forget!\n```\n\n### 2. **Namespace Metrics and Files**\n\nOrganize metrics and files with namespaces:\n\n```python\ntrain_metrics = exp.metrics(\"train\")\nval_metrics = exp.metrics(\"val\")\ntest_metrics = exp.metrics(\"test\")\n\ncheckpoints = exp.files(\"checkpoints\")\nconfigs = exp.files(\"configs\")\nvisualizations = exp.files(\"plots\")\n```\n\n### 3. **Use collect() + flush() for Batch Metrics**\n\nFor fine-grained batch logging with epoch aggregation:\n\n```python\nfor epoch in range(epochs):\n for batch in batches:\n loss = train_batch(batch)\n exp.metrics.collect(loss=loss)\n\n # Log aggregated metrics once per epoch\n exp.metrics.flush(_aggregation=\"mean\", step=epoch)\n```\n\n### 4. **Log Configuration Early**\n\nLog all hyperparameters at the start:\n\n```python\nwith exp.run():\n exp.params.set(\n model=\"resnet50\",\n learning_rate=0.001,\n batch_size=128,\n epochs=100,\n optimizer=\"adam\",\n dataset=\"cifar10\"\n )\n # ... training ...\n```\n\n### 5. **Use Hierarchical Organization**\n\nOrganize experiments with directories:\n\n```python\nexp = Experiment(\n namespace=\"alice\",\n workspace=\"vision\",\n prefix=\"run-001\",\n directory=\"models/resnet/cifar10\" # Hierarchical organization\n)\n```\n\n### 6. **Add Context to Logs**\n\nMake logs searchable with context:\n\n```python\nexp.info(\"Epoch completed\",\n epoch=5,\n train_loss=0.3,\n val_loss=0.35,\n learning_rate=0.001)\n\nexp.warning(\"High memory usage\",\n memory_gb=14.5,\n available_gb=16.0,\n batch_size=128)\n```\n\n### 7. **Save Comprehensive Checkpoints**\n\nInclude all state needed for resumption:\n\n```python\ncheckpoints.save({\n \"epoch\": epoch,\n \"model_state_dict\": model.state_dict(),\n \"optimizer_state_dict\": optimizer.state_dict(),\n \"scheduler_state_dict\": scheduler.state_dict(),\n \"best_val_loss\": best_val_loss,\n \"config\": config\n}, \"checkpoint.pt\")\n```\n\n### 8. **Version Control Integration**\n\nLog git information:\n\n```python\nimport subprocess\n\ngit_hash = subprocess.check_output(\n [\"git\", \"rev-parse\", \"HEAD\"]\n).decode().strip()\n\nexp.params.set(\n git_commit=git_hash,\n git_branch=\"main\"\n)\n```\n\n### 9. **Error Handling**\n\nThe context manager handles errors automatically, but you can add custom handling:\n\n```python\nwith exp.run():\n try:\n train()\n except RuntimeError as e:\n exp.error(\"Training error\", error=str(e), device=\"cuda:0\")\n raise\n```\n\n---\n\n## File Format Details\n\n### metrics.jsonl\n\n```json\n{\n \"timestamp\": 1234567890.123,\n \"step\": 0,\n \"metrics\": {\n \"loss\": 0.5,\n \"accuracy\": 0.8\n }\n}\n{\n \"timestamp\": 1234567891.456,\n \"step\": 1,\n \"metrics\": {\n \"loss\": 0.4,\n \"accuracy\": 0.85\n }\n}\n```\n\n### parameters.jsonl\n\n```json\n{\n \"timestamp\": 1234567890.123,\n \"operation\": \"set\",\n \"data\": {\n \"lr\": 0.001,\n \"batch_size\": 32\n }\n}\n{\n \"timestamp\": 1234567892.456,\n \"operation\": \"update\",\n \"key\": \"lr\",\n \"value\": 0.0001\n}\n```\n\n### logs.jsonl\n\n```json\n{\n \"timestamp\": 1234567890.123,\n \"level\": \"INFO\",\n \"message\": \"Training started\",\n \"context\": {\n \"epoch\": 0\n }\n}\n{\n \"timestamp\": 1234567891.456,\n \"level\": \"WARNING\",\n \"message\": \"High memory\",\n \"context\": {\n \"memory_gb\": 14.5\n }\n}\n```\n\n### .ml-logger.meta.json\n\n```json\n{\n \"namespace\": \"alice\",\n \"workspace\": \"vision\",\n \"prefix\": \"experiment-1\",\n \"status\": \"completed\",\n \"started_at\": 1234567890.123,\n \"completed_at\": 1234567900.456,\n \"readme\": \"ResNet experiment\",\n \"experiment_id\": \"exp_123\",\n \"hostname\": \"gpu-server-01\"\n}\n```\n\n---\n\n## Troubleshooting\n\n### Issue: Remote server connection failed\n\n```python\n# Warning: Failed to initialize experiment on remote server\n# Solution: Check if ml-dash-server is running\ncd\nml - dash / dash - server & & pnpm\ndev\n```\n\n### Issue: File not found when loading\n\n```python\n# Check if file exists first\nif exp.files.exists(\"model.pt\"):\n model.load_state_dict(exp.files.load(\"model.pt\"))\nelse:\n print(\"Checkpoint not found\")\n```\n\n### Issue: Metrics not aggregating correctly\n\n```python\n# Make sure to call flush() after collect()\nfor batch in batches:\n metrics.collect(loss=loss)\n\nmetrics.flush(_aggregation=\"mean\", step=epoch) # Don't forget this!\n```\n\n---\n\n## API Summary\n\n| Component | Key Methods | Purpose |\n|----------------|---------------------------------------------|-----------------------------|\n| **Experiment** | `run()`, `complete()`, `fail()`, `info()` | Manage experiment lifecycle |\n| **Parameters** | `set()`, `extend()`, `update()`, `read()` | Store configuration |\n| **Metrics** | `log()`, `collect()`, `flush()`, `read()` | Track time-series metrics |\n| **Logs** | `info()`, `warning()`, `error()`, `debug()` | Structured logging |\n| **Files** | `save()`, `load()`, `exists()`, `list()` | File management |\n\n---\n\n## Additional Resources\n\n- **GitHub**: https://github.com/vuer-ai/vuer-dashboard\n- **Examples**: See `ml-logger/examples/` directory\n- **Tests**: See `ml-logger/tests/` for usage examples\n- **Dashboard**: http://qwqdug4btp.us-east-1.awsapprunner.com (when dash-server is running)\n\n---\n\n## Contributing & Development\n\n### Development Setup\n\n1. **Clone the repository**:\n ```bash\n git clone https://github.com/vuer-ai/vuer-dashboard.git\n cd vuer-dashboard/ml-logger\n ```\n\n2. **Install with development dependencies**:\n ```bash\n # Install all dev dependencies (includes testing, docs, and torch)\n uv sync --extra dev\n ```\n\n### Running Tests\n\n```bash\n# Run all tests\nuv run pytest\n\n# Run with coverage\nuv run pytest --cov=ml_dash --cov-report=html\n\n# Run specific test file\nuv run pytest tests/test_backends.py\n```\n\n### Building Documentation\n\n```bash\n# Build HTML documentation\ncd docs && make html\n\n# Serve docs with live reload (auto-refreshes on file changes)\ncd docs && make serve\n\n# Clean build artifacts\ncd docs && make clean\n```\n\nThe built documentation will be in `docs/_build/html/`. The `make serve` command starts a local server at `http://localhost:8000` with automatic rebuilding on file changes.\n\n### Linting and Code Checks\n\n```bash\n# Format code\nuv run ruff format .\n\n# Lint code\nuv run ruff check .\n\n# Fix auto-fixable issues\nuv run ruff check --fix .\n```\n\n### Project Structure\n\n```\nml-logger/\n\u251c\u2500\u2500 src/ml_logger_beta/ # Main package source\n\u2502 \u251c\u2500\u2500 __init__.py # Package exports\n\u2502 \u251c\u2500\u2500 run.py # Experiment class\n\u2502 \u251c\u2500\u2500 ml_logger.py # ML_Logger class\n\u2502 \u251c\u2500\u2500 job_logger.py # JobLogger class\n\u2502 \u251c\u2500\u2500 backends/ # Storage backends\n\u2502 \u2502 \u251c\u2500\u2500 base.py # Base backend interface\n\u2502 \u2502 \u251c\u2500\u2500 local_backend.py # Local filesystem backend\n\u2502 \u2502 \u2514\u2500\u2500 dash_backend.py # Remote server backend\n\u2502 \u2514\u2500\u2500 components/ # Component managers\n\u2502 \u251c\u2500\u2500 parameters.py # Parameter management\n\u2502 \u251c\u2500\u2500 metrics.py # Metrics logging\n\u2502 \u251c\u2500\u2500 logs.py # Structured logging\n\u2502 \u2514\u2500\u2500 files.py # File management\n\u251c\u2500\u2500 tests/ # Test suite\n\u251c\u2500\u2500 docs/ # Sphinx documentation\n\u251c\u2500\u2500 pyproject.toml # Package configuration\n\u2514\u2500\u2500 README.md # This file\n```\n\n### Dependency Structure\n\nThe project uses a simplified dependency structure:\n\n- **`dependencies`**: Core runtime dependencies (always installed)\n - `msgpack`, `numpy`, `requests`\n- **`dev`**: All development dependencies\n - Linting and formatting: `ruff`\n - Testing: `pytest`, `pytest-cov`, `pytest-asyncio`\n - Documentation: `sphinx`, `furo`, `myst-parser`, `sphinx-copybutton`, `sphinx-autobuild`\n - Optional features: `torch` (for saving/loading .pt/.pth files)\n\n### Making Changes\n\n1. Create a new branch for your changes\n2. Make your modifications\n3. Run tests to ensure everything works: `uv run pytest`\n4. Run linting: `uv run ruff check .`\n5. Format code: `uv run ruff format .`\n6. Update documentation if needed\n7. Submit a pull request\n\n### Building and Publishing\n\n```bash\n# Build the package\nuv build\n\n# Publish to PyPI (requires credentials)\nuv publish\n```\n\n### Tips for Contributors\n\n- Follow the existing code style (enforced by ruff)\n- Add tests for new features\n- Update documentation for API changes\n- Use type hints where appropriate\n- Keep functions focused and modular\n- Write descriptive commit messages\n\n---\n\n**Happy Experimenting!** \ud83d\ude80\n",
"bugtrack_url": null,
"license": null,
"summary": "Add your description here",
"version": "0.4.0",
"project_urls": null,
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "09b1764f83cfa4292a64090cc570bdbb656462c745571db014bf9f6aef18c7f7",
"md5": "21ed34a054bb96f9043acce31b60ab1f",
"sha256": "6993c48cb6cc443162f60093edf85eed4a501d02e8b07217e7cb5d553ad243bb"
},
"downloads": -1,
"filename": "ml_dash-0.4.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "21ed34a054bb96f9043acce31b60ab1f",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.12",
"size": 34440,
"upload_time": "2025-10-25T19:56:02",
"upload_time_iso_8601": "2025-10-25T19:56:02.198694Z",
"url": "https://files.pythonhosted.org/packages/09/b1/764f83cfa4292a64090cc570bdbb656462c745571db014bf9f6aef18c7f7/ml_dash-0.4.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "10319d5ccf1f697f722399956de5ac16737b8854c819276757211ef24caaa258",
"md5": "c6766ac8104f62fc48449ad03a6a9d35",
"sha256": "729a07f7c1d7bd2bfe38bcd853edd47763e2a49392dde430a47fda2026a3c5d2"
},
"downloads": -1,
"filename": "ml_dash-0.4.0.tar.gz",
"has_sig": false,
"md5_digest": "c6766ac8104f62fc48449ad03a6a9d35",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.12",
"size": 27213,
"upload_time": "2025-10-25T19:56:03",
"upload_time_iso_8601": "2025-10-25T19:56:03.449999Z",
"url": "https://files.pythonhosted.org/packages/10/31/9d5ccf1f697f722399956de5ac16737b8854c819276757211ef24caaa258/ml_dash-0.4.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-25 19:56:03",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "ml-dash"
}