llm-instrumentation

Name	llm-instrumentation JSON
Version	1.2.11 JSON
	download
home_page	None
Summary	A high-performance instrumentation framework for capturing, streaming, and analyzing internal activations of large language models
upload_time	2025-11-03 19:14:19
maintainer	None
docs_url	None
author	None
requires_python	>=3.8
license	None
keywords	llm instrumentation interpretability pytorch transformers profiling
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # LLM Instrumentation Framework

A high-performance instrumentation framework for LLM interpretability and observability.

## Objectives

- **Throughput**: Maintain ≥ 90% of un-instrumented inference speed.
- **Data rate**: Sustain ≥ 2 GB/s activation streaming to disk.
- **Compression**: Achieve ≥ 3× reduction with lossy error < 1e-3 when enabled.
- **Memory**: Keep host RAM usage ≤ 24 GB with backpressure and buffering.
- **Resilience**: Automatic checkpoints safeguard long generations from data loss.

## Stack

- **Runtime**: PyTorch, asyncio, threading.
- **GPU**: Optional CUDA streams and pinned buffers (see `memory/cuda_manager.py`).
- **Compression**: LZ4, Zstd, optional no-op.
- **Analysis**: Hooks for downstream causal graphs and SAE-based features.

## Install

```bash
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
pip install -e .
```

## Quick Usage

```python
import torch
from llm_instrumentation import capture_activations
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")

with capture_activations(model, preset="balanced", output_path="output.stream") as framework:
    _ = model(torch.randint(0, 100, (1, 16)))

analysis = framework.analyze_activations("output.stream")

## Per-token Tracking & Checkpointing (opt-in)

Enable lightweight token boundary tracking without affecting the compression/streaming pipeline. Token metadata is stored in memory and saved to `{output_path}_tokens.json` on context exit. Optional checkpoints persist snapshots every _N_ tokens to make long streaming sessions resumable.

```python
from llm_instrumentation import analyze_activations_with_tokens

with framework.capture_activations("gen.stream", track_per_token=True) as tracker:
    ids = torch.randint(0, 100, (1, 8))
    for _ in range(32):
        with torch.no_grad():
            out = model(ids)
            next_tok = out.logits[:, -1, :].argmax(dim=-1, keepdim=True)
        tracker.record_token(next_tok[0].item(), tokenizer.decode(next_tok[0]))
        ids = torch.cat([ids, next_tok], dim=-1)
        if next_tok[0].item() == tokenizer.eos_token_id:
            break

analysis = analyze_activations_with_tokens("gen.stream", framework)
print("bytes_per_token:", analysis.get("bytes_per_token"))

# Persist checkpoints every 64 tokens to allow resuming long captures.
with framework.capture_activations(
    "gen.stream",
    track_per_token=True,
    checkpoint_interval_tokens=64,
) as tracker:
    ...

# Resume the same capture later (appends to the existing stream file).
with framework.capture_activations(
    "gen.stream",
    track_per_token=True,
    checkpoint_interval_tokens=64,
    resume_from_checkpoint=True,
) as tracker:
    ...
```

Checkpoint files default to `{output_path}.ckpt.json`; override with `checkpoint_path` to control placement. Successful captures clean up checkpoints after flushing token metadata.

## Configuration

- `InstrumentationConfig.fast_capture()` - minimal overhead capture without compression.
- `InstrumentationConfig.balanced()` - default preset balancing throughput and compression.
- `InstrumentationConfig.max_compression()` - prioritize disk footprint with Zstd.
- `InstrumentationConfig.attention_analysis()` / `.mlp_analysis()` - capture subsets for focused studies.
- Builder-style overrides: e.g. `InstrumentationConfig.balanced().with_compression("zstd").with_memory_limit(16)`.
- Direct parameters:
  - `granularity` (`HookGranularity`): `FULL_TENSOR`, `SAMPLED_SLICES`, `ATTENTION_ONLY`, `MLP_ONLY`.
  - `compression_algorithm` (`str`): `"lz4"`, `"zstd"`, or `"none"`.
  - `target_throughput_gbps` (`float`): Desired streaming rate for tuning.
  - `max_memory_gb` (`float|None`): Budget for host buffering policies.

Refer to `docs/API.md` for full API details.

## Automatic Configuration

The `llm_instrumentation.core.auto_detect` module can derive sensible defaults from a model instance:

```python
from llm_instrumentation import capture_activations
from llm_instrumentation.core.auto_detect import create_optimized_config, detect_model_architecture

arch = detect_model_architecture(model)
config = create_optimized_config(arch, purpose="performance_analysis")
with capture_activations(model, config=config, output_path=f"{arch}.stream"):
    ...
```

This path keeps manual overrides available via the builder helpers while accelerating setup for common analysis workflows.

## Stream Format

Each packet:

- Header: network-endian `!HI` → `(name_len: uint16, data_len: uint32)`
- Name: UTF-8 layer/module name (`name_len` bytes)
- Data: compressed tensor bytes (`data_len` bytes)

See `docs/STREAM_FORMAT.md` for a parsing example.

## Architecture

E2E path: PyTorch forward hooks → async enqueue → compression workers → ring buffer → async file writer. See `docs/ARCHITECTURE.md`.

## Benchmarks & Performance

Run `scripts/run_benchmarks.sh` and see `docs/PERFORMANCE.md` for targets, methodology, and how to generate reports.

## Block I/O Instrumentation

### Overview

STRAP-LLM includes eBPF-based block I/O monitoring to correlate disk performance with activation streaming:

- **`scripts/tracepoints.py`**: Captures latency histograms and queue depth using stable kernel tracepoints (`block:block_rq_issue`/`block:block_rq_complete`)
- **`scripts/analyze_tracepoints.py`**: Generates summaries and PNG visualizations from persisted JSONL snapshots

### Quick Start

**Collect I/O metrics:**

```bash
sudo python3 scripts/tracepoints.py --interval 5 --output tracepoints.jsonl
```

**Analyze results:**

```bash
python3 scripts/analyze_tracepoints.py \
  --input tracepoints.jsonl \
  --output-dir ../benchmarks/systems/I-O
```

### Features

- **Low overhead**: < 1% CPU usage, ~100ns per I/O request
- **Stable ABI**: Uses kernel tracepoints (no kprobes)
- **Async persistence**: Memory-mapped JSONL writer with batch flushes
- **Log₂ histograms**: Constant memory usage at any IOPS level
- **Queue depth tracking**: In-flight request monitoring per device

### CLI Options

| Flag | Description | Default |
|------|-------------|---------|
| `--interval` | Sampling interval (seconds) | 5.0 |
| `--output` | JSONL output file | `tracepoints.jsonl` |
| `--no-output` | Disable file output | False |
| `--flush-every` | Snapshots per flush | 12 |
| `--fsync` | Force fsync after flush | False |

### Output Format

Each JSONL line contains:

- Timestamp (Unix epoch + ISO 8601)
- Per-device latency histogram (log₂ buckets in μs)
- Per-device in-flight request count

**Example:**

```json
{
  "timestamp": 1696262400.123,
  "iso_timestamp": "2025-10-02T14:20:00.123000+00:00",
  "interval_s": 5.0,
  "latency_histogram": [
    {
      "device_name": "nvme0n1",
      "total": 45123,
      "buckets": [
        {"slot": 4, "count": 12000, "bucket_low": 16, "bucket_high": 31}
      ]
    }
  ],
  "inflight": [
    {"device_name": "nvme0n1", "count": 24}
  ]
}
```

### Documentation

See **`docs/BLOCK_IO_TRACEPOINTS.md`** for:

- Prerequisites and installation
- Detailed usage examples
- Integration with LLM workflows
- Troubleshooting guide
- Performance characteristics
- Advanced customization

## CPU & Memory Metrics

### Overview

- **`scripts/system_metrics.py`**: Engancha tracepoints `exceptions:page_fault_user` y `sched:sched_switch` para capturar fallos de página por PID, tiempo fuera de CPU y presión PSI de CPU/I/O/memoria.
- Se ejecuta como root y persiste snapshots JSONL con los campos `off_cpu_ns`, `page_faults` y `pressure` cada `N` segundos.
- La salida complementa los histogramas de latencia/colas producidos por `tracepoints.py` para correlacionar latencia de servicio con contención de CPU, swapping y presión sistémica.

### Quick Start

```bash
sudo python3 scripts/system_metrics.py --interval 5 --output system_metrics.jsonl
```

Cada línea JSON incluye `timestamp`, `iso_timestamp`, `interval_s`, un mapa `off_cpu_ns` (PID → nanosegundos fuera de CPU), `page_faults` (PID → fallos de página de usuario) y la estructura `pressure` con métricas PSI para CPU, I/O y memoria.

Para ver las muestras sólo por pantalla añade `--no-output`. Usa `--flush-every` y `--fsync` para controlar el flushing asíncrono en disco.

### CLI Options

| Flag | Description | Default |
|------|-------------|---------|
| `--interval` | Intervalo entre snapshots (s) | 5.0 |
| `--output` | Archivo JSONL de salida | `system_metrics.jsonl` |
| `--no-output` | Deshabilita escritura a disco | False |
| `--flush-every` | Snapshots por flush | 12 |
| `--fsync` | Forzar fsync tras cada flush | False |

### Correlación

Combina `system_metrics.jsonl` y `tracepoints.jsonl` con `scripts/analyze_tracepoints.py` o cargas personalizadas en pandas para atribuir latencia a contención de CPU, fallos de página, presión PSI o I/O de disco.

## Development

- Tests: `pytest -q` in repo root or the package directory.
- Examples: `examples/basic_usage.py`.
- Contributions: PRs welcome. Keep changes focused and covered by tests.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "llm-instrumentation",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "llm, instrumentation, interpretability, pytorch, transformers, profiling",
    "author": null,
    "author_email": "Ruben Fernandez Boullon <rubenfernandezboullon@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/a7/be/5505002517025f486fea7022d501c6e21d3ea58a3699fb359bc30666b33e/llm_instrumentation-1.2.11.tar.gz",
    "platform": null,
    "description": "# LLM Instrumentation Framework\n\nA high-performance instrumentation framework for LLM interpretability and observability.\n\n## Objectives\n\n- **Throughput**: Maintain \u2265 90% of un-instrumented inference speed.\n- **Data rate**: Sustain \u2265 2 GB/s activation streaming to disk.\n- **Compression**: Achieve \u2265 3\u00d7 reduction with lossy error < 1e-3 when enabled.\n- **Memory**: Keep host RAM usage \u2264 24 GB with backpressure and buffering.\n- **Resilience**: Automatic checkpoints safeguard long generations from data loss.\n\n## Stack\n\n- **Runtime**: PyTorch, asyncio, threading.\n- **GPU**: Optional CUDA streams and pinned buffers (see `memory/cuda_manager.py`).\n- **Compression**: LZ4, Zstd, optional no-op.\n- **Analysis**: Hooks for downstream causal graphs and SAE-based features.\n\n## Install\n\n```bash\npython -m venv .venv && source .venv/bin/activate\npip install -r requirements.txt\npip install -e .\n```\n\n## Quick Usage\n\n```python\nimport torch\nfrom llm_instrumentation import capture_activations\nfrom transformers import AutoModelForCausalLM\n\nmodel = AutoModelForCausalLM.from_pretrained(\"meta-llama/Llama-2-7b-hf\")\n\nwith capture_activations(model, preset=\"balanced\", output_path=\"output.stream\") as framework:\n    _ = model(torch.randint(0, 100, (1, 16)))\n\nanalysis = framework.analyze_activations(\"output.stream\")\n\n## Per-token Tracking & Checkpointing (opt-in)\n\nEnable lightweight token boundary tracking without affecting the compression/streaming pipeline. Token metadata is stored in memory and saved to `{output_path}_tokens.json` on context exit. Optional checkpoints persist snapshots every _N_ tokens to make long streaming sessions resumable.\n\n```python\nfrom llm_instrumentation import analyze_activations_with_tokens\n\nwith framework.capture_activations(\"gen.stream\", track_per_token=True) as tracker:\n    ids = torch.randint(0, 100, (1, 8))\n    for _ in range(32):\n        with torch.no_grad():\n            out = model(ids)\n            next_tok = out.logits[:, -1, :].argmax(dim=-1, keepdim=True)\n        tracker.record_token(next_tok[0].item(), tokenizer.decode(next_tok[0]))\n        ids = torch.cat([ids, next_tok], dim=-1)\n        if next_tok[0].item() == tokenizer.eos_token_id:\n            break\n\nanalysis = analyze_activations_with_tokens(\"gen.stream\", framework)\nprint(\"bytes_per_token:\", analysis.get(\"bytes_per_token\"))\n\n# Persist checkpoints every 64 tokens to allow resuming long captures.\nwith framework.capture_activations(\n    \"gen.stream\",\n    track_per_token=True,\n    checkpoint_interval_tokens=64,\n) as tracker:\n    ...\n\n# Resume the same capture later (appends to the existing stream file).\nwith framework.capture_activations(\n    \"gen.stream\",\n    track_per_token=True,\n    checkpoint_interval_tokens=64,\n    resume_from_checkpoint=True,\n) as tracker:\n    ...\n```\n\nCheckpoint files default to `{output_path}.ckpt.json`; override with `checkpoint_path` to control placement. Successful captures clean up checkpoints after flushing token metadata.\n\n## Configuration\n\n- `InstrumentationConfig.fast_capture()` - minimal overhead capture without compression.\n- `InstrumentationConfig.balanced()` - default preset balancing throughput and compression.\n- `InstrumentationConfig.max_compression()` - prioritize disk footprint with Zstd.\n- `InstrumentationConfig.attention_analysis()` / `.mlp_analysis()` - capture subsets for focused studies.\n- Builder-style overrides: e.g. `InstrumentationConfig.balanced().with_compression(\"zstd\").with_memory_limit(16)`.\n- Direct parameters:\n  - `granularity` (`HookGranularity`): `FULL_TENSOR`, `SAMPLED_SLICES`, `ATTENTION_ONLY`, `MLP_ONLY`.\n  - `compression_algorithm` (`str`): `\"lz4\"`, `\"zstd\"`, or `\"none\"`.\n  - `target_throughput_gbps` (`float`): Desired streaming rate for tuning.\n  - `max_memory_gb` (`float|None`): Budget for host buffering policies.\n\nRefer to `docs/API.md` for full API details.\n\n## Automatic Configuration\n\nThe `llm_instrumentation.core.auto_detect` module can derive sensible defaults from a model instance:\n\n```python\nfrom llm_instrumentation import capture_activations\nfrom llm_instrumentation.core.auto_detect import create_optimized_config, detect_model_architecture\n\narch = detect_model_architecture(model)\nconfig = create_optimized_config(arch, purpose=\"performance_analysis\")\nwith capture_activations(model, config=config, output_path=f\"{arch}.stream\"):\n    ...\n```\n\nThis path keeps manual overrides available via the builder helpers while accelerating setup for common analysis workflows.\n\n## Stream Format\n\nEach packet:\n\n- Header: network-endian `!HI` \u2192 `(name_len: uint16, data_len: uint32)`\n- Name: UTF-8 layer/module name (`name_len` bytes)\n- Data: compressed tensor bytes (`data_len` bytes)\n\nSee `docs/STREAM_FORMAT.md` for a parsing example.\n\n## Architecture\n\nE2E path: PyTorch forward hooks \u2192 async enqueue \u2192 compression workers \u2192 ring buffer \u2192 async file writer. See `docs/ARCHITECTURE.md`.\n\n## Benchmarks & Performance\n\nRun `scripts/run_benchmarks.sh` and see `docs/PERFORMANCE.md` for targets, methodology, and how to generate reports.\n\n## Block I/O Instrumentation\n\n### Overview\n\nSTRAP-LLM includes eBPF-based block I/O monitoring to correlate disk performance with activation streaming:\n\n- **`scripts/tracepoints.py`**: Captures latency histograms and queue depth using stable kernel tracepoints (`block:block_rq_issue`/`block:block_rq_complete`)\n- **`scripts/analyze_tracepoints.py`**: Generates summaries and PNG visualizations from persisted JSONL snapshots\n\n### Quick Start\n\n**Collect I/O metrics:**\n\n```bash\nsudo python3 scripts/tracepoints.py --interval 5 --output tracepoints.jsonl\n```\n\n**Analyze results:**\n\n```bash\npython3 scripts/analyze_tracepoints.py \\\n  --input tracepoints.jsonl \\\n  --output-dir ../benchmarks/systems/I-O\n```\n\n### Features\n\n- **Low overhead**: < 1% CPU usage, ~100ns per I/O request\n- **Stable ABI**: Uses kernel tracepoints (no kprobes)\n- **Async persistence**: Memory-mapped JSONL writer with batch flushes\n- **Log\u2082 histograms**: Constant memory usage at any IOPS level\n- **Queue depth tracking**: In-flight request monitoring per device\n\n### CLI Options\n\n| Flag | Description | Default |\n|------|-------------|---------|\n| `--interval` | Sampling interval (seconds) | 5.0 |\n| `--output` | JSONL output file | `tracepoints.jsonl` |\n| `--no-output` | Disable file output | False |\n| `--flush-every` | Snapshots per flush | 12 |\n| `--fsync` | Force fsync after flush | False |\n\n### Output Format\n\nEach JSONL line contains:\n\n- Timestamp (Unix epoch + ISO 8601)\n- Per-device latency histogram (log\u2082 buckets in \u03bcs)\n- Per-device in-flight request count\n\n**Example:**\n\n```json\n{\n  \"timestamp\": 1696262400.123,\n  \"iso_timestamp\": \"2025-10-02T14:20:00.123000+00:00\",\n  \"interval_s\": 5.0,\n  \"latency_histogram\": [\n    {\n      \"device_name\": \"nvme0n1\",\n      \"total\": 45123,\n      \"buckets\": [\n        {\"slot\": 4, \"count\": 12000, \"bucket_low\": 16, \"bucket_high\": 31}\n      ]\n    }\n  ],\n  \"inflight\": [\n    {\"device_name\": \"nvme0n1\", \"count\": 24}\n  ]\n}\n```\n\n### Documentation\n\nSee **`docs/BLOCK_IO_TRACEPOINTS.md`** for:\n\n- Prerequisites and installation\n- Detailed usage examples\n- Integration with LLM workflows\n- Troubleshooting guide\n- Performance characteristics\n- Advanced customization\n\n## CPU & Memory Metrics\n\n### Overview\n\n- **`scripts/system_metrics.py`**: Engancha tracepoints `exceptions:page_fault_user` y `sched:sched_switch` para capturar fallos de p\u00e1gina por PID, tiempo fuera de CPU y presi\u00f3n PSI de CPU/I/O/memoria.\n- Se ejecuta como root y persiste snapshots JSONL con los campos `off_cpu_ns`, `page_faults` y `pressure` cada `N` segundos.\n- La salida complementa los histogramas de latencia/colas producidos por `tracepoints.py` para correlacionar latencia de servicio con contenci\u00f3n de CPU, swapping y presi\u00f3n sist\u00e9mica.\n\n### Quick Start\n\n```bash\nsudo python3 scripts/system_metrics.py --interval 5 --output system_metrics.jsonl\n```\n\nCada l\u00ednea JSON incluye `timestamp`, `iso_timestamp`, `interval_s`, un mapa `off_cpu_ns` (PID \u2192 nanosegundos fuera de CPU), `page_faults` (PID \u2192 fallos de p\u00e1gina de usuario) y la estructura `pressure` con m\u00e9tricas PSI para CPU, I/O y memoria.\n\nPara ver las muestras s\u00f3lo por pantalla a\u00f1ade `--no-output`. Usa `--flush-every` y `--fsync` para controlar el flushing as\u00edncrono en disco.\n\n### CLI Options\n\n| Flag | Description | Default |\n|------|-------------|---------|\n| `--interval` | Intervalo entre snapshots (s) | 5.0 |\n| `--output` | Archivo JSONL de salida | `system_metrics.jsonl` |\n| `--no-output` | Deshabilita escritura a disco | False |\n| `--flush-every` | Snapshots por flush | 12 |\n| `--fsync` | Forzar fsync tras cada flush | False |\n\n### Correlaci\u00f3n\n\nCombina `system_metrics.jsonl` y `tracepoints.jsonl` con `scripts/analyze_tracepoints.py` o cargas personalizadas en pandas para atribuir latencia a contenci\u00f3n de CPU, fallos de p\u00e1gina, presi\u00f3n PSI o I/O de disco.\n\n## Development\n\n- Tests: `pytest -q` in repo root or the package directory.\n- Examples: `examples/basic_usage.py`.\n- Contributions: PRs welcome. Keep changes focused and covered by tests.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A high-performance instrumentation framework for capturing, streaming, and analyzing internal activations of large language models",
    "version": "1.2.11",
    "project_urls": {
        "Bug Tracker": "https://github.com/rubenfb23/STRAP-LLM/issues",
        "Documentation": "https://github.com/rubenfb23/STRAP-LLM/tree/main/llm-instrumentation/docs",
        "Homepage": "https://github.com/rubenfb23/STRAP-LLM",
        "Repository": "https://github.com/rubenfb23/STRAP-LLM"
    },
    "split_keywords": [
        "llm",
        " instrumentation",
        " interpretability",
        " pytorch",
        " transformers",
        " profiling"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "4602fa00799a19908ef60e3ad54d87a23b07bea4ba284bd7bc39a7198bc5ed3c",
                "md5": "8ade1c99e8a430f77b5f31f8353c609d",
                "sha256": "be62ea3d1c1e1d382a1eef8360f69e0c63edef8ac86585dc0fc8986d335e87b1"
            },
            "downloads": -1,
            "filename": "llm_instrumentation-1.2.11-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "8ade1c99e8a430f77b5f31f8353c609d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 34637,
            "upload_time": "2025-11-03T19:14:18",
            "upload_time_iso_8601": "2025-11-03T19:14:18.051291Z",
            "url": "https://files.pythonhosted.org/packages/46/02/fa00799a19908ef60e3ad54d87a23b07bea4ba284bd7bc39a7198bc5ed3c/llm_instrumentation-1.2.11-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "a7be5505002517025f486fea7022d501c6e21d3ea58a3699fb359bc30666b33e",
                "md5": "990b9c352673bae4c6247011ecbd3645",
                "sha256": "c824f29ffa48cf477ac78871ecaaa572c8e92d7fc21f79f830ef747cb13c4b3a"
            },
            "downloads": -1,
            "filename": "llm_instrumentation-1.2.11.tar.gz",
            "has_sig": false,
            "md5_digest": "990b9c352673bae4c6247011ecbd3645",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 72465,
            "upload_time": "2025-11-03T19:14:19",
            "upload_time_iso_8601": "2025-11-03T19:14:19.233563Z",
            "url": "https://files.pythonhosted.org/packages/a7/be/5505002517025f486fea7022d501c6e21d3ea58a3699fb359bc30666b33e/llm_instrumentation-1.2.11.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-11-03 19:14:19",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "rubenfb23",
    "github_project": "STRAP-LLM",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "llm-instrumentation"
}

None