# SAE Bench
## Table of Contents
- [Overview](#overview)
- [Installation](#installation)
- [Running Evaluations](#running-evaluations)
- [Custom SAE Usage](#custom-sae-usage)
- [Training Your Own SAEs](#training-your-own-saes)
- [Graphing Results](#graphing-results)
CURRENT REPO STATUS: SAE Bench is currently a beta release. This repo is still under development as we clean up some of the rough edges left over from the research process. However, it is usable in the current state for both SAE Lens SAEs and custom SAEs.
## Overview
SAE Bench is a comprehensive suite of 8 evaluations for Sparse Autoencoder (SAE) models:
- **[Feature Absorption](https://arxiv.org/abs/2409.14507)**
- **[AutoInterp](https://blog.eleuther.ai/autointerp/)**
- **L0 / Loss Recovered**
- **[RAVEL](https://arxiv.org/abs/2402.17700) (under development)**
- **[Spurious Correlation Removal (SCR)](https://arxiv.org/abs/2411.18895)**
- **[Targeted Probe Pertubation (TPP)](https://arxiv.org/abs/2411.18895)**
- **Sparse Probing**
- **[Unlearning](https://arxiv.org/abs/2410.19278)**
For more information, refer to our [blog post](https://www.neuronpedia.org/sae-bench/info).
### Supported Models and SAEs
- **SAE Lens Pretrained SAEs**: Supports evaluations on any [SAE Lens](https://github.com/jbloomAus/SAELens) SAE.
- **dictionary_learning SAES**: We support evaluations on any SAE trained with the [dictionary_learning repo](https://github.com/saprmarks/dictionary_learning) (see [Custom SAE Usage](#custom-sae-usage)).
- **Custom SAEs**: Supports any general SAE object with `encode()` and `decode()` methods (see [Custom SAE Usage](#custom-sae-usage)).
### Installation
Set up a virtual environment with python >= 3.10.
```
git clone https://github.com/adamkarvonen/SAEBench.git
cd SAEBench
pip install -e .
```
Alternative, you can install from pypi:
```
pip install sae-bench
```
If you encounter dependency issues, you can use our tested working versions by uncommenting the fixed versions in pyproject.toml. All evals can be ran with current batch sizes on Gemma-2-2B on a 24GB VRAM GPU (e.g. a RTX 3090). By default, some evals cache LLM activations, which can require up to 100 GB of disk space. However, this can be disabled.
Autointerp requires the creation of `openai_api_key.txt`. Unlearning requires requesting access to the WMDP bio dataset (refer to `unlearning/README.md`).
## Getting Started
We recommend to get starting by going through the `sae_bench_demo.ipynb` notebook. In this notebook, we load both a custom SAE and an SAE Lens SAE, run both of them on multiple evaluations, and plot graphs of the results.
## Running Evaluations with SAE Lens
Each evaluation has an example command located in its respective `main.py` file. To run all evaluations on a selection of SAE Lens SAEs, refer to `shell_scripts/README.md`. Here's an example of how to run a sparse probing evaluation on a single SAE Bench Pythia-70M SAE:
```
python -m sae_bench.evals.sparse_probing.main \
--sae_regex_pattern "sae_bench_pythia70m_sweep_standard_ctx128_0712" \
--sae_block_pattern "blocks.4.hook_resid_post__trainer_10" \
--model_name pythia-70m-deduped
```
The results will be saved to the `eval_results/sparse_probing` directory.
We use regex patterns to select SAE Lens SAEs. For more examples of regex patterns, refer to `sae_regex_selection.ipynb`.
Every eval folder contains an `eval_config.py`, which contains all relevant hyperparamters for that evaluation. The values are currently set to the default recommended values.
## Custom SAE Usage
Our goal is to have first class support for custom SAEs as the field is rapidly evolving. Our evaluations can run on any SAE object with `encode()`, `decode()`, and a few config values. We recommend referring to `sae_bench_demo.ipynb`. In this notebook, we load a custom SAE and an SAE Bench baseline SAE, run them on two evals, and graph the results. There is additional information about custom SAE usage in `sae_bench/custom_saes/README.md`.
If your SAEs are trained with the [dictionary_learning repo](https://github.com/saprmarks/dictionary_learning), you can evaluate your SAEs by passing in the name of the HuggingFace repo containing your SAEs. Refer to `sae_bench/custom_saes/run_all_evals_dictionary_learning_saes.py`.
For other SAE types, refer to `sae_bench/custom_saes/run_all_evals_custom_saes.py`.
We currently have a suite of SAE Bench SAEs on layer 8 of Pythia-160M and layer 12 of Gemma-2-2B, each trained on 500M tokens with some having checkpoints at various points. These SAEs can serve as baselines for any new custom SAEs. We also have baseline eval results, saved [here](https://huggingface.co/datasets/adamkarvonen/sae_bench_results_0125). For more information, refer to `sae_bench/custom_saes/README.md`.
## Training Your Own SAEs
You can deterministically replicate the training of our SAEs using scripts provided [here](https://github.com/adamkarvonen/dictionary_learning_demo), or implement your own SAE, or make a change to one of our SAE implementations. Once you train your new version, you can benchmark against our existing SAEs for a true apples to apples comparison.
## Graphing Results
If evaluating your own SAEs, we recommend using the graphing cells in `sae_bench_demo.ipynb`. To replicate all SAE Bench plots, refer to `graphing.ipynb`. In this notebook, we download all SAE Bench data and create a variety of plots.
## Computational Requirements
The computational requirements for running SAEBench evaluations were measured on an NVIDIA RTX 3090 GPU using 16K width SAEs trained on the Gemma-2-2B model. The table below breaks down the timing for each evaluation type into two components: an initial setup phase and the per-SAE evaluation time.
- **Setup Phase**: Includes operations like precomputing model activations, training probes, or other one-time preprocessing steps which can be reused across multiple SAE evaluations.
- **Per-SAE Evaluation Time**: The time required to evaluate a single SAE once the setup is complete.
The total evaluation time for a single SAE across all benchmarks is approximately **65 minutes**, with an additional **107 minutes** of setup time. Note that actual runtimes may vary significantly based on factors such as SAE dictionary size, base model, and GPU selection.
| Evaluation Type | Avg Time per SAE (min) | Setup Time (min) |
| --------------- | ---------------------- | ---------------- |
| Absorption | 26 | 33 |
| Core | 9 | 0 |
| SCR | 6 | 22 |
| TPP | 2 | 5 |
| Sparse Probing | 3 | 15 |
| Auto-Interp | 9 | 0 |
| Unlearning | 10 | 33 |
| **Total** | **65** | **107** |
## Development
This project uses [Poetry](https://python-poetry.org/) for dependency management and packaging.
To install the development dependencies, run:
```
poetry install
```
### Linting and Formatting
This project uses [Ruff](https://github.com/astral-sh/ruff) for linting and formatting. To run linting, run:
```
make lint
```
To run formatting, run:
```
make format
```
To run type checking, run:
```
make check-type
```
### Testing
Unit tests can be run with:
```
poetry run pytest tests/unit
```
These test will be run automatically on every PR in CI.
There are also acceptance tests than can be run with:
```
poetry run pytest tests/acceptance
```
These tests are expensive and will not be run automatically in CI, but are worth running manually before large changes.
### Running all CI checks locally
Before submitting a PR, run:
```
make check-ci
```
This will run linting, formatting, type checking, and unit tests. If these all pass, your PR should be good to go!
### Configuring VSCode for auto-formatting
If you use VSCode, install the Ruff plugin, and add the following to your `.vscode/settings.json` file:
```json
{
"[python]": {
"editor.formatOnSave": true,
"editor.codeActionsOnSave": {
"source.fixAll": "explicit",
"source.organizeImports": "explicit"
},
"editor.defaultFormatter": "charliermarsh.ruff"
}
}
```
### Pre-commit hook
There's a pre-commit hook that will run ruff and pyright on each commit. To install it, run:
```bash
poetry run pre-commit install
```
### Updating Eval Output Schemas
Eval output structures / data types are under the `eval_output.py` file in each eval directory. If any of the `eval_output.py` files are updated, it's a good idea to run `python sae_bench/evals/generate_json_schemas.py` to make the json schemas match them as well.
Raw data
{
"_id": null,
"home_page": null,
"name": "sae-bench",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.10",
"maintainer_email": null,
"keywords": "sparse-autoencoders, mechanistic-interpretability, PyTorch",
"author": "Adam Karvonen",
"author_email": "adam.karvonen@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/a1/2a/74a4e20891bef50abb610c619be484c1a93ba3d04d2e2184815a0d2b5222/sae_bench-0.4.0.tar.gz",
"platform": null,
"description": "# SAE Bench\n\n## Table of Contents\n\n- [Overview](#overview)\n- [Installation](#installation)\n- [Running Evaluations](#running-evaluations)\n- [Custom SAE Usage](#custom-sae-usage)\n- [Training Your Own SAEs](#training-your-own-saes)\n- [Graphing Results](#graphing-results)\n\nCURRENT REPO STATUS: SAE Bench is currently a beta release. This repo is still under development as we clean up some of the rough edges left over from the research process. However, it is usable in the current state for both SAE Lens SAEs and custom SAEs.\n\n## Overview\n\nSAE Bench is a comprehensive suite of 8 evaluations for Sparse Autoencoder (SAE) models:\n\n- **[Feature Absorption](https://arxiv.org/abs/2409.14507)**\n- **[AutoInterp](https://blog.eleuther.ai/autointerp/)**\n- **L0 / Loss Recovered**\n- **[RAVEL](https://arxiv.org/abs/2402.17700) (under development)**\n- **[Spurious Correlation Removal (SCR)](https://arxiv.org/abs/2411.18895)**\n- **[Targeted Probe Pertubation (TPP)](https://arxiv.org/abs/2411.18895)**\n- **Sparse Probing**\n- **[Unlearning](https://arxiv.org/abs/2410.19278)**\n\nFor more information, refer to our [blog post](https://www.neuronpedia.org/sae-bench/info).\n\n### Supported Models and SAEs\n\n- **SAE Lens Pretrained SAEs**: Supports evaluations on any [SAE Lens](https://github.com/jbloomAus/SAELens) SAE.\n- **dictionary_learning SAES**: We support evaluations on any SAE trained with the [dictionary_learning repo](https://github.com/saprmarks/dictionary_learning) (see [Custom SAE Usage](#custom-sae-usage)).\n- **Custom SAEs**: Supports any general SAE object with `encode()` and `decode()` methods (see [Custom SAE Usage](#custom-sae-usage)).\n\n### Installation\n\nSet up a virtual environment with python >= 3.10.\n\n```\ngit clone https://github.com/adamkarvonen/SAEBench.git\ncd SAEBench\npip install -e .\n```\n\nAlternative, you can install from pypi:\n\n```\npip install sae-bench\n```\n\nIf you encounter dependency issues, you can use our tested working versions by uncommenting the fixed versions in pyproject.toml. All evals can be ran with current batch sizes on Gemma-2-2B on a 24GB VRAM GPU (e.g. a RTX 3090). By default, some evals cache LLM activations, which can require up to 100 GB of disk space. However, this can be disabled.\n\nAutointerp requires the creation of `openai_api_key.txt`. Unlearning requires requesting access to the WMDP bio dataset (refer to `unlearning/README.md`).\n\n## Getting Started\n\nWe recommend to get starting by going through the `sae_bench_demo.ipynb` notebook. In this notebook, we load both a custom SAE and an SAE Lens SAE, run both of them on multiple evaluations, and plot graphs of the results.\n\n## Running Evaluations with SAE Lens\n\nEach evaluation has an example command located in its respective `main.py` file. To run all evaluations on a selection of SAE Lens SAEs, refer to `shell_scripts/README.md`. Here's an example of how to run a sparse probing evaluation on a single SAE Bench Pythia-70M SAE:\n\n```\npython -m sae_bench.evals.sparse_probing.main \\\n --sae_regex_pattern \"sae_bench_pythia70m_sweep_standard_ctx128_0712\" \\\n --sae_block_pattern \"blocks.4.hook_resid_post__trainer_10\" \\\n --model_name pythia-70m-deduped\n```\n\nThe results will be saved to the `eval_results/sparse_probing` directory.\n\nWe use regex patterns to select SAE Lens SAEs. For more examples of regex patterns, refer to `sae_regex_selection.ipynb`.\n\nEvery eval folder contains an `eval_config.py`, which contains all relevant hyperparamters for that evaluation. The values are currently set to the default recommended values.\n\n## Custom SAE Usage\n\nOur goal is to have first class support for custom SAEs as the field is rapidly evolving. Our evaluations can run on any SAE object with `encode()`, `decode()`, and a few config values. We recommend referring to `sae_bench_demo.ipynb`. In this notebook, we load a custom SAE and an SAE Bench baseline SAE, run them on two evals, and graph the results. There is additional information about custom SAE usage in `sae_bench/custom_saes/README.md`.\n\nIf your SAEs are trained with the [dictionary_learning repo](https://github.com/saprmarks/dictionary_learning), you can evaluate your SAEs by passing in the name of the HuggingFace repo containing your SAEs. Refer to `sae_bench/custom_saes/run_all_evals_dictionary_learning_saes.py`.\n\nFor other SAE types, refer to `sae_bench/custom_saes/run_all_evals_custom_saes.py`.\n\nWe currently have a suite of SAE Bench SAEs on layer 8 of Pythia-160M and layer 12 of Gemma-2-2B, each trained on 500M tokens with some having checkpoints at various points. These SAEs can serve as baselines for any new custom SAEs. We also have baseline eval results, saved [here](https://huggingface.co/datasets/adamkarvonen/sae_bench_results_0125). For more information, refer to `sae_bench/custom_saes/README.md`.\n\n## Training Your Own SAEs\n\nYou can deterministically replicate the training of our SAEs using scripts provided [here](https://github.com/adamkarvonen/dictionary_learning_demo), or implement your own SAE, or make a change to one of our SAE implementations. Once you train your new version, you can benchmark against our existing SAEs for a true apples to apples comparison.\n\n## Graphing Results\n\nIf evaluating your own SAEs, we recommend using the graphing cells in `sae_bench_demo.ipynb`. To replicate all SAE Bench plots, refer to `graphing.ipynb`. In this notebook, we download all SAE Bench data and create a variety of plots.\n\n## Computational Requirements\n\nThe computational requirements for running SAEBench evaluations were measured on an NVIDIA RTX 3090 GPU using 16K width SAEs trained on the Gemma-2-2B model. The table below breaks down the timing for each evaluation type into two components: an initial setup phase and the per-SAE evaluation time.\n\n- **Setup Phase**: Includes operations like precomputing model activations, training probes, or other one-time preprocessing steps which can be reused across multiple SAE evaluations.\n- **Per-SAE Evaluation Time**: The time required to evaluate a single SAE once the setup is complete.\n\nThe total evaluation time for a single SAE across all benchmarks is approximately **65 minutes**, with an additional **107 minutes** of setup time. Note that actual runtimes may vary significantly based on factors such as SAE dictionary size, base model, and GPU selection.\n\n| Evaluation Type | Avg Time per SAE (min) | Setup Time (min) |\n| --------------- | ---------------------- | ---------------- |\n| Absorption | 26 | 33 |\n| Core | 9 | 0 |\n| SCR | 6 | 22 |\n| TPP | 2 | 5 |\n| Sparse Probing | 3 | 15 |\n| Auto-Interp | 9 | 0 |\n| Unlearning | 10 | 33 |\n| **Total** | **65** | **107** |\n\n## Development\n\nThis project uses [Poetry](https://python-poetry.org/) for dependency management and packaging.\n\nTo install the development dependencies, run:\n\n```\npoetry install\n```\n\n### Linting and Formatting\n\nThis project uses [Ruff](https://github.com/astral-sh/ruff) for linting and formatting. To run linting, run:\n\n```\nmake lint\n```\n\nTo run formatting, run:\n\n```\nmake format\n```\n\nTo run type checking, run:\n\n```\nmake check-type\n```\n\n### Testing\n\nUnit tests can be run with:\n\n```\npoetry run pytest tests/unit\n```\n\nThese test will be run automatically on every PR in CI.\n\nThere are also acceptance tests than can be run with:\n\n```\npoetry run pytest tests/acceptance\n```\n\nThese tests are expensive and will not be run automatically in CI, but are worth running manually before large changes.\n\n### Running all CI checks locally\n\nBefore submitting a PR, run:\n\n```\nmake check-ci\n```\n\nThis will run linting, formatting, type checking, and unit tests. If these all pass, your PR should be good to go!\n\n### Configuring VSCode for auto-formatting\n\nIf you use VSCode, install the Ruff plugin, and add the following to your `.vscode/settings.json` file:\n\n```json\n{\n \"[python]\": {\n \"editor.formatOnSave\": true,\n \"editor.codeActionsOnSave\": {\n \"source.fixAll\": \"explicit\",\n \"source.organizeImports\": \"explicit\"\n },\n \"editor.defaultFormatter\": \"charliermarsh.ruff\"\n }\n}\n```\n\n### Pre-commit hook\n\nThere's a pre-commit hook that will run ruff and pyright on each commit. To install it, run:\n\n```bash\npoetry run pre-commit install\n```\n\n### Updating Eval Output Schemas\n\nEval output structures / data types are under the `eval_output.py` file in each eval directory. If any of the `eval_output.py` files are updated, it's a good idea to run `python sae_bench/evals/generate_json_schemas.py` to make the json schemas match them as well.\n\n",
"bugtrack_url": null,
"license": null,
"summary": "A framework for evaluating sparse autoencoders",
"version": "0.4.0",
"project_urls": {
"Repository": "https://github.com/adamkarvonen/SAEBench"
},
"split_keywords": [
"sparse-autoencoders",
" mechanistic-interpretability",
" pytorch"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "c573e5bd4f693e89460732ceec6432e6a220a997b69562b918a0034551195e20",
"md5": "a99f6bb4180e79abc7bcaccdfe8d4d84",
"sha256": "11287a744036da4992ed8e66781c6872b42589fd246f99f913d0d764c38ba068"
},
"downloads": -1,
"filename": "sae_bench-0.4.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "a99f6bb4180e79abc7bcaccdfe8d4d84",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.10",
"size": 254902,
"upload_time": "2025-02-22T02:39:47",
"upload_time_iso_8601": "2025-02-22T02:39:47.358705Z",
"url": "https://files.pythonhosted.org/packages/c5/73/e5bd4f693e89460732ceec6432e6a220a997b69562b918a0034551195e20/sae_bench-0.4.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "a12a74a4e20891bef50abb610c619be484c1a93ba3d04d2e2184815a0d2b5222",
"md5": "e6905d9e39870fcf9dc95c8e8f8c2474",
"sha256": "e3dd4fd87b344ea0a35dcc17226171b25f551d8b4537b5ef4e1b6db2115f31ae"
},
"downloads": -1,
"filename": "sae_bench-0.4.0.tar.gz",
"has_sig": false,
"md5_digest": "e6905d9e39870fcf9dc95c8e8f8c2474",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.10",
"size": 193424,
"upload_time": "2025-02-22T02:39:50",
"upload_time_iso_8601": "2025-02-22T02:39:50.675043Z",
"url": "https://files.pythonhosted.org/packages/a1/2a/74a4e20891bef50abb610c619be484c1a93ba3d04d2e2184815a0d2b5222/sae_bench-0.4.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-02-22 02:39:50",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "adamkarvonen",
"github_project": "SAEBench",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "sae-bench"
}