| Name | latentsae JSON |
| Version |
0.1.0
JSON |
| download |
| home_page | https://github.com/enjalot/latent-sae |
| Summary | LatentSAE: Training and inference for SAEs on embeddings |
| upload_time | 2024-09-06 15:51:21 |
| maintainer | None |
| docs_url | None |
| author | None |
| requires_python | None |
| license | None |
| keywords |
|
| VCS |
 |
| bugtrack_url |
|
| requirements |
No requirements were recorded.
|
| Travis-CI |
No Travis.
|
| coveralls test coverage |
No coveralls.
|
# latent-sae
This is essentially a fork of [EleutherAI/sae](https://github.com/EleutherAI/sae) focused on training Sparse Autodencoders on Sentence transformer embeddings. The main differences are:
1) Focus on training only one model on one set of input
2) Load training data (embeddings) quickly from disk
## Inference
```python
# !pip install latentsae
from latentsae import Sae
sae_model = Sae.load_from_hub("enjalot/sae-nomic-text-v1.5-FineWeb-edu-100BT", "64_32")
# or from disk
sae_model = Sae.load_from_disk("models/sae_64_32.3mq7ckj7")
# Get some embeddings
texts = ["Hello world", "Will I ever halt?", "Goodbye world"]
from sentence_transformers import SentenceTransformer
emb_model = SentenceTransformer("nomic-ai/nomic-embed-text-v1.5", trust_remote_code=True)
embeddings = emb_model.encode(texts, convert_to_tensor=True, normalize_embeddings=True)
features = sae_model.encode(normalized_embeddings)
print(features.top_indices)
print(features.top_acts)
```
See [notebooks/eval.ipynb](notebooks/eval.ipynb) for an example of how to use the model for extracting features from an embedding dataset.
## Training
The main way to train (that I've gotten working) is using modal_labs infrastructure
```bash
modal run train_modal.py --batch-size 512 --grad-acc-steps 4 --k 64 --expansion-factor 128
```
I do have some initial code for training locally
```bash
python train_local.py --batch-size 512 --grad-acc-steps 4 --k 64 --expansion-factor 128
```
## Data Preparation
I wrote a detailed article on the methodology behind the data, training and analysis of the SAEs trained with this repo:
[Latent Taxonomy Methodology](https://enjalot.github.io/latent-taxonomy/articles/about)
I used [Modal Labs](https://modal.com) to rent VMs and GPUs for the data preprocessing and training. See [enjalot/fineweb-modal](https://github.com/enjalot/fineweb-modal) for the scripts used to preprocess the FineWeb-EDU 10BT and 100BT samples and embed them with [nomic-embed-text-v1.5](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5).
I first trained on the 10BT sample, chunked into 500 token chunks which is available [on HuggingFace](https://huggingface.co/datasets/enjalot/fineweb-edu-sample-10BT-chunked-500-nomic-text-v1.5). This gave 25 million embeddings to train on.
From the wandb charts it looked like the model could improve further with more data so I then prepared 10x the embeddings with the 100BT sample. I'm working on uploading that to HF still.
For locally testing the code I downloaded a single parquet file from the dataset.
For the full training run, I downloaded the whole dataset to disk in a modal volume, then processed it into sharded torch .pt files using this script: [torched.py](https://github.com/enjalot/fineweb-modal/blob/main/torched.py)
## Parameters
The main parameters I tried to change were:
- batch-size: how many embeddings in a batch (bigger is better?) settled on 512 for performance tradeoff
- grad-acc-steps: how many steps to skip updating gradient. simulates bigger batch size. not sure the penalty for making this really big. settled on 4 with batch size of 512
- k: sparsity; how many top features to consider. fewer is sparser and more interpretable, but worse error. tried 64 and 128 but unsure how to measure quality differences yet
- expansion factor: multiply times dimensions of input embedding (768 in case of nomic). chose 32 and 128 to give ~25k and ~100k features respectively.
### Open questions
Another thought I have that I might try is to process the data into even smaller chunks. At 500 tokens the samples are quite large and I believe we are essentially aggregating a lot of features across those tokens.
If we chunked at something like 100 tokens each sample would be much more granular and we would also have 5x more training data.
Again, I'm not sure how I'd evaluate the quality tradeoff of this yet.
Part of the motivation with this repo and the [fineweb-modal](https://github.com/enjalot/fineweb-modal) repo is to make it easier
to train SAEs on other datasets. FineWeb-EDU has certain desirable properties for some down-stream tasks, but I can imagine training on a large dataset of code or a more general corpus like RedPajama v2.
Raw data
{
"_id": null,
"home_page": "https://github.com/enjalot/latent-sae",
"name": "latentsae",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": null,
"author": null,
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/b8/94/88e40d27593b57dc1f1f19c3f1d94dc71abd3682ad8f0965752d6e5f772f/latentsae-0.1.0.tar.gz",
"platform": null,
"description": "# latent-sae\n\nThis is essentially a fork of [EleutherAI/sae](https://github.com/EleutherAI/sae) focused on training Sparse Autodencoders on Sentence transformer embeddings. The main differences are: \n1) Focus on training only one model on one set of input\n2) Load training data (embeddings) quickly from disk\n\n\n\n## Inference\n\n```python\n# !pip install latentsae\nfrom latentsae import Sae\nsae_model = Sae.load_from_hub(\"enjalot/sae-nomic-text-v1.5-FineWeb-edu-100BT\", \"64_32\")\n# or from disk\nsae_model = Sae.load_from_disk(\"models/sae_64_32.3mq7ckj7\")\n\n# Get some embeddings\ntexts = [\"Hello world\", \"Will I ever halt?\", \"Goodbye world\"]\nfrom sentence_transformers import SentenceTransformer\nemb_model = SentenceTransformer(\"nomic-ai/nomic-embed-text-v1.5\", trust_remote_code=True)\nembeddings = emb_model.encode(texts, convert_to_tensor=True, normalize_embeddings=True)\n\nfeatures = sae_model.encode(normalized_embeddings)\nprint(features.top_indices)\nprint(features.top_acts)\n```\n\nSee [notebooks/eval.ipynb](notebooks/eval.ipynb) for an example of how to use the model for extracting features from an embedding dataset.\n\n## Training\n\nThe main way to train (that I've gotten working) is using modal_labs infrastructure \n```bash\nmodal run train_modal.py --batch-size 512 --grad-acc-steps 4 --k 64 --expansion-factor 128\n```\n\nI do have some initial code for training locally\n\n```bash\npython train_local.py --batch-size 512 --grad-acc-steps 4 --k 64 --expansion-factor 128 \n```\n\n## Data Preparation\nI wrote a detailed article on the methodology behind the data, training and analysis of the SAEs trained with this repo:\n[Latent Taxonomy Methodology](https://enjalot.github.io/latent-taxonomy/articles/about)\n\nI used [Modal Labs](https://modal.com) to rent VMs and GPUs for the data preprocessing and training. See [enjalot/fineweb-modal](https://github.com/enjalot/fineweb-modal) for the scripts used to preprocess the FineWeb-EDU 10BT and 100BT samples and embed them with [nomic-embed-text-v1.5](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5).\n\nI first trained on the 10BT sample, chunked into 500 token chunks which is available [on HuggingFace](https://huggingface.co/datasets/enjalot/fineweb-edu-sample-10BT-chunked-500-nomic-text-v1.5). This gave 25 million embeddings to train on.\nFrom the wandb charts it looked like the model could improve further with more data so I then prepared 10x the embeddings with the 100BT sample. I'm working on uploading that to HF still.\n\nFor locally testing the code I downloaded a single parquet file from the dataset.\nFor the full training run, I downloaded the whole dataset to disk in a modal volume, then processed it into sharded torch .pt files using this script: [torched.py](https://github.com/enjalot/fineweb-modal/blob/main/torched.py)\n\n## Parameters\nThe main parameters I tried to change were:\n\n- batch-size: how many embeddings in a batch (bigger is better?) settled on 512 for performance tradeoff\n- grad-acc-steps: how many steps to skip updating gradient. simulates bigger batch size. not sure the penalty for making this really big. settled on 4 with batch size of 512\n- k: sparsity; how many top features to consider. fewer is sparser and more interpretable, but worse error. tried 64 and 128 but unsure how to measure quality differences yet\n- expansion factor: multiply times dimensions of input embedding (768 in case of nomic). chose 32 and 128 to give ~25k and ~100k features respectively.\n\n### Open questions\nAnother thought I have that I might try is to process the data into even smaller chunks. At 500 tokens the samples are quite large and I believe we are essentially aggregating a lot of features across those tokens. \nIf we chunked at something like 100 tokens each sample would be much more granular and we would also have 5x more training data.\nAgain, I'm not sure how I'd evaluate the quality tradeoff of this yet.\n\nPart of the motivation with this repo and the [fineweb-modal](https://github.com/enjalot/fineweb-modal) repo is to make it easier\nto train SAEs on other datasets. FineWeb-EDU has certain desirable properties for some down-stream tasks, but I can imagine training on a large dataset of code or a more general corpus like RedPajama v2.\n\n",
"bugtrack_url": null,
"license": null,
"summary": "LatentSAE: Training and inference for SAEs on embeddings",
"version": "0.1.0",
"project_urls": {
"Homepage": "https://github.com/enjalot/latent-sae",
"Source": "https://github.com/enjalot/latent-sae",
"Tracker": "https://github.com/enjalot/latent-sae/issues"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "a4856ed9b4a7b500eb54773a2262e5ba2ad0ce87fe2b140d24e5cfe927c167f6",
"md5": "f682668ff885a3d6ab1b116ce6daab1b",
"sha256": "c7df50235a708fd09da88c292cabba7a9b8f9fab6782fe2c5857b00df6753af1"
},
"downloads": -1,
"filename": "latentsae-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "f682668ff885a3d6ab1b116ce6daab1b",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 19782,
"upload_time": "2024-09-06T15:51:20",
"upload_time_iso_8601": "2024-09-06T15:51:20.621592Z",
"url": "https://files.pythonhosted.org/packages/a4/85/6ed9b4a7b500eb54773a2262e5ba2ad0ce87fe2b140d24e5cfe927c167f6/latentsae-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "b89488e40d27593b57dc1f1f19c3f1d94dc71abd3682ad8f0965752d6e5f772f",
"md5": "5a653ed20c5ed87f1732d029eb76f124",
"sha256": "226525ab29a015c2b708a433cc0a27e2181d389c343b87bd75d57ac478e309a4"
},
"downloads": -1,
"filename": "latentsae-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "5a653ed20c5ed87f1732d029eb76f124",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 19802,
"upload_time": "2024-09-06T15:51:21",
"upload_time_iso_8601": "2024-09-06T15:51:21.535479Z",
"url": "https://files.pythonhosted.org/packages/b8/94/88e40d27593b57dc1f1f19c3f1d94dc71abd3682ad8f0965752d6e5f772f/latentsae-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-09-06 15:51:21",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "enjalot",
"github_project": "latent-sae",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "latentsae"
}