llama-layer-collector

Name	llama-layer-collector JSON
Version	1.0.6 JSON
	download
home_page	None
Summary	A tool for loading and computing on parts of Llama models.
upload_time	2025-02-22 23:21:59
maintainer	None
docs_url	None
author	None
requires_python	>=3.10
license	None
keywords	llama safetensors torch transformers
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            Llama Layer Collector
=====================

![PyPI - Version](https://img.shields.io/pypi/v/llama-layer-collector)

**Llama Layer Collector** is a lightweight Python package for selectively loading and computing on individual layers of Llama-based language models. It is especially helpful when working with large, sharded checkpoints that you’d like to load only partially, or when you need granular access to model internals (embeddings, norms, decoder layers, etc.).

* * *

Key Features
------------

*   **Layer-by-Layer Loading:** Specify which layers to load (e.g., layers `0` through `10`) rather than loading the entire model.
*   **Caching for Speed:** Create and reuse cached metadata about shard files to avoid repeated scanning of checkpoints.
*   **Flexible Device & Precision Support:** Easily move layers to CPU or GPU and configure their precision (e.g., `torch.float16`).
*   **Helper Compute Functions:** Built-in utilities (e.g., `compute_embedding`, `compute_layer`, and `compute_head`) to perform partial or full forward passes without building an entire model class.

* * *

Installation
------------

You can install **Llama Layer Collector** directly from PyPI:

`pip install llama-layer-collector`

* * *

Class Overview: LlamaLayerCollector
-----------------------------------

The **LlamaLayerCollector** is initialized with several parameters that give you fine-grained control over how model layers are discovered and loaded:

*   **model\_dir (str)**  
    A required path to the directory containing model shards and a `config.json` file.
    
*   **cache\_file (str, optional)**  
    Path to a JSON file used for caching shard metadata. If no cache file is specified, the collector still builds metadata in memory but does not persist it for future runs.
    
*   **shard\_pattern (str, optional)**  
    A regular expression (default: `'model-(\\d+)-of-(\\d+).safetensors'`) indicating how shard files are named.
    
*   **layer\_prefix (str, optional)**  
    The string prefix identifying decoder layer keys in your model checkpoints (default: `'model.layers.'`).
    
*   **input\_embedding\_layer\_name (str, optional)**  
    Name of the input embedding weight parameter (default: `'model.embed_tokens.weight'`).
    
*   **norm\_layer\_name (str, optional)**  
    Name of the RMS norm layer weight parameter (default: `'model.norm.weight'`).
    
*   **lm\_head\_name (str, optional)**  
    Name of the LM head weight parameter (default: `'lm_head.weight'`).
    
*   **dtype (torch.dtype, optional)**  
    Data type (default: `torch.float16`) used when loading all model weights.
    
*   **device (str, optional)**  
    Device on which the loaded tensors will be placed (default: `'cpu'`, though `'cuda'` is common for GPU usage).
    

During initialization, the collector checks for a `config.json` file in `model_dir`. If the file is missing, a `FileNotFoundError` is raised.

### Commonly Used Methods

*   **`load_input_embedding()`**  
    Loads and returns a PyTorch `Embedding` layer for token embeddings.
    
*   **`load_norm()`**  
    Returns the RMSNorm layer (`LlamaRMSNorm` in Llama-based models) with loaded weights.
    
*   **`load_head()`**  
    Provides a linear layer for the LM head. If the head weights are not found, it defaults to using the input embedding weights.
    
*   **`load_layer_set(start_layer: int, end_layer: int)`**  
    Loads a specified range of decoder layers (e.g., from layer `0` to layer `5`), returning them as a list.
    

* * *

Example Usage
-------------

Below is a minimal example demonstrating how to load a Llama model’s layers individually, tokenize an input, and run a partial forward pass. This setup is particularly useful for memory-constrained environments or for debugging/tracing through specific model layers.

```python
from llama_layer_collector import LlamaLayerCollector
from llama_layer_collector.compute import compute_embedding, compute_layer, compute_head 
from transformers import AutoTokenizer  

# Specify the directory containing your model checkpoints and configuration. 
model_directory = "/path/to/llama/model"  
cache_file = "model_cache.json"    
# Create a collector instance with desired settings. 
collector = LlamaLayerCollector(     
    model_dir=model_directory,     
    cache_file=cache_file,     
    device="cuda",  # or "cpu"     
    dtype=torch.float16 
)  
# Load tokenizer from Transformers. 
tokenizer = AutoTokenizer.from_pretrained(model_directory) 
input_ids = tokenizer("The quick brown fox ", return_tensors='pt')['input_ids']  

# Load the input embedding layer. 
embedding = collector.load_input_embedding()  

# Load the normalization layer. 
norm = collector.load_norm()  

# Load the LM head (fallbacks to embedding if not available). 
head = collector.load_head()  

# Load a set of decoder layers (in this example, all layers). 
layers = collector.load_layer_set(0, collector.num_layers)  

# Perform a forward pass using the helper computation functions. 
state = compute_embedding(embedding, input_ids, collector.config) 
for lyr in layers:     
    state.state = compute_layer(lyr, state)

# Compute final output logits and retrieve the top predicted token ID. 
result = compute_head(head, norm(state.state), topk=1) 
print(f'Top predicted token ID: {result}')
```
1.  **Initialize the Collector**:  
    The `LlamaLayerCollector` scans your model directory, identifies shard files, and (optionally) caches metadata for fast reuse. 
2.  **Load Model Pieces**:  
    Grab individual components (embeddings, normalization, head, and a range of layers) as needed. 
3.  **Partial or Full Computation**:  
    Use the provided functions in `llama_layer_collector.compute` to sequentially pass data through each layer. This is especially handy for stepping through intermediate activations or customizing layer outputs.
4.  **Retrieve Predictions**:  
    Pass the final hidden state through the LM head, apply a softmax, and retrieve top-k token IDs.
    

* * *

When to Use This Package
------------------------

*   **Memory Constraints**: If your environment cannot hold an entire Llama model in memory, load only the layers you need.
*   **Debugging**: Trace the forward pass one layer at a time for analyzing intermediate states.
*   **Research & Development**: Experiment with custom modifications to specific layers or partial fine-tuning without instantiating the full model.

* * *

Additional Notes
----------------

*   **Shard Pattern**: By default, we look for files named `model-<NUM>-of-<NUM>.safetensors`. You can override this pattern in the constructor if your files follow a different naming convention.
*   **Caching**: A JSON cache file (e.g., `model_cache.json`) is automatically created and updated by the collector for quick retrieval of shard file information.
*   **Helper Compute Functions**:
    *   `compute_embedding`: Prepares the input embedding state and sets up the causal mask.
    *   `compute_layer`: Passes the current hidden state through a `LlamaDecoderLayer`.
    *   `compute_head`: Applies the final linear head to generate logits, then returns the top token(s).

* * *

Contributing
------------

Feedback, bug reports, and pull requests are welcome! Please open an issue or submit a PR on GitHub if you have any ideas for improvements or new features.

* * *

License
-------

This project is released under the MIT License.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "llama-layer-collector",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "llama, safetensors, torch, transformers",
    "author": null,
    "author_email": "Erin Clemmer <erin.c.clemmer@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/33/6f/565ecbddab0caa7673d57682a598819690bf8a240f19798710557382115c/llama_layer_collector-1.0.6.tar.gz",
    "platform": null,
    "description": "Llama Layer Collector\n=====================\n\n![PyPI - Version](https://img.shields.io/pypi/v/llama-layer-collector)\n\n**Llama Layer Collector** is a lightweight Python package for selectively loading and computing on individual layers of Llama-based language models. It is especially helpful when working with large, sharded checkpoints that you\u2019d like to load only partially, or when you need granular access to model internals (embeddings, norms, decoder layers, etc.).\n\n* * *\n\nKey Features\n------------\n\n*   **Layer-by-Layer Loading:** Specify which layers to load (e.g., layers `0` through `10`) rather than loading the entire model.\n*   **Caching for Speed:** Create and reuse cached metadata about shard files to avoid repeated scanning of checkpoints.\n*   **Flexible Device & Precision Support:** Easily move layers to CPU or GPU and configure their precision (e.g., `torch.float16`).\n*   **Helper Compute Functions:** Built-in utilities (e.g., `compute_embedding`, `compute_layer`, and `compute_head`) to perform partial or full forward passes without building an entire model class.\n\n* * *\n\nInstallation\n------------\n\nYou can install **Llama Layer Collector** directly from PyPI:\n\n`pip install llama-layer-collector`\n\n* * *\n\nClass Overview: LlamaLayerCollector\n-----------------------------------\n\nThe **LlamaLayerCollector** is initialized with several parameters that give you fine-grained control over how model layers are discovered and loaded:\n\n*   **model\\_dir (str)**  \n    A required path to the directory containing model shards and a `config.json` file.\n    \n*   **cache\\_file (str, optional)**  \n    Path to a JSON file used for caching shard metadata. If no cache file is specified, the collector still builds metadata in memory but does not persist it for future runs.\n    \n*   **shard\\_pattern (str, optional)**  \n    A regular expression (default: `'model-(\\\\d+)-of-(\\\\d+).safetensors'`) indicating how shard files are named.\n    \n*   **layer\\_prefix (str, optional)**  \n    The string prefix identifying decoder layer keys in your model checkpoints (default: `'model.layers.'`).\n    \n*   **input\\_embedding\\_layer\\_name (str, optional)**  \n    Name of the input embedding weight parameter (default: `'model.embed_tokens.weight'`).\n    \n*   **norm\\_layer\\_name (str, optional)**  \n    Name of the RMS norm layer weight parameter (default: `'model.norm.weight'`).\n    \n*   **lm\\_head\\_name (str, optional)**  \n    Name of the LM head weight parameter (default: `'lm_head.weight'`).\n    \n*   **dtype (torch.dtype, optional)**  \n    Data type (default: `torch.float16`) used when loading all model weights.\n    \n*   **device (str, optional)**  \n    Device on which the loaded tensors will be placed (default: `'cpu'`, though `'cuda'` is common for GPU usage).\n    \n\nDuring initialization, the collector checks for a `config.json` file in `model_dir`. If the file is missing, a `FileNotFoundError` is raised.\n\n### Commonly Used Methods\n\n*   **`load_input_embedding()`**  \n    Loads and returns a PyTorch `Embedding` layer for token embeddings.\n    \n*   **`load_norm()`**  \n    Returns the RMSNorm layer (`LlamaRMSNorm` in Llama-based models) with loaded weights.\n    \n*   **`load_head()`**  \n    Provides a linear layer for the LM head. If the head weights are not found, it defaults to using the input embedding weights.\n    \n*   **`load_layer_set(start_layer: int, end_layer: int)`**  \n    Loads a specified range of decoder layers (e.g., from layer `0` to layer `5`), returning them as a list.\n    \n\n* * *\n\nExample Usage\n-------------\n\nBelow is a minimal example demonstrating how to load a Llama model\u2019s layers individually, tokenize an input, and run a partial forward pass. This setup is particularly useful for memory-constrained environments or for debugging/tracing through specific model layers.\n\n```python\nfrom llama_layer_collector import LlamaLayerCollector\nfrom llama_layer_collector.compute import compute_embedding, compute_layer, compute_head \nfrom transformers import AutoTokenizer  \n\n# Specify the directory containing your model checkpoints and configuration. \nmodel_directory = \"/path/to/llama/model\"  \ncache_file = \"model_cache.json\"    \n# Create a collector instance with desired settings. \ncollector = LlamaLayerCollector(     \n    model_dir=model_directory,     \n    cache_file=cache_file,     \n    device=\"cuda\",  # or \"cpu\"     \n    dtype=torch.float16 \n)  \n# Load tokenizer from Transformers. \ntokenizer = AutoTokenizer.from_pretrained(model_directory) \ninput_ids = tokenizer(\"The quick brown fox \", return_tensors='pt')['input_ids']  \n\n# Load the input embedding layer. \nembedding = collector.load_input_embedding()  \n\n# Load the normalization layer. \nnorm = collector.load_norm()  \n\n# Load the LM head (fallbacks to embedding if not available). \nhead = collector.load_head()  \n\n# Load a set of decoder layers (in this example, all layers). \nlayers = collector.load_layer_set(0, collector.num_layers)  \n\n# Perform a forward pass using the helper computation functions. \nstate = compute_embedding(embedding, input_ids, collector.config) \nfor lyr in layers:     \n    state.state = compute_layer(lyr, state)\n\n# Compute final output logits and retrieve the top predicted token ID. \nresult = compute_head(head, norm(state.state), topk=1) \nprint(f'Top predicted token ID: {result}')\n```\n1.  **Initialize the Collector**:  \n    The `LlamaLayerCollector` scans your model directory, identifies shard files, and (optionally) caches metadata for fast reuse. \n2.  **Load Model Pieces**:  \n    Grab individual components (embeddings, normalization, head, and a range of layers) as needed. \n3.  **Partial or Full Computation**:  \n    Use the provided functions in `llama_layer_collector.compute` to sequentially pass data through each layer. This is especially handy for stepping through intermediate activations or customizing layer outputs.\n4.  **Retrieve Predictions**:  \n    Pass the final hidden state through the LM head, apply a softmax, and retrieve top-k token IDs.\n    \n\n* * *\n\nWhen to Use This Package\n------------------------\n\n*   **Memory Constraints**: If your environment cannot hold an entire Llama model in memory, load only the layers you need.\n*   **Debugging**: Trace the forward pass one layer at a time for analyzing intermediate states.\n*   **Research & Development**: Experiment with custom modifications to specific layers or partial fine-tuning without instantiating the full model.\n\n* * *\n\nAdditional Notes\n----------------\n\n*   **Shard Pattern**: By default, we look for files named `model-<NUM>-of-<NUM>.safetensors`. You can override this pattern in the constructor if your files follow a different naming convention.\n*   **Caching**: A JSON cache file (e.g., `model_cache.json`) is automatically created and updated by the collector for quick retrieval of shard file information.\n*   **Helper Compute Functions**:\n    *   `compute_embedding`: Prepares the input embedding state and sets up the causal mask.\n    *   `compute_layer`: Passes the current hidden state through a `LlamaDecoderLayer`.\n    *   `compute_head`: Applies the final linear head to generate logits, then returns the top token(s).\n\n* * *\n\nContributing\n------------\n\nFeedback, bug reports, and pull requests are welcome! Please open an issue or submit a PR on GitHub if you have any ideas for improvements or new features.\n\n* * *\n\nLicense\n-------\n\nThis project is released under the MIT License.",
    "bugtrack_url": null,
    "license": null,
    "summary": "A tool for loading and computing on parts of Llama models.",
    "version": "1.0.6",
    "project_urls": {
        "Homepage": "https://github.com/erinclemmer/llama-layer-collector",
        "Issues": "https://github.com/erinclemmer/llama-layer-collector/issues"
    },
    "split_keywords": [
        "llama",
        " safetensors",
        " torch",
        " transformers"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "70959ba5c5669582beb92870f4e24860d89ff7bf7abdc33cb8d40b21d1caaa60",
                "md5": "e754a3dadc540d6266a3e92495caebf9",
                "sha256": "f52913835a5173e783381e44dd414a521209366861ea0df894cc54b3dbd98525"
            },
            "downloads": -1,
            "filename": "llama_layer_collector-1.0.6-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e754a3dadc540d6266a3e92495caebf9",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 9496,
            "upload_time": "2025-02-22T23:21:56",
            "upload_time_iso_8601": "2025-02-22T23:21:56.333869Z",
            "url": "https://files.pythonhosted.org/packages/70/95/9ba5c5669582beb92870f4e24860d89ff7bf7abdc33cb8d40b21d1caaa60/llama_layer_collector-1.0.6-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "336f565ecbddab0caa7673d57682a598819690bf8a240f19798710557382115c",
                "md5": "993b14814f39c39796c8abf3fcd71f71",
                "sha256": "3fd137754605a9cf4accb750a16fc3d3343fb11d18339553ed699c9de76c0d21"
            },
            "downloads": -1,
            "filename": "llama_layer_collector-1.0.6.tar.gz",
            "has_sig": false,
            "md5_digest": "993b14814f39c39796c8abf3fcd71f71",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 8837,
            "upload_time": "2025-02-22T23:21:59",
            "upload_time_iso_8601": "2025-02-22T23:21:59.613049Z",
            "url": "https://files.pythonhosted.org/packages/33/6f/565ecbddab0caa7673d57682a598819690bf8a240f19798710557382115c/llama_layer_collector-1.0.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-22 23:21:59",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "erinclemmer",
    "github_project": "llama-layer-collector",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "llama-layer-collector"
}

None