# 🦙 Llama Layer Collector

A practical Python package for working with Llama-based models at the layer level. Designed to help developers and researchers load specific model components when working with large, sharded checkpoints.
## ✨ What It Does
- 🎯 Selective layer loading for more efficient resource usage
- 🚀 Streamlined handling of model checkpoints
- 💡 Useful for research, development, and memory-constrained environments
## 🛠️ Core Capabilities
Our package excels in four key areas:
### Precision Layer Control
Select which model components to load, from embedding layers to specific decoder blocks. This helps manage memory usage and processing requirements for your use case.
### Modular Architecture
Design your model processing pipeline by working with individual components. This approach enables focused testing, targeted optimization, and easier debugging of model behavior.
### Streamlined Computation
Use helper functions for embedding computation, layer-wise processing, and head operations to simplify working with model components.
## 🚀 Getting Started
### Installation
```bash
pip install llama-layer-collector
```
### Essential Components
The LlamaLayerCollector class serves as your central interface to the package's functionality. Here's what you need to know about its key parameters:
#### Required Parameters:
- `model_dir`: Path to your model directory containing shards and configuration
- `device`: Target device for tensor operations ("cpu" or "cuda")
- `dtype`: Desired numerical precision (default: torch.float16)
#### Optional Parameters:
- `cache_file`: Location for storing shard metadata
- `shard_pattern`: Custom regex for matching shard files
- `layer_prefix`: Prefix for identifying decoder layers
- Various layer name parameters for custom architectures
## 💻 Example Usage
Here's how you might use Llama Layer Collector in practice:
```python
from llama_layer_collector import LlamaLayerCollector
from llama_layer_collector.compute import compute_embedding, compute_layer, compute_head
from transformers import AutoTokenizer
import torch
# Initialize core components
collector = LlamaLayerCollector(
model_dir="/path/to/model",
cache_file="cache.json",
device="cuda",
dtype=torch.float16
)
# Set up tokenization
tokenizer = AutoTokenizer.from_pretrained("/path/to/model")
input_text = "The quick brown fox"
input_ids = tokenizer(input_text, return_tensors='pt')['input_ids']
# Load model components
embedding = collector.load_input_embedding()
norm = collector.load_norm()
head = collector.load_head()
layers = collector.load_layer_set(0, collector.num_layers - 1)
# Execute forward pass
state = compute_embedding(embedding, input_ids, collector.config)
for layer in layers:
state.state = compute_layer(layer, state)
# Generate predictions
predictions = compute_head(head, norm(state.state), topk=1)
```
## Optimal Use Cases
### Resource-Constrained Environments
Perfect for scenarios where loading entire models is impractical or impossible. Load only the layers you need, when you need them.
### Model Development
Ideal for researchers and developers who need to:
- Analyze intermediate layer outputs
- Experiment with architectural modifications
- Implement custom layer combinations
- Debug model behavior at a granular level
### Production Optimization
Streamline production deployments by loading only essential model components, reducing memory footprint and improving resource utilization.
## ⚙️ Technical Details
### Shard Management
- Default pattern: `model-<NUM>-of-<NUM>.safetensors`
- Customizable through constructor parameters
- Efficient metadata caching via JSON
### Computation Pipeline
Our helper functions provide a streamlined approach to model operations:
- `compute_embedding`: Handles input embedding and causal mask setup
- `compute_layer`: Manages state transitions through decoder layers
- `compute_head`: Processes final linear projections and token prediction
Raw data
{
"_id": null,
"home_page": null,
"name": "llama-layer-collector",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "llama, safetensors, torch, transformers",
"author": null,
"author_email": "Erin Clemmer <erin.c.clemmer@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/b1/16/b214ea35d9c37229471fdb76ab051cc48aad8e688fbe62ce15a0114efba4/llama_layer_collector-1.0.15.tar.gz",
"platform": null,
"description": "# \ud83e\udd99 Llama Layer Collector\n\n\n\nA practical Python package for working with Llama-based models at the layer level. Designed to help developers and researchers load specific model components when working with large, sharded checkpoints.\n\n## \u2728 What It Does\n\n- \ud83c\udfaf Selective layer loading for more efficient resource usage\n- \ud83d\ude80 Streamlined handling of model checkpoints\n- \ud83d\udca1 Useful for research, development, and memory-constrained environments\n\n## \ud83d\udee0\ufe0f Core Capabilities\n\nOur package excels in four key areas:\n\n### Precision Layer Control\nSelect which model components to load, from embedding layers to specific decoder blocks. This helps manage memory usage and processing requirements for your use case.\n\n### Modular Architecture\nDesign your model processing pipeline by working with individual components. This approach enables focused testing, targeted optimization, and easier debugging of model behavior.\n\n### Streamlined Computation\nUse helper functions for embedding computation, layer-wise processing, and head operations to simplify working with model components.\n\n## \ud83d\ude80 Getting Started\n\n### Installation\n\n```bash\npip install llama-layer-collector\n```\n\n### Essential Components\n\nThe LlamaLayerCollector class serves as your central interface to the package's functionality. Here's what you need to know about its key parameters:\n\n#### Required Parameters:\n- `model_dir`: Path to your model directory containing shards and configuration\n- `device`: Target device for tensor operations (\"cpu\" or \"cuda\")\n- `dtype`: Desired numerical precision (default: torch.float16)\n\n#### Optional Parameters:\n- `cache_file`: Location for storing shard metadata\n- `shard_pattern`: Custom regex for matching shard files\n- `layer_prefix`: Prefix for identifying decoder layers\n- Various layer name parameters for custom architectures\n\n## \ud83d\udcbb Example Usage\n\nHere's how you might use Llama Layer Collector in practice:\n\n```python\nfrom llama_layer_collector import LlamaLayerCollector\nfrom llama_layer_collector.compute import compute_embedding, compute_layer, compute_head\nfrom transformers import AutoTokenizer\nimport torch\n\n# Initialize core components\ncollector = LlamaLayerCollector(\n model_dir=\"/path/to/model\",\n cache_file=\"cache.json\",\n device=\"cuda\",\n dtype=torch.float16\n)\n\n# Set up tokenization\ntokenizer = AutoTokenizer.from_pretrained(\"/path/to/model\")\ninput_text = \"The quick brown fox\"\ninput_ids = tokenizer(input_text, return_tensors='pt')['input_ids']\n\n# Load model components\nembedding = collector.load_input_embedding()\nnorm = collector.load_norm()\nhead = collector.load_head()\nlayers = collector.load_layer_set(0, collector.num_layers - 1)\n\n# Execute forward pass\nstate = compute_embedding(embedding, input_ids, collector.config)\nfor layer in layers:\n state.state = compute_layer(layer, state)\n\n# Generate predictions\npredictions = compute_head(head, norm(state.state), topk=1)\n```\n\n## Optimal Use Cases\n\n### Resource-Constrained Environments\nPerfect for scenarios where loading entire models is impractical or impossible. Load only the layers you need, when you need them.\n\n### Model Development\nIdeal for researchers and developers who need to:\n- Analyze intermediate layer outputs\n- Experiment with architectural modifications\n- Implement custom layer combinations\n- Debug model behavior at a granular level\n\n### Production Optimization\nStreamline production deployments by loading only essential model components, reducing memory footprint and improving resource utilization.\n\n## \u2699\ufe0f Technical Details\n\n### Shard Management\n- Default pattern: `model-<NUM>-of-<NUM>.safetensors`\n- Customizable through constructor parameters\n- Efficient metadata caching via JSON\n\n### Computation Pipeline\nOur helper functions provide a streamlined approach to model operations:\n- `compute_embedding`: Handles input embedding and causal mask setup\n- `compute_layer`: Manages state transitions through decoder layers\n- `compute_head`: Processes final linear projections and token prediction",
"bugtrack_url": null,
"license": null,
"summary": "A tool for loading and computing on parts of Llama models.",
"version": "1.0.15",
"project_urls": {
"Homepage": "https://github.com/erinclemmer/llama-layer-collector",
"Issues": "https://github.com/erinclemmer/llama-layer-collector/issues"
},
"split_keywords": [
"llama",
" safetensors",
" torch",
" transformers"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "8deb65289bc1ec5a64c58b69e0385d7cb9fcf001a141edb43fcb98ebd0ff3e93",
"md5": "c1940d7abca9685fae136a40c13e16ef",
"sha256": "a20cdecb5b4fb84bcf709b259c3e5aa1c1f2e8ea739aa370513e8532fea8dbb9"
},
"downloads": -1,
"filename": "llama_layer_collector-1.0.15-py3-none-any.whl",
"has_sig": false,
"md5_digest": "c1940d7abca9685fae136a40c13e16ef",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 8770,
"upload_time": "2025-08-17T14:09:33",
"upload_time_iso_8601": "2025-08-17T14:09:33.303546Z",
"url": "https://files.pythonhosted.org/packages/8d/eb/65289bc1ec5a64c58b69e0385d7cb9fcf001a141edb43fcb98ebd0ff3e93/llama_layer_collector-1.0.15-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "b116b214ea35d9c37229471fdb76ab051cc48aad8e688fbe62ce15a0114efba4",
"md5": "3b68eec8b738c7329e610de8f97139da",
"sha256": "563d8920cd70ba9565f8c905f8606f26ce51fd0de06ee7f1bbb806af6158be6e"
},
"downloads": -1,
"filename": "llama_layer_collector-1.0.15.tar.gz",
"has_sig": false,
"md5_digest": "3b68eec8b738c7329e610de8f97139da",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 8466,
"upload_time": "2025-08-17T14:09:34",
"upload_time_iso_8601": "2025-08-17T14:09:34.036955Z",
"url": "https://files.pythonhosted.org/packages/b1/16/b214ea35d9c37229471fdb76ab051cc48aad8e688fbe62ce15a0114efba4/llama_layer_collector-1.0.15.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-17 14:09:34",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "erinclemmer",
"github_project": "llama-layer-collector",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "llama-layer-collector"
}