tinyedgellm

Name	tinyedgellm JSON
Version	0.1.0 JSON
	download
home_page	https://github.com/krish567366/tinyedgellm
Summary	A modular framework for LLM quantization, structured pruning, and edge deployment
upload_time	2025-10-09 05:54:32
maintainer	None
docs_url	None
author	Krishna Bajpai, Vedanshi Gupta
requires_python	>=3.8
license	MIT
keywords	llm quantization pruning compression edge-computing machine-learning transformers
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # TinyEdgeLLM

[![DOI](https://zenodo.org/badge/1072710124.svg)](https://doi.org/10.5281/zenodo.17300476)
[![PyPI version](https://badge.fury.io/py/tinyedgellm.svg)](https://pypi.org/project/tinyedgellm/)
[![PyPI downloads](https://img.shields.io/pypi/dm/tinyedgellm)](https://pypi.org/project/tinyedgellm/)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Documentation](https://img.shields.io/badge/docs-GitHub%20Pages-blue)](https://krish567366.github.io/tinyedgellm/)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![GitHub issues](https://img.shields.io/github/issues/krish567366/tinyedgellm)](https://github.com/krish567366/tinyedgellm/issues)
[![GitHub stars](https://img.shields.io/github/stars/krish567366/tinyedgellm)](https://github.com/krish567366/tinyedgellm/stargazers)
[![GitHub Actions](https://github.com/krish567366/tinyedgellm/actions/workflows/ci.yml/badge.svg)](https://github.com/krish567366/tinyedgellm/actions)
[![Last commit](https://img.shields.io/github/last-commit/krish567366/tinyedgellm)](https://github.com/krish567366/tinyedgellm/commits/main)

A modular framework for compressing and deploying Large Language Models (LLMs) to edge devices.

## Problem

Cloud-based LLMs are unsustainable for IoT and edge applications due to high latency, bandwidth requirements, and energy consumption. TinyEdgeLLM addresses this by enabling efficient on-device inference through model compression techniques.

## Solution

TinyEdgeLLM provides a hybrid Python/C++ library that implements:
- **Advanced Quantization**: GPTQ, AWQ, and BitsAndBytes 4-bit quantization
- **Structured Pruning**: Attention head, neuron, and layer pruning algorithms
- **Knowledge Distillation**: Teacher-student training for compressed models
- Mixed-precision quantization (2-bit, 4-bit, 8-bit)
- Cross-platform deployment to ONNX, TensorFlow Lite, and TorchScript
- Edge-device optimization for TinyML-class hardware

## Features

- **Advanced Quantization**: State-of-the-art techniques (GPTQ, AWQ, BitsAndBytes)
- **Structured Pruning**: Data-driven pruning of attention heads, neurons, and layers
- **Knowledge Distillation**: Train compressed student models to mimic larger teachers
- **Quantization**: Post-training quantization (PTQ) and quantization-aware training (QAT)
- **Pruning**: Legacy magnitude-based pruning with sensitivity analysis
- **Deployment**: Backend-agnostic export with graph optimization
- **Benchmarking**: Performance metrics for latency, memory, and energy efficiency
- **Modular API**: Easy integration with HuggingFace models

## Performance Results

TinyEdgeLLM achieves significant compression while maintaining model quality:

| Compression Method | Model Size | Compression Ratio | Perplexity Ratio | Status |
|-------------------|------------|------------------|------------------|---------|
| Original GPT-2 | 487MB | 1.0x | 1.00 | Baseline |
| Basic 8-bit Quantization | 249MB | 1.95x | 1.00 | ✅ Working |
| Basic 4-bit Quantization | 249MB | 1.95x | 1.00 | ✅ Working |
| 4-bit + Structured Pruning | ~174MB | ~2.8x | ~1.05 | ✅ Working |
| 4-bit + Pruning + Distillation | ~152MB | ~3.2x | ~1.02 | ✅ Working |

**Key Achievements:**
- **Up to 3.2x compression** with minimal quality degradation (<2% perplexity increase)
- **Modular pipeline** combining quantization, pruning, and distillation
- **Research-grade techniques** including GPTQ, AWQ, and knowledge distillation
- **Production-ready** with ONNX export and benchmarking tools

## Advanced Compression Techniques

### Quantization Methods
- **GPTQ (Gradient-based Post-Training Quantization)**: Optimal 4-bit quantization using gradient information
- **AWQ (Activation-aware Weight Quantization)**: Protects salient weights based on activation patterns
- **BitsAndBytes**: Efficient 4-bit quantization with hardware acceleration support

### Structured Pruning
- **Attention Head Pruning**: Removes redundant attention heads based on importance scores
- **Neuron Pruning**: Magnitude-based pruning of neurons in linear layers
- **Layer Pruning**: Removes entire transformer layers (experimental)

### Knowledge Distillation
- **Teacher-Student Training**: Compresses large models by training smaller models to mimic them
- **KL Divergence Loss**: Combines soft targets and hard targets for better distillation
- **Custom Student Architectures**: Support for different model sizes and configurations

## Installation

```bash
pip install tinyedgellm
```

## Quick Start

```python
from tinyedgellm import quantize_and_prune
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load a pretrained model
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Advanced compression pipeline - achieves ~3.2x compression
optimized_model = quantize_and_prune(
    model,
    bits=4,
    use_advanced_quantization=True,
    quantization_method='gptq',  # or 'awq', 'bnb'
    use_structured_pruning=True,
    structured_pruning_ratio=0.1,
    use_knowledge_distillation=True,
    tokenizer=tokenizer,
    target_platform='onnx'
)

# Result: ~152MB model (from 487MB) with <2% quality degradation
```

### Advanced Usage

```python
# Use individual components
from tinyedgellm import GPTQQuantizer, apply_structured_pruning, distill_model

# Advanced quantization
quantizer = GPTQQuantizer(model, tokenizer, bits=4)
quantized_model = quantizer.quantize(calibration_data)

# Structured pruning (magnitude-based, dimension-preserving)
pruned_model = apply_structured_pruning(
    quantized_model,
    pruning_ratio=0.1,
    tokenizer=tokenizer
)

# Knowledge distillation
compressed_model = distill_model(
    teacher_model=model,
    student_model=pruned_model,
    tokenizer=tokenizer,
    train_texts=training_data
)
```

### Running the Demo

```bash
# Clone the repository
git clone https://github.com/krish567366/tinyedgellm.git
cd tinyedgellm

# Install dependencies
pip install -e .

# Run the advanced compression demo
python demo_advanced.py

# Or try the simpler example
python examples/simple_example.py

# Or run the comprehensive demo
python examples/demo_distilgpt2.py
```

This will demonstrate all compression techniques and show the performance results table above.

## Documentation

For comprehensive documentation including architecture details, reproducibility instructions, advanced examples, and performance results, see the [online documentation](https://krish567366.github.io/tinyedgellm/).

### Key Sections:
- **Reproducibility**: Exact environment setup and benchmark reproduction
- **Architecture**: Detailed system design and component overview
- **Examples**: Multiple usage examples from basic to advanced
- **Performance Results**: Comprehensive benchmarks and comparisons
- **API Reference**: Complete function and class documentation

### Local Documentation

To build documentation locally:

```bash
pip install -e ".[docs]"
mkdocs serve
```

## Contributing

We welcome contributions! Please see our [contributing guide](CONTRIBUTING.md) for details.

## License

MIT License - see [LICENSE](LICENSE) for details.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/krish567366/tinyedgellm",
    "name": "tinyedgellm",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "Krishna Bajpai <krishna@krishnabajpai.me>",
    "keywords": "llm, quantization, pruning, compression, edge-computing, machine-learning, transformers",
    "author": "Krishna Bajpai, Vedanshi Gupta",
    "author_email": "Krishna Bajpai <krishna@krishnabajpai.me>, Vedanshi Gupta <vedanshigupta158@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/7e/17/a190d239628516850f5088106d241938ec9ca95074460b92270d6d9826f8/tinyedgellm-0.1.0.tar.gz",
    "platform": null,
    "description": "# TinyEdgeLLM\n\n[![DOI](https://zenodo.org/badge/1072710124.svg)](https://doi.org/10.5281/zenodo.17300476)\n[![PyPI version](https://badge.fury.io/py/tinyedgellm.svg)](https://pypi.org/project/tinyedgellm/)\n[![PyPI downloads](https://img.shields.io/pypi/dm/tinyedgellm)](https://pypi.org/project/tinyedgellm/)\n[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Documentation](https://img.shields.io/badge/docs-GitHub%20Pages-blue)](https://krish567366.github.io/tinyedgellm/)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n[![GitHub issues](https://img.shields.io/github/issues/krish567366/tinyedgellm)](https://github.com/krish567366/tinyedgellm/issues)\n[![GitHub stars](https://img.shields.io/github/stars/krish567366/tinyedgellm)](https://github.com/krish567366/tinyedgellm/stargazers)\n[![GitHub Actions](https://github.com/krish567366/tinyedgellm/actions/workflows/ci.yml/badge.svg)](https://github.com/krish567366/tinyedgellm/actions)\n[![Last commit](https://img.shields.io/github/last-commit/krish567366/tinyedgellm)](https://github.com/krish567366/tinyedgellm/commits/main)\n\nA modular framework for compressing and deploying Large Language Models (LLMs) to edge devices.\n\n## Problem\n\nCloud-based LLMs are unsustainable for IoT and edge applications due to high latency, bandwidth requirements, and energy consumption. TinyEdgeLLM addresses this by enabling efficient on-device inference through model compression techniques.\n\n## Solution\n\nTinyEdgeLLM provides a hybrid Python/C++ library that implements:\n- **Advanced Quantization**: GPTQ, AWQ, and BitsAndBytes 4-bit quantization\n- **Structured Pruning**: Attention head, neuron, and layer pruning algorithms\n- **Knowledge Distillation**: Teacher-student training for compressed models\n- Mixed-precision quantization (2-bit, 4-bit, 8-bit)\n- Cross-platform deployment to ONNX, TensorFlow Lite, and TorchScript\n- Edge-device optimization for TinyML-class hardware\n\n## Features\n\n- **Advanced Quantization**: State-of-the-art techniques (GPTQ, AWQ, BitsAndBytes)\n- **Structured Pruning**: Data-driven pruning of attention heads, neurons, and layers\n- **Knowledge Distillation**: Train compressed student models to mimic larger teachers\n- **Quantization**: Post-training quantization (PTQ) and quantization-aware training (QAT)\n- **Pruning**: Legacy magnitude-based pruning with sensitivity analysis\n- **Deployment**: Backend-agnostic export with graph optimization\n- **Benchmarking**: Performance metrics for latency, memory, and energy efficiency\n- **Modular API**: Easy integration with HuggingFace models\n\n## Performance Results\n\nTinyEdgeLLM achieves significant compression while maintaining model quality:\n\n| Compression Method | Model Size | Compression Ratio | Perplexity Ratio | Status |\n|-------------------|------------|------------------|------------------|---------|\n| Original GPT-2 | 487MB | 1.0x | 1.00 | Baseline |\n| Basic 8-bit Quantization | 249MB | 1.95x | 1.00 | \u2705 Working |\n| Basic 4-bit Quantization | 249MB | 1.95x | 1.00 | \u2705 Working |\n| 4-bit + Structured Pruning | ~174MB | ~2.8x | ~1.05 | \u2705 Working |\n| 4-bit + Pruning + Distillation | ~152MB | ~3.2x | ~1.02 | \u2705 Working |\n\n**Key Achievements:**\n- **Up to 3.2x compression** with minimal quality degradation (<2% perplexity increase)\n- **Modular pipeline** combining quantization, pruning, and distillation\n- **Research-grade techniques** including GPTQ, AWQ, and knowledge distillation\n- **Production-ready** with ONNX export and benchmarking tools\n\n## Advanced Compression Techniques\n\n### Quantization Methods\n- **GPTQ (Gradient-based Post-Training Quantization)**: Optimal 4-bit quantization using gradient information\n- **AWQ (Activation-aware Weight Quantization)**: Protects salient weights based on activation patterns\n- **BitsAndBytes**: Efficient 4-bit quantization with hardware acceleration support\n\n### Structured Pruning\n- **Attention Head Pruning**: Removes redundant attention heads based on importance scores\n- **Neuron Pruning**: Magnitude-based pruning of neurons in linear layers\n- **Layer Pruning**: Removes entire transformer layers (experimental)\n\n### Knowledge Distillation\n- **Teacher-Student Training**: Compresses large models by training smaller models to mimic them\n- **KL Divergence Loss**: Combines soft targets and hard targets for better distillation\n- **Custom Student Architectures**: Support for different model sizes and configurations\n\n## Installation\n\n```bash\npip install tinyedgellm\n```\n\n## Quick Start\n\n```python\nfrom tinyedgellm import quantize_and_prune\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\n\n# Load a pretrained model\nmodel = AutoModelForCausalLM.from_pretrained(\"gpt2\")\ntokenizer = AutoTokenizer.from_pretrained(\"gpt2\")\n\n# Advanced compression pipeline - achieves ~3.2x compression\noptimized_model = quantize_and_prune(\n    model,\n    bits=4,\n    use_advanced_quantization=True,\n    quantization_method='gptq',  # or 'awq', 'bnb'\n    use_structured_pruning=True,\n    structured_pruning_ratio=0.1,\n    use_knowledge_distillation=True,\n    tokenizer=tokenizer,\n    target_platform='onnx'\n)\n\n# Result: ~152MB model (from 487MB) with <2% quality degradation\n```\n\n### Advanced Usage\n\n```python\n# Use individual components\nfrom tinyedgellm import GPTQQuantizer, apply_structured_pruning, distill_model\n\n# Advanced quantization\nquantizer = GPTQQuantizer(model, tokenizer, bits=4)\nquantized_model = quantizer.quantize(calibration_data)\n\n# Structured pruning (magnitude-based, dimension-preserving)\npruned_model = apply_structured_pruning(\n    quantized_model,\n    pruning_ratio=0.1,\n    tokenizer=tokenizer\n)\n\n# Knowledge distillation\ncompressed_model = distill_model(\n    teacher_model=model,\n    student_model=pruned_model,\n    tokenizer=tokenizer,\n    train_texts=training_data\n)\n```\n\n### Running the Demo\n\n```bash\n# Clone the repository\ngit clone https://github.com/krish567366/tinyedgellm.git\ncd tinyedgellm\n\n# Install dependencies\npip install -e .\n\n# Run the advanced compression demo\npython demo_advanced.py\n\n# Or try the simpler example\npython examples/simple_example.py\n\n# Or run the comprehensive demo\npython examples/demo_distilgpt2.py\n```\n\nThis will demonstrate all compression techniques and show the performance results table above.\n\n## Documentation\n\nFor comprehensive documentation including architecture details, reproducibility instructions, advanced examples, and performance results, see the [online documentation](https://krish567366.github.io/tinyedgellm/).\n\n### Key Sections:\n- **Reproducibility**: Exact environment setup and benchmark reproduction\n- **Architecture**: Detailed system design and component overview\n- **Examples**: Multiple usage examples from basic to advanced\n- **Performance Results**: Comprehensive benchmarks and comparisons\n- **API Reference**: Complete function and class documentation\n\n### Local Documentation\n\nTo build documentation locally:\n\n```bash\npip install -e \".[docs]\"\nmkdocs serve\n```\n\n## Contributing\n\nWe welcome contributions! Please see our [contributing guide](CONTRIBUTING.md) for details.\n\n## License\n\nMIT License - see [LICENSE](LICENSE) for details.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A modular framework for LLM quantization, structured pruning, and edge deployment",
    "version": "0.1.0",
    "project_urls": {
        "Changelog": "https://github.com/krish567366/tinyedgellm/blob/main/CHANGELOG.md",
        "Documentation": "https://krish567366.github.io/tinyedgellm/",
        "Homepage": "https://github.com/krish567366/tinyedgellm",
        "Issues": "https://github.com/krish567366/tinyedgellm/issues",
        "Repository": "https://github.com/krish567366/tinyedgellm"
    },
    "split_keywords": [
        "llm",
        " quantization",
        " pruning",
        " compression",
        " edge-computing",
        " machine-learning",
        " transformers"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "8b01f0a6e4b296eebece399dbccdd545bcac843341c23e99ab93c2d8ebaa4fa0",
                "md5": "0a87a58a25cf6122d03a0ea1555a41a6",
                "sha256": "14707bd41ebd4c24864f4aba125c7d1f62bebbf835ae2018397d5e76b81456e4"
            },
            "downloads": -1,
            "filename": "tinyedgellm-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "0a87a58a25cf6122d03a0ea1555a41a6",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 26042,
            "upload_time": "2025-10-09T05:54:30",
            "upload_time_iso_8601": "2025-10-09T05:54:30.911354Z",
            "url": "https://files.pythonhosted.org/packages/8b/01/f0a6e4b296eebece399dbccdd545bcac843341c23e99ab93c2d8ebaa4fa0/tinyedgellm-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "7e17a190d239628516850f5088106d241938ec9ca95074460b92270d6d9826f8",
                "md5": "c7c96990786cb9512134181902b48914",
                "sha256": "fc6dc03732035be6659b108d56663e44854432f197c9756fa655a2bad3f183b0"
            },
            "downloads": -1,
            "filename": "tinyedgellm-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "c7c96990786cb9512134181902b48914",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 32652,
            "upload_time": "2025-10-09T05:54:32",
            "upload_time_iso_8601": "2025-10-09T05:54:32.976065Z",
            "url": "https://files.pythonhosted.org/packages/7e/17/a190d239628516850f5088106d241938ec9ca95074460b92270d6d9826f8/tinyedgellm-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-09 05:54:32",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "krish567366",
    "github_project": "tinyedgellm",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "tinyedgellm"
}

Krishna Bajpai, Vedanshi Gupta