speculators


Namespeculators JSON
Version 0.1.0 PyPI version JSON
download
home_pageNone
SummaryA unified library for creating, representing, and storing speculative decoding algorithms for LLM serving such as in vLLM.
upload_time2025-08-08 01:22:17
maintainerNone
docs_urlNone
authorRed Hat
requires_python>=3.9
licenseNone
keywords speculative decoding transformers llm inference vllm machine learning deep learning nlp language models serving standard format productization decoding algorithms inference server
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <div align="center">

<picture>
    <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/neuralmagic/speculators/main/docs/assets/branding/speculators-logo-white.svg" />
    <source media="(prefers-color-scheme: light)" srcset="https://raw.githubusercontent.com/neuralmagic/speculators/main/docs/assets/branding/speculators-logo-black.svg" />
    <img alt="Speculators logo" src="https://raw.githubusercontent.com/neuralmagic/speculators/main/docs/assets/branding/speculators-logo-black.svg" height="64" />
  </picture>

[![License](https://img.shields.io/github/license/neuralmagic/speculators.svg)](https://github.com/neuralmagic/speculators/blob/main/LICENSE) [![Python Versions](https://img.shields.io/badge/Python-3.9--3.13-orange)](https://pypi.python.org/pypi/speculators)

</div>

## Overview

**Speculators** is a unified library for building, evaluating, and storing speculative decoding algorithms for large language model (LLM) inference, including in frameworks like vLLM. Speculative decoding is a lossless technique that speeds up LLM inference by using a smaller, faster speculator model to propose tokens, which are then verified by the larger base model, reducing latency without compromising output quality. Speculators standardizes this process with reusable formats and tools, enabling easier integration and deployment of speculative decoding in production-grade inference servers.

<p align="center">
  <picture>
    <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/neuralmagic/speculators/main/docs/assets/branding/speculators-user-flow-dark.svg" />
    <source media="(prefers-color-scheme: light)" srcset="https://raw.githubusercontent.com/neuralmagic/speculators/main/docs/assets/branding/speculators-user-flow-light.svg" />
    <img alt="Speculators user flow diagram" src="https://raw.githubusercontent.com/neuralmagic/speculators/main/docs/assets/branding/speculators-user-flow-light.svg" />
  </picture>
</p>

### Key Features

- **Unified Speculative Decoding Toolkit:** Simplifies the development, evaluation, and representation of speculative decoding algorithms, supporting both research and production use cases for LLMs.
- **Standardized, Extensible Format:** Provides a Hugging Face-compatible format for defining speculative models, with tools to convert from external research repositories for easy adoption.
- **Seamless vLLM Integration:** Built for direct deployment into vLLM, enabling low-latency, production-grade inference with minimal overhead.

## Getting Started

### Installation

Before installing, ensure you have the following prerequisites:

- OS: Linux or MacOS
- Python: 3.9 or higher

Install Speculators directly from source using pip::

```bash
pip install git+https://github.com/neuralmagic/speculators.git
```

## Resources

Here you can find links to our research implementations. These provide prototype code for immediate enablement and experimentation, with plans for productization into the main package soon.

- [eagle3](https://github.com/neuralmagic/speculators/tree/main/research/eagle3): This implementation trains models similar to the EAGLE 3 architecture, specifically utilizing the Train Time Test method.

- [hass](https://github.com/neuralmagic/speculators/tree/main/research/hass): This implementation trains models that are a variation on the EAGLE 1 architecture using the [HASS](https://github.com/HArmonizedSS/HASS) method.

## vLLM Inference

Once in the speculators format, you can serve the speculator using vLLM:

```bash
VLLM_USE_V1=1 vllm serve RedHatAI/Qwen3-8B-speculator.eagle3
```

Served models can then be benchmarked using [GuideLLM](https://github.com/vllm-project/guidellm). Below, we show sample benchmark results where we compare our speculator with its dense counterpart. We also additionally compare [quantization](https://github.com/vllm-project/llm-compressor) to explore additional performance improvements by swapping the dense verifier, `Qwen/Qwen3-8B` with the quantized FP8 model, [RedHatAI/Qwen3-8B-FP8-dynamic](https://huggingface.co/RedHatAI/Qwen3-8B-FP8-dynamic) in the `speculator_config`.

<p align="center">
  <picture>
    <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/neuralmagic/speculators/main/docs/assets/qwen_quant_benchmark.png">
    <img alt="GuideLLM Logo" src="https://raw.githubusercontent.com/neuralmagic/speculators/main/docs/assets/qwen_quant_benchmark.png" width=180%>
  </picture>
</p>

### License

Speculators is licensed under the [Apache License 2.0](https://github.com/neuralmagic/speculators/blob/main/LICENSE).

### Cite

If you find Speculators helpful in your research or projects, please consider citing it:

```bibtex
@misc{speculators2025,
  title={Speculators: A Unified Library for Speculative Decoding Algorithms in LLM Serving},
  author={Red Hat},
  year={2025},
  howpublished={\url{https://github.com/neuralmagic/speculators}},
}
```

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "speculators",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "speculative decoding, transformers, llm, inference, vllm, machine learning, deep learning, nlp, language models, serving, standard format, productization, decoding algorithms, inference server",
    "author": "Red Hat",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/03/2b/3e4fc6a6a6302a13c820b44e8622b9e13bb97096aeadde889b3d5523f9d7/speculators-0.1.0.tar.gz",
    "platform": null,
    "description": "<div align=\"center\">\n\n<picture>\n    <source media=\"(prefers-color-scheme: dark)\" srcset=\"https://raw.githubusercontent.com/neuralmagic/speculators/main/docs/assets/branding/speculators-logo-white.svg\" />\n    <source media=\"(prefers-color-scheme: light)\" srcset=\"https://raw.githubusercontent.com/neuralmagic/speculators/main/docs/assets/branding/speculators-logo-black.svg\" />\n    <img alt=\"Speculators logo\" src=\"https://raw.githubusercontent.com/neuralmagic/speculators/main/docs/assets/branding/speculators-logo-black.svg\" height=\"64\" />\n  </picture>\n\n[![License](https://img.shields.io/github/license/neuralmagic/speculators.svg)](https://github.com/neuralmagic/speculators/blob/main/LICENSE) [![Python Versions](https://img.shields.io/badge/Python-3.9--3.13-orange)](https://pypi.python.org/pypi/speculators)\n\n</div>\n\n## Overview\n\n**Speculators** is a unified library for building, evaluating, and storing speculative decoding algorithms for large language model (LLM) inference, including in frameworks like vLLM. Speculative decoding is a lossless technique that speeds up LLM inference by using a smaller, faster speculator model to propose tokens, which are then verified by the larger base model, reducing latency without compromising output quality. Speculators standardizes this process with reusable formats and tools, enabling easier integration and deployment of speculative decoding in production-grade inference servers.\n\n<p align=\"center\">\n  <picture>\n    <source media=\"(prefers-color-scheme: dark)\" srcset=\"https://raw.githubusercontent.com/neuralmagic/speculators/main/docs/assets/branding/speculators-user-flow-dark.svg\" />\n    <source media=\"(prefers-color-scheme: light)\" srcset=\"https://raw.githubusercontent.com/neuralmagic/speculators/main/docs/assets/branding/speculators-user-flow-light.svg\" />\n    <img alt=\"Speculators user flow diagram\" src=\"https://raw.githubusercontent.com/neuralmagic/speculators/main/docs/assets/branding/speculators-user-flow-light.svg\" />\n  </picture>\n</p>\n\n### Key Features\n\n- **Unified Speculative Decoding Toolkit:** Simplifies the development, evaluation, and representation of speculative decoding algorithms, supporting both research and production use cases for LLMs.\n- **Standardized, Extensible Format:** Provides a Hugging Face-compatible format for defining speculative models, with tools to convert from external research repositories for easy adoption.\n- **Seamless vLLM Integration:** Built for direct deployment into vLLM, enabling low-latency, production-grade inference with minimal overhead.\n\n## Getting Started\n\n### Installation\n\nBefore installing, ensure you have the following prerequisites:\n\n- OS: Linux or MacOS\n- Python: 3.9 or higher\n\nInstall Speculators directly from source using pip::\n\n```bash\npip install git+https://github.com/neuralmagic/speculators.git\n```\n\n## Resources\n\nHere you can find links to our research implementations. These provide prototype code for immediate enablement and experimentation, with plans for productization into the main package soon.\n\n- [eagle3](https://github.com/neuralmagic/speculators/tree/main/research/eagle3): This implementation trains models similar to the EAGLE 3 architecture, specifically utilizing the Train Time Test method.\n\n- [hass](https://github.com/neuralmagic/speculators/tree/main/research/hass): This implementation trains models that are a variation on the EAGLE 1 architecture using the [HASS](https://github.com/HArmonizedSS/HASS) method.\n\n## vLLM Inference\n\nOnce in the speculators format, you can serve the speculator using vLLM:\n\n```bash\nVLLM_USE_V1=1 vllm serve RedHatAI/Qwen3-8B-speculator.eagle3\n```\n\nServed models can then be benchmarked using [GuideLLM](https://github.com/vllm-project/guidellm). Below, we show sample benchmark results where we compare our speculator with its dense counterpart. We also additionally compare [quantization](https://github.com/vllm-project/llm-compressor) to explore additional performance improvements by swapping the dense verifier, `Qwen/Qwen3-8B` with the quantized FP8 model, [RedHatAI/Qwen3-8B-FP8-dynamic](https://huggingface.co/RedHatAI/Qwen3-8B-FP8-dynamic) in the `speculator_config`.\n\n<p align=\"center\">\n  <picture>\n    <source media=\"(prefers-color-scheme: dark)\" srcset=\"https://raw.githubusercontent.com/neuralmagic/speculators/main/docs/assets/qwen_quant_benchmark.png\">\n    <img alt=\"GuideLLM Logo\" src=\"https://raw.githubusercontent.com/neuralmagic/speculators/main/docs/assets/qwen_quant_benchmark.png\" width=180%>\n  </picture>\n</p>\n\n### License\n\nSpeculators is licensed under the [Apache License 2.0](https://github.com/neuralmagic/speculators/blob/main/LICENSE).\n\n### Cite\n\nIf you find Speculators helpful in your research or projects, please consider citing it:\n\n```bibtex\n@misc{speculators2025,\n  title={Speculators: A Unified Library for Speculative Decoding Algorithms in LLM Serving},\n  author={Red Hat},\n  year={2025},\n  howpublished={\\url{https://github.com/neuralmagic/speculators}},\n}\n```\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A unified library for creating, representing, and storing speculative decoding algorithms for LLM serving such as in vLLM.",
    "version": "0.1.0",
    "project_urls": {
        "homepage": "https://github.com/neuralmagic/speculators",
        "issues": "https://github.com/neuralmagic/speculators/issues",
        "source": "https://github.com/neuralmagic/speculators"
    },
    "split_keywords": [
        "speculative decoding",
        " transformers",
        " llm",
        " inference",
        " vllm",
        " machine learning",
        " deep learning",
        " nlp",
        " language models",
        " serving",
        " standard format",
        " productization",
        " decoding algorithms",
        " inference server"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "52d9d85c04d5c39b48090a8ba6afee3b9889e756383a51f51a1b476567fa2012",
                "md5": "3a8c82f843b025522d44e6cb1893580e",
                "sha256": "1c762f93c4b402411ba75fa6fc6e99c58342c5b82d33275a469df33dde485e51"
            },
            "downloads": -1,
            "filename": "speculators-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "3a8c82f843b025522d44e6cb1893580e",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 56580,
            "upload_time": "2025-08-08T01:22:15",
            "upload_time_iso_8601": "2025-08-08T01:22:15.675875Z",
            "url": "https://files.pythonhosted.org/packages/52/d9/d85c04d5c39b48090a8ba6afee3b9889e756383a51f51a1b476567fa2012/speculators-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "032b3e4fc6a6a6302a13c820b44e8622b9e13bb97096aeadde889b3d5523f9d7",
                "md5": "1e437dac2e3a1bce353f3cc217033551",
                "sha256": "afb2874a7af0c4c679e3296532645e0b9b354a994aa99fb9f1792fe3593d8f36"
            },
            "downloads": -1,
            "filename": "speculators-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "1e437dac2e3a1bce353f3cc217033551",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 50122,
            "upload_time": "2025-08-08T01:22:17",
            "upload_time_iso_8601": "2025-08-08T01:22:17.839921Z",
            "url": "https://files.pythonhosted.org/packages/03/2b/3e4fc6a6a6302a13c820b44e8622b9e13bb97096aeadde889b3d5523f9d7/speculators-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-08 01:22:17",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "neuralmagic",
    "github_project": "speculators",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "tox": true,
    "lcname": "speculators"
}
        
Elapsed time: 2.31064s