## SLOTHY: Assembly optimization via constraint solving
<p align="center">
<img src="https://raw.githubusercontent.com/slothy-optimizer/slothy/main/docs/slothy_logo.png" width=160>
</p>





## About SLOTHY
**SLOTHY** - **S**uper (**L**azy) **O**ptimization of **T**ricky **H**andwritten assembl**Y** - is an assembly-level superoptimizer
for:
1. Instruction scheduling
2. Register allocation
3. Software pipelining (= periodic loop interleaving)
SLOTHY is generic in the target architecture and microarchitecture. This repository provides instantiations for:
- Armv8.1-M+Helium: Cortex-M55, Cortex-M85
- AArch64: Cortex-A55, and experimentally Cortex-A72, Cortex-X/Neoverse-V, Apple M1 (Firestorm, Icestorm)
SLOTHY is discussed in [Fast and Clean: Auditable high-performance assembly via constraint solving](https://eprint.iacr.org/2022/1303).
### Goal
SLOTHY enables a development workflow where developers write 'clean' assembly by hand, emphasizing the logic of the computation, while SLOTHY automates microarchitecture-specific micro-optimizations. This accelerates development, keeps manually written code artifacts maintainable, and allows to split efforts for formal verification into the separate verification of the clean code and the micro-optimizations.
### How it works
SLOTHY is essentially a constraint solver frontend: It converts the input source into a data flow graph and
builds a constraint model capturing valid instruction schedulings, register renamings, and periodic loop
interleavings. The model is passed to an external constraint solver and, upon success,
a satisfying assignment converted back into the final code. Currently, SLOTHY uses
[Google OR-Tools](https://developers.google.com/optimization) as its constraint solver backend.
### Performance
As a rough rule of thumb, SLOTHY typically optimizes workloads of <50 instructions in seconds to minutes, workloads
up to 150 instructions in minutes to hours, while for larger kernels some heuristics are necessary.
### Applications
SLOTHY has been used to provide the fastest known implementations of various cryptographic and DSP primitives:
For example, the [SLOTHY paper](https://eprint.iacr.org/2022/1303) discusses the NTTs underlying ML-KEM and ML-DSA for
Cortex-{A55, A72, M55, M85}, the FFT for Cortex-{M55,M85}, and the X25519 scalar multiplication for Cortex-A55. You find
the clean and optimized source code for those examples in [`paper/`](paper).
## Getting started
Have a look at the [SLOTHY tutorial](docs/source/tutorial/README.md) for a hands-on and example-based introduction to SLOTHY.
## Real world uses
* [AWS libcrypto (AWS-LC)](https://github.com/aws/aws-lc): SLOTHY-optimized X25519 code based on our un-interleaved form of the [original code by Emil
Lenngren](https://github.com/Emill/X25519-AArch64) has been [formally verified and
included](https://github.com/awslabs/s2n-bignum/pull/108) in
[s2n-bignum](https://github.com/awslabs/s2n-bignum/) (the bignum component of AWS-LC) and [merged](https://github.com/aws/aws-lc/pull/1469) into
AWS-LC. This was the topic of a [Real World Crypto 2024
talk](https://iacr.org/submit/files/slides/2024/rwc/rwc2024/38/slides.pdf).
* [s2n-bignum](https://github.com/awslabs/s2n-bignum/) routinely employs SLOTHY for finding
further highly optimized ECC implementations (e.g., [P256](https://github.com/awslabs/s2n-bignum/pull/118),
[P384](https://github.com/awslabs/s2n-bignum/pull/122), [P521](https://github.com/awslabs/s2n-bignum/pull/130) and
verifies them through automated equivalence-checking in [HOL-Light](https://hol-light.github.io/).
* [Arm EndpointAI](https://github.com/ARM-software/EndpointAI): SLOTHY-optimized code has been deployed to the CMSIS DSP Library for the radix-4 CFFT routines as part
of the Arm EndpointAI project in [this
commit](https://github.com/ARM-software/EndpointAI/commit/817bb57d8a4a604538a04627851f5e9adb5f08fc).
* [mlkem-native](https://github.com/pq-code-package/mlkem-native): AArch64 assembly routines of ML-KEM are automatically optimized using SLOTHY.
* [pqm7](https://github.com/mupq/pqm7): Benchmarking framework for the Arm Cortex-M7 which has been created as a case study for automated microarchitectural migrations of software libraries using SLOTHY. See [our paper](https://eprint.iacr.org/2025/366) for more details.
## Installation
### Option 1: Install from PyPI
The easiest way to install SLOTHY is via pip:
```bash
pip install slothy
```
#### Quick Start
Here's a minimal example of using SLOTHY to optimize assembly code:
```python
import sys
import logging
import slothy
import slothy.targets.aarch64.aarch64_neon as AArch64_Neon
import slothy.targets.aarch64.cortex_a55 as Target_CortexA55
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
# Create SLOTHY instance for ARM Cortex-A55
s = slothy.Slothy(AArch64_Neon, Target_CortexA55)
# Load assembly code from file
s.load_source_from_file('example.s')
# Optimize the code
s.optimize(start='start_label', end='end_label')
# Write optimized assembly to file
s.write_source_to_file('example_optimized.s')
print("Optimization complete! Check example_optimized.s")
```
Example assembly file (`example.s`):
```assembly
start_label:
ldr x0, [x1]
ldr x2, [x3]
add x4, x0, x2
str x4, [x5]
end_label:
```
### Option 2: Development Installation
For development or to run the examples in this repository, first clone the SLOTHY development repository:
```bash
git clone https://github.com/slothy-optimizer/slothy.git
```
#### Requirements
SLOTHY has been successfully used on
- Ubuntu-21.10 and up (64-bit),
- macOS Monterey 12.6 and up.
SLOTHY supports Python 3.9 up to 3.13. For development Python >= 3.10 is required.
See [requirements.txt](requirements.txt) for package requirements, and install via `pip install -r requirements.txt`.
**Note:** `requirements.txt` pins versions for reproducibility. If you already have newer versions of some dependencies
installed and don't want them downgraded, consider using a virtual environment:
```bash
python3 -m venv venv
./venv/bin/python3 -m pip install -r requirements.txt
```
Then, enter the virtual environment via `source venv/bin/activate` prior to running SLOTHY.
Finally, adjust your PATH environment variable to include the directories containining
the `slothy-cli` script and the LLVM `llvm-mca` tool.
#### Verify Development Installation
To check that your development setup is complete, try the following from the base directory:
```
% python3 test.py --tests aarch64_simple0_a55
```
You should see something like the following:
```
* Example: aarch64_simple0_a55...
INFO:aarch64_simple0_a55:SLOTHY version: 0.1.0
INFO:aarch64_simple0_a55:Instructions in body: 20
INFO:aarch64_simple0_a55.slothy:Perform internal binary search for minimal number of stalls...
INFO:aarch64_simple0_a55.slothy:Attempt optimization with max 32 stalls...
INFO:aarch64_simple0_a55.slothy:Objective: minimize number of stalls
INFO:aarch64_simple0_a55.slothy:Invoking external constraint solver (OR-Tools CP-SAT v9.7.2996) ...
INFO:aarch64_simple0_a55.slothy:[0.0721s]: Found 1 solutions so far... objective 19.0, bound 8.0 (minimize number of stalls)
INFO:aarch64_simple0_a55.slothy:[0.0765s]: Found 2 solutions so far... objective 18.0, bound 12.0 (minimize number of stalls)
INFO:aarch64_simple0_a55.slothy:OPTIMAL, wall time: 0.155224 s
INFO:aarch64_simple0_a55.slothy:Booleans in result: 509
INFO:aarch64_simple0_a55.slothy.selfcheck:OK!
INFO:aarch64_simple0_a55.slothy:Minimum number of stalls: 18
```
### Option 3: Docker
A dockerfile for an Ubuntu-22.04 based Docker image with all dependencies of SLOTHY and the PQMX+PQAX test
environments setup can be found in [paper/artifact/slothy.dockerfile](paper/artifact/slothy.Dockerfile). See
[paper/artifact/README.md](paper/artifact/README.md) for instructions.
#### Examples
The [SLOTHY Tutorial](docs/source/tutorial/README.md) and the [examples](examples/naive) directory contain numerous exemplary
assembly snippets. To try them, use `python3 example.py --examples={YOUR_EXAMPLE}`. See `python3 example.py --help` for
the list of all available examples.
The use of SLOTHY from the command line is illustrated in [scripts/](scripts/) supporting the real-world optimizations
for the NTT, FFT and X25519 discussed in [Fast and Clean: Auditable high-performance assembly via constraint
solving](https://eprint.iacr.org/2022/1303).
## Contributing
For information on how to contribute to SLOTHY, please see
[`CONTRIBUTING.md`](./CONTRIBUTING.md).
Raw data
{
"_id": null,
"home_page": null,
"name": "slothy",
"maintainer": "Hanno Becker, Matthias Kannwischer, Amin Abdulrahman",
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "superoptimization, constraint solving, assembly, cryptography, optimization, ortools",
"author": "Hanno Becker, Amin Abdulrahman, Matthias Kannwischer, Fabien Klein",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/42/6e/4ecda0a130a065777bb73fc0c276c1f6b7f433685d08d4164d60e97a99d9/slothy-0.1.2.tar.gz",
"platform": null,
"description": "## SLOTHY: Assembly optimization via constraint solving\n\n<p align=\"center\">\n <img src=\"https://raw.githubusercontent.com/slothy-optimizer/slothy/main/docs/slothy_logo.png\" width=160>\n</p>\n\n\n\n\n\n\n\n## About SLOTHY\n\n**SLOTHY** - **S**uper (**L**azy) **O**ptimization of **T**ricky **H**andwritten assembl**Y** - is an assembly-level superoptimizer\nfor:\n1. Instruction scheduling\n2. Register allocation\n3. Software pipelining (= periodic loop interleaving)\n\nSLOTHY is generic in the target architecture and microarchitecture. This repository provides instantiations for:\n- Armv8.1-M+Helium: Cortex-M55, Cortex-M85\n- AArch64: Cortex-A55, and experimentally Cortex-A72, Cortex-X/Neoverse-V, Apple M1 (Firestorm, Icestorm)\n\nSLOTHY is discussed in [Fast and Clean: Auditable high-performance assembly via constraint solving](https://eprint.iacr.org/2022/1303).\n\n### Goal\n\nSLOTHY enables a development workflow where developers write 'clean' assembly by hand, emphasizing the logic of the computation, while SLOTHY automates microarchitecture-specific micro-optimizations. This accelerates development, keeps manually written code artifacts maintainable, and allows to split efforts for formal verification into the separate verification of the clean code and the micro-optimizations.\n\n### How it works\n\nSLOTHY is essentially a constraint solver frontend: It converts the input source into a data flow graph and\nbuilds a constraint model capturing valid instruction schedulings, register renamings, and periodic loop\ninterleavings. The model is passed to an external constraint solver and, upon success,\na satisfying assignment converted back into the final code. Currently, SLOTHY uses\n[Google OR-Tools](https://developers.google.com/optimization) as its constraint solver backend.\n\n### Performance\n\nAs a rough rule of thumb, SLOTHY typically optimizes workloads of <50 instructions in seconds to minutes, workloads\nup to 150 instructions in minutes to hours, while for larger kernels some heuristics are necessary.\n\n### Applications\n\nSLOTHY has been used to provide the fastest known implementations of various cryptographic and DSP primitives:\nFor example, the [SLOTHY paper](https://eprint.iacr.org/2022/1303) discusses the NTTs underlying ML-KEM and ML-DSA for\nCortex-{A55, A72, M55, M85}, the FFT for Cortex-{M55,M85}, and the X25519 scalar multiplication for Cortex-A55. You find\nthe clean and optimized source code for those examples in [`paper/`](paper).\n\n## Getting started\n\nHave a look at the [SLOTHY tutorial](docs/source/tutorial/README.md) for a hands-on and example-based introduction to SLOTHY.\n\n## Real world uses\n\n* [AWS libcrypto (AWS-LC)](https://github.com/aws/aws-lc): SLOTHY-optimized X25519 code based on our un-interleaved form of the [original code by Emil\n Lenngren](https://github.com/Emill/X25519-AArch64) has been [formally verified and\n included](https://github.com/awslabs/s2n-bignum/pull/108) in\n [s2n-bignum](https://github.com/awslabs/s2n-bignum/) (the bignum component of AWS-LC) and [merged](https://github.com/aws/aws-lc/pull/1469) into\n AWS-LC. This was the topic of a [Real World Crypto 2024\n talk](https://iacr.org/submit/files/slides/2024/rwc/rwc2024/38/slides.pdf).\n\n* [s2n-bignum](https://github.com/awslabs/s2n-bignum/) routinely employs SLOTHY for finding\nfurther highly optimized ECC implementations (e.g., [P256](https://github.com/awslabs/s2n-bignum/pull/118),\n[P384](https://github.com/awslabs/s2n-bignum/pull/122), [P521](https://github.com/awslabs/s2n-bignum/pull/130) and\nverifies them through automated equivalence-checking in [HOL-Light](https://hol-light.github.io/).\n\n* [Arm EndpointAI](https://github.com/ARM-software/EndpointAI): SLOTHY-optimized code has been deployed to the CMSIS DSP Library for the radix-4 CFFT routines as part\n of the Arm EndpointAI project in [this\n commit](https://github.com/ARM-software/EndpointAI/commit/817bb57d8a4a604538a04627851f5e9adb5f08fc).\n\n* [mlkem-native](https://github.com/pq-code-package/mlkem-native): AArch64 assembly routines of ML-KEM are automatically optimized using SLOTHY.\n\n* [pqm7](https://github.com/mupq/pqm7): Benchmarking framework for the Arm Cortex-M7 which has been created as a case study for automated microarchitectural migrations of software libraries using SLOTHY. See [our paper](https://eprint.iacr.org/2025/366) for more details. \n\n## Installation\n\n### Option 1: Install from PyPI\n\nThe easiest way to install SLOTHY is via pip:\n\n```bash\npip install slothy\n```\n\n#### Quick Start\n\nHere's a minimal example of using SLOTHY to optimize assembly code:\n\n```python\nimport sys\nimport logging \n\nimport slothy\nimport slothy.targets.aarch64.aarch64_neon as AArch64_Neon\nimport slothy.targets.aarch64.cortex_a55 as Target_CortexA55\n\nlogging.basicConfig(stream=sys.stdout, level=logging.INFO)\n\n# Create SLOTHY instance for ARM Cortex-A55\ns = slothy.Slothy(AArch64_Neon, Target_CortexA55)\n\n# Load assembly code from file\ns.load_source_from_file('example.s')\n\n# Optimize the code\ns.optimize(start='start_label', end='end_label')\n\n# Write optimized assembly to file\ns.write_source_to_file('example_optimized.s')\n\nprint(\"Optimization complete! Check example_optimized.s\")\n```\n\nExample assembly file (`example.s`):\n```assembly\nstart_label:\n ldr x0, [x1]\n ldr x2, [x3]\n add x4, x0, x2\n str x4, [x5]\nend_label:\n```\n\n### Option 2: Development Installation\n\nFor development or to run the examples in this repository, first clone the SLOTHY development repository:\n\n```bash\ngit clone https://github.com/slothy-optimizer/slothy.git\n```\n\n#### Requirements\n\nSLOTHY has been successfully used on\n\n- Ubuntu-21.10 and up (64-bit),\n- macOS Monterey 12.6 and up.\n\nSLOTHY supports Python 3.9 up to 3.13. For development Python >= 3.10 is required.\nSee [requirements.txt](requirements.txt) for package requirements, and install via `pip install -r requirements.txt`.\n\n**Note:** `requirements.txt` pins versions for reproducibility. If you already have newer versions of some dependencies\ninstalled and don't want them downgraded, consider using a virtual environment:\n\n```bash\npython3 -m venv venv\n./venv/bin/python3 -m pip install -r requirements.txt\n```\n\nThen, enter the virtual environment via `source venv/bin/activate` prior to running SLOTHY.\nFinally, adjust your PATH environment variable to include the directories containining\nthe `slothy-cli` script and the LLVM `llvm-mca` tool.\n\n#### Verify Development Installation\n\nTo check that your development setup is complete, try the following from the base directory:\n\n```\n% python3 test.py --tests aarch64_simple0_a55\n```\n\nYou should see something like the following:\n\n```\n* Example: aarch64_simple0_a55...\nINFO:aarch64_simple0_a55:SLOTHY version: 0.1.0\nINFO:aarch64_simple0_a55:Instructions in body: 20\nINFO:aarch64_simple0_a55.slothy:Perform internal binary search for minimal number of stalls...\nINFO:aarch64_simple0_a55.slothy:Attempt optimization with max 32 stalls...\nINFO:aarch64_simple0_a55.slothy:Objective: minimize number of stalls\nINFO:aarch64_simple0_a55.slothy:Invoking external constraint solver (OR-Tools CP-SAT v9.7.2996) ...\nINFO:aarch64_simple0_a55.slothy:[0.0721s]: Found 1 solutions so far... objective 19.0, bound 8.0 (minimize number of stalls)\nINFO:aarch64_simple0_a55.slothy:[0.0765s]: Found 2 solutions so far... objective 18.0, bound 12.0 (minimize number of stalls)\nINFO:aarch64_simple0_a55.slothy:OPTIMAL, wall time: 0.155224 s\nINFO:aarch64_simple0_a55.slothy:Booleans in result: 509\nINFO:aarch64_simple0_a55.slothy.selfcheck:OK!\nINFO:aarch64_simple0_a55.slothy:Minimum number of stalls: 18\n```\n\n### Option 3: Docker\n\nA dockerfile for an Ubuntu-22.04 based Docker image with all dependencies of SLOTHY and the PQMX+PQAX test\nenvironments setup can be found in [paper/artifact/slothy.dockerfile](paper/artifact/slothy.Dockerfile). See\n[paper/artifact/README.md](paper/artifact/README.md) for instructions.\n\n#### Examples\n\nThe [SLOTHY Tutorial](docs/source/tutorial/README.md) and the [examples](examples/naive) directory contain numerous exemplary\nassembly snippets. To try them, use `python3 example.py --examples={YOUR_EXAMPLE}`. See `python3 example.py --help` for\nthe list of all available examples.\n\nThe use of SLOTHY from the command line is illustrated in [scripts/](scripts/) supporting the real-world optimizations\nfor the NTT, FFT and X25519 discussed in [Fast and Clean: Auditable high-performance assembly via constraint\nsolving](https://eprint.iacr.org/2022/1303).\n\n## Contributing\n\nFor information on how to contribute to SLOTHY, please see\n[`CONTRIBUTING.md`](./CONTRIBUTING.md).\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "SLOTHY: Assembly superoptimization via constraint solving",
"version": "0.1.2",
"project_urls": {
"Documentation": "https://slothy-optimizer.github.io/slothy",
"Homepage": "https://slothy-optimizer.github.io/slothy",
"Issues": "https://github.com/slothy-optimizer/slothy/issues",
"Repository": "https://github.com/slothy-optimizer/slothy"
},
"split_keywords": [
"superoptimization",
" constraint solving",
" assembly",
" cryptography",
" optimization",
" ortools"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "609fee3404fdd59e20de1ea65e95fb40c3c892c1701023aed1967ce08fefef2b",
"md5": "ee4d8b326b8bc8d5cc01e72eaba86bec",
"sha256": "2b5dfba0dad0ff6ce4cb6f2182c45a4f70e694d5430bf8d29acb9ec770297725"
},
"downloads": -1,
"filename": "slothy-0.1.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "ee4d8b326b8bc8d5cc01e72eaba86bec",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 183607,
"upload_time": "2025-07-28T04:34:33",
"upload_time_iso_8601": "2025-07-28T04:34:33.010805Z",
"url": "https://files.pythonhosted.org/packages/60/9f/ee3404fdd59e20de1ea65e95fb40c3c892c1701023aed1967ce08fefef2b/slothy-0.1.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "426e4ecda0a130a065777bb73fc0c276c1f6b7f433685d08d4164d60e97a99d9",
"md5": "be134ca497ae416535f9ea6802db3687",
"sha256": "f3986e61a9a9e72b00f921ac4c1cde8518b625204af885ee11c369761028d5dd"
},
"downloads": -1,
"filename": "slothy-0.1.2.tar.gz",
"has_sig": false,
"md5_digest": "be134ca497ae416535f9ea6802db3687",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 163671,
"upload_time": "2025-07-28T04:34:35",
"upload_time_iso_8601": "2025-07-28T04:34:35.429303Z",
"url": "https://files.pythonhosted.org/packages/42/6e/4ecda0a130a065777bb73fc0c276c1f6b7f433685d08d4164d60e97a99d9/slothy-0.1.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-28 04:34:35",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "slothy-optimizer",
"github_project": "slothy",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "ortools",
"specs": [
[
"==",
"9.7.2996"
]
]
},
{
"name": "ortools",
"specs": [
[
"==",
"9.12.4544"
]
]
},
{
"name": "pandas",
"specs": [
[
">=",
"2.0.3"
]
]
},
{
"name": "sympy",
"specs": [
[
"==",
"1.14.0"
]
]
},
{
"name": "unicorn",
"specs": [
[
"==",
"2.1.3"
]
]
},
{
"name": "black",
"specs": []
},
{
"name": "pydoclint",
"specs": [
[
"==",
"0.6.6"
]
]
},
{
"name": "flake8",
"specs": [
[
"==",
"7.3.0"
]
]
},
{
"name": "sphinx",
"specs": [
[
"==",
"8.2.3"
]
]
},
{
"name": "sphinx_rtd_theme",
"specs": [
[
"==",
"3.0.2"
]
]
},
{
"name": "sphinx_autoapi",
"specs": []
},
{
"name": "sphinx_mdinclude",
"specs": []
},
{
"name": "myst_parser",
"specs": []
},
{
"name": "sphinx-autobuild",
"specs": []
},
{
"name": "sphinx-autodoc2",
"specs": []
}
],
"lcname": "slothy"
}