<!-- markdownlint-disable MD041 -->






# CodableLLM
**CodableLLM** is a Python framework for creating and curating high-quality code datasets tailored for training and evaluating large language models (LLMs). It supports source code and decompiled code extraction, with a flexible architecture for handling multiple languages and integration with custom LLM prompts.
## Installation
### PyPI
Install CodableLLM directly from PyPI:
```bash
pip install codablellm
```
### Docker Compose (Recommended)
CodableLLM uses [Prefect](https://www.prefect.io/) for orchestration and parallel processing.
Because Prefect relies on a backend database, we recommend using the provided Docker Compose setup, which includes a configured PostgreSQL database.
**Run an example extraction using Docker Compose**:
```bash
docker compose run --rm app \
codablellm \
--url https://github.com/dmanuel64/codablellm/raw/refs/heads/main/examples/demo-c-repo.zip \
/tmp/demo-c-repo \
./demo-c-repo.csv \
/tmp/demo-c-repo \
--strip \
--transform my_transform.transform \
--generation-mode temp-append \
--build make
```
This command does the following:
- Downloads and extracts a compressed C project archive from the given --url to `/tmp/demo-c-repo`.
- Uses `/tmp/demo-c-repo` as both the source of extracted code and the location of compiled binaries.
- Outputs a dataset to `./demo-c-repo.csv` (relative to your host machine).
- Runs the build command (`make`) inside the extracted repo directory to generate binaries.
- Applies transformations using the function defined in `my_transform.py` (i.e., `my_transform.transform`).
- Uses --generation-mode `temp-append`, which appends transformed outputs to the original dataset, preserving both.
> **This uses the `app` service defined in `docker-compose.yml`, giving you access to the full environment including Prefect and PostgreSQL, which are required for managing flows and task state.**
## Features
- Extracts functions and methods from source code repositories using [tree-sitter](https://github.com/tree-sitter/tree-sitter).
- Easy integration with LLMs to refine or augment extracted code (e.g. rename variables, insert comments, etc.)
- Language-agnostic design with support for plugin-based extractor and decompiler extensions.
- Extendable API for building your own workflows and datasets.
- Fast and scalable, using Prefect to orchestrate and parallelize code extraction, transformation, and dataset generation across multiple processes and tasks.
## Documentation
Complete documentation is available on [Read the Docs](https://codablellm.readthedocs.io/):
- [User Guide](https://codablellm.readthedocs.io/en/latest/User%20Guide/)
- [Supported Languages & Decompilers](https://codablellm.readthedocs.io/en/latest/Built-In%20Support/)
- [API Reference](https://codablellm.readthedocs.io/en/latest/documentation/codablellm/)
## Citation
If you use this tool in your research, please cite [the paper](https://arxiv.org/abs/2507.22066) associated with it:
```bibtex
@misc{manuel2025codablellmautomatingdecompiledsource,
title={CodableLLM: Automating Decompiled and Source Code Mapping for LLM Dataset Generation},
author={Dylan Manuel and Paul Rad},
year={2025},
eprint={2507.22066},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2507.22066},
}
```
## Contributing
We welcome contributions from the community! See [CONTRIBUTING.md](https://github.com/dmanuel64/codablellm/blob/main/CONTRIBUTING.md) for guidelines, development setup, and how to get started.
Raw data
{
"_id": null,
"home_page": null,
"name": "codablellm",
"maintainer": null,
"docs_url": null,
"requires_python": "<3.13,>=3.9",
"maintainer_email": null,
"keywords": "large language models, automation, reverse engineering, software security, dataset generation",
"author": null,
"author_email": "Dylan Manuel <dylan.manuel@my.utsa.edu>",
"download_url": "https://files.pythonhosted.org/packages/b2/80/001706ae2157e26a4724856cd948a45be6d50f446f04df98e5e65829b3e4/codablellm-1.3.2.tar.gz",
"platform": null,
"description": "<!-- markdownlint-disable MD041 -->\n\n\n\n\n\n\n\n# CodableLLM\n\n**CodableLLM** is a Python framework for creating and curating high-quality code datasets tailored for training and evaluating large language models (LLMs). It supports source code and decompiled code extraction, with a flexible architecture for handling multiple languages and integration with custom LLM prompts.\n\n## Installation\n\n### PyPI\n\nInstall CodableLLM directly from PyPI:\n\n```bash\npip install codablellm\n```\n\n### Docker Compose (Recommended)\n\nCodableLLM uses [Prefect](https://www.prefect.io/) for orchestration and parallel processing.\nBecause Prefect relies on a backend database, we recommend using the provided Docker Compose setup, which includes a configured PostgreSQL database.\n\n**Run an example extraction using Docker Compose**:\n\n```bash\ndocker compose run --rm app \\\n codablellm \\\n --url https://github.com/dmanuel64/codablellm/raw/refs/heads/main/examples/demo-c-repo.zip \\\n /tmp/demo-c-repo \\\n ./demo-c-repo.csv \\\n /tmp/demo-c-repo \\\n --strip \\\n --transform my_transform.transform \\\n --generation-mode temp-append \\\n --build make\n```\n\nThis command does the following:\n\n- Downloads and extracts a compressed C project archive from the given --url to `/tmp/demo-c-repo`.\n- Uses `/tmp/demo-c-repo` as both the source of extracted code and the location of compiled binaries.\n- Outputs a dataset to `./demo-c-repo.csv` (relative to your host machine).\n- Runs the build command (`make`) inside the extracted repo directory to generate binaries.\n- Applies transformations using the function defined in `my_transform.py` (i.e., `my_transform.transform`).\n- Uses --generation-mode `temp-append`, which appends transformed outputs to the original dataset, preserving both.\n\n> **This uses the `app` service defined in `docker-compose.yml`, giving you access to the full environment including Prefect and PostgreSQL, which are required for managing flows and task state.**\n\n## Features\n\n- Extracts functions and methods from source code repositories using [tree-sitter](https://github.com/tree-sitter/tree-sitter).\n- Easy integration with LLMs to refine or augment extracted code (e.g. rename variables, insert comments, etc.)\n- Language-agnostic design with support for plugin-based extractor and decompiler extensions.\n- Extendable API for building your own workflows and datasets.\n- Fast and scalable, using Prefect to orchestrate and parallelize code extraction, transformation, and dataset generation across multiple processes and tasks.\n\n## Documentation\n\nComplete documentation is available on [Read the Docs](https://codablellm.readthedocs.io/):\n\n- [User Guide](https://codablellm.readthedocs.io/en/latest/User%20Guide/)\n- [Supported Languages & Decompilers](https://codablellm.readthedocs.io/en/latest/Built-In%20Support/)\n- [API Reference](https://codablellm.readthedocs.io/en/latest/documentation/codablellm/)\n\n## Citation\n\nIf you use this tool in your research, please cite [the paper](https://arxiv.org/abs/2507.22066) associated with it:\n\n```bibtex\n@misc{manuel2025codablellmautomatingdecompiledsource,\n title={CodableLLM: Automating Decompiled and Source Code Mapping for LLM Dataset Generation}, \n author={Dylan Manuel and Paul Rad},\n year={2025},\n eprint={2507.22066},\n archivePrefix={arXiv},\n primaryClass={cs.SE},\n url={https://arxiv.org/abs/2507.22066}, \n}\n```\n\n## Contributing\n\nWe welcome contributions from the community! See [CONTRIBUTING.md](https://github.com/dmanuel64/codablellm/blob/main/CONTRIBUTING.md) for guidelines, development setup, and how to get started.\n",
"bugtrack_url": null,
"license": null,
"summary": "A framework for creating and curating high-quality code datasets tailored for large language models",
"version": "1.3.2",
"project_urls": {
"Bug Tracker": "https://github.com/dmanuel64/codablellm/issues",
"Documentation": "https://codablellm.readthedocs.io",
"GitHub": "https://github.com/dmanuel64/codablellm",
"Homepage": "https://codablellm.readthedocs.io"
},
"split_keywords": [
"large language models",
" automation",
" reverse engineering",
" software security",
" dataset generation"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "91dd2d17dfb146cd6cef4bc0d1d47da076769b1bfdf0bf9a36bfa0477ce87e16",
"md5": "323d81192cdf76f4a0420eef3820ae88",
"sha256": "df2bd28e326e666ec7d438916a6d505f17e1510d1c7bba2f83558b16d9e06dfe"
},
"downloads": -1,
"filename": "codablellm-1.3.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "323d81192cdf76f4a0420eef3820ae88",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<3.13,>=3.9",
"size": 54591,
"upload_time": "2025-08-08T16:45:05",
"upload_time_iso_8601": "2025-08-08T16:45:05.308251Z",
"url": "https://files.pythonhosted.org/packages/91/dd/2d17dfb146cd6cef4bc0d1d47da076769b1bfdf0bf9a36bfa0477ce87e16/codablellm-1.3.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "b280001706ae2157e26a4724856cd948a45be6d50f446f04df98e5e65829b3e4",
"md5": "4fc530969eff807f6e54b46c7177c4a4",
"sha256": "6c15be1cca74bebd447e75919d7cfe9df6a2c051b090095a117e093aee05a542"
},
"downloads": -1,
"filename": "codablellm-1.3.2.tar.gz",
"has_sig": false,
"md5_digest": "4fc530969eff807f6e54b46c7177c4a4",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<3.13,>=3.9",
"size": 45041,
"upload_time": "2025-08-08T16:45:06",
"upload_time_iso_8601": "2025-08-08T16:45:06.720758Z",
"url": "https://files.pythonhosted.org/packages/b2/80/001706ae2157e26a4724856cd948a45be6d50f446f04df98e5e65829b3e4/codablellm-1.3.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-08 16:45:06",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "dmanuel64",
"github_project": "codablellm",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "codablellm"
}