codablellm

Name	codablellm JSON
Version	1.3.2 JSON
	download
home_page	None
Summary	A framework for creating and curating high-quality code datasets tailored for large language models
upload_time	2025-08-08 16:45:06
maintainer	None
docs_url	None
author	None
requires_python	<3.13,>=3.9
license	None
keywords	large language models automation reverse engineering software security dataset generation
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <!-- markdownlint-disable MD041 -->
![Build Status](https://github.com/dmanuel64/codablellm/actions/workflows/test.yml/badge.svg?branch=main)
![Python Version](https://img.shields.io/pypi/pyversions/codablellm)
![PyPI](https://img.shields.io/pypi/v/codablellm)
![Downloads](https://img.shields.io/pypi/dm/codablellm)
![License](https://img.shields.io/github/license/dmanuel64/codablellm)
![Documentation Status](https://readthedocs.org/projects/codablellm/badge/?version=latest)

# CodableLLM

**CodableLLM** is a Python framework for creating and curating high-quality code datasets tailored for training and evaluating large language models (LLMs). It supports source code and decompiled code extraction, with a flexible architecture for handling multiple languages and integration with custom LLM prompts.

## Installation

### PyPI

Install CodableLLM directly from PyPI:

```bash
pip install codablellm
```

### Docker Compose (Recommended)

CodableLLM uses [Prefect](https://www.prefect.io/) for orchestration and parallel processing.
Because Prefect relies on a backend database, we recommend using the provided Docker Compose setup, which includes a configured PostgreSQL database.

**Run an example extraction using Docker Compose**:

```bash
docker compose run --rm app \
  codablellm \
  --url https://github.com/dmanuel64/codablellm/raw/refs/heads/main/examples/demo-c-repo.zip \
  /tmp/demo-c-repo \
  ./demo-c-repo.csv \
  /tmp/demo-c-repo \
  --strip \
  --transform my_transform.transform \
  --generation-mode temp-append \
  --build make
```

This command does the following:

- Downloads and extracts a compressed C project archive from the given --url to `/tmp/demo-c-repo`.
- Uses `/tmp/demo-c-repo` as both the source of extracted code and the location of compiled binaries.
- Outputs a dataset to `./demo-c-repo.csv` (relative to your host machine).
- Runs the build command (`make`) inside the extracted repo directory to generate binaries.
- Applies transformations using the function defined in `my_transform.py` (i.e., `my_transform.transform`).
- Uses --generation-mode `temp-append`, which appends transformed outputs to the original dataset, preserving both.

> **This uses the `app` service defined in `docker-compose.yml`, giving you access to the full environment including Prefect and PostgreSQL, which are required for managing flows and task state.**

## Features

- Extracts functions and methods from source code repositories using [tree-sitter](https://github.com/tree-sitter/tree-sitter).
- Easy integration with LLMs to refine or augment extracted code (e.g. rename variables, insert comments, etc.)
- Language-agnostic design with support for plugin-based extractor and decompiler extensions.
- Extendable API for building your own workflows and datasets.
- Fast and scalable, using Prefect to orchestrate and parallelize code extraction, transformation, and dataset generation across multiple processes and tasks.

## Documentation

Complete documentation is available on [Read the Docs](https://codablellm.readthedocs.io/):

- [User Guide](https://codablellm.readthedocs.io/en/latest/User%20Guide/)
- [Supported Languages & Decompilers](https://codablellm.readthedocs.io/en/latest/Built-In%20Support/)
- [API Reference](https://codablellm.readthedocs.io/en/latest/documentation/codablellm/)

## Citation

If you use this tool in your research, please cite [the paper](https://arxiv.org/abs/2507.22066) associated with it:

```bibtex
@misc{manuel2025codablellmautomatingdecompiledsource,
      title={CodableLLM: Automating Decompiled and Source Code Mapping for LLM Dataset Generation}, 
      author={Dylan Manuel and Paul Rad},
      year={2025},
      eprint={2507.22066},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2507.22066}, 
}
```

## Contributing

We welcome contributions from the community! See [CONTRIBUTING.md](https://github.com/dmanuel64/codablellm/blob/main/CONTRIBUTING.md) for guidelines, development setup, and how to get started.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "codablellm",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<3.13,>=3.9",
    "maintainer_email": null,
    "keywords": "large language models, automation, reverse engineering, software security, dataset generation",
    "author": null,
    "author_email": "Dylan Manuel <dylan.manuel@my.utsa.edu>",
    "download_url": "https://files.pythonhosted.org/packages/b2/80/001706ae2157e26a4724856cd948a45be6d50f446f04df98e5e65829b3e4/codablellm-1.3.2.tar.gz",
    "platform": null,
    "description": "<!-- markdownlint-disable MD041 -->\n![Build Status](https://github.com/dmanuel64/codablellm/actions/workflows/test.yml/badge.svg?branch=main)\n![Python Version](https://img.shields.io/pypi/pyversions/codablellm)\n![PyPI](https://img.shields.io/pypi/v/codablellm)\n![Downloads](https://img.shields.io/pypi/dm/codablellm)\n![License](https://img.shields.io/github/license/dmanuel64/codablellm)\n![Documentation Status](https://readthedocs.org/projects/codablellm/badge/?version=latest)\n\n# CodableLLM\n\n**CodableLLM** is a Python framework for creating and curating high-quality code datasets tailored for training and evaluating large language models (LLMs). It supports source code and decompiled code extraction, with a flexible architecture for handling multiple languages and integration with custom LLM prompts.\n\n## Installation\n\n### PyPI\n\nInstall CodableLLM directly from PyPI:\n\n```bash\npip install codablellm\n```\n\n### Docker Compose (Recommended)\n\nCodableLLM uses [Prefect](https://www.prefect.io/) for orchestration and parallel processing.\nBecause Prefect relies on a backend database, we recommend using the provided Docker Compose setup, which includes a configured PostgreSQL database.\n\n**Run an example extraction using Docker Compose**:\n\n```bash\ndocker compose run --rm app \\\n  codablellm \\\n  --url https://github.com/dmanuel64/codablellm/raw/refs/heads/main/examples/demo-c-repo.zip \\\n  /tmp/demo-c-repo \\\n  ./demo-c-repo.csv \\\n  /tmp/demo-c-repo \\\n  --strip \\\n  --transform my_transform.transform \\\n  --generation-mode temp-append \\\n  --build make\n```\n\nThis command does the following:\n\n- Downloads and extracts a compressed C project archive from the given --url to `/tmp/demo-c-repo`.\n- Uses `/tmp/demo-c-repo` as both the source of extracted code and the location of compiled binaries.\n- Outputs a dataset to `./demo-c-repo.csv` (relative to your host machine).\n- Runs the build command (`make`) inside the extracted repo directory to generate binaries.\n- Applies transformations using the function defined in `my_transform.py` (i.e., `my_transform.transform`).\n- Uses --generation-mode `temp-append`, which appends transformed outputs to the original dataset, preserving both.\n\n> **This uses the `app` service defined in `docker-compose.yml`, giving you access to the full environment including Prefect and PostgreSQL, which are required for managing flows and task state.**\n\n## Features\n\n- Extracts functions and methods from source code repositories using [tree-sitter](https://github.com/tree-sitter/tree-sitter).\n- Easy integration with LLMs to refine or augment extracted code (e.g. rename variables, insert comments, etc.)\n- Language-agnostic design with support for plugin-based extractor and decompiler extensions.\n- Extendable API for building your own workflows and datasets.\n- Fast and scalable, using Prefect to orchestrate and parallelize code extraction, transformation, and dataset generation across multiple processes and tasks.\n\n## Documentation\n\nComplete documentation is available on [Read the Docs](https://codablellm.readthedocs.io/):\n\n- [User Guide](https://codablellm.readthedocs.io/en/latest/User%20Guide/)\n- [Supported Languages & Decompilers](https://codablellm.readthedocs.io/en/latest/Built-In%20Support/)\n- [API Reference](https://codablellm.readthedocs.io/en/latest/documentation/codablellm/)\n\n## Citation\n\nIf you use this tool in your research, please cite [the paper](https://arxiv.org/abs/2507.22066) associated with it:\n\n```bibtex\n@misc{manuel2025codablellmautomatingdecompiledsource,\n      title={CodableLLM: Automating Decompiled and Source Code Mapping for LLM Dataset Generation}, \n      author={Dylan Manuel and Paul Rad},\n      year={2025},\n      eprint={2507.22066},\n      archivePrefix={arXiv},\n      primaryClass={cs.SE},\n      url={https://arxiv.org/abs/2507.22066}, \n}\n```\n\n## Contributing\n\nWe welcome contributions from the community! See [CONTRIBUTING.md](https://github.com/dmanuel64/codablellm/blob/main/CONTRIBUTING.md) for guidelines, development setup, and how to get started.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A framework for creating and curating high-quality code datasets tailored for large language models",
    "version": "1.3.2",
    "project_urls": {
        "Bug Tracker": "https://github.com/dmanuel64/codablellm/issues",
        "Documentation": "https://codablellm.readthedocs.io",
        "GitHub": "https://github.com/dmanuel64/codablellm",
        "Homepage": "https://codablellm.readthedocs.io"
    },
    "split_keywords": [
        "large language models",
        " automation",
        " reverse engineering",
        " software security",
        " dataset generation"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "91dd2d17dfb146cd6cef4bc0d1d47da076769b1bfdf0bf9a36bfa0477ce87e16",
                "md5": "323d81192cdf76f4a0420eef3820ae88",
                "sha256": "df2bd28e326e666ec7d438916a6d505f17e1510d1c7bba2f83558b16d9e06dfe"
            },
            "downloads": -1,
            "filename": "codablellm-1.3.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "323d81192cdf76f4a0420eef3820ae88",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.13,>=3.9",
            "size": 54591,
            "upload_time": "2025-08-08T16:45:05",
            "upload_time_iso_8601": "2025-08-08T16:45:05.308251Z",
            "url": "https://files.pythonhosted.org/packages/91/dd/2d17dfb146cd6cef4bc0d1d47da076769b1bfdf0bf9a36bfa0477ce87e16/codablellm-1.3.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "b280001706ae2157e26a4724856cd948a45be6d50f446f04df98e5e65829b3e4",
                "md5": "4fc530969eff807f6e54b46c7177c4a4",
                "sha256": "6c15be1cca74bebd447e75919d7cfe9df6a2c051b090095a117e093aee05a542"
            },
            "downloads": -1,
            "filename": "codablellm-1.3.2.tar.gz",
            "has_sig": false,
            "md5_digest": "4fc530969eff807f6e54b46c7177c4a4",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.13,>=3.9",
            "size": 45041,
            "upload_time": "2025-08-08T16:45:06",
            "upload_time_iso_8601": "2025-08-08T16:45:06.720758Z",
            "url": "https://files.pythonhosted.org/packages/b2/80/001706ae2157e26a4724856cd948a45be6d50f446f04df98e5e65829b3e4/codablellm-1.3.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-08 16:45:06",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "dmanuel64",
    "github_project": "codablellm",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "codablellm"
}

None