cldk

Name	cldk JSON
Version	0.4.0 JSON
	download
home_page	https://github.com/IBM/codellm-devkit
Summary	codellm-devkit: A python library for seamless integration with LLMs.
upload_time	2024-11-13 20:09:24
maintainer	None
docs_url	None
author	Rahul Krishna
requires_python	>=3.11
license	Apache 2.0
keywords	ibm llm large language models code analyzer syntax tree
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # CodeLLM-Devkit: A Python library for seamless interaction with CodeLLMs

![codellm-devkit logo](https://github.com/IBM/codellm-devkit/blob/main/docs/assets/cldk.png?raw=true)

[![arXiv](https://img.shields.io/badge/arXiv-2410.13007-b31b1b.svg)](https://arxiv.org/abs/2410.13007)
[![Python 3.11](https://img.shields.io/badge/python-3.11-blue.svg)](https://www.python.org/downloads/release/python-3110/)
[![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://opensource.org/licenses/Apache-2.0)
[![Documentation](https://img.shields.io/badge/GitHub%20Pages-Docs-blue)](https://ibm.github.io/codellm-devkit/)
[![PyPI version](https://badge.fury.io/py/cldk.svg)](https://badge.fury.io/py/cldk)


Codellm-devkit (CLDK) is a multilingual program analysis framework that bridges the gap between traditional static analysis tools and Large Language Models (LLMs) specialized for code (CodeLLMs). Codellm-devkit allows developers to streamline the process of transforming raw code into actionable insights by providing a unified interface for integrating outputs from various analysis tools and preparing them for effective use by CodeLLMs.

Codellm-devkit simplifies the complex process of analyzing codebases that span multiple programming languages, making it easier to extract meaningful insights and drive LLM-based code analysis. `CLDK` achieves this through an open-source Python library that abstracts the intricacies of program analysis and LLM interactions. With this library, developer can streamline the process of transforming raw code into actionable insights by providing a unified interface for integrating outputs from various analysis tools and preparing them for effective use by CodeLLMs.

**The purpose of Codellm-devkit is to enable the development and experimentation of robust analysis pipelines that harness the power of both traditional program analysis tools and CodeLLMs.**
By providing a consistent and extensible framework, Codellm-devkit aims to reduce the friction associated with multi-language code analysis and ensure compatibility across different analysis tools and LLM platforms.

Codellm-devkit is designed to integrate seamlessly with a variety of popular analysis tools, such as WALA, Tree-sitter, LLVM, and CodeQL, each implemented in different languages. Codellm-devkit acts as a crucial intermediary layer, enabling efficient and consistent communication between these tools and the CodeLLMs.

Codellm-devkit is constantly evolving to include new tools and frameworks, ensuring it remains a versatile solution for code analysis and LLM integration.

Codellm-devkit is:

- **Unified**: Provides a single framework for integrating multiple analysis tools and CodeLLMs, regardless of the programming languages involved.
- **Extensible**: Designed to support new analysis tools and LLM platforms, making it adaptable to the evolving landscape of code analysis.
- **Streamlined**: Simplifies the process of transforming raw code into structured, LLM-ready inputs, reducing the overhead typically associated with multi-language analysis.

Codellm-devkit is an ongoing project, developed at IBM Research.

## Contact

For any questions, feedback, or suggestions, please contact the authors:

| Name | Email |
| ---- | ----- |
| Rahul Krishna | [i.m.ralk@gmail.com](mailto:imralk+oss@gmail.com) |
| Rangeet Pan | [rangeet.pan@ibm.com](mailto:rangeet.pan@gmail.com) |
| Saurabh Sihna | [sinhas@us.ibm.com](mailto:sinhas@us.ibm.com) |
## Table of Contents

- [CodeLLM-Devkit: A Python library for seamless interaction with CodeLLMs](#codellm-devkit-a-python-library-for-seamless-interaction-with-codellms)
  - [Contact](#contact)
  - [Table of Contents](#table-of-contents)
  - [Architectural and Design Overview](#architectural-and-design-overview)
  - [Quick Start: Example Walkthrough](#quick-start-example-walkthrough)
    - [Prerequisites](#prerequisites)
    - [Step 1:  Set up an Ollama server](#step-1--set-up-an-ollama-server)
      - [Pull the latest version of Granite 8b instruct model from ollama](#pull-the-latest-version-of-granite-8b-instruct-model-from-ollama)
    - [Step 2:  Install CLDK](#step-2--install-cldk)
    - [Step 3:  Build a code summarization pipeline](#step-3--build-a-code-summarization-pipeline)
    - [Publication (papers and blogs related to CLDK)](#publication-papers-and-blogs-related-to-cldk)

## Architectural and Design Overview

Below is a very high-level overview of the architectural of CLDK:


```mermaid
graph TD
User <--> A[CLDK]
    A --> 15[Retrieval ‡]
    A --> 16[Prompting ‡]
    A[CLDK] <--> B[Languages]
        B --> C[Java, Python, Go ‡, C ‡, JavaScript ‡, TypeScript ‡, Rust ‡]
            C --> D[Data Models]
                D --> 13{Pydantic}
            13 --> 7            
            C --> 7{backends}
                7 <--> 9[WALA]
                    9 <--> 14[Analysis]
                7 <--> 10[Tree-sitter] 
                    10 <--> 14[Analysis]
                7 <--> 11[LLVM ‡]
                    11 <--> 14[Analysis]
                7 <--> 12[CodeQL ‡]
                    12 <--> 14[Analysis]

    

X[‡ Yet to be implemented]
```

The user interacts by invoking the CLDK API. The CLDK API is responsible for handling the user requests and delegating them to the appropriate language-specific modules. 

Each language comprises of two key components: data models and backends.

1. **Data Models:** These are high level abstractions that represent the various language constructs and componentes in a structured format using pydantic. This confers a high degree of flexibility and extensibility to the models as well as allowing for easy accees of various data components via a simple dot notation. In addition, the data models are designed to be easily serializable and deserializable, making it easy to store and retrieve data from various sources.

2. **Analysis Backends:** These are the components that are responsible for interfacing with the various program analysis tools. The core backends are Treesitter, Javaparse, WALA, LLVM, and CodeQL. The backends are responsible for handling the user requests and delegating them to the appropriate analysis tools. The analysis tools perfrom the requisite analysis and return the results to the user. The user merely calls one of several high-level API functions such as `get_method_body`, `get_method_signature`, `get_call_graph`, etc. and the backend takes care of the rest. 

    Some langugages may have multiple backends. For example, Java has WALA, Javaparser, Treesitter, and CodeQL backends. The user has freedom to choose the backend that best suits their needs. 

We are currently working on implementing the retrieval and prompting components. The retrieval component will be responsible for retrieving the relevant code snippets from the codebase for RAG usecases. The prompting component will be responsible for generating the prompts for the CodeLLMs using popular prompting frameworks such as `PDL`, `Guidance`, or `LMQL`.  

## Quick Start: Example Walkthrough

In this section, we will walk through a simple example to demonstrate how to use CLDK. We will:

* Set up a local ollama server to interact with CodeLLMs
* Build a simple code summarization pipeline for a Java and a Python application.

### Prerequisites

Before we begin, make sure you have the following prerequisites installed:

  * Python 3.11 or later
  * Ollama v0.3.4 or later

### Step 1:  Set up an Ollama server

If don't already have ollama, please download and install it from here: [Ollama](https://ollama.com/download). 

Once you have ollama, start the server and make sure it is running.

If you're on MacOS, Linux, or WSL, you can check to make sure the server is running by running the following command:

```bash
sudo systemctl status ollama
```

You should see an output similar to the following:

```bash
➜ sudo systemctl status ollama
● ollama.service - Ollama Service
     Loaded: loaded (/etc/systemd/system/ollama.service; enabled; preset: enabled)
     Active: active (running) since Sat 2024-08-10 20:39:56 EDT; 17s ago
   Main PID: 23069 (ollama)
      Tasks: 19 (limit: 76802)
     Memory: 1.2G (peak: 1.2G)
        CPU: 6.745s
     CGroup: /system.slice/ollama.service
             └─23069 /usr/local/bin/ollama serve
```

If not, you may have to start the server manually. You can do this by running the following command:

```bash
sudo systemctl start ollama
```

#### Pull the latest version of Granite 8b instruct model from ollama

To pull the latest version of the Granite 8b instruct model from ollama, run the following command:

```bash
ollama pull granite-code:8b-instruct
```

Check to make sure the model was successfully pulled by running the following command:

```bash
ollama run granite-code:8b-instruct 'Write a function to print hello world in python'
```

The output should be similar to the following:

```
➜ ollama run granite-code:8b-instruct 'Write a function to print hello world in python'

def say_hello():
    print("Hello World!")
```

### Step 2:  Install CLDK

You may install the latest version of CLDK from [PyPi](https://pypi.org/project/cldk/):

```python
pip install cldk
```

Once CLDK is installed, you can import it into your Python code:

```python
from cldk import CLDK
```

### Step 3:  Build a code summarization pipeline

Now that we have set up the ollama server and installed CLDK, we can build a simple code summarization pipeline for a Java application.

1. Let's download a sample Java (apache-commons-cli):

    * Download and unzip the sample Java application:
        ```bash
        wget https://github.com/apache/commons-cli/archive/refs/tags/rel/commons-cli-1.7.0.zip -O commons-cli-1.7.0.zip && unzip commons-cli-1.7.0.zip
        ```
    * Record the path to the sample Java application:
        ```bash
        export JAVA_APP_PATH=/path/to/commons-cli-1.7.0 
      ```

Below is a simple code summarization pipeline for a Java application using CLDK. It does the following things:

* Creates a new instance of the CLDK class (see comment `# (1)`)
* Creates an analysis object over the Java application (see comment `# (2)`)
* Iterates over all the files in the project (see comment `# (3)`)
* Iterates over all the classes in the file (see comment `# (4)`)
* Iterates over all the methods in the class (see comment `# (5)`)
* Gets the code body of the method (see comment `# (6)`)
* Initializes the treesitter utils for the class file content (see comment `# (7)`)
* Sanitizes the class for analysis (see comment `# (8)`)
* Formats the instruction for the given focal method and class (see comment `# (9)`)
* Prompts the local model on Ollama (see comment `# (10)`)
* Prints the instruction and LLM output (see comment `# (11)`)

```python
# code_summarization_for_java.py

from cldk import CLDK


def format_inst(code, focal_method, focal_class):
    """
    Format the instruction for the given focal method and class.
    """
    inst = f"Question: Can you write a brief summary for the method `{focal_method}` in the class `{focal_class}` below?\n"

    inst += "\n"
    inst += f"```{language}\n"
    inst += code
    inst += "```" if code.endswith("\n") else "\n```"
    inst += "\n"
    return inst

def prompt_ollama(message: str, model_id: str = "granite-code:8b-instruct") -> str:
    """Prompt local model on Ollama"""
    response_object = ollama.generate(model=model_id, prompt=message)
    return response_object["response"]


if __name__ == "__main__":
    # (1) Create a new instance of the CLDK class
    cldk = CLDK(language="java")

    # (2) Create an analysis object over the java application
    analysis = cldk.analysis(project_path=os.getenv("JAVA_APP_PATH"))

    # (3) Iterate over all the files in the project
    for file_path, class_file in analysis.get_symbol_table().items():
        class_file_path = Path(file_path).absolute().resolve()
        # (4) Iterate over all the classes in the file
        for type_name, type_declaration in class_file.type_declarations.items():
            # (5) Iterate over all the methods in the class
            for method in type_declaration.callable_declarations.values():
                
                # (6) Get code body of the method
                code_body = class_file_path.read_text()
                
                # (7) Initialize the treesitter utils for the class file content
                tree_sitter_utils = cldk.tree_sitter_utils(source_code=code_body)
                
                # (8) Sanitize the class for analysis
                sanitized_class = tree_sitter_utils.sanitize_focal_class(method.declaration)

                # (9) Format the instruction for the given focal method and class
                instruction = format_inst(
                    code=sanitized_class,
                    focal_method=method.declaration,
                    focal_class=type_name,
                )

                # (10) Prompt the local model on Ollama
                llm_output = prompt_ollama(
                    message=instruction,
                    model_id="granite-code:20b-instruct",
                )

                # (11) Print the instruction and LLM output
                print(f"Instruction:\n{instruction}")
                print(f"LLM Output:\n{llm_output}")
```

### Publication (papers and blogs related to CLDK)
1. Krishna, Rahul, Rangeet Pan, Raju Pavuluri, Srikanth Tamilselvam, Maja Vukovic, and Saurabh Sinha. "[Codellm-Devkit: A Framework for Contextualizing Code LLMs with Program Analysis Insights.](https://arxiv.org/pdf/2410.13007)" arXiv preprint arXiv:2410.13007 (2024).
2. Pan, Rangeet, Myeongsoo Kim, Rahul Krishna, Raju Pavuluri, and Saurabh Sinha. "[Multi-language Unit Test Generation using LLMs.](https://arxiv.org/abs/2409.03093)" arXiv preprint arXiv:2409.03093 (2024).
3. Pan, Rangeet, Rahul Krishna, Raju Pavuluri, Saurabh Sinha, and Maja Vukovic., "[Simplify your Code LLM solutions using CodeLLM Dev Kit (CLDK).](https://www.linkedin.com/pulse/simplify-your-code-llm-solutions-using-codellm-dev-kit-rangeet-pan-vnnpe/?trackingId=kZ3U6d8GSDCs8S1oApXZgg%3D%3D)", Blog.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/IBM/codellm-devkit",
    "name": "cldk",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.11",
    "maintainer_email": null,
    "keywords": "ibm, llm, large language models, code analyzer, syntax tree",
    "author": "Rahul Krishna",
    "author_email": "i.m.ralk@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/61/86/f9d9da625cca490fd0e43b69b2ba04d9e7d1b2b063140464dfae02cf103e/cldk-0.4.0.tar.gz",
    "platform": null,
    "description": "# CodeLLM-Devkit: A Python library for seamless interaction with CodeLLMs\n\n![codellm-devkit logo](https://github.com/IBM/codellm-devkit/blob/main/docs/assets/cldk.png?raw=true)\n\n[![arXiv](https://img.shields.io/badge/arXiv-2410.13007-b31b1b.svg)](https://arxiv.org/abs/2410.13007)\n[![Python 3.11](https://img.shields.io/badge/python-3.11-blue.svg)](https://www.python.org/downloads/release/python-3110/)\n[![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://opensource.org/licenses/Apache-2.0)\n[![Documentation](https://img.shields.io/badge/GitHub%20Pages-Docs-blue)](https://ibm.github.io/codellm-devkit/)\n[![PyPI version](https://badge.fury.io/py/cldk.svg)](https://badge.fury.io/py/cldk)\n\n\nCodellm-devkit (CLDK) is a multilingual program analysis framework that bridges the gap between traditional static analysis tools and Large Language Models (LLMs) specialized for code (CodeLLMs). Codellm-devkit allows developers to streamline the process of transforming raw code into actionable insights by providing a unified interface for integrating outputs from various analysis tools and preparing them for effective use by CodeLLMs.\n\nCodellm-devkit simplifies the complex process of analyzing codebases that span multiple programming languages, making it easier to extract meaningful insights and drive LLM-based code analysis. `CLDK` achieves this through an open-source Python library that abstracts the intricacies of program analysis and LLM interactions. With this library, developer can streamline the process of transforming raw code into actionable insights by providing a unified interface for integrating outputs from various analysis tools and preparing them for effective use by CodeLLMs.\n\n**The purpose of Codellm-devkit is to enable the development and experimentation of robust analysis pipelines that harness the power of both traditional program analysis tools and CodeLLMs.**\nBy providing a consistent and extensible framework, Codellm-devkit aims to reduce the friction associated with multi-language code analysis and ensure compatibility across different analysis tools and LLM platforms.\n\nCodellm-devkit is designed to integrate seamlessly with a variety of popular analysis tools, such as WALA, Tree-sitter, LLVM, and CodeQL, each implemented in different languages. Codellm-devkit acts as a crucial intermediary layer, enabling efficient and consistent communication between these tools and the CodeLLMs.\n\nCodellm-devkit is constantly evolving to include new tools and frameworks, ensuring it remains a versatile solution for code analysis and LLM integration.\n\nCodellm-devkit is:\n\n- **Unified**: Provides a single framework for integrating multiple analysis tools and CodeLLMs, regardless of the programming languages involved.\n- **Extensible**: Designed to support new analysis tools and LLM platforms, making it adaptable to the evolving landscape of code analysis.\n- **Streamlined**: Simplifies the process of transforming raw code into structured, LLM-ready inputs, reducing the overhead typically associated with multi-language analysis.\n\nCodellm-devkit is an ongoing project, developed at IBM Research.\n\n## Contact\n\nFor any questions, feedback, or suggestions, please contact the authors:\n\n| Name | Email |\n| ---- | ----- |\n| Rahul Krishna | [i.m.ralk@gmail.com](mailto:imralk+oss@gmail.com) |\n| Rangeet Pan | [rangeet.pan@ibm.com](mailto:rangeet.pan@gmail.com) |\n| Saurabh Sihna | [sinhas@us.ibm.com](mailto:sinhas@us.ibm.com) |\n## Table of Contents\n\n- [CodeLLM-Devkit: A Python library for seamless interaction with CodeLLMs](#codellm-devkit-a-python-library-for-seamless-interaction-with-codellms)\n  - [Contact](#contact)\n  - [Table of Contents](#table-of-contents)\n  - [Architectural and Design Overview](#architectural-and-design-overview)\n  - [Quick Start: Example Walkthrough](#quick-start-example-walkthrough)\n    - [Prerequisites](#prerequisites)\n    - [Step 1:  Set up an Ollama server](#step-1--set-up-an-ollama-server)\n      - [Pull the latest version of Granite 8b instruct model from ollama](#pull-the-latest-version-of-granite-8b-instruct-model-from-ollama)\n    - [Step 2:  Install CLDK](#step-2--install-cldk)\n    - [Step 3:  Build a code summarization pipeline](#step-3--build-a-code-summarization-pipeline)\n    - [Publication (papers and blogs related to CLDK)](#publication-papers-and-blogs-related-to-cldk)\n\n## Architectural and Design Overview\n\nBelow is a very high-level overview of the architectural of CLDK:\n\n\n```mermaid\ngraph TD\nUser <--> A[CLDK]\n    A --> 15[Retrieval \u2021]\n    A --> 16[Prompting \u2021]\n    A[CLDK] <--> B[Languages]\n        B --> C[Java, Python, Go \u2021, C \u2021, JavaScript \u2021, TypeScript \u2021, Rust \u2021]\n            C --> D[Data Models]\n                D --> 13{Pydantic}\n            13 --> 7            \n            C --> 7{backends}\n                7 <--> 9[WALA]\n                    9 <--> 14[Analysis]\n                7 <--> 10[Tree-sitter] \n                    10 <--> 14[Analysis]\n                7 <--> 11[LLVM \u2021]\n                    11 <--> 14[Analysis]\n                7 <--> 12[CodeQL \u2021]\n                    12 <--> 14[Analysis]\n\n    \n\nX[\u2021 Yet to be implemented]\n```\n\nThe user interacts by invoking the CLDK API. The CLDK API is responsible for handling the user requests and delegating them to the appropriate language-specific modules. \n\nEach language comprises of two key components: data models and backends.\n\n1. **Data Models:** These are high level abstractions that represent the various language constructs and componentes in a structured format using pydantic. This confers a high degree of flexibility and extensibility to the models as well as allowing for easy accees of various data components via a simple dot notation. In addition, the data models are designed to be easily serializable and deserializable, making it easy to store and retrieve data from various sources.\n\n2. **Analysis Backends:** These are the components that are responsible for interfacing with the various program analysis tools. The core backends are Treesitter, Javaparse, WALA, LLVM, and CodeQL. The backends are responsible for handling the user requests and delegating them to the appropriate analysis tools. The analysis tools perfrom the requisite analysis and return the results to the user. The user merely calls one of several high-level API functions such as `get_method_body`, `get_method_signature`, `get_call_graph`, etc. and the backend takes care of the rest. \n\n    Some langugages may have multiple backends. For example, Java has WALA, Javaparser, Treesitter, and CodeQL backends. The user has freedom to choose the backend that best suits their needs. \n\nWe are currently working on implementing the retrieval and prompting components. The retrieval component will be responsible for retrieving the relevant code snippets from the codebase for RAG usecases. The prompting component will be responsible for generating the prompts for the CodeLLMs using popular prompting frameworks such as `PDL`, `Guidance`, or `LMQL`.  \n\n## Quick Start: Example Walkthrough\n\nIn this section, we will walk through a simple example to demonstrate how to use CLDK. We will:\n\n* Set up a local ollama server to interact with CodeLLMs\n* Build a simple code summarization pipeline for a Java and a Python application.\n\n### Prerequisites\n\nBefore we begin, make sure you have the following prerequisites installed:\n\n  * Python 3.11 or later\n  * Ollama v0.3.4 or later\n\n### Step 1:  Set up an Ollama server\n\nIf don't already have ollama, please download and install it from here: [Ollama](https://ollama.com/download). \n\nOnce you have ollama, start the server and make sure it is running.\n\nIf you're on MacOS, Linux, or WSL, you can check to make sure the server is running by running the following command:\n\n```bash\nsudo systemctl status ollama\n```\n\nYou should see an output similar to the following:\n\n```bash\n\u279c sudo systemctl status ollama\n\u25cf ollama.service - Ollama Service\n     Loaded: loaded (/etc/systemd/system/ollama.service; enabled; preset: enabled)\n     Active: active (running) since Sat 2024-08-10 20:39:56 EDT; 17s ago\n   Main PID: 23069 (ollama)\n      Tasks: 19 (limit: 76802)\n     Memory: 1.2G (peak: 1.2G)\n        CPU: 6.745s\n     CGroup: /system.slice/ollama.service\n             \u2514\u250023069 /usr/local/bin/ollama serve\n```\n\nIf not, you may have to start the server manually. You can do this by running the following command:\n\n```bash\nsudo systemctl start ollama\n```\n\n#### Pull the latest version of Granite 8b instruct model from ollama\n\nTo pull the latest version of the Granite 8b instruct model from ollama, run the following command:\n\n```bash\nollama pull granite-code:8b-instruct\n```\n\nCheck to make sure the model was successfully pulled by running the following command:\n\n```bash\nollama run granite-code:8b-instruct 'Write a function to print hello world in python'\n```\n\nThe output should be similar to the following:\n\n```\n\u279c ollama run granite-code:8b-instruct 'Write a function to print hello world in python'\n\ndef say_hello():\n    print(\"Hello World!\")\n```\n\n### Step 2:  Install CLDK\n\nYou may install the latest version of CLDK from [PyPi](https://pypi.org/project/cldk/):\n\n```python\npip install cldk\n```\n\nOnce CLDK is installed, you can import it into your Python code:\n\n```python\nfrom cldk import CLDK\n```\n\n### Step 3:  Build a code summarization pipeline\n\nNow that we have set up the ollama server and installed CLDK, we can build a simple code summarization pipeline for a Java application.\n\n1. Let's download a sample Java (apache-commons-cli):\n\n    * Download and unzip the sample Java application:\n        ```bash\n        wget https://github.com/apache/commons-cli/archive/refs/tags/rel/commons-cli-1.7.0.zip -O commons-cli-1.7.0.zip && unzip commons-cli-1.7.0.zip\n        ```\n    * Record the path to the sample Java application:\n        ```bash\n        export JAVA_APP_PATH=/path/to/commons-cli-1.7.0 \n      ```\n\nBelow is a simple code summarization pipeline for a Java application using CLDK. It does the following things:\n\n* Creates a new instance of the CLDK class (see comment `# (1)`)\n* Creates an analysis object over the Java application (see comment `# (2)`)\n* Iterates over all the files in the project (see comment `# (3)`)\n* Iterates over all the classes in the file (see comment `# (4)`)\n* Iterates over all the methods in the class (see comment `# (5)`)\n* Gets the code body of the method (see comment `# (6)`)\n* Initializes the treesitter utils for the class file content (see comment `# (7)`)\n* Sanitizes the class for analysis (see comment `# (8)`)\n* Formats the instruction for the given focal method and class (see comment `# (9)`)\n* Prompts the local model on Ollama (see comment `# (10)`)\n* Prints the instruction and LLM output (see comment `# (11)`)\n\n```python\n# code_summarization_for_java.py\n\nfrom cldk import CLDK\n\n\ndef format_inst(code, focal_method, focal_class):\n    \"\"\"\n    Format the instruction for the given focal method and class.\n    \"\"\"\n    inst = f\"Question: Can you write a brief summary for the method `{focal_method}` in the class `{focal_class}` below?\\n\"\n\n    inst += \"\\n\"\n    inst += f\"```{language}\\n\"\n    inst += code\n    inst += \"```\" if code.endswith(\"\\n\") else \"\\n```\"\n    inst += \"\\n\"\n    return inst\n\ndef prompt_ollama(message: str, model_id: str = \"granite-code:8b-instruct\") -> str:\n    \"\"\"Prompt local model on Ollama\"\"\"\n    response_object = ollama.generate(model=model_id, prompt=message)\n    return response_object[\"response\"]\n\n\nif __name__ == \"__main__\":\n    # (1) Create a new instance of the CLDK class\n    cldk = CLDK(language=\"java\")\n\n    # (2) Create an analysis object over the java application\n    analysis = cldk.analysis(project_path=os.getenv(\"JAVA_APP_PATH\"))\n\n    # (3) Iterate over all the files in the project\n    for file_path, class_file in analysis.get_symbol_table().items():\n        class_file_path = Path(file_path).absolute().resolve()\n        # (4) Iterate over all the classes in the file\n        for type_name, type_declaration in class_file.type_declarations.items():\n            # (5) Iterate over all the methods in the class\n            for method in type_declaration.callable_declarations.values():\n                \n                # (6) Get code body of the method\n                code_body = class_file_path.read_text()\n                \n                # (7) Initialize the treesitter utils for the class file content\n                tree_sitter_utils = cldk.tree_sitter_utils(source_code=code_body)\n                \n                # (8) Sanitize the class for analysis\n                sanitized_class = tree_sitter_utils.sanitize_focal_class(method.declaration)\n\n                # (9) Format the instruction for the given focal method and class\n                instruction = format_inst(\n                    code=sanitized_class,\n                    focal_method=method.declaration,\n                    focal_class=type_name,\n                )\n\n                # (10) Prompt the local model on Ollama\n                llm_output = prompt_ollama(\n                    message=instruction,\n                    model_id=\"granite-code:20b-instruct\",\n                )\n\n                # (11) Print the instruction and LLM output\n                print(f\"Instruction:\\n{instruction}\")\n                print(f\"LLM Output:\\n{llm_output}\")\n```\n\n### Publication (papers and blogs related to CLDK)\n1. Krishna, Rahul, Rangeet Pan, Raju Pavuluri, Srikanth Tamilselvam, Maja Vukovic, and Saurabh Sinha. \"[Codellm-Devkit: A Framework for Contextualizing Code LLMs with Program Analysis Insights.](https://arxiv.org/pdf/2410.13007)\" arXiv preprint arXiv:2410.13007 (2024).\n2. Pan, Rangeet, Myeongsoo Kim, Rahul Krishna, Raju Pavuluri, and Saurabh Sinha. \"[Multi-language Unit Test Generation using LLMs.](https://arxiv.org/abs/2409.03093)\" arXiv preprint arXiv:2409.03093 (2024).\n3. Pan, Rangeet, Rahul Krishna, Raju Pavuluri, Saurabh Sinha, and Maja Vukovic., \"[Simplify your Code LLM solutions using CodeLLM Dev Kit (CLDK).](https://www.linkedin.com/pulse/simplify-your-code-llm-solutions-using-codellm-dev-kit-rangeet-pan-vnnpe/?trackingId=kZ3U6d8GSDCs8S1oApXZgg%3D%3D)\", Blog.\n",
    "bugtrack_url": null,
    "license": "Apache 2.0",
    "summary": "codellm-devkit: A python library for seamless integration with LLMs.",
    "version": "0.4.0",
    "project_urls": {
        "Homepage": "https://github.com/IBM/codellm-devkit",
        "Repository": "https://github.com/IBM/codellm-devkit"
    },
    "split_keywords": [
        "ibm",
        " llm",
        " large language models",
        " code analyzer",
        " syntax tree"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e25e4ee3ad0c0a8e5a7a0d1794b9df3b20cceadab0e605595f0ac493e4a45bb9",
                "md5": "02cfe9eecc908bf29ad7129353201fa0",
                "sha256": "1d83a659f018dde15dcf7899618f3de7422eef3bfb42ad4d8281389d3737b487"
            },
            "downloads": -1,
            "filename": "cldk-0.4.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "02cfe9eecc908bf29ad7129353201fa0",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11",
            "size": 26101326,
            "upload_time": "2024-11-13T20:09:20",
            "upload_time_iso_8601": "2024-11-13T20:09:20.717622Z",
            "url": "https://files.pythonhosted.org/packages/e2/5e/4ee3ad0c0a8e5a7a0d1794b9df3b20cceadab0e605595f0ac493e4a45bb9/cldk-0.4.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6186f9d9da625cca490fd0e43b69b2ba04d9e7d1b2b063140464dfae02cf103e",
                "md5": "3d48edecc1f2b12373e5edcd665d99b4",
                "sha256": "f1cdd4da6d8bcee03b3c5eb15bf5db0768f09bb919ac4902289dc6a2743b636d"
            },
            "downloads": -1,
            "filename": "cldk-0.4.0.tar.gz",
            "has_sig": false,
            "md5_digest": "3d48edecc1f2b12373e5edcd665d99b4",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11",
            "size": 26076505,
            "upload_time": "2024-11-13T20:09:24",
            "upload_time_iso_8601": "2024-11-13T20:09:24.432241Z",
            "url": "https://files.pythonhosted.org/packages/61/86/f9d9da625cca490fd0e43b69b2ba04d9e7d1b2b063140464dfae02cf103e/cldk-0.4.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-13 20:09:24",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "IBM",
    "github_project": "codellm-devkit",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "cldk"
}

Rahul Krishna