salesforce-codetf

Name	salesforce-codetf JSON
Version	1.0.2.5 JSON
	download
home_page	https://github.com/Salesforce/CodeTF
Summary	CodeTF: A Transformer-based Library for Code Intelligence
upload_time	2024-02-28 06:01:14
maintainer
docs_url	None
author	Nghi D. Q. Bui
requires_python	>=3.8.0
license	Apache 2.0
keywords	ai4code code intelligence generative ai deep learning library pytorch huggingface
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            
    
<p align="center">
    <br>
    <img src="assets/logo.png" width="500"/>
    <br>
<p>
<div align="center">
  <a href="https://opensource.org/license/apache-2-0/">
  <img alt="license" src="https://img.shields.io/badge/License-Apache%202.0-green.svg"/>
  </a>
   <a href="https://www.python.org/downloads/release/python-380/">
  <img alt="python" src="https://img.shields.io/badge/python-3.8+-yellow.svg"/>
  </a> 
   <a href="https://pypi.org/project/salesforce-codetf/">
  <img alt="downloads" src="https://static.pepy.tech/badge/salesforce-codetf"/>
  </a> 

<a href="https://arxiv.org/pdf/2306.00029.pdf">Technical Report</a>,
<a href="https://opensource.salesforce.com/CodeTF/latest/index.html">Documentation</a>,
<a href="https://github.com/salesforce/CodeTF/tree/main/test_inference">Examples</a>,
    
# CodeTF - A One-stop Transformer Library for State-of-the-art Code LLM

<!-- 
[![Code License](https://img.shields.io/badge/Code%20License-Apache_2.0-green.svg)](https://github.com/bdqnghi/CodeTF_personal/blob/main/LICENSE)
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/release/python-390/)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) -->
 </div>   
    
## Table of Contents
  - [Introduction](#introduction)
  - [Installation](#installation-guide)
  - [Getting Started](#getting-started)
    - [Inferencing Pipeline](#inferencing-pipeline)
    - [Model Zoo](#model-zoo)
    - [Fine-Tuning Your Own Model](#fine-tuning-pipeline)
    - [Evaluate On Well-Known Benchmarks](#evaluate-on-well-known-benchmarks)
    - [Utilities to Manipulate Source Code Based on AST](#code-utilities)
        - [AST Parser in Multiple Languages](#ast-parser-in-multiple-languages)
        - [Extract Code Attributes](#extract-code-attributes)
        - [Remove Comments](#remove-comments)
  - [Ethical and Responsible Use](#ethical-and-responsible-use) 
  - [License](#license)

## Introduction
CodeTF is a one-stop Python transformer-based library for ***code large language models (Code LLMs)*** and ***code intelligence***, provides a seamless interface for training and inferencing on code intelligence tasks like code summarization, translation, code generation and so on. It aims to facilitate easy integration of SOTA CodeLLMs into real-world applications.

In addition to the core LLMs's features for code, CodeTF offers utilities for code manipulation across various languages, including easy extraction of code attributes. Using tree-sitter as its core AST parser, it enables parsing of attributes such as function names, comments, and variable names. Pre-built libraries for numerous languages are provided, eliminating the need for complicated parser setup. CodeTF thus ensures a user-friendly and accessible environment for code intelligence tasks.

The current version of the library offers:

- **Fast Model Serving**: We support an easy-to-use interface for rapid inferencing with **pre-quantized models** (int8, int16, float16). CodeTF handles all aspects of device management, so users do not have to worry about that aspect. If your model is large, we offer advanced features such as weight sharding across GPUs to serve the models more quickly.
- **Fine-Tuning Your Own Models**: We provide an API for quickly fine-tuning your own LLMs for code using SOTA techniques for **parameter-efficient fine-tuning** (HuggingFace PEFT) on distributed environments.
- **Supported Tasks**: nl2code, code summarization, code completion, code translation, code refinement, clone detection, defect prediction.
- **Datasets+**: We have preprocessed well-known benchmarks (**Human-Eval, MBPP, CodeXGLUE, APPS, etc.**) and offer an easy-to-load feature for these datasets.
- **Model Evaluator**: We provide interface to evaluate models on well-known benchmarks (e.g. Human-Eval) on popular metrics (e.g., pass@k) with little effort (**~15 LOCs**).
- **Pretrained Models**: We supply pretrained checkpoints of state-of-the-art foundational language models of code (CodeBERT, CodeT5, CodeGen, CodeT5+, Incoder, StarCoder, etc.).
- **Fine-Tuned Models**: We furnish fine-tuned checkpoints for 8+ downstream tasks.
- **Utility to Manipulate Source Code**: We provide utilities to easily manipulate source code, such as user-friendly AST parsers (based on tree-sitter) in **15+ programming languages**, to extract important code features, such as function name, identifiers, etc.

The following table shows the supported models with sizes and the tasks that the models support. This is a continuing effort and we are working on further growing the list.
    
| Model        | Size                                                                                                                          | Tasks                                                                                                                                                                                                     |
|--------------|-------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------|
| CodeT5       | Base, Base-multi-sum, Base-translate-cs, Base-translate-java, Base-sum, Base-clone, Base-defect                              | Pretrained, NL to Code, Refine, Translation (CS to Java, Java to CS), Summarization (Python, Go, PHP, JavaScript, Java, Ruby), Clone detection, Defect prediction |
| CodeT5+      | Plus-instruct-16B, Plus-16B, Plus-6B, Plus-2B, Plus-770M-python, Plus-770M, Plus-220M                                      | Pretrained, NL to Code, Refine , Defect prediction |
| CodeGen      | Mono: 350M, 2B, 6B, 1B, 3.7B, 7B, 16B<br>Multi: 350M, 2B, 6B<br>NL: 350M, 2B                                           | Pretrained |
| StarCoder    | 15.5B                                                                                                                         | Pretrained |
| SantaCoder   | 1.1B                                                                                                                          | Pretrained |
| GPT-NeoX     | 20B                                                                                                                           | Pretrained |
| GPT-Neo      | 1.3B                                                                                                                          | Pretrained |
| GPT-J        | 6B                                                                                                                            | Pretrained |
| Incoder      | 6B                                                                                                                            | Pretrained |
| CodeParrot   | Small-python (110M), Small-multi(110M), 1.5B                                                                                   | Pretrained |
| CodeBERT     | CodeBERT-base, UnixCoder-base, CodeBERTa-small                                                                                 | Pretrained |


## Installation Guide

1. (Optional) Creating conda environment

```bash
conda create -n codetf python=3.8
conda activate codetf
```

2. Install from [PyPI](https://pypi.org/project/salesforce-codetf/):
```bash
pip install salesforce-codetf
```
    
3. Alternatively, build CodeTF from source:

```bash
git clone https://github.com/salesforce/CodeTF.git
cd CodeTF
pip install -e .
```

Additionally, to make sure the quantization feature works well, also install these dependencies:
```bash
pip install -q -U git+https://github.com/huggingface/transformers.git
pip install -q -U git+https://github.com/huggingface/peft.git
pip install -q -U git+https://github.com/huggingface/accelerate.git
```

For some models, such as [StarCoder](https://github.com/bigcode-project/starcoder), it is required to log in Huggingface. Please obtain the HuggingFace token and login:
```
huggingface-cli login
```

## Getting Started
### Inferencing Pipeline
    
Getting started with CodeTF is simple and quick with our model loading pipeline function ``load_model_pipeline()``. Here's an example showing how to load codet5+ model and perform inference on code generation task:
    
```python
from codetf.models import load_model_pipeline

code_generation_model = load_model_pipeline(model_name="codet5", task="pretrained",
            model_type="plus-770M-python", is_eval=True,
            load_in_8bit=True, load_in_4bit=False, weight_sharding=False)
            
result = code_generation_model.predict(["def print_hello_world():"])
print(result)
```
There are a few notable arguments that need to be considered:
-  ``model_name``: the name of the model, currently support ``codet5`` and ``causal-lm``. 
-  ``model_type``: type of model for each model name, e.g. ``base``, ``codegen-350M-mono``, ``j-6B``, etc.
-  ``load_in_8bit`` and ``load_in_4bit``: inherit the dynamic quantization feature from [Huggingface Quantization](https://huggingface.co/docs/transformers/main/main_classes/quantization).
-  ``weight_sharding``: our advance feature that leverages [HuggingFace Sharded Checkpoint](https://huggingface.co/docs/accelerate/v0.19.0/en/package_reference/big_modeling#accelerate.load_checkpoint_and_dispatch) to split a large model in several smaller shards in different GPUs. Please consider using this if you are dealing with large models.

### Model Zoo
You might want to view all of the supported models. To do this, you can use the ``model_zoo()``:
```python
from codetf.models import model_zoo
print(model_zoo)
# ============================================================================================================
# Architectures                  Types                           Tasks
# ============================================================================================================
# causallm                       codegen-350M-mono              pretrained
#                                codegen-350M-multi             pretrained
#                                codegen-350M-nl                pretrained
#                                codegen-2B-mono                pretrained
#                                codegen-2B-multi               pretrained
#                                codegen-2B-nl                  pretrained
#                                codegen-6B-mono                pretrained
#                                codegen-6B-nl                  pretrained
#                                codegen-6B-multi               pretrained
#                                starcoder-15.5B                pretrained
#                                gpt-neox-20B                   pretrained
#                                gpt-neo-1.3B                   pretrained
#                                gpt-j-6B                       pretrained
#                                incoder-6B                     pretrained
#                                codegen2-1B                    pretrained
#                                codegen2-3.7B                  pretrained
#                                codegen2-7B                    pretrained
#                                codegen2-16B                   pretrained
# codet5                         base-multi-sum                 pretrained
#                                base                           nl2code
#                                base                           refine
#                                base                           translate_cs_java
#                                base                           translate_java_cs
#                                base                           sum_python
#                                base                           sum_go
#                                base                           sum_php
#                                base                           sum_javascript
#                                base                           sum_java
#                                base                           sum_ruby
#                                base                           clone
#                                base                           defect
#                                plus-instruct-16B              pretrained
#                                plus-16B                       pretrained
#                                plus-6B                        pretrained
#                                plus-2B                        pretrained
#                                plus-770M-python               pretrained
#                                plus-770M                      pretrained
#                                plus-220M                      pretrained
# bert                           codebert-base                  pretrained
#                                unixcoder-base                 pretrained
#                                codeberta-small                pretrained
```

### Fine-Tuning Pipeline
Want to train a custom LLM for code? We've got you covered. Below is an example using the ``Seq2SeqTrainer`` to fine-tune a [CodeT5+ pretrained model](https://github.com/salesforce/CodeT5), along with our dataset utilities, make it easy to fine-tune your models using the CodeXGLUE dataset. Here's an example:
    
```python
from codetf.trainer.codet5_trainer import CodeT5Seq2SeqTrainer
from codetf.data_utility.codexglue_dataset import CodeXGLUEDataset
from codetf.models import load_model_pipeline
from codetf.performance.evaluation_metric import EvaluationMetric
from codetf.data_utility.base_dataset import CustomDataset

model_class = load_model_pipeline(model_name="codet5", task="pretrained",
            model_type="plus-220M", is_eval=True)

dataset = CodeXGLUEDataset(tokenizer=model_class.get_tokenizer())
train, test, validation = dataset.load(subset="text-to-code")

train_dataset= CustomDataset(train[0], train[1])
test_dataset= CustomDataset(test[0], test[1])
val_dataset= CustomDataset(validation[0], validation[1])

evaluator = EvaluationMetric(metric="bleu", tokenizer=model_class.tokenizer)

# peft can be in ["lora", "prefixtuning"]
trainer = CodeT5Seq2SeqTrainer(train_dataset=train_dataset, 
                                validation_dataset=val_dataset, 
                                peft="lora",
                                pretrained_model_or_path=model_class.get_model(),
                                tokenizer=model_class.tokenizer)
trainer.train()
```

Comparing to [this script from StarCoder](https://github.com/bigcode-project/starcoder/blob/main/finetune/finetune.py), which requires ~300 LOCs to fine-tune a model, we only need 14 LOCs to do the same !!!


### Evaluate on Well-Known Benchmarks
Planning to reproduce the results of well-known benchmarks like ``Human-Eval``, but struggling with not achieving the same numbers as reported in the original papers? Worried about the complicated evaluation process? Don't worry, we've got you covered with an intuitive, easy-to-use interface. Here's a sample snippet demonstrating how to evaluate Human Eval using pass@k (k=[1,10,100]) as the metric:
```python
from codetf.models import load_model_pipeline
from codetf.data_utility.human_eval_dataset import HumanEvalDataset
from codetf.performance.model_evaluator import ModelEvaluator

os.environ["HF_ALLOW_CODE_EVAL"] = "1"
os.environ["TOKENIZERS_PARALLELISM"] = "true"

model_class = load_model_pipeline(model_name="causal-lm", task="pretrained",
            model_type="codegen-350M-mono", is_eval=True,
            load_in_8bit=True, weight_sharding=False)

dataset = HumanEvalDataset(tokenizer=model_class.get_tokenizer())
prompt_token_ids, prompt_attention_masks, references= dataset.load()

problems = TensorDataset(prompt_token_ids, prompt_attention_masks)

evaluator = ModelEvaluator(model_class)
avg_pass_at_k = evaluator.evaluate_pass_k(problems=problems, unit_tests=references)
print("Pass@k: ", avg_pass_at_k)
```

Comparing to [this script from HuggingFace](https://github.com/huggingface/transformers/blob/main/examples/research_projects/codeparrot/scripts/human_eval.py), which requires ~230 LOCs to evaluate on pass@k, we only need 14 LOCs to do the same !!!

### Loading Preprocessed Data
CodeTF provides the Dataset utility for several well-known datasets, such as CodeXGLUE, Human Eval, MBPP, and APPS. The following is an example of how to load the CodeXGLUE dataset:  

```python
from codetf.data_utility.codexglue_dataset import CodeXGLUEDataset
from transformers import RobertaTokenizer

tokenizer = RobertaTokenizer.from_pretrained("Salesforce/codet5-base", use_fast=True)
dataset = CodeXGLUEDataset(tokenizer=tokenizer)
train, test, validation = dataset.load(subset="text-to-code")
```

The ``train``, ``test``, ``validation`` are returned in form of [Pytorch tensor](https://pytorch.org/docs/stable/tensors.html) to provide the flexilbity for the users to wrap it into higher-lever wrapper for their own use cases.

### Code Utilities
In addition to providing utilities for LLMs, CodeTF also equips users with tools for effective source code manipulation. This is crucial in the code intelligence pipeline, where operations like parsing code into an Abstract Syntax Tree (AST) or extracting code attributes (such as function names or identifiers) are often required (CodeT5). These tasks can be challenging to execute, especially when setup and multi-language support is needed. Our code utility interface offers a streamlined solution, facilitating easy parsing and attribute extraction from code across 15+ languages.


#### AST Parser in Multiple Languages

CodeTF includes AST parsers compatible with numerous programming languages. Here's an example showcasing the parsing of Apex code into an AST:
```python
from codetf.code_utility.apex.apex_code_utility import ApexCodeUtility

apex_code_utility = ApexCodeUtility()

sample_code = """
    public class SampleClass {    
        public Integer myNumber;
        
        **
        * This is a method that returns the value of myNumber.
        * @return An integer value
        */
        public Integer getMyNumber() {
            // Return the current value of myNumber
            return this.myNumber;
        }
    }
"""
ast = apex_code_utility.parse(sample_code)

# This will print the tree-sitter AST object
print(ast)
```

Then you can traverse the tree using the interface from [py-tree-sitter](https://github.com/tree-sitter/py-tree-sitter)
```
root_node = ast.root_node
assert root_node.type == 'module'
assert root_node.start_point == (1, 0)
assert root_node.end_point == (3, 13)
```

There are also other utilities for Java, Python, etc, that can perform the same operations. 

#### Extract Code Attributes

CodeTF provides an interface to easily extract code attributes. The following is a sample for extracting the function name of a Python function:

```python
code_attributes = apex_code_utility.get_code_attributes(sample_code)
print(code_attributes)
```

This will print:
``
{'class_names': ['AccountWithContacts'], 'method_names': ['getAccountsWithContacts'], 'comments': [], 'variable_names': ['acc', 'accounts', 'con', 'System', 'debug', 'Contacts', 'Id', 'Name', 'Account', 'Email', 'LastName']}
``

### Remove Comments
There are other existing utilities, such as removing comments from code:
```python
new_code_snippet = apex_code_utility.remove_comments(sample_code)
print(new_code_snippet)
```

This will print:
```java
public class SampleClass {    
        public Integer myNumber;
        public Integer getMyNumber() {
            return this.myNumber;
        }
    }
 ```

Note that this is an ongoing process, we will add more features to extract complicated code attributes in the future. More examples can be found [here](https://github.com/salesforce/CodeTF/tree/main/test_code_utilities).

## More Examples
You can find more examples for each use case:
- [Fine-tuning](https://github.com/salesforce/CodeTF/tree/main/test_trainer)
- [Inferencing](https://github.com/salesforce/CodeTF/tree/main/test_inference)
- [Model Evaluate](https://github.com/salesforce/CodeTF/tree/main/test_evaluation)
- [Code Utility](https://github.com/salesforce/CodeTF/tree/main/test_code_utilities)

## Notes
- CodeTF is designed to complement and enhance the capabilities of [HuggingFace Transformers](https://huggingface.co/docs/transformers/index), rather than replace it. It serves as a specialized layer specifically tailored for code intelligence tasks, such as fine-tuning language models with code-specific features and evaluating on well-known code intelligence benchmarks. If users require more customization, they are encouraged to write their own training code from scratch.
- CodeTF leverages the powerful functionality provided by [Accelerate](https://github.com/huggingface/accelerate) for both inference and training. With Accelerate, users do not need to manually manage GPUs or CPU devices for most operations, allowing for a streamlined and efficient workflow.

## Ethical and Responsible Use
CodeTF, while powerful, does not guarantee infallible code intelligence capabilities. Users may encounter inaccuracies or biases, possibly leading to misinterpretations or undesired behaviors. Risks include the generation of insecure code, propagation of poor coding practices, or inadvertent revelation of sensitive data. We strongly advise users to examine the pretrained models and system before practical adoption. CodeTF facilitates effective code analysis, prediction, and debugging, promoting reproducible research and development. We encourage its responsible use for enhancing software quality and developer productivity.

However, misuse can lead to unethical outcomes such as unauthorized code manipulation, privacy breaches, or insecure coding practices. Users should familiarize themselves with guidelines for responsible AI before using CodeTF. Our commitment is to continually refine the library by identifying and mitigating potential biases and inappropriate behaviors. Users should review the models and system before practical implementation, and contribute towards refining the library to ensure ethical usage.

## Technical Report and Citing CodeTF
You can find more details in our [technical report](https://arxiv.org/abs/2306.00029).

If you're using CodeTF in your research or applications, please cite using this BibTeX:
```bibtex
@misc{nghi2023codetf,
      title={CodeTF: A Transformer-based Library for CodeLLM & Code Intelligence}, 
      author={Nghi D. Q. Bui, Henry Le, Yue Wang, Akhilesh Deepak Gotmare, Junnan Li, Steven Hoi.},
      year={2023},
      eprint={2209.09019},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
```

## Contact us
If you have any questions, comments or suggestions, please do not hesitate to contact us at codetf@salesforce.com.

## License
[Apache License Version 2.0](LICENSE.txt)

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Salesforce/CodeTF",
    "name": "salesforce-codetf",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8.0",
    "maintainer_email": "",
    "keywords": "AI4Code,Code Intelligence,Generative AI,Deep Learning,Library,PyTorch,HuggingFace",
    "author": "Nghi D. Q. Bui",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/0a/22/45043def16404a97c256a7cff802f67396fc4ea1e7c5c269ef26bbbbfd97/salesforce-codetf-1.0.2.5.tar.gz",
    "platform": null,
    "description": "\n    \n<p align=\"center\">\n    <br>\n    <img src=\"assets/logo.png\" width=\"500\"/>\n    <br>\n<p>\n<div align=\"center\">\n  <a href=\"https://opensource.org/license/apache-2-0/\">\n  <img alt=\"license\" src=\"https://img.shields.io/badge/License-Apache%202.0-green.svg\"/>\n  </a>\n   <a href=\"https://www.python.org/downloads/release/python-380/\">\n  <img alt=\"python\" src=\"https://img.shields.io/badge/python-3.8+-yellow.svg\"/>\n  </a> \n   <a href=\"https://pypi.org/project/salesforce-codetf/\">\n  <img alt=\"downloads\" src=\"https://static.pepy.tech/badge/salesforce-codetf\"/>\n  </a> \n\n<a href=\"https://arxiv.org/pdf/2306.00029.pdf\">Technical Report</a>,\n<a href=\"https://opensource.salesforce.com/CodeTF/latest/index.html\">Documentation</a>,\n<a href=\"https://github.com/salesforce/CodeTF/tree/main/test_inference\">Examples</a>,\n    \n# CodeTF - A One-stop Transformer Library for State-of-the-art Code LLM\n\n<!-- \n[![Code License](https://img.shields.io/badge/Code%20License-Apache_2.0-green.svg)](https://github.com/bdqnghi/CodeTF_personal/blob/main/LICENSE)\n[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/release/python-390/)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) -->\n </div>   \n    \n## Table of Contents\n  - [Introduction](#introduction)\n  - [Installation](#installation-guide)\n  - [Getting Started](#getting-started)\n    - [Inferencing Pipeline](#inferencing-pipeline)\n    - [Model Zoo](#model-zoo)\n    - [Fine-Tuning Your Own Model](#fine-tuning-pipeline)\n    - [Evaluate On Well-Known Benchmarks](#evaluate-on-well-known-benchmarks)\n    - [Utilities to Manipulate Source Code Based on AST](#code-utilities)\n        - [AST Parser in Multiple Languages](#ast-parser-in-multiple-languages)\n        - [Extract Code Attributes](#extract-code-attributes)\n        - [Remove Comments](#remove-comments)\n  - [Ethical and Responsible Use](#ethical-and-responsible-use) \n  - [License](#license)\n\n## Introduction\nCodeTF is a one-stop Python transformer-based library for ***code large language models (Code LLMs)*** and ***code intelligence***, provides a seamless interface for training and inferencing on code intelligence tasks like code summarization, translation, code generation and so on. It aims to facilitate easy integration of SOTA CodeLLMs into real-world applications.\n\nIn addition to the core LLMs's features for code, CodeTF offers utilities for code manipulation across various languages, including easy extraction of code attributes. Using tree-sitter as its core AST parser, it enables parsing of attributes such as function names, comments, and variable names. Pre-built libraries for numerous languages are provided, eliminating the need for complicated parser setup. CodeTF thus ensures a user-friendly and accessible environment for code intelligence tasks.\n\nThe current version of the library offers:\n\n- **Fast Model Serving**: We support an easy-to-use interface for rapid inferencing with **pre-quantized models** (int8, int16, float16). CodeTF handles all aspects of device management, so users do not have to worry about that aspect. If your model is large, we offer advanced features such as weight sharding across GPUs to serve the models more quickly.\n- **Fine-Tuning Your Own Models**: We provide an API for quickly fine-tuning your own LLMs for code using SOTA techniques for **parameter-efficient fine-tuning** (HuggingFace PEFT) on distributed environments.\n- **Supported Tasks**: nl2code, code summarization, code completion, code translation, code refinement, clone detection, defect prediction.\n- **Datasets+**: We have preprocessed well-known benchmarks (**Human-Eval, MBPP, CodeXGLUE, APPS, etc.**) and offer an easy-to-load feature for these datasets.\n- **Model Evaluator**: We provide interface to evaluate models on well-known benchmarks (e.g. Human-Eval) on popular metrics (e.g., pass@k) with little effort (**~15 LOCs**).\n- **Pretrained Models**: We supply pretrained checkpoints of state-of-the-art foundational language models of code (CodeBERT, CodeT5, CodeGen, CodeT5+, Incoder, StarCoder, etc.).\n- **Fine-Tuned Models**: We furnish fine-tuned checkpoints for 8+ downstream tasks.\n- **Utility to Manipulate Source Code**: We provide utilities to easily manipulate source code, such as user-friendly AST parsers (based on tree-sitter) in **15+ programming languages**, to extract important code features, such as function name, identifiers, etc.\n\nThe following table shows the supported models with sizes and the tasks that the models support. This is a continuing effort and we are working on further growing the list.\n    \n| Model        | Size                                                                                                                          | Tasks                                                                                                                                                                                                     |\n|--------------|-------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------|\n| CodeT5       | Base, Base-multi-sum, Base-translate-cs, Base-translate-java, Base-sum, Base-clone, Base-defect                              | Pretrained, NL to Code, Refine, Translation (CS to Java, Java to CS), Summarization (Python, Go, PHP, JavaScript, Java, Ruby), Clone detection, Defect prediction |\n| CodeT5+      | Plus-instruct-16B, Plus-16B, Plus-6B, Plus-2B, Plus-770M-python, Plus-770M, Plus-220M                                      | Pretrained, NL to Code, Refine , Defect prediction |\n| CodeGen      | Mono: 350M, 2B, 6B, 1B, 3.7B, 7B, 16B<br>Multi: 350M, 2B, 6B<br>NL: 350M, 2B                                           | Pretrained |\n| StarCoder    | 15.5B                                                                                                                         | Pretrained |\n| SantaCoder   | 1.1B                                                                                                                          | Pretrained |\n| GPT-NeoX     | 20B                                                                                                                           | Pretrained |\n| GPT-Neo      | 1.3B                                                                                                                          | Pretrained |\n| GPT-J        | 6B                                                                                                                            | Pretrained |\n| Incoder      | 6B                                                                                                                            | Pretrained |\n| CodeParrot   | Small-python (110M), Small-multi(110M), 1.5B                                                                                   | Pretrained |\n| CodeBERT     | CodeBERT-base, UnixCoder-base, CodeBERTa-small                                                                                 | Pretrained |\n\n\n## Installation Guide\n\n1. (Optional) Creating conda environment\n\n```bash\nconda create -n codetf python=3.8\nconda activate codetf\n```\n\n2. Install from [PyPI](https://pypi.org/project/salesforce-codetf/):\n```bash\npip install salesforce-codetf\n```\n    \n3. Alternatively, build CodeTF from source:\n\n```bash\ngit clone https://github.com/salesforce/CodeTF.git\ncd CodeTF\npip install -e .\n```\n\nAdditionally, to make sure the quantization feature works well, also install these dependencies:\n```bash\npip install -q -U git+https://github.com/huggingface/transformers.git\npip install -q -U git+https://github.com/huggingface/peft.git\npip install -q -U git+https://github.com/huggingface/accelerate.git\n```\n\nFor some models, such as [StarCoder](https://github.com/bigcode-project/starcoder), it is required to log in Huggingface. Please obtain the HuggingFace token and login:\n```\nhuggingface-cli login\n```\n\n## Getting Started\n### Inferencing Pipeline\n    \nGetting started with CodeTF is simple and quick with our model loading pipeline function ``load_model_pipeline()``. Here's an example showing how to load codet5+ model and perform inference on code generation task:\n    \n```python\nfrom codetf.models import load_model_pipeline\n\ncode_generation_model = load_model_pipeline(model_name=\"codet5\", task=\"pretrained\",\n            model_type=\"plus-770M-python\", is_eval=True,\n            load_in_8bit=True, load_in_4bit=False, weight_sharding=False)\n            \nresult = code_generation_model.predict([\"def print_hello_world():\"])\nprint(result)\n```\nThere are a few notable arguments that need to be considered:\n-  ``model_name``: the name of the model, currently support ``codet5`` and ``causal-lm``. \n-  ``model_type``: type of model for each model name, e.g. ``base``, ``codegen-350M-mono``, ``j-6B``, etc.\n-  ``load_in_8bit`` and ``load_in_4bit``: inherit the dynamic quantization feature from [Huggingface Quantization](https://huggingface.co/docs/transformers/main/main_classes/quantization).\n-  ``weight_sharding``: our advance feature that leverages [HuggingFace Sharded Checkpoint](https://huggingface.co/docs/accelerate/v0.19.0/en/package_reference/big_modeling#accelerate.load_checkpoint_and_dispatch) to split a large model in several smaller shards in different GPUs. Please consider using this if you are dealing with large models.\n\n### Model Zoo\nYou might want to view all of the supported models. To do this, you can use the ``model_zoo()``:\n```python\nfrom codetf.models import model_zoo\nprint(model_zoo)\n# ============================================================================================================\n# Architectures                  Types                           Tasks\n# ============================================================================================================\n# causallm                       codegen-350M-mono              pretrained\n#                                codegen-350M-multi             pretrained\n#                                codegen-350M-nl                pretrained\n#                                codegen-2B-mono                pretrained\n#                                codegen-2B-multi               pretrained\n#                                codegen-2B-nl                  pretrained\n#                                codegen-6B-mono                pretrained\n#                                codegen-6B-nl                  pretrained\n#                                codegen-6B-multi               pretrained\n#                                starcoder-15.5B                pretrained\n#                                gpt-neox-20B                   pretrained\n#                                gpt-neo-1.3B                   pretrained\n#                                gpt-j-6B                       pretrained\n#                                incoder-6B                     pretrained\n#                                codegen2-1B                    pretrained\n#                                codegen2-3.7B                  pretrained\n#                                codegen2-7B                    pretrained\n#                                codegen2-16B                   pretrained\n# codet5                         base-multi-sum                 pretrained\n#                                base                           nl2code\n#                                base                           refine\n#                                base                           translate_cs_java\n#                                base                           translate_java_cs\n#                                base                           sum_python\n#                                base                           sum_go\n#                                base                           sum_php\n#                                base                           sum_javascript\n#                                base                           sum_java\n#                                base                           sum_ruby\n#                                base                           clone\n#                                base                           defect\n#                                plus-instruct-16B              pretrained\n#                                plus-16B                       pretrained\n#                                plus-6B                        pretrained\n#                                plus-2B                        pretrained\n#                                plus-770M-python               pretrained\n#                                plus-770M                      pretrained\n#                                plus-220M                      pretrained\n# bert                           codebert-base                  pretrained\n#                                unixcoder-base                 pretrained\n#                                codeberta-small                pretrained\n```\n\n### Fine-Tuning Pipeline\nWant to train a custom LLM for code? We've got you covered. Below is an example using the ``Seq2SeqTrainer`` to fine-tune a [CodeT5+ pretrained model](https://github.com/salesforce/CodeT5), along with our dataset utilities, make it easy to fine-tune your models using the CodeXGLUE dataset. Here's an example:\n    \n```python\nfrom codetf.trainer.codet5_trainer import CodeT5Seq2SeqTrainer\nfrom codetf.data_utility.codexglue_dataset import CodeXGLUEDataset\nfrom codetf.models import load_model_pipeline\nfrom codetf.performance.evaluation_metric import EvaluationMetric\nfrom codetf.data_utility.base_dataset import CustomDataset\n\nmodel_class = load_model_pipeline(model_name=\"codet5\", task=\"pretrained\",\n            model_type=\"plus-220M\", is_eval=True)\n\ndataset = CodeXGLUEDataset(tokenizer=model_class.get_tokenizer())\ntrain, test, validation = dataset.load(subset=\"text-to-code\")\n\ntrain_dataset= CustomDataset(train[0], train[1])\ntest_dataset= CustomDataset(test[0], test[1])\nval_dataset= CustomDataset(validation[0], validation[1])\n\nevaluator = EvaluationMetric(metric=\"bleu\", tokenizer=model_class.tokenizer)\n\n# peft can be in [\"lora\", \"prefixtuning\"]\ntrainer = CodeT5Seq2SeqTrainer(train_dataset=train_dataset, \n                                validation_dataset=val_dataset, \n                                peft=\"lora\",\n                                pretrained_model_or_path=model_class.get_model(),\n                                tokenizer=model_class.tokenizer)\ntrainer.train()\n```\n\nComparing to [this script from StarCoder](https://github.com/bigcode-project/starcoder/blob/main/finetune/finetune.py), which requires ~300 LOCs to fine-tune a model, we only need 14 LOCs to do the same !!!\n\n\n### Evaluate on Well-Known Benchmarks\nPlanning to reproduce the results of well-known benchmarks like ``Human-Eval``, but struggling with not achieving the same numbers as reported in the original papers? Worried about the complicated evaluation process? Don't worry, we've got you covered with an intuitive, easy-to-use interface. Here's a sample snippet demonstrating how to evaluate Human Eval using pass@k (k=[1,10,100]) as the metric:\n```python\nfrom codetf.models import load_model_pipeline\nfrom codetf.data_utility.human_eval_dataset import HumanEvalDataset\nfrom codetf.performance.model_evaluator import ModelEvaluator\n\nos.environ[\"HF_ALLOW_CODE_EVAL\"] = \"1\"\nos.environ[\"TOKENIZERS_PARALLELISM\"] = \"true\"\n\nmodel_class = load_model_pipeline(model_name=\"causal-lm\", task=\"pretrained\",\n            model_type=\"codegen-350M-mono\", is_eval=True,\n            load_in_8bit=True, weight_sharding=False)\n\ndataset = HumanEvalDataset(tokenizer=model_class.get_tokenizer())\nprompt_token_ids, prompt_attention_masks, references= dataset.load()\n\nproblems = TensorDataset(prompt_token_ids, prompt_attention_masks)\n\nevaluator = ModelEvaluator(model_class)\navg_pass_at_k = evaluator.evaluate_pass_k(problems=problems, unit_tests=references)\nprint(\"Pass@k: \", avg_pass_at_k)\n```\n\nComparing to [this script from HuggingFace](https://github.com/huggingface/transformers/blob/main/examples/research_projects/codeparrot/scripts/human_eval.py), which requires ~230 LOCs to evaluate on pass@k, we only need 14 LOCs to do the same !!!\n\n### Loading Preprocessed Data\nCodeTF provides the Dataset utility for several well-known datasets, such as CodeXGLUE, Human Eval, MBPP, and APPS. The following is an example of how to load the CodeXGLUE dataset:  \n\n```python\nfrom codetf.data_utility.codexglue_dataset import CodeXGLUEDataset\nfrom transformers import RobertaTokenizer\n\ntokenizer = RobertaTokenizer.from_pretrained(\"Salesforce/codet5-base\", use_fast=True)\ndataset = CodeXGLUEDataset(tokenizer=tokenizer)\ntrain, test, validation = dataset.load(subset=\"text-to-code\")\n```\n\nThe ``train``, ``test``, ``validation`` are returned in form of [Pytorch tensor](https://pytorch.org/docs/stable/tensors.html) to provide the flexilbity for the users to wrap it into higher-lever wrapper for their own use cases.\n\n### Code Utilities\nIn addition to providing utilities for LLMs, CodeTF also equips users with tools for effective source code manipulation. This is crucial in the code intelligence pipeline, where operations like parsing code into an Abstract Syntax Tree (AST) or extracting code attributes (such as function names or identifiers) are often required (CodeT5). These tasks can be challenging to execute, especially when setup and multi-language support is needed. Our code utility interface offers a streamlined solution, facilitating easy parsing and attribute extraction from code across 15+ languages.\n\n\n#### AST Parser in Multiple Languages\n\nCodeTF includes AST parsers compatible with numerous programming languages. Here's an example showcasing the parsing of Apex code into an AST:\n```python\nfrom codetf.code_utility.apex.apex_code_utility import ApexCodeUtility\n\napex_code_utility = ApexCodeUtility()\n\nsample_code = \"\"\"\n    public class SampleClass {    \n        public Integer myNumber;\n        \n        **\n        * This is a method that returns the value of myNumber.\n        * @return An integer value\n        */\n        public Integer getMyNumber() {\n            // Return the current value of myNumber\n            return this.myNumber;\n        }\n    }\n\"\"\"\nast = apex_code_utility.parse(sample_code)\n\n# This will print the tree-sitter AST object\nprint(ast)\n```\n\nThen you can traverse the tree using the interface from [py-tree-sitter](https://github.com/tree-sitter/py-tree-sitter)\n```\nroot_node = ast.root_node\nassert root_node.type == 'module'\nassert root_node.start_point == (1, 0)\nassert root_node.end_point == (3, 13)\n```\n\nThere are also other utilities for Java, Python, etc, that can perform the same operations. \n\n#### Extract Code Attributes\n\nCodeTF provides an interface to easily extract code attributes. The following is a sample for extracting the function name of a Python function:\n\n```python\ncode_attributes = apex_code_utility.get_code_attributes(sample_code)\nprint(code_attributes)\n```\n\nThis will print:\n``\n{'class_names': ['AccountWithContacts'], 'method_names': ['getAccountsWithContacts'], 'comments': [], 'variable_names': ['acc', 'accounts', 'con', 'System', 'debug', 'Contacts', 'Id', 'Name', 'Account', 'Email', 'LastName']}\n``\n\n### Remove Comments\nThere are other existing utilities, such as removing comments from code:\n```python\nnew_code_snippet = apex_code_utility.remove_comments(sample_code)\nprint(new_code_snippet)\n```\n\nThis will print:\n```java\npublic class SampleClass {    \n        public Integer myNumber;\n        public Integer getMyNumber() {\n            return this.myNumber;\n        }\n    }\n ```\n\nNote that this is an ongoing process, we will add more features to extract complicated code attributes in the future. More examples can be found [here](https://github.com/salesforce/CodeTF/tree/main/test_code_utilities).\n\n## More Examples\nYou can find more examples for each use case:\n- [Fine-tuning](https://github.com/salesforce/CodeTF/tree/main/test_trainer)\n- [Inferencing](https://github.com/salesforce/CodeTF/tree/main/test_inference)\n- [Model Evaluate](https://github.com/salesforce/CodeTF/tree/main/test_evaluation)\n- [Code Utility](https://github.com/salesforce/CodeTF/tree/main/test_code_utilities)\n\n## Notes\n- CodeTF is designed to complement and enhance the capabilities of [HuggingFace Transformers](https://huggingface.co/docs/transformers/index), rather than replace it. It serves as a specialized layer specifically tailored for code intelligence tasks, such as fine-tuning language models with code-specific features and evaluating on well-known code intelligence benchmarks. If users require more customization, they are encouraged to write their own training code from scratch.\n- CodeTF leverages the powerful functionality provided by [Accelerate](https://github.com/huggingface/accelerate) for both inference and training. With Accelerate, users do not need to manually manage GPUs or CPU devices for most operations, allowing for a streamlined and efficient workflow.\n\n## Ethical and Responsible Use\nCodeTF, while powerful, does not guarantee infallible code intelligence capabilities. Users may encounter inaccuracies or biases, possibly leading to misinterpretations or undesired behaviors. Risks include the generation of insecure code, propagation of poor coding practices, or inadvertent revelation of sensitive data. We strongly advise users to examine the pretrained models and system before practical adoption. CodeTF facilitates effective code analysis, prediction, and debugging, promoting reproducible research and development. We encourage its responsible use for enhancing software quality and developer productivity.\n\nHowever, misuse can lead to unethical outcomes such as unauthorized code manipulation, privacy breaches, or insecure coding practices. Users should familiarize themselves with guidelines for responsible AI before using CodeTF. Our commitment is to continually refine the library by identifying and mitigating potential biases and inappropriate behaviors. Users should review the models and system before practical implementation, and contribute towards refining the library to ensure ethical usage.\n\n## Technical Report and Citing CodeTF\nYou can find more details in our [technical report](https://arxiv.org/abs/2306.00029).\n\nIf you're using CodeTF in your research or applications, please cite using this BibTeX:\n```bibtex\n@misc{nghi2023codetf,\n      title={CodeTF: A Transformer-based Library for CodeLLM & Code Intelligence}, \n      author={Nghi D. Q. Bui, Henry Le, Yue Wang, Akhilesh Deepak Gotmare, Junnan Li, Steven Hoi.},\n      year={2023},\n      eprint={2209.09019},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV}\n}\n```\n\n## Contact us\nIf you have any questions, comments or suggestions, please do not hesitate to contact us at codetf@salesforce.com.\n\n## License\n[Apache License Version 2.0](LICENSE.txt)\n",
    "bugtrack_url": null,
    "license": "Apache 2.0",
    "summary": "CodeTF: A Transformer-based Library for Code Intelligence",
    "version": "1.0.2.5",
    "project_urls": {
        "Homepage": "https://github.com/Salesforce/CodeTF"
    },
    "split_keywords": [
        "ai4code",
        "code intelligence",
        "generative ai",
        "deep learning",
        "library",
        "pytorch",
        "huggingface"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0a2245043def16404a97c256a7cff802f67396fc4ea1e7c5c269ef26bbbbfd97",
                "md5": "5ba98e8fef60e112bb555df78d5b14bc",
                "sha256": "2793eeeb0ba242f8ca58b317981e7051376a8b33d745aa61d108d59b923d8426"
            },
            "downloads": -1,
            "filename": "salesforce-codetf-1.0.2.5.tar.gz",
            "has_sig": false,
            "md5_digest": "5ba98e8fef60e112bb555df78d5b14bc",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8.0",
            "size": 6323940,
            "upload_time": "2024-02-28T06:01:14",
            "upload_time_iso_8601": "2024-02-28T06:01:14.243508Z",
            "url": "https://files.pythonhosted.org/packages/0a/22/45043def16404a97c256a7cff802f67396fc4ea1e7c5c269ef26bbbbfd97/salesforce-codetf-1.0.2.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-28 06:01:14",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Salesforce",
    "github_project": "CodeTF",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "salesforce-codetf"
}

Nghi D. Q. Bui