Ahmed, Toufique

LEARNING PROGRAM EMBEDDING FROM UNLABELED SOURCE CODE

2023

Ahmed, Toufique
Advisor(s): Devanbu, Premkumar

Abstract

Machine-learning models can reach very high performance with supervised training, where they learn from labeled data. However, supervised training requires annotating data with desired output labels, which can be a difficult and time-consuming task. Meanwhile, advancements in deep learning models and technology have made it possible to train very large models, which was not feasible a few years ago. Although training such big models requires a substantial amount of supervised data, models can overcome this limitation by first learning from un-labeled data. Pre-trained language models enable us to achieve state-of-the-art performance from large-scale models with limited supervised data. During the pre-training stage, models are exposed to unlabeled data, and their weights can be adjusted using self-supervised tasks such as filling masks or spans, de- noising artificially induced noises, or simple auto-regressive token generation. These tasks help the model learn token distributions and the context of the programming language. After acquiring this knowledge, the model can be easily fine-tuned or used even without any fine-tuning for a specific task. This thesis began with work on foundation models, which are pre-trained on simple tasks like mask-filling and de-noising, and then are fine-tuned for task-specific applications. We show how to effectively apply pre-trained language models in SE tasks, including traditional ones such as code correction and novel ones such as decompiling binaries. We also investigate the effectiveness of multilingual training and demonstrate how knowledge can be transferred from one language to another, thereby improving the model’s performance on three tasks: code summarization, code search, and method name prediction. We also investigated what foundation models learn during the pre-training stage. It is evident that learning syntactical distribution is relatively easier and can be done using tasks such as MLM. However, it is unclear whether the model learns semantic distribution and is robust with respect to meaning-preserving transformations. In our work, we found that the model learns semantics and is robust to meaning-preserving transformations.More recently, Large language models (LLMs), with a few billion to 540 billion parameters, trained on billions of tokens, have become available. They are proficient at few-shot learning, where just a few examples provided at the time of querying are sufficient. These models have largely eliminated the need for a fine-tuning stage and can perform even with a few training samples or none at all. It’s worth noting that during zero-shot and few-shot learning, we don’t change the parameters of the model; we just use the prompt to divert the model to condition its text generation in a more desirable manner. However, the model’s performance heavily relies on the prompt, and prompt engineering for different tasks has become a focus of attention for numerous researchers. In this work, we demonstrate how to automatically semantically augment prompts for code summarization, achieving state-of-the-art performance on this task.

Main Content

For improved accessibility of PDF content, download the file to your device.

UC Davis

LEARNING PROGRAM EMBEDDING FROM UNLABELED SOURCE CODE