- Main
Modeling Source Code For Developers
- Jesse, Kevin
- Advisor(s): Devanbu, Premkumar T
Abstract
Software Engineering practices are changing in an age of artificial intelligence. While the core activities of design, develop, maintain, test and evaluate remain, the methods used in these activities are evolving. The prevalence of generative programming models has the potential to reconstitute the duties of a software engineer. Widely adopted models like Copilot and Bard are IDE-based pair-programming assistants that create code from virtually any input: contextual code, natural language, specifications, input output pairs, etc. The way developers interact with these models will redefine some core ideas of software engineering. These models empower virtually anyone, of varying coding proficiency, to create software. Models with the capacity to code will surely manage to inherit software design and analysis capabilities [186], albeit for now, with specific training or prompting.
Naturally, one wonders how language modeling, or more specifically the modeling of source code and its features, will impact developers. Researchers often conjecture on the varying degree of influence these methods will have, but certainly, these tools will support developers in new and existing tasks: code completion, bug and vulnerability detection, code summarization, type annotation, and more are already prominent use cases. One can envision a world where software developers delegate portions of their work to machine learning pipelines, such as unit testing and vulnerability testing of their code; how much of that code they actually write is up for debate as well. Developers will likely automate portions of their work flow but simultaneously gain new tasks and responsibilities. These tasks might include passing automatic code reviews that detect code smells, place code comments automatically, and detect refactorings; maybe using models from [56], [133], [59]. These capabilities come from modeling source code and its features directly by distilling down meaningful representations for the task at hand.
This thesis explores learning meaningful representations from code through a variety of applications for developer supporting tools. The first application is a type-prediction model using representations learned with masked-language-modeling. While effective, we find that the off-the-shelf model fails at an aspect of modeling source code, namely the use of local user-defined types. The next application modifies the model learned representations with one characterized by an objective function capturing how developers actually use types. Along this body of work, the next two chapters present a type inference dataset for the community and a framework for new machine learning models with a Visual Studio plugin. This thesis concludes with a study of large language models on single statement bug introduction and proposes avoidance strategies. Finally I present some future work to improve these models. By reading this thesis, I hope the reader has a few takeaways:
(1) Machine learning is an essential tool for capturing code and its meta data. Models trained on code and its features are capable of generalizing and improving old and new processes.
(2) The data that models train on is not perfect, and the resulting models often inherit biases towards vulnerable and buggy code; researchers must evaluate the risk vs. reward with broadly trained models.
(3) The objectives optimized for software models may not align with our goals; models that incorporate human feedback may ultimately align better to our values and understanding of code.
(4) Large language models are powerful tools for software engineering, but they’re only part of the picture. Models that learn data and control flow, project and file meta data, local and global scope semantics, and information associated with code traces, are better informed on the source code it consumes and produces.
This thesis attempts to quantify the utility of off-the-shelf LLMs like BERT, the misalignment of LLM representations to human derived representations of coding constructs, and the present risks of using LLM predictions at face value. Hopefully, in each case, the chapters leave you optimistic that many of the aforementioned concerns can be minimized, or mitigated with just a bit of ingenuity.
Main Content
Enter the password to open this PDF file:
-
-
-
-
-
-
-
-
-
-
-
-
-
-