LLM Assisted Development

Castelberg, Dominik and Flury, Flury (2024) LLM Assisted Development. Other thesis, OST Ostschweizer Fachhochschule.

Text
SA Castelberg-Flury.pdf - Supplemental Material
Download (3MB)

Abstract

A. Introduction
In recent years, Large Language Models (LLMs) have become valuable tools for developers with their ability to quickly scaffold code and act as an impactful accelerator for development teams around the world. With code being heavily standardized on a global scale and an abundance of training data freely available on platforms such as GitHub, it is easy to intuit possible reasons for their performance. While these assumptions hold true for popular languages such as Java, Python or C#, there has been little research in the usage of LLMs as development tools for less commonly used languages such as Haskell. With this project we aim to explore LLM based development support for Haskell and develop an environment, in which such research can be conducted in a quick and efficient manner.
B. Approach
Four stateoftheart models (Llama 2, Code Llama, GPT3.5 and GPT4) were evaluated based on their performance on tasks typically faced by an automated development support tool. The tasks were scoped and classified into three major categories:
Code Generation, Debugging and Testing. For each of these categories, quantifiable metrics for the evaluation of the model performance were defined. These criteria must be interpretable as quantifiable metrics to allow a comparative analysis between models. These metrics were then weighted according to their importance, based on the insight of experts in the field of Haskell development. Each task was executed with three sorting algorithms of varying cognitive complexity in sample implementations. Cognitive complexity was selected as the complexity measure after careful evaluation, to ensure that the ordering of the algorithms is based on complexity of interpretability. Utilizing cognitive complexity inspired us to frame LLMs as entities whose performance can be analysed through the lens of cognitive load theory. This enabled the differentiation between errors caused by the complexity of a provided algorithm (intrinsic load) and errors caused by unclear instructions (extraneous load). To accelerate the evaluation processes, a development environment, enabling both the automated testing of generated Haskell code using Jupyter notebooks and the utilization of cloud hosted models with modern LLM tooling like LangChain, was created.
C. Conclusion
Our work has shown that there are large gaps in output quality between each model. We have encountered outliers, but we are confident that these outliers were not caused by the intrinsic complexity of the algorithms provided and can be explained with extraneous complexity that lead to the model not understanding the task. This problem can be resolved with further prompt engineering and finetuning, which is why we are optimistic about the viability of models such as GPT4 or Code Llama as supporting tools for Haskell development. Using Chain of Thought Prompting, which leads models to break down given tasks into sequential subtasks, tends to increase the outputquality in general. However the sequential nature of it lead to inconsistencies in the compositions of Haskell’s higherorder functions. It lead the models to neglect critical nuances in function composition, resulting in erroneous code generation.

Item Type:	Thesis (Other)
Subjects:	Area of Application > Development Tools Technologies > Programming Languages > Haskell Metatags > IFS (Institute for Software)
Divisions:	Research and Development > IFS - Institute for Software
Depositing User:	Stud. I
Contributors:	Contribution Name Email Thesis advisor Purandere, Mitra UNSPECIFIED
Date Deposited:	12 Feb 2024 16:02
Last Modified:	12 Feb 2024 16:05
URI:	https://eprints.ost.ch/id/eprint/1157

Actions (login required)

: View Item