Text-to-SQL for DataGovernance Technologies and Education

Pichler, Fiona and Kern, Benjamin (2024) Text-to-SQL for DataGovernance Technologies and Education. Other thesis, OST Ostschweizer Fachhochschule.

[thumbnail of HS 2024 2025-SA-EP-Pichler-Text-to-SQL für DataGovernance Technologies und die Aus- und.pdf] Text
HS 2024 2025-SA-EP-Pichler-Text-to-SQL für DataGovernance Technologies und die Aus- und.pdf - Supplemental Material

Download (1MB)

Abstract

This project explores the development of a Natural Language to SQL (NL-to-SQL) system using
Large Language Models (LLMs). The primary objective is to research different approaches of
generating SQL from NL and evaluate their feasibility. Subsequently a PoC was implemented to
demonstrate real-world usefulness.

The research found four key approaches which were evaluated: pure LLM, in-context learning,
fine-tuning the LLM, and Retrieval-Augmented Generation (RAG). Fine-tuning was ruled out due
to an insufficient amount of training data and time. The remaining three approaches were imple-
mented in Python and relevant LLM APIs and thoroughly evaluated.

Key findings indicate that while pure LLM approaches and in context learning provide a base-
line, RAG significantly enhances the accuracy and reliability of the generated queries. The results
of the test queries were evaluated in terms of 1. similarity (how similar is an output compared to
the example solution?), 2. validity (is the output valid SQL?), 3. executability (can the output be
executed and are the generated column names correct?), 4. reliability (how similar is the output
to the same user prompt?).

Testsets were divided into two grades: basic and advanced, based on its complexity. For the
advanced test cases, on average 45% can be executed on the databases, with only 70% of the re-
quested columns extracted. Hallucinations could not be completely eliminated when extending
the scope of the context, resulting in the low number of executable SQL queries. In this thesis,
the llama3.2 model was found to have the most potential for further development.
The PoC application demonstrates the feasibility of using RAG with metadata about database
schemas as well as JSONL input of users to generate SQL from NL, offering a user-friendly inter-
face for both technical and non-technical users.

Item Type: Thesis (Other)
Subjects: Area of Application > Academic and Education
Technologies > Programming Languages > Python
Technologies > Databases > PostgreSQL
Technologies > Databases > SQL
Metatags > IFS (Institute for Software)
Divisions: Bachelor of Science FHO in Informatik > Bachelor Thesis
Depositing User: OST Deposit User
Contributors:
Contribution
Name
Email
Thesis advisor
Keller, Stefan
UNSPECIFIED
Date Deposited: 18 Feb 2025 12:29
Last Modified: 18 Feb 2025 12:29
URI: https://eprints.ost.ch/id/eprint/1271

Actions (login required)

View Item
View Item