Citation Recognition in Legal Documents

Eggmann, Mirio and Fritz, Jasmin and Glasl, Dario (2025) Citation Recognition in Legal Documents. Other thesis, OST Ostschweizer Fachhochschule.

[thumbnail of HS 2024 2025-BA-EP-Eggmann-Glasl-Fitz-Erkennung von Zitatreferenzen.pdf] Text
HS 2024 2025-BA-EP-Eggmann-Glasl-Fitz-Erkennung von Zitatreferenzen.pdf - Supplemental Material

Download (31MB)

Abstract

This bachelor thesis develops a transformer-based system for citation recognition in legal documents. It is intended to replace an existing solution based on regular expression matching, which is hard to maintain and does not generalize to new citation formats.

The task is divided into two steps: One model classifies citations in legal texts into CASELAW, LAW and LITERATURE. A second model identifies the parts of these citations, such as COURT, DATE, ARTICLE, etc.

The existing regex solution serves as data source and benchmark for the first step of recognizing citations. 10 million text chunks containing legal citations are used to fine-tune a Google BERT base multilingual (uncased) model. It achieves a test recall of 96.9%. On a manually annotated dataset, it scores a recall of 74.33%, while the regex solution achieves 72.15%.

Sparse training data poses a challenge for recognizing parts of a citation. Few-shot prompting with an LLM yields good results, but experiments show that it is prohibitively slow in practice. Therefore, knowledge distillation is used to fine-tune a DistilBERT base (uncased) model via the supervision of Llama 3.3 70B. The 1'044 times smaller DistilBERT achieves a similar performance to Llama. The model scores a test recall of 99.11%. On a manually annotated dataset, it achieves a recall of 93.37%.

The final .NET-based solution allows bulk processing of documents and provides a web interface for user interaction. FastAPI is used to serve the fine-tuned models. The results are stored in an MS SQL database.

The proposed solution offers many advantages besides the at par performance with the regex-based system. The solution is more robust and can handle deviations such as typing errors and new citation formats more easily. Moreover, the cherry on top is the improved performance in recognizing parts of a citation with more granular labels.

Item Type: Thesis (Other)
Subjects: Area of Application > Administration, Government
Area of Application > Web based
Technologies > Programming Languages > Python
Technologies > Databases > SQL
Technologies > Web > HTML
Metatags > IFS (Institute for Software)
Divisions: Bachelor of Science FHO in Informatik > Bachelor Thesis
Depositing User: OST Deposit User
Contributors:
Contribution
Name
Email
Thesis advisor
Purandare, Mitra
UNSPECIFIED
Date Deposited: 18 Feb 2025 12:28
Last Modified: 18 Feb 2025 12:28
URI: https://eprints.ost.ch/id/eprint/1248

Actions (login required)

View Item
View Item