Graftceva, Anastasiia (2024) Integration of Deep Computer Vision Foundation Models for Document Analysis. Other thesis, OST - Ostschweizer Fachhochschule.
HS 2023 2024-SA-EP-Graftceva- AI Integration of Deep Computer Vision Foundation Models f.pdf - Supplemental Material
Download (2MB)
Abstract
Integration of Deep Computer Vision Foundation Models for Document Analysis:
Enhancing Optical Character Recognition Using an OCR-free Transformer Model
This study explores the efficacy of a pre-trained transformer model from the open source Hugging Face Library applied in the domain of Optical Character Recognition (OCR), specifically to the task of extraction of dates from scanned documents.
Initially, OCR technology concentrated on recognizing patterns, using algorithms based on specific rules, to identify letters and numbers through their distinct shapes. Deep learning greatly improved accuracy and the ability to work with more nuanced text and complex layouts, which in combination with Large Language Models (LLMs) has made visual document understanding possible.
Approach: The conventional OCR approach follows two steps: First, one would OCR a scanned document with the help of an OCR engine like Tesseract, and then process the output using pattern matching and regular expressions, or, alternatively, a LLM trained for the specific field of application. A major limitation of OCR engines, however, lies in their generic nature, which often brings challenges in accuracy and efficiency.
The OCR-free or pseudo-OCR approach instead relies on a single encoder-decoder transformer model which integrates the aforementioned two steps, making it an end-to-end solution which can be adjusted and fine-tuned for a specific field of application.
For this project I selected the OCR-free Document Understanding Transformer model (Donut) which was initially pre-trained on an extensive and varied collection of documents. I then fine-tuned it on a targeted datasets of diverse sizes to find out model’s ability to read, understand and extract dates from images. I evaluated the results based on accuracy and the model's adaptability to different document types and qualities, as well as different date formats.
Conclusion: The results of the study are encouraging, achieving an average accuracy of 75% on the somewhat limited training and test datasets meticulously assembled for fine-tuning. The OCR-free approach undoubtedly shows promise in performing atomic tasks on images such as extracting dates. However, its efficacy could be significantly enhanced by incorporating a wider variety of document types and date formats. Additionally, adapting it to manage scenarios with zero, one, or multiple dates in a single image is likely necessary. Data engineering has emerged as a crucial element, even in this proof-of-concept stage.
Item Type: | Thesis (Other) |
---|---|
Subjects: | Area of Application > Image/Video Processing Technologies > Programming Languages > Python |
Divisions: | Bachelor of Science FHO in Informatik > Student Research Project |
Depositing User: | OST Deposit User |
Contributors: | Contribution Name Email Thesis advisor Lehmann, Marco UNSPECIFIED |
Date Deposited: | 16 May 2024 11:56 |
Last Modified: | 16 May 2024 11:56 |
URI: | https://eprints.ost.ch/id/eprint/1167 |