Integration of Deep Computer Vision Foundation Models for Document Interpretation and Anonymisation

Havrilla, Marc (2024) Integration of Deep Computer Vision Foundation Models for Document Interpretation and Anonymisation. Other thesis, OST - Ostschweizer Fachhochschule.

[thumbnail of HS 2023 2024-BA-EP-Havrilla-AI  Integration of Deep Computer Vision Foundation Models fo.pdf] Text
HS 2023 2024-BA-EP-Havrilla-AI Integration of Deep Computer Vision Foundation Models fo.pdf - Supplemental Material

Download (2MB)

Abstract

Visual Document Understanding (VDU) models, combined with Optical Character Recognition (OCR) or OCR-free, offer businesses and institutions a great opportunity to digitalise their processes and improve workflows. The digitalisation is progressing. However, challenges like sufficient knowhow to integrate VDU models, compliance with data protection regulations and identifying the processes, where VDU models offer the most significant benefit, have to be resolved.

The main goal of the work is to analyse and evaluate the practicality and appropriateness of available VDU models for processing of documents (e.g. PDF of scanned documents) and to demonstrate these in a Proof-of-Concept (POC) application. Even though some regulatory aspects, especially regarding anonymisation, are discussed in the work, the developed application does not aspire to be regulatory compliant.

During this work, two areas have been identified, where a tool to extract text from an image, identify relevant entities of personal information and anonymise these, is beneficial. First, the anonymisation of medical documents makes more data available for research and educational purposes. A second application is data leakage prevention, where detecting client data from screenshots would lower the risk of data breaches.

Various tools exist to extract text from an image. In the scope of this project, three tools have been integrated i.e., Tesseract, Amazon Textract and OpenAI GPT-4V(ison). The application extracts the text of uploaded documents or images and provides the user with the resulting text from all three tools. The user will be able to select the text with the best quality. Afterwards, a Named Entity Recognition (NER) Transformer model (i.e., bert-base-NER model) is used to identify the names of persons in the extracted text. The last step is the pseudonymisation of the entities. A randomly generated unique string replaces the entities in the text, so that a person cannot be identified based on the name in the text.

Another feature of the application is the evaluation of the OCR accuracy. The user is able to upload an additional ground truth file, which will then be compared with the output of the uploaded images. To calculate the OCR accuracy the Jaro Similarity string comparison algorithm is used. Furthermore, the NER model can also be tested by uploading the expected entities of the document in a separate file. The test will then show how many of the provided entities have been found in the extracted text.

It is impressive how powerful today's text extraction and NER models have become. However, during the work, it was recognised that they are not yet off-the-shelf and just ready to use. Neither works each tool perfectly, so errors are propagated to subsequent processes nor are the outputs of each tool standardised. To overcome such limitations, the process of text extraction and entity recognition should be executed by one model, which is also fine-tuned on the specific document types.

Item Type: Thesis (Other)
Subjects: Topics > Cloud Computing
Area of Application > Image/Video Processing
Technologies > Programming Languages > Python
Technologies > Databases > PostgreSQL
Technologies > Security
Technologies > Web
Technologies > Virtualization > Docker
Divisions: Bachelor of Science FHO in Informatik > Bachelor Thesis
Depositing User: OST Deposit User
Contributors:
Contribution
Name
Email
Thesis advisor
Lehmann, Marco
UNSPECIFIED
Date Deposited: 16 May 2024 11:59
Last Modified: 16 May 2024 11:59
URI: https://eprints.ost.ch/id/eprint/1163

Actions (login required)

View Item
View Item