Title

Steiner Goes Big Data: Validating Database Entries in Large Databases

Description

In medicine, lists of patients are often kept manually. These serve as the basis for retrospective research projects. In the retrospective study itself, the IT department is then supposed to automatically export medical information via SQL for the given data (for example, laboratory data before and after the operation). In practice, unfortunately, it now turns out that these lists contain transmission errors. For example, due to transposed numbers, the patient ID does not belong to a patient or there was no operation on the date mentioned. An automated check of the validity of these lists would be useful as a pre-process of the data export and would save many manual checks: Am I not finding lab data because I have an error in the SQL query or is the input data erroneous?

The current implementation of such a tool can fulfil the described task. The user has to connect to the reference database and upload the CSV file to be checked. A mapping must then be configured to determine which columns of the CSV file correspond to which columns of the tables in the database. Using the Steiner algorithm, the tool calculates a join sequence of all tables used to check by means of foreign key relationships whether, for example, a case number really belongs to a patient ID. Likewise, suggestions are made via Levensthein distances if no exact match is found (case number F123 does not belong to a patient, but there is a clavicle operation for a patient in case number F213 on 20.02.2022).

For databases with < 100 tables (or tables to be joined < 10) with approx. 10,000 entries, the processing time is reasonable. In the context of the clinical database (approx. 5000 tables with partly billions of rows), a calculation of the Levensthein distance and possible join sequences are not practicable. The aim of the master's thesis is to implement an optimisation for large amounts of data.

Requirements

Databases, SQL

Person working on it

Daniel Preciado-Marquez

Category

Master thesis