Exploiting the User Interface for Data Integration in Effectiveness Research

Research Focus

Health care effectiveness research asks which medical interventions work in real life practice, rather than in the highly controlled environment of clinical trials. This type of research depends on the capture of health data collected during the course of clinical care. In our work supporting data analysts engaged in effectiveness research with a single data source, we observe that their crucial first step, selecting medical reports to include in a study, can be very difficult because the underlying databases are often structured in a generic manner where attribute names (and their corresponding attribute values) are stored as data. (Such a generic database design typically makes the software easy to extend, to accommodate additional fields for each customer.) More than that, much of the semantics of the original data, such as the wording of the question on the user interface (UI) that captured the data, are not accessible by looking solely at the underlying database schema. Data analysts typically rely on a database expert to write complex SQL queries; this introduces a level of indirection that can interfere with the analysts’ ability to control which cases are selected and to understand precisely what the data values mean. We propose to solve these problems by creating a query interface based directly on the UI along with a database transformation mechanism that allows the underlying database to be stored with a generic structure. We believe that the UI is the pre-eminent source of semantics for the data and that domain experts are very likely to understand the details of the original UI, including the detailed medical information and terminology. In our approach, domain experts can easily express their queries directly. Thus, we provide detailed access to data semantics and we eliminate the reliance on database experts.

By integrating data from multiple sources, the diverse spectrum of clinical practice is better represented in effectiveness research. Questions may then be asked over subpopulations or to evaluate rare disorders; answers obtained are more generalizable when obtained from these larger and more diverse datasets. The challenge then is to provide analysts with data integrated from multiple, diverse database sources. We focus on the case where the underlying databases have not (yet) been integrated. We believe that data analysts engaged in effectiveness research must see the original data in the medical reports and that they must understand the semantics of the data. Further, each study essentially involves a series of data integration decisions as appropriate. The data analyst must see the original data and decide the best way to classify the data (often in an intentionally lossy manner). And each study may involve different classifications. Our goal is to provide a UI-based query interface for each source and to allow an analyst to describe their classification decisions for each attribute of interest in the study, for each contributing data source.


Prototype software is available for each of the different aspects of this project.




This project is funded in part by the National Institutes of Health, National Library of Medicine grant no. R21LM009550. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Institutes of Health.