Raw Data is Impossible in its Native Format

Research using observational data (“real-world data”) has grown substantially over the last decade. New data sources have become available, and regulatory agencies have increasingly supported real-world evidence as part of their decision-making. However, the raw data available from data providers is virtually impossible to analyze in its original form. As a result, the enormous challenge of reviewing and manipulating the raw data creates a substantial barrier to conducting quality observational research.

The Real Miracle in Research

Borrowing a thought from Sydney Harris, the figure below is how most researchers view the process of structuring their data for research.

Data miracle graphic

Breaking it Down

At a simple level, the manipulation of data can be broken into two tasks. The first is to organize the raw data by cataloging its contents and the relationships among the tables, examining it for completeness and correctness, and mapping it to a structure suitable to research. (For regulatory purposes, the organization process itself needs to be documented as well.) The second task is identifying the relevant cohort to study, extracting the relevant records, and creating analysis-ready datasets. Once this two-step data pipeline is implemented, researchers can then conduct the analyses of interest.

The Starting Point

For the first task, the only option for most researchers is to create an ad hoc data structure specific to the research project at hand. While adequate for one-off studies, this approach is prone to error, difficult to modify, and impossible to scale. More flexible approaches to data organization involve either writing a query translation layer or moving data into a common data model structure. Both approaches map the original data structure into a new, more efficient structure, but the common data model approach can be manipulated more efficiently, reliably, and repeatably by modern database platforms.

The Problem and the Solution

This highlights the fundamental problem for all researchers: there are no useful off-the-shelf, open-source toolkits that help researchers process raw healthcare data into a usable form for research. Hence, we are working to build such a toolkit and to make it available for others.