What is the Problem?

With Medicare data moving to the Virtual Research Data Center (VRDC) exclusively, researchers are being forced into a new way of conducting research. Some have expressed concerns that research could take longer, or that quality control could suffer. But these concerns simply highlight the long-standing struggle to write effective code for conducting observational research.

Part of the problem is that observational research is an insular process. To ensure patient privacy, data can not easily be shared. This makes collaboration difficult. And because 60-80% of a research project is simply wrangling raw data to create analysis-ready datasets, data management is labor-intensive.

To complicate matters further, most data management practices are still stuck in the technology of the 1990’s. One must look no further than to the ubiquity of SAS or Stata for large-scale data management, the minimal use of relational databases and SQL, and the use of double data programming instead of modern approaches for code testing and version control. Lastly, many researchers conflate cohort creation with data analysis by trying to use the same software and coding practices for both parts of a project.

In short, the limited adoption of appropriate data management practices has led to a dead-end in the era of data enclaves like the VRDC.

Where Do We Go?

With the advent of better tools for working with raw data, observational research needs to take advantage of fit-for-purpose technologies and apply them in appropriate ways. Standard data manipulation languages (e.g., SQL), relational databases (e.g., PostgreSQL), and data platforms (e.g., Databricks) make it efficient to extract analysis-ready datasets from terabytes of raw data.

Although tools exist to improve the efficiency of observational research, the solution is not as simple as changing software. The entire process needs to re-defined. And therein lies the primary issue – most observational researchers are not well-versed in the software development best practices that are required to move the field forward. Even more recent approaches to rethinking observational research (e.g., OMOP/OHDSI, FDA Sentinel, PCORnet, i2b2, etc.) still fail to address many of these challenges properly.

Software Development – Really?

By necessity, observational researchers dabble in software development as a byproduct of writing the code for a project. But code development is limited by the narrow scope of the project at hand. The iterative process of adapting code for each new project and evaluating the output to ensure correctness is second-nature to researchers. But the prospect of working on the VRDC exposes the inefficiencies and limitations of this approach.

The root cause is that researchers are increasingly out of their element when using modern software (e.g., Databricks on the VRDC). Writing reusable and testable code for modern data platforms requires an understanding of software libraries, programming paradigms, and testing frameworks. However, the fundamentals of these tools are not part of the training for most researchers and statistical programmers.

To move research forward, researchers need to reconsider their approach and adopt a software development mindset. Viewed through this lens, cohort-building can be reduced to a set of repetitive tasks that are identical across different observational study designs. In other words, researchers need to think in terms of building a “data pipeline”.

A data pipeline requires clear specifications for the inputs and outputs for each part of the process. There are many benefits to this approach. Importantly, this includes creating specifications for organizing the raw data. This can be accomplished in a variety of ways including using data models, database views, or other approaches. The key point related to the VRDC is that, once the structure of the raw data is defined, there is no need to touch the data to specify how to build the cohort and to create the analysis-ready datasets.

By analogy, it is like using Google or Apple Maps. Once we have a database of GPS coordinates for all roads and addresses, and an understanding of basic driving rules, we can write software to create detailed, optimized directions without ever having to get in the car.

How Will That Help with Platforms like the VRDC?

Working on the VRDC means that researchers need to develop their cohort-creation code outside the VRDC (see this blog post). Once researchers can generate the code to create their analysis-ready datasets without needing simultaneous access to the raw data, working on an enclave like the VRDC becomes easier, faster, and cheaper. In fact, this kind of “offline” approach enables researchers to work on data in a consistent fashion anywhere it is stored – a powerful idea.

That Is Not Realistic. Or Is It?

At this point it would be reasonable to say, “but surely everyone can’t build their own validated software stack for the few studies they might do in a year; that is neither possible nor efficient.” I can tell you that, as a very small company, we have built exactly this system so we know it is possible. I can tell you that, having learned how to work “offline”, we are excited to work on platforms like the VRDC. And I can tell you that, as I write this, we are working on some VRDC-related demonstration projects to prove that this approach works. So, yes, it can be done. But I can also tell you that it involved a lot of hard lessons and about 10 years of work.

That’s Nice. But How Does that Help Anyone Else?

One of our core principles is to share our work with others. The key components of our research process have been publicly available via GitHub since they were created. This includes the Generalized Data Model, which defines how we organize raw data, and ConceptQL, the language we built to create, store, and share algorithms. We also made our software, Jigsaw, and our algorithm library publicly readable as of last year (see this blog post). And we are actively searching for a way make Jigsaw usable by others to solve problems like working on the VRDC (see this blog post).


Observational researchers need to think outside the box that has enclosed observational research for the last 40 years. The tools are available, and we have laid much of the groundwork. We will continue to share our progress and help others conduct quality research in an evolving data landscape. If you have any questions, or if you want to collaborate, feel free to contact us.