Bringing Code to the Data

If an organization can store all the data it needs on its own servers, it can choose almost any software package or platform to manipulate the data for research purposes. But what happens if the organization isn’t allowed to store the data on its own servers? For example, most people who want to access the full Medicare data must do so within the CMS Virtual Research Data Center (VRDC) using either SAS or Databricks. Similarly, some countries limit their data to servers only accessible within their own country, and some commercial data providers limit access to their own virtual data centers. While this can improve data security and reduce patient privacy risks, it creates logistical challenges for researchers.

The Solution?

The solution is to bring project-specific software code to the data behind the firewall. But how does one write code for a project without touching the actual data? There are at least two options for solving the problem: synthetic data and code-generation software.

Synthetic Data

In theory, researchers can craft their project code against a synthetic, but realistic, copy of the data. This presupposes that someone will create and maintain an accessible, usable, privacy-protecting synthetic dataset for each data source of interest. Even then, working with large, unwieldy data can be a challenge. Imagine working with a synthetic version of the full Medicare data – that isn’t a job for anyone without access to suitable computing resources. All-in-all, synthetic data represents a possible solution, but it isn’t particularly efficient because it still relies on hand-writing code for manipulating potentially large datasets.

Code-Generation Software: Jigsaw

In our opinion, the better option is to use Jigsaw. It may not be obvious, but Jigsaw doesn’t need to touch the actual raw data to do its job. How is that possible? Jigsaw’s job is to write all the SQL queries for creating an analysis-ready dataset from the raw, organized data. As long as the data is organized using a known data model, Jigsaw can write a script containing the required SQL queries. A user can then run the script on a server containing the organized data, and the script can save the analysis-ready data sets on the same server in a location specified by the user.

Because Jigsaw itself can be Cloud-based, researchers can collaborate on the specification of the analysis-ready data from anywhere. The protocol and its algorithms remain publicly available and shareable. As before, all algorithms are explicitly documented in the protocol summary document that can be generated (and also shared).

Or Both?

In some scenarios, it could make sense to use both approaches. One could use Jigaw to create an analysis-ready dataset, and then create a synthetic version for developing the analyses themselves.

Challenges

In order for this work, the data on the server needs to be reorganized into a data model. We strongly prefer the Generalized Data Model because it brings together some of the best ideas of the i2b2 and OMOP/OHDSI data models. By this, we referring to combining the i2b2 idea of storing all clinical information in one central “fact” table combined with the OMOP/OHDSI vocabularies.

Once a transformation to a data model is created, it is relatively easy to share the code for others to transform their raw data into a data model. As of early 2024, we have code for the current SEER-Medicare linked data. Next, we will create the transformation code for the Medicare data on the VRDC. After that, we are open to suggestions.

Conclusion

Working on other data platforms doesn’t have to be challenging. We think Jigsaw can be a powerful tool for working with observational data wherever it is without compromising data security or patient privacy.