Jigsaw by Outcomes Insights, Inc.

The Challenge of Working on the VRDC

2024-04-16T00:00:00+00:00

What is the Problem?

With Medicare data moving to the Virtual Research Data Center (VRDC) exclusively, researchers are being forced into a new way of conducting research. Some have expressed concerns that research could take longer, or that quality control could suffer. But these concerns simply highlight the long-standing struggle to write effective code for conducting observational research.

Part of the problem is that observational research is an insular process. To ensure patient privacy, data can not easily be shared. This makes collaboration difficult. And because 60-80% of a research project is simply wrangling raw data to create analysis-ready datasets, data management is labor-intensive.

To complicate matters further, most data management practices are still stuck in the technology of the 1990’s. One must look no further than to the ubiquity of SAS or Stata for large-scale data management, the minimal use of relational databases and SQL, and the use of double data programming instead of modern approaches for code testing and version control. Lastly, many researchers conflate cohort creation with data analysis by trying to use the same software and coding practices for both parts of a project.

In short, the limited adoption of appropriate data management practices has led to a dead-end in the era of data enclaves like the VRDC.

Where Do We Go?

With the advent of better tools for working with raw data, observational research needs to take advantage of fit-for-purpose technologies and apply them in appropriate ways. Standard data manipulation languages (e.g., SQL), relational databases (e.g., PostgreSQL), and data platforms (e.g., Databricks) make it efficient to extract analysis-ready datasets from terabytes of raw data.

Although tools exist to improve the efficiency of observational research, the solution is not as simple as changing software. The entire process needs to re-defined. And therein lies the primary issue – most observational researchers are not well-versed in the software development best practices that are required to move the field forward. Even more recent approaches to rethinking observational research (e.g., OMOP/OHDSI, FDA Sentinel, PCORnet, i2b2, etc.) still fail to address many of these challenges properly.

Software Development – Really?

By necessity, observational researchers dabble in software development as a byproduct of writing the code for a project. But code development is limited by the narrow scope of the project at hand. The iterative process of adapting code for each new project and evaluating the output to ensure correctness is second-nature to researchers. But the prospect of working on the VRDC exposes the inefficiencies and limitations of this approach.

The root cause is that researchers are increasingly out of their element when using modern software (e.g., Databricks on the VRDC). Writing reusable and testable code for modern data platforms requires an understanding of software libraries, programming paradigms, and testing frameworks. However, the fundamentals of these tools are not part of the training for most researchers and statistical programmers.

To move research forward, researchers need to reconsider their approach and adopt a software development mindset. Viewed through this lens, cohort-building can be reduced to a set of repetitive tasks that are identical across different observational study designs. In other words, researchers need to think in terms of building a “data pipeline”.

A data pipeline requires clear specifications for the inputs and outputs for each part of the process. There are many benefits to this approach. Importantly, this includes creating specifications for organizing the raw data. This can be accomplished in a variety of ways including using data models, database views, or other approaches. The key point related to the VRDC is that, once the structure of the raw data is defined, there is no need to touch the data to specify how to build the cohort and to create the analysis-ready datasets.

By analogy, it is like using Google or Apple Maps. Once we have a database of GPS coordinates for all roads and addresses, and an understanding of basic driving rules, we can write software to create detailed, optimized directions without ever having to get in the car.

How Will That Help with Platforms like the VRDC?

Working on the VRDC means that researchers need to develop their cohort-creation code outside the VRDC (see this blog post). Once researchers can generate the code to create their analysis-ready datasets without needing simultaneous access to the raw data, working on an enclave like the VRDC becomes easier, faster, and cheaper. In fact, this kind of “offline” approach enables researchers to work on data in a consistent fashion anywhere it is stored – a powerful idea.

That Is Not Realistic. Or Is It?

At this point it would be reasonable to say, “but surely everyone can’t build their own validated software stack for the few studies they might do in a year; that is neither possible nor efficient.” I can tell you that, as a very small company, we have built exactly this system so we know it is possible. I can tell you that, having learned how to work “offline”, we are excited to work on platforms like the VRDC. And I can tell you that, as I write this, we are working on some VRDC-related demonstration projects to prove that this approach works. So, yes, it can be done. But I can also tell you that it involved a lot of hard lessons and about 10 years of work.

That’s Nice. But How Does that Help Anyone Else?

One of our core principles is to share our work with others. The key components of our research process have been publicly available via GitHub since they were created. This includes the Generalized Data Model, which defines how we organize raw data, and ConceptQL, the language we built to create, store, and share algorithms. We also made our software, Jigsaw, and our algorithm library publicly readable as of last year (see this blog post). And we are actively searching for a way make Jigsaw usable by others to solve problems like working on the VRDC (see this blog post).

Conclusion

Observational researchers need to think outside the box that has enclosed observational research for the last 40 years. The tools are available, and we have laid much of the groundwork. We will continue to share our progress and help others conduct quality research in an evolving data landscape. If you have any questions, or if you want to collaborate, feel free to contact us.

Moving Jigsaw to the Real World

2024-01-30T00:00:00+00:00

Moving Jigsaw to the Real World

Last year we made our algorithm library publicly available. In 2024, we are working to make our entire Jigsaw application freely accessible for creating and sharing protocols, creating and sharing algorithms, and generating the code to create analysis-ready datasets from observational data.

Doesn’t This Exist Already?

Solutions for streamlining observational research fall into one of the following approaches:

Building an internal repository of implementation code
Implementing and supporting an open-source software system internally
Licensing access to a commercial platform

However, these options also have important limitations:

In-house solutions are generally software-specific and have limited version control
Open-source platforms require staff for installation, support, and updates
Commercial platforms can be expensive black boxes with little visibility into underlying processes being implemented
No solution allows researchers to collaborate across institutions that use different approaches.

In short, despite improvements in software capabilities, researchers still reside in their own silos, unable to collaborate efficiently with one another.

What is the Alternative?

The ideal solution is a freely accessible, cloud-based software application.

As an analogy, think about GitHub. For anyone unfamiliar with GitHub, the following is the one-sentence Wikipedia description:

GitHub, Inc. is an AI-powered developer platform that allows developers to create, store, manage and share their code.”

So, imagine if we alter that slightly for Jigsaw to read as follows:

Jigsaw is a cloud-based platform that allows observational researchers to create, store, manage and share their research protocols and to generate the code for implementing them against data.”

Conclusion

Despite thousands of hours and millions of dollars being invested in potential solutions, the fundamental problem remains – research methods are still too fragmented. It is about time we solved this problem. We think Jigsaw is the solution.

Bringing Code to the Data

2024-01-21T00:00:00+00:00

Bringing Code to the Data

If an organization can store all the data it needs on its own servers, it can choose almost any software package or platform to manipulate the data for research purposes. But what happens if the organization isn’t allowed to store the data on its own servers? For example, most people who want to access the full Medicare data must do so within the CMS Virtual Research Data Center (VRDC) using either SAS or Databricks. Similarly, some countries limit their data to servers only accessible within their own country, and some commercial data providers limit access to their own virtual data centers. While this can improve data security and reduce patient privacy risks, it creates logistical challenges for researchers.

The Solution?

The solution is to bring project-specific software code to the data behind the firewall. But how does one write code for a project without touching the actual data? There are at least two options for solving the problem: synthetic data and code-generation software.

Synthetic Data

In theory, researchers can craft their project code against a synthetic, but realistic, copy of the data. This presupposes that someone will create and maintain an accessible, usable, privacy-protecting synthetic dataset for each data source of interest. Even then, working with large, unwieldy data can be a challenge. Imagine working with a synthetic version of the full Medicare data – that isn’t a job for anyone without access to suitable computing resources. All-in-all, synthetic data represents a possible solution, but it isn’t particularly efficient because it still relies on hand-writing code for manipulating potentially large datasets.

Code-Generation Software: Jigsaw

In our opinion, the better option is to use Jigsaw. It may not be obvious, but Jigsaw doesn’t need to touch the actual raw data to do its job. How is that possible? Jigsaw’s job is to write all the SQL queries for creating an analysis-ready dataset from the raw, organized data. As long as the data is organized using a known data model, Jigsaw can write a script containing the required SQL queries. A user can then run the script on a server containing the organized data, and the script can save the analysis-ready data sets on the same server in a location specified by the user.

Because Jigsaw itself can be Cloud-based, researchers can collaborate on the specification of the analysis-ready data from anywhere. The protocol and its algorithms remain publicly available and shareable. As before, all algorithms are explicitly documented in the protocol summary document that can be generated (and also shared).

Or Both?

In some scenarios, it could make sense to use both approaches. One could use Jigaw to create an analysis-ready dataset, and then create a synthetic version for developing the analyses themselves.

Challenges

In order for this work, the data on the server needs to be reorganized into a data model. We strongly prefer the Generalized Data Model because it brings together some of the best ideas of the i2b2 and OMOP/OHDSI data models. By this, we referring to combining the i2b2 idea of storing all clinical information in one central “fact” table combined with the OMOP/OHDSI vocabularies.

Once a transformation to a data model is created, it is relatively easy to share the code for others to transform their raw data into a data model. As of early 2024, we have code for the current SEER-Medicare linked data. Next, we will create the transformation code for the Medicare data on the VRDC. After that, we are open to suggestions.

Conclusion

Working on other data platforms doesn’t have to be challenging. We think Jigsaw can be a powerful tool for working with observational data wherever it is without compromising data security or patient privacy.

Spark of Genius

2023-10-16T00:00:00+00:00

File this post under “late to the game”, but I just completed a project where I used Apache Spark for the first time and I’m blown away. Here’s my experience.

No Cluster Needed

Perhaps it was my bias from working with Apache Impala a few years back, but I just assumed that Spark was going to need Hadoop set up on a cluster of servers. I didn’t want to spend my time getting all that set up just to play around with Spark, so I never bothered with it before.

Turns out, Spark has a rather robust single-machine setup. Even better, there’s an R library that took care of all the set up for me.

`sparklyr` Makes Spark Simple

The R package sparklyr made my foray into Spark dead simple. The package happily installed Spark for me and provided me functions to easily start and stop a Spark instance from within my R scripts.

Pro tip: by default sparklyr limits Spark to a single core when it starts up an instance. You can change to multiple cores pretty easily and it makes a world of difference in terms of performance.

`dplyr` and Spark Is a Powerful Combination

sparklyr gave me access to the tables I loaded into Spark. dplyr gave me the ability to manipulate and query those tables via dbplyr.

dplyr is amazing. Rather than hand-writing Spark SQL, dplyr provides a set of functions that allowed me to join tables, add where clauses, and manipulate the columns returned from Spark.

Great Performance

My project was to explore replacing an existing part of our data pipeline. Using Spark, our processing time went from days to hours.

More Spark in the Future

After this successful venture into Spark territory, I’m pretty sure I’ll be employing Spark in future projects.

Parquet Makes My Day

2023-07-31T00:00:00+00:00

I was introduced to Parquet format back in 2015. At the time, I was tasked with working with an Impala-based system and it was using Parquet to store its data. My impression was Parquet was some technology built upon HDFS and required some sort of distributed, Hadoop-based system to work with it. That impression was not accurate.

A few years later, the Arrow package for R came out and it had support for, much to my surprise, Parquet. Suddenly Parquet seemed to be freed from HDFS and could plunk huge swaths of data into cute little folders directly in my computer’s filesystem. What a powerful tool I suddenly had. It was a great alternative to all the other dirty data formats I dealt with in the past.

Since then, we’ve standardized our data pipeline tools on Parquet. It’s a great format for data storage and transfer. It has a true standard for storing data and their types, unlike CSV. It compresses data down very nicely, leaving the files smaller than most other proprietary or CSV formats we’ve seen. It is fast to read and write, making it a great intermediate storage for our data pipelines. And, thanks to Arrow, all the languages we use in our tools have first-class support for working with Parquet. Also, shout out to DuckDB for being the SQLite of Parquet.

This last winter, one of our data vendors actually offered to send us data in Parquet format. It was amazing. AMAZING. We downloaded the files from them and within a minute I was able to query the data, get counts of rows, types of columns, and confirm that we had, indeed, received all the records we expected. It was unlike any other data ingestion experience I’ve ever had. No unzipping, no CSV tools, no proprietary formats. These pre-ETL steps were almost completely unnecessary and we could move right into transformation of the data into GDM

When I first encountered it, I never thought I’d be such a fan of Parquet, but I now sincerely hope that it continues to become the standard for transferring claims data between vendors and researchers.

Sharing Is a Good Thing

2023-07-21T00:00:00+00:00

There was a lot to discuss in last week’s post on the public version of Jigsaw algorithm library, so I will try to make this one short. It is about creating and sharing algorithm (and protocol) summaries with Jigsaw.

Background

From inside any algorithm in the library, people can create a stand-alone algorithm summary. This is a stand-alone webpage that can be downloaded and saved like any other document (click the green “Download” button at the top). Importantly, the webpage also contains a CSV file with all algorithm codes. The CSV file can be accessed via link at the bottom of the summary.

To create the summary from inside any algorithm in the library, simply click the “Action” button at the top, and select “Create Algorithm Summary”.

Link to the algorithm within the Jigsaw algorithm library

The first option is a link to the algorithm in the Jigsaw library itself. For example, below is a link to an algorithm in the library:

https://public.jigsaw.io/algorithms/646c44ce-be1a-4f04-b4e6-ecc5d7b44e13

From that link, people can view algorithm attributes, a full-screen algorithm diagram, and the ConceptQL statement that stores all the algorithm specifications. People can even generate an SQL implementation of the algorithm from within the diagram by choosing “Source” at the top of the screen.

Link directly to the stand-alone algorithm summary

The second option is a direct link to the stand-alone algorithm summary page. It is the same URL as the algorithm in the Jigsaw library except it ends in “/summary”. See the link below for an example:

https://public.jigsaw.io/algorithms/646c44ce-be1a-4f04-b4e6-ecc5d7b44e13/summary

Bonus points – Protocols Can Also Be Shared!

Pretty much everything above applies to protocols too. Not only can an entire protocol be shared from within Jigsaw’s protocol builder, but a stand-alone summary document can also be created, shared, and downloaded. The entire protocol specification, including the enrollment criteria, algorithms and how they were defined and used (e.g., inclusion, exclusion, outcome, etc.) can be created as a stand-alone, downloadable web page. For example, see the link below:

https://public.jigsaw.io/studies/b522827a-0ea3-4d22-add7-849c072a5717/summary

Why is this useful?

The reason we designed these features was to allow for sharing and for documentation purposes. Imagine being able to see someone else’s algorithms and protocols – the exact steps they followed to create cohorts and analysis data sets. Government agencies, academic researchers, and commercial companies could share their methods without having to share the actual code or data.

Also, simply having all the detailed specifications in one place allows us to retain them as an appendix to technical reports, and saves time and energy in writing study publications.

Jigsaw Algorithm Library is Publicly Available

2023-07-11T00:00:00+00:00

We are pleased to announce that we have released our algorithm library to the public at https://public.jigsaw.io. This is something we have wanted to do for years, but there was always something else that we wanted to include or change. In the spirit of “perfect is the enemy of good”, we decided that the algorithm library was finally “good enough” to make it available to others. It is still a work in progress. We hope researchers who conduct studies using healthcare data find it useful.

Why put this out there?

We have spent many years struggling with the same issues that all researchers face when trying to extract information from healthcare data. Algorithms are hard to find, poorly documented and difficult to implement consistently. It is time to solve the problem.

Who is funding this?

This project is completely self-funded. Please keep in mind that we have limited resources, and we may not be able to accommodate requests. It turns out that it is surprisingly challenging to make something publicly and freely available, so we appreciate your understanding.

Collaboration

If any organization is interested in collaborating with us on algorithms, we welcome it – even from organizations that might be considered “competitors”. Algorithms should be freely available and compatible with other systems than our own. We are willing to see if we can make them work with other systems, even proprietary ones.

Feedback

We welcome suggestions. We are happy to consider algorithm modifications if an algorithm can be improved in any way. We would like our algorithms to be usable by anyone. Along these lines, we are willing to consider adding other algorithms to our library, as long they can be made public.

Browser requirements

The algorithm library is designed and tested against Google Chrome because of its popularity. Other browsers may work, but we don’t provide cross-browser support at this time.

But wait, there’s more!

In addition to the high-level details above, we also provide more detailed technical information and limitations below, as well as an example algorithm.

Technical details

Limitations

Example algorithm

Technical Details

What is an algorithm library?

Like most libraries, the Jigsaw algorithm library stores algorithms, facilitates searching for algorithms, and contains tools for creating, editing, and versioning algorithms. The library enables algorithm sharing in two ways: as a stand-alone HTML file downloaded from the website, or via the algorithm’s unique URL.

How to use the Jigsaw algorithm library

Anyone interested can simply navigate to the algorithm section of the Jigsaw application to find algorithms. Users can then search for algorithms by different characteristics. Clicking on any algorithm’s name opens it, and the user is presented with algorithm attributes as well as a diagram of the algorithm specification. Users can click on the diagram to get more detailed information about the specifications.

We cover some of the basics below, and we will provide more granular instructions in the future.

What is an algorithm?

An algorithm has two parts – attributes and specifications.

Algorithm attributes include all of the information that makes algorithms friendly to researchers. This includes titles, tags, documentation, variable names, and even evidence of validity (e.g., sensitivity and specificity).

Algorithm specifications are comprised of the code(s) and operator(s) that are needed to identify information for an analysis dataset. The most basic operation of an algorithm is to “select” records with the code(s) of interest. This means that the simplest possible algorithm specification is simply to select all records using a single code (see Standardization for more on codes). However, some algorithms can have thousands of codes. Algorithms can also require operations related to the context in which the code was used (provenance), have temporal specifications (before, after, within, etc.), require relationships to other codes (co-reported), use filters (first, last, etc.), or specify other operations.

Sometimes algorithm specifications require many operations in sequence. For these use cases, we create special operators to make them easier to implement. The best example is an operator that looks for a code that occurs at least 1 time in the inpatient setting or at least 2 times in the outpatient setting within specified time intervals (sometimes referred to as a “one inpatient or two outpatient” operation). These operators make creating algorithm specifications much simpler.

What does an algorithm specification produce?

An algorithm specification is a representation of a table that is created and, potentially, modified. When implemented against data, every algorithm starts with a selection operator to create the first table based on the codes in the specification. This table can then be passed to subsequent operators for additional modification. The output of an algorithm is a table that includes all of the relevant records for each person as modified by the specified operations. As an example, a specification could first select all records with at least one diabetes code based on a list of diabetes codes, and then filter these records to retain only the first code for each person.

The core of each table includes a person id, details about the exact table and record that was returned, the start and end dates, the vocabulary (e.g., ICD-10-CM), and the specific code. Additional columns are dynamically added for selected types of records (e.g., prescription records will add additional columns specific to prescriptions, like dosing information).

How algorithms operate

We have defined all algorithm operations that are commonly used for research using ConceptQL (see ConceptQL blog post). Within the algorithm library, there is a hyperlink from each operator to our ConceptQL specification in Github that provides a more detailed description of how it functions.

Implicitly, all algorithm operations are conducted by person since almost all research is conducted on cohorts of people. All algorithms are read left to right, and only one table is ever produced at the end of any operation. This is true even if multiple tables are used as inputs to an operation. For algorithms with two sides to them, the left side takes the input of the previous operation and modifies it using information from the right side. Tables on the right side are never passed through and included in the output.

If the above doesn’t make any sense, don’t worry. We will provide more details about how algorithms operate in a future post.

Visual representation

We have found that a visual representation of an algorithm specification is extremely helpful for communicating it to others. As a result, an algorithm is specified using a visual algorithm builder that generates ConceptQL statements and encapsulates the algorithm specification. See below for an example.

Algorithm scope

Do we have every possible operation that might ever be needed? Probably not. Hence, it is possible to extend Jigsaw with new operators as the need arises. However, we expect the need for new operators to be relatively rare. Many times, the desired output requires some creativity using existing operations.

Alternatively, it may be that the desired operation is really a protocol specification, and not an algorithm specification. It is beyond this blog post to define the difference between an algorithm specification and a protocol specification precisely. However, in general, an algorithm specifies the variable of interest, and the protocol defines the context in which the algorithm operates. The separation between an algorithm and a protocol ensures that an algorithm is as reusable as possible.

To make this a bit more concrete, an algorithm might define a myocardial infarction. Or it might define diabetes diagnosed at least 30 days before a myocardial infarction. However, a myocardial infarction within 30 days of the index date, or as an inclusion or exclusion criterion, or as a censoring variable – those are ways of using a myocardial infarction algorithm in the context of a protocol. The same idea applies with clean periods or lookback periods – those are protocol-specific ideas that are not part of an algorithm’s scope.

As a side note, users may see some algorithms in our library that violate this principle. This is because our thinking about the line between algorithms and protocols has evolved over the years. As a result, we have added features to our protocol builder to help keep the distinction as clear as possible. But since we don’t readily remove algorithms we have used, these algorithms remain in the algorithm library.

It isn’t clear how algorithms are used to create analysis data sets

Protocols, not algorithms, specify the analysis datasets that need to be created. Protocols specify the index date, inclusion and exclusion criteria, baseline variables, lookback periods, clean periods, date restrictions, and outcome variables. Protocols use algorithms to specify each variable within the context of a protocol. It is much like assembling a jigsaw puzzle (protocol) using pieces (algorithms). Hence, our software is named “Jigsaw”.

As a simple example, the specification for myocardial infarction can be identical regardless of whether it is needed as a baseline variable, index event, or outcome. However, different records will be selected depending on which context required.

To illustrate how algorithms fit into protocols, we have made our entire Jigsaw application available, including our protocol builder, albeit in read-only form. There is a demonstration protocol available within the application.

Limitations

Disclaimer

While we built Jigsaw and the algorithms carefully, we make no guarantees of correctness or suitability for a specific study. Jigsaw is a tool to make research easier and more transparent; however, it still requires that users fully understand all aspects of the research they are conducting, their data source(s), and the limitations of real-world data.

Is every algorithm ready for research?

The algorithm library contains over 1,000 algorithms and spans many years of development and growth. Therefore, not every algorithm is ready to be used in its current state. Some were created as examples, some were created prior to ICD-10-CM data becoming available, and some were created or updated very recently. Most algorithms are classified as “Draft” even though they have been extensively used. Some “Finalized” algorithms should probably be updated. As mentioned above, this is a work in progress and we only update algorithms as we reuse them. We suggest looking at the date of the most recent update to get a sense of how current an algorithm is. Users should do their own due-diligence before using any algorithm.

Is an algorithm specific to a data model or software system?

The algorithm library is agnostic to both the data structure and the software used for implementation. The algorithm simply captures the specifications for an algorithm. For example, an algorithm can encapsulate the idea that we are looking for a “diabetes diagnosis at least 30 days before a myocardial infarction”.

As such, it is not written in SQL, SAS, R, Python, or any language that can be used directly on data. Instead, it is specified in JSON using the ConceptQL language. Jigsaw uses the specification to dynamically generate SQL and implement the algorithm against data. In theory, ConceptQL could generate something other than SQL, but practically, SQL is so widely available there has been no reason to consider other options. This approach allows us to generate SQL that is specific to the idiosyncrasies of different database implementations of SQL. For example, we can generate SQL specific to PostgreSQL, Impala, SQLite or most other database systems that follow SQL standards.

Furthermore, we can generate SQL that works directly against different data models, like the Generalized Data Model (GDM), OMOP, PCORnet, and Sentinel. (Note that support for OMOP, PCORnet, and Sentinel is not currently implemented because we have not had a use case for them.)

To see what the SQL code looks like for an algorithm, users can dynamically generate the necessary SQL from inside the algorithm diagram. The SQL is specific to the Generalized Data Model, and it produces PostgreSQL queries. See below for a complete example of an algorithm that includes the diagram, attributes, ConceptQL JSON statement, and the resulting SQL.

Why not make the entire application available?

While Jigsaw is copyrighted and remains our intellectual property, our hope is that we can make its full functionality available to other researchers. To do that, we must make ETL easier first, which is not a trivial issue. We have made progress on that front as well, and are willing to collaborate with organizations interested in adopting the Generalized Data Model and/or Jigsaw.

Example Algorithm

Below is an example algorithm that identifies diabetes diagnoses coming at least 30 days prior to a myocardial infarction. Both ICD-9-CM and ICD-10-CM codes are used to identify diabetes and myocardial infarction. Note that this is an example and may not represent the correct codes or a clinically meaningful event.

Algorithm diagram

Algorithm attributes

ConceptQL

Below is the ConceptQL statement that corresponds to the diagram above.

[
  "before",
  {
    "left": [
      "union",
      [
        "icd9cm",
        "250.00",
        "250.01",
        "250.02",
        "250.03",
        "250.10",
        "250.11",
        "250.12",
        "250.13",
        "250.20",
        "250.21",
        "250.22",
        "250.23",
        "250.30",
        "250.31",
        "250.32",
        "250.33",
        "250.80",
        "250.81",
        "250.82",
        "250.83",
        "250.90",
        "250.91",
        "250.92",
        "250.93"
      ],
      [
        "icd10cm",
        "E10.10",
        "E10.11",
        "E10.61",
        "E10.61",
        "E10.62",
        "E10.62",
        "E10.62",
        "E10.62",
        "E10.63",
        "E10.63",
        "E10.64",
        "E10.64",
        "E10.65",
        "E10.69",
        "E10.8",
        "E10.9",
        "E11.00",
        "E11.01",
        "E11.10",
        "E11.11",
        "E11.61",
        "E11.61",
        "E11.62",
        "E11.62",
        "E11.62",
        "E11.62",
        "E11.63",
        "E11.63",
        "E11.64",
        "E11.64",
        "E11.65",
        "E11.69",
        "E11.8",
        "E11.9",
        "E13.00",
        "E13.01",
        "E13.10",
        "E13.11",
        "E13.61",
        "E13.61",
        "E13.62",
        "E13.62",
        "E13.62",
        "E13.62",
        "E13.63",
        "E13.63",
        "E13.64",
        "E13.64",
        "E13.65",
        "E13.69",
        "E13.8",
        "E13.9"
      ]
    ],
    "right": [
      "union",
      [
        "icd9cm",
        "410.01",
        "410.11",
        "410.21",
        "410.31",
        "410.41",
        "410.51",
        "410.61",
        "410.71",
        "410.81",
        "410.91"
      ],
      [
        "icd10cm",
        "I21.01",
        "I21.02",
        "I21.09",
        "I21.11",
        "I21.19",
        "I21.21",
        "I21.29",
        "I21.3",
        "I21.4",
        "I21.9",
        "I21.A1",
        "I21.A9",
        "I22.0",
        "I22.1",
        "I22.2",
        "I22.8",
        "I22.9"
      ]
    ],
    "at_least": "30d"
  }
]

SQL

The code below is generated from the ConceptQL specification. This particular SQL code will create an output table from data stored in a PostgreSQL relational database based on the Generalized Data Model schema.

SELECT
    *
FROM ( WITH "before_22_3_24a2b6fb9b0f37c3735c654d1ffd3cab" AS MATERIALIZED (
        SELECT
            *
        FROM (
            SELECT
                "person_id",
                "criterion_id",
                "criterion_table",
                "criterion_domain",
                "start_date",
                "end_date",
                "source_value",
                "source_vocabulary_id",
                "label"
            FROM (
                SELECT
                    *
                FROM (
                    SELECT
                        *
                    FROM (
                        SELECT
                            *
                        FROM (
                            WITH "union_22_2_64ac7676c653ae02d50931398ee95903" AS MATERIALIZED (
                                SELECT
                                    *
                                FROM (
                                    SELECT
                                        "person_id",
                                        "criterion_id",
                                        "criterion_table",
                                        "criterion_domain",
                                        "start_date",
                                        "end_date",
                                        "source_value",
                                        "source_vocabulary_id",
                                        "label"
                                    FROM (
                                        SELECT
                                            "person_id",
                                            "criterion_id",
                                            "criterion_table",
                                            "criterion_domain",
                                            "start_date",
                                            "end_date",
                                            "source_value",
                                            "source_vocabulary_id",
                                            "label"
                                        FROM (
                                            SELECT
                                                *
                                            FROM (
                                                SELECT
                                                    "patient_id" AS "person_id",
                                                    "id" AS "criterion_id",
                                                    CAST(
                                                        'clinical_codes' AS text
) AS "criterion_table",
                                                    CAST(
                                                        'condition_occurrence' AS text
) AS "criterion_domain",
                                                    "start_date",
                                                    "end_date",
                                                    CAST(
                                                        "clinical_code_source_value" AS text
) AS "source_value",
                                                    CAST(
                                                        "clinical_code_vocabulary_id" AS text
) AS "source_vocabulary_id",
                                                    CAST(
                                                        NULL AS text
) AS "label"
                                                FROM
                                                    "clinical_codes"
                                                WHERE ((
                                                        "clinical_code_concept_id" IN (
                                                            SELECT
                                                                "id"
                                                            FROM
                                                                "concepts"
                                                            WHERE ((
                                                                    "vocabulary_id" = 'ICD9CM'
)
                                                                AND (
                                                                    "concept_code" IN (
                                                                        '250.00', '250.01', '250.02', '250.03', '250.10', '250.11', '250.12', '250.13', '250.20', '250.21', '250.22', '250.23', '250.30', '250.31', '250.32', '250.33', '250.80', '250.81', '250.82', '250.83', '250.90', '250.91', '250.92', '250.93'
)
)
)
)
)
                                                    OR (
                                                        "clinical_code_concept_id" IN (
                                                            SELECT
                                                                "id"
                                                            FROM
                                                                "concepts"
                                                            WHERE ((
                                                                    "vocabulary_id" = 'ICD10CM'
)
                                                                AND (
                                                                    "concept_code" IN (
                                                                        'E10.10', 'E10.11', 'E10.61', 'E10.61', 'E10.62', 'E10.62', 'E10.62', 'E10.62', 'E10.63', 'E10.63', 'E10.64', 'E10.64', 'E10.65', 'E10.69', 'E10.8', 'E10.9', 'E11.00', 'E11.01', 'E11.10', 'E11.11', 'E11.61', 'E11.61', 'E11.62', 'E11.62', 'E11.62', 'E11.62', 'E11.63', 'E11.63', 'E11.64', 'E11.64', 'E11.65', 'E11.69', 'E11.8', 'E11.9', 'E13.00', 'E13.01', 'E13.10', 'E13.11', 'E13.61', 'E13.61', 'E13.62', 'E13.62', 'E13.62', 'E13.62', 'E13.63', 'E13.63', 'E13.64', 'E13.64', 'E13.65', 'E13.69', 'E13.8', 'E13.9'
)
)
)
)
)
) -- #
) AS "t1"
) AS "t1"
) AS "t1" -- #, #]>
) AS "t1"
)
                            SELECT
                                *
                            FROM
                                "union_22_2_64ac7676c653ae02d50931398ee95903"
) AS "l"
) AS "l"
                    WHERE (
                        EXISTS (
                            SELECT
                                1
                            FROM (
                                SELECT
                                    *
                                FROM (
                                    SELECT
                                        "person_id",
                                        max(
                                            "start_date"
) AS "start_date"
                                    FROM (
                                        WITH "union_22_1_d3d7bbd424bca0fe3b3f966d4ee80692" AS MATERIALIZED (
                                            SELECT
                                                *
                                            FROM (
                                                SELECT
                                                    "person_id",
                                                    "criterion_id",
                                                    "criterion_table",
                                                    "criterion_domain",
                                                    "start_date",
                                                    "end_date",
                                                    "source_value",
                                                    "source_vocabulary_id",
                                                    "label"
                                                FROM (
                                                    SELECT
                                                        "person_id",
                                                        "criterion_id",
                                                        "criterion_table",
                                                        "criterion_domain",
                                                        "start_date",
                                                        "end_date",
                                                        "source_value",
                                                        "source_vocabulary_id",
                                                        "label"
                                                    FROM (
                                                        SELECT
                                                            *
                                                        FROM (
                                                            SELECT
                                                                "patient_id" AS "person_id",
                                                                "id" AS "criterion_id",
                                                                CAST(
                                                                    'clinical_codes' AS text
) AS "criterion_table",
                                                                CAST(
                                                                    'condition_occurrence' AS text
) AS "criterion_domain",
                                                                "start_date",
                                                                "end_date",
                                                                CAST(
                                                                    "clinical_code_source_value" AS text
) AS "source_value",
                                                                CAST(
                                                                    "clinical_code_vocabulary_id" AS text
) AS "source_vocabulary_id",
                                                                CAST(
                                                                    NULL AS text
) AS "label"
                                                            FROM
                                                                "clinical_codes"
                                                            WHERE ((
                                                                    "clinical_code_concept_id" IN (
                                                                        SELECT
                                                                            "id"
                                                                        FROM
                                                                            "concepts"
                                                                        WHERE ((
                                                                                "vocabulary_id" = 'ICD9CM'
)
                                                                            AND (
                                                                                "concept_code" IN (
                                                                                    '410.01', '410.11', '410.21', '410.31', '410.41', '410.51', '410.61', '410.71', '410.81', '410.91'
)
)
)
)
)
                                                                OR (
                                                                    "clinical_code_concept_id" IN (
                                                                        SELECT
                                                                            "id"
                                                                        FROM
                                                                            "concepts"
                                                                        WHERE ((
                                                                                "vocabulary_id" = 'ICD10CM'
)
                                                                            AND (
                                                                                "concept_code" IN (
                                                                                    'I21.01', 'I21.02', 'I21.09', 'I21.11', 'I21.19', 'I21.21', 'I21.29', 'I21.3', 'I21.4', 'I21.9', 'I21.A1', 'I21.A9', 'I22.0', 'I22.1', 'I22.2', 'I22.8', 'I22.9'
)
)
)
)
)
) -- #
) AS "t1"
) AS "t1"
) AS "t1" -- #, #]>
) AS "t1"
)
                                        SELECT
                                            *
                                        FROM
                                            "union_22_1_d3d7bbd424bca0fe3b3f966d4ee80692"
) AS "t1"
                                    GROUP BY
                                        "person_id"
) AS "r"
) AS "r"
                            WHERE ((
                                    "l"."person_id" = "r"."person_id"
)
                                AND (
                                    "l"."end_date" < CAST((
                                            CAST(
                                                "r"."start_date" AS timestamp
) + make_interval(
                                                days := - 30
)
) AS date
)
)
)
)
)
) AS "t1"
) AS "t1" -- # #, #]>, :right=> #, #]>}>
) AS "t1"
)
    SELECT
        *
    FROM
        "before_22_3_24a2b6fb9b0f37c3735c654d1ffd3cab") AS "t1"

ETL Is Hell

2023-06-28T00:00:00+00:00

They say there is no such thing as a free lunch. When it comes to common data models (CDMs), this is most certainly true.

There are several different CDMs, each offering its own specialized database schema. The promise is that once the data is loaded into this schema, there is a suite of tools to leverage for exploration, cohort creation, and/or analyses.

This is a very tantalizing prospect. Free data model! Free tools! What’s the catch?

The catch is “once the data is loaded into this schema”. The catch is the huge chasm between the data in its current structure and the data in the CDM’s data structure. The catch is you need to Extract, Transform, and Load (ETL) your data.

I cannot overstate how time-intensive, resource-intensive, detail-oriented, and downright tedious the ETL process is. Depending on the size and complexity of the data, expect to spend thousands of dollars and/or many months on ETL.

Surely, I’m Wrong

I’m not. I wish I were. I’ve witnessed multi-billion dollar corporations struggle with ETL for years. I’ve personally written several ETL programs. I’ve used several more. ETL is very, very complicated. And it is complicated because, particularly with healthcare data, the details matter.

First of all, there is one universal truth: the larger the incoming data, the longer each step takes to run (or the more resources are required to process the data). Doing an ETL will either burn time or money, and probably both.

Let’s look at some of the fundamental issues with each step:

Extract (E)
- As I outlined in a previous post, data from vendors is dirty
- Most powerful ETL tools I’ve encountered are powerful on the T (Transform) part but are lacking in the E (Extract) department
- Simply extracting incoming data from its original format into a more usable format can take many, many hours and require many different specialized programs and libraries to get the job done
Transform (T)
- Requires substantial knowledge about how healthcare data is coded and used
- Requires intimate knowledge of the source data schema
- Requires intimate knowledge of the target data schema
- Often requires a tremendous amount of custom, one-off code to transform data from the raw schema to the CDM schema
  - Takes a great deal of iteration and refinement
- The “wider” the data, the more complex the transformation
  - As the complexity or number of raw data fields increases, so does the complexity of the transformation
- The number of patients has less impact on the cost of transformation
  - This is practically a fixed cost. If there are 157 patients of data, or 157 million patients of data, the complexity of the transformation is the same. However, unanticipated data issues are more likely to occur with 157 million patients, which may make the transformation marginally more complex
Load (L)
- Hopefully the person doing the ETL has control over the output format of the transformation step
  - That said, I’ve seen situations where the ETL is implemented by a separate team from those loading the data. In this case a lot of careful coordination has to take place
- Loading may also depend on the database that is being used – some are faster than others for large data

First Rule of Doing ETL In-House: Don’t

There are entire companies that have emerged around these CDMs and have specialized in ETLing data on behalf of their clients. If you have some data and thousands of dollars, your money and time is probably best spent using one of these companies to do the ETL for you.

If you’re a multi-billion dollar company with a huge IT staff, a data lake, and a huge budget, don’t do your own in-house ETL. Maybe you think your staff has the chops for ETL. Most likely, they don’t. Maybe you think you’ll save money in the long run building a tool in-house. Most likely, you won’t. At least we have not seen this happen. Don’t forget that to do an ETL, substantial knowledge of healthcare data and how it is used in research is critical, and not readily available.

But maybe you’re a lean, mean small business or academic institution with no budget but some skilled people. I’d still argue that you should avoid in-house ETL, if that is possible.

Doing ETL In-House

If you’re a small business with tighter margins and some talented folks, then maybe you’ll do ETL in house. You shouldn’t, but sometimes you have to. So in this case, what do you do?

Find a Commercial Off-The-Shelf Solution (COTS)

Admittedly, I haven’t spent a lot of time in this space. My understanding about ETL leads me to believe that each ETL project is so unique that it is difficult to create a COTS solution that doesn’t require an extremely robust support model, to the point that it makes better sense for the COTS provider to perform the ETL on behalf of a client rather than expecting the client to learn the tool and perform the ETL themselves.

From the COTS products I’ve looked at, they are far too anemic to do the fiddly kinds of transformations required.

Look to the Open Source Community

If you’re like me, your first instinct is to hit GitHub and find a nice, open source ETL tool that handles a lot of the above issues for you.

Unfortunately, I don’t know of a single CDM that includes a well-supported ETL tool as part of its offering. I’ve seen maybe one or two tools that are haphazardly thrown over the fence, are poorly maintained, and even more poorly documented. Outside the institution where they were developed, they are virtually useless.

I have some speculations as to why free, open-source ETL tools for healthcare data don’t proliferate around the web.

Many ETLs are written as one-off programs with no expectation of re-use
Most ETLs are bound to a certain storage engine, such as a particular RDBMS and can’t be easily reused without the specialized tooling they leverage
Each incoming data set is unique, making it hard to write a generalized ETL
Most data vendors consider their documentation and even their data’s schema to be guarded intellectual property making it is illegal to share that information without permission
Most healthcare data is guarded by HIPAA, making it impossible to share example data sets
Data is dirty and it is a lot of work to generalize data ingestion into an ETL tool
Creating an ETL takes a lot of time and labor
- It is a huge cost and properly open-sourcing a project of that magnitude is even more costly
- There is very little return on that kind of investment
ETL, unlike a lot of open source software, is rarely a personal, hobbyist project, but instead the undertaking of a corporation that is guarded about their intellectual property

Build Your Own System (Like Us!)

If you’ve been tasked with writing an in-house ETL despite the warnings above, all I can offer you are my condolences and a few observations:

Don’t expect great documentation from vendors
- Plan to have a point of contact from the vendor to ask questions, because there will be questions
The source schema and data aren’t coming for weeks, if not months
- Start learning about the target CDM while you wait
Extract the source data from its original format and cache it in something faster and easier to work with
- As I’ve said before, incoming data is dirty and many incoming formats are slow to read
Get good at data exploration
- If you’re lucky, the data documentation you get will show you things like counts and maybe some of the values you can expect in each row
  - You’ll rarely get that lucky
- Know how to grab sample values, averages, mins, maxes, and other handy stats about each each column
Quickly identify and eliminate blank columns
- The fewer the columns you have to think about, the more time you can focus on the ones that matter
Don’t be afraid to reshape the source data
- Often times my ETL is made much easier if I build a few of my own source tables from existing source tables
- Doing a mild ETL on the source data can make the main ETL process a lot easier
Start with the easy stuff
- I normally focus on getting basic patient demographics done first, just to get the easy win
Work your way through one source table at a time
It takes me at least two months to learn enough about a source data set to be able to complete my ETLs
- Maybe I’m slow and stupid, but there is a lot of work that goes into
  - Extracting and preparing the source data
  - Exploring the source data
  - Learning the data documentation
  - Getting clarifications about issues with the data
  - Understanding what the source data actually contains
  - Writing out the actual ETL
  - Actually waiting hours, sometimes days for ETL processes to complete
  - Verifying the results of the ETL
- And don’t forget the Mythical Man Month plays into this – you can’t have a team of 8 make an ETL take 1 week

So, About Your Free Lunch

I hope this post has revealed some of the hazards that come along with ETLing data. As tempting as CDMs and their suites of tools are, ETL has a very real, and often ignored, cost. Every organization looking to take advantage of a CDM must consider the cost of ETLing their data into that CDM. Bear in mind everything I discuss above is for ETLing a single data set. This process must be repeated for each additional data set, and then redone as the data is updated. And once it is all up and working, don’t forget to pray that the data provider doesn’t change their data structure. Because if that happens, you might be subject to everything described above again.

Data Documentation Difficulties

2023-06-06T00:00:00+00:00

We’ve already established that data is dirty and now we’d like to explain why data documentation is another challenging aspect of working with raw data.

Data Documentation is Anemic

For today’s discussion, we are focusing on basic documentation. There is a lot of different documentation that can come with data, but we are focused on the bare-bones documentation that should come with every data set. At a minimum, for data documentation to be useful, the following should be provided:

For each file:

The scope of the file (e.g., is it limited to a year or to a specific type of utilization like prescription drugs)
Row count for the data for each file

For each column in a file:

An explanation about what the column represents
Data type (integer, string, etc)
Position in the data (especially if data is fixed width)
Expected length of the column (especially if data is fixed width)
Possible values that appear in the column (especially if something like sex, race, state, etc)
- along with what those values mean
  - e.g. 01 = New Mexico, 02 = California, etc.
Sentinel values for the column
- e.g. 99 = unknown
An explanation for why that column might be completely blank
An explanation for why that column might have any blanks
An explanation of how data is linked within and across tables (e.g., person identifiers, claim identifiers, record identifiers, etc.)

Rarely, if ever, do we get documentation that contains all this information. Instead, we are forced to spend hours doing data exploration on the incoming data. We end up reaching out to the data vendor to ask for clarification about what we find in the data. This is a waste of our time and theirs.

Why We Need This Information

If we were provided this information, particularly if it is in a machine-readable table, we could do many wonderful things such as:

Gather all the positions and widths of columns to automate reading in fixed-width data files
Ensure that every file does indeed contain the correct columns
Gather all the possible values for a column and:
- Ensure only those values appear in the column
- Give guidance to data users about what each value in a column means
- Create a set of RDBMS enumerations for to enforce data integrity
Gather all the descriptions of columns and
- Display that in a user interface to make easier to work with the data
- Turn them into comments and apply them to each column inside an RDBMS
Gather all the widths of columns and make sure we have allocated the correct space in our data structure
Change sentinel values into proper NULLs or other kinds of values within our data structure
Possibly derive all foreign keys in the data and ensure the referential integrity of that data

This is just small list of the things we could do if data documentation was given to me in a machine-readable format.

How We Want Documentation

Notice that we used the term “machine-readable”. We want a simple spreadsheet, or, better yet, a set of tables with all this information. This would be the best way to send over this critical, useful information.

Instead, the documentation is almost always provided as a PDF, and sometimes as a Microsoft Word document. It always seems to be maintained by hand. Even the nicest, neatest documentation we’ve seen has been maintained by a human. This implies the data vendor has no underlying database of information about their own data.

Admittedly, these documents are sometimes very handsomely maintained and look good to a human reader, but it is locking all that critically important information away. It is rendering the documentation unusable for so many great things other than just reading.

Side Note: SAS Input Files Should Not Be Documentation

I’ve sometimes seen vendors use SAS Input files as documentation about their data. This is not acceptable. SAS Input files require the SAS execution environment, a proprietary and expensive system, to use. In fact, I’ve written my own library that attempts to process these files without SAS. I’ve also seen plenty of examples where these files are wrong and incorrectly define column widths, positions, etc., meaning the data vendor didn’t bother to check that the SAS Input file was correct before sending it out as documentation.

Data Vendors Are Wasting Our Time

We spend a lot of time parsing documentation so that we can glean some nuggets of information about the data to facilitate its use. I’ve even written an as-yet-to-be-released tool that helps me parse documentation into usable bits of information. But having to write our own tools to parse data documentation means that data vendors are wasting our time.

Data Vendors Are Wasting Their Time

You know what’s a huge waste of time? Painstakingly writing, formatting, and maintaining a document when it can automatically generated from the data.

If data vendors maintained a database of information about their columns, they could generate those pretty documents that they should include with their data. They could use the information in that database to do the same things I’m looking to do: verify the integrity of their data.

Instead, they have someone toiling away in Word, cutting and pasting whole pages including tables – something that rarely works out nicely – and replacing text with other text. This is error prone, tedious, and a waste of someone’s time. And there is no way to verify that the data described in the document is the data that is actually provided.

Humans Shouldn’t Write the Bulk of Documentation

As someone who has spent their life automating systems through writing computer programs, I’m not, in general, a fan of people doing repetitive tasks. Why teach a person to do a thing when I can write a program to do that thing and know I’ll get consistent, predictable results every time.

Data vendors don’t seem to share my enthusiasm for automation, and that is made clear in their data documentation. However, there are considerable advantages for them and their clients, if they maintain their documentation in a machine-readable fashion.

Help Us Help You

Another thing that amazes us is that, often, a data vendor’s documentation is not made freely available on their website. They only provide it as part of a purchased data set.

No one is served by keeping data documentation secret. If more data documentation were made freely available, then we all would be able to share and improve the documentation. Crowd-sourcing is very powerful. And hiding the documentation doesn’t protect any meaningful intellectual property. We all know what is in these data sources already because we have worked with the data before or we have worked with very similar data. Keeping database schemas and their documentation secret serves no one and limits innovation around things like ETL tools and common data models.

PS: Don’t Forget That More Information Is Better

Above the discussion was focused on the “bare bones” documentation for the data itself. However, we also need an overview of the data to make sure we know what data we are working with. This includes, but is not limited to, the following:

The data source(s) that were included
The date it was created (or the version of the data)
The years that are covered by the data
The selection criteria/process that were used to cut the data
Any other information that will let us know what we have and what we do not have (e.g., insurance coverage types, scope of prescription data, fields that indicate something about data completeness or quality, etc.)

Data Is Dirty

2023-05-23T00:00:00+00:00

Data In(di)gestion

After decades of dealing with other people’s data, it is clear that raw data is dirty. Data suppliers almost always provide data with flaws or eccentricities. In the spirit of “misery loves company” we thought it would be therapeutic to describe the frustrating and time-consuming process that data ingestion can be. These issues are based on years of time spent ingesting data from various vendors in multiple formats, and moving data into different storage systems and databases.

CSV Isn’t A Standard Standard

CSV doesn’t actually have a standard. It uses commas to separate data fields, but it can use delimiters other than comma. There’s a chance that data fields might contain the delimiter itself, requiring quotes, which themselves might also be included in a field. Escaping these characters (i.e., differentiating between the delimiter and the character itself) is complex.

CSV doesn’t store a specification of the data types in each column, requiring some external metadata about the columns to properly reconstitute them.

There are dozens of different CSV parsing libraries and they handle odd edge cases differently. If a field contains a carriage return, only some software libraries are willing to support handling them as one continuous line. Sometimes the software library we use to ingest data can also guess the types intelligently, but that isn’t nearly as reliable as an actual specification.

NULL data is difficult to represent. Is it blank fields? Fields with just “”? Some other sentinel value?

Fixed-width Follies

Another popular format is fixed-width files. I’ve seen all kinds of problems with these.

First off, the description of how to read the fixed-width record has to be stored in some other file, normally a SAS input file because SAS supports reading of fixed-width files as well as some advanced logic for storing and reading these files. SAS is semi-ubiquitous but its fixed-width specification is proprietary, meaning that one needs either need a SAS license to read in the input file, or access to some open source library that can read in and execute SAS input files to determine the proper format. I ended up having to write one myself, SASin in order to bypass the need for SAS to simply read in a raw data file. But, just to be clear, there is really no reason that a fixed-width file specification has to be in SAS. The same information could be provided in a machine-readable text file.

Once the fixed-width format is determined, sometimes I’ve found that the format described doesn’t match the format of the fixed-width file. I’ve seen many instances where sentinel values contain three digits but the field is defined as two digits.

SAS Shouldn’t Be A Standard

Occasionally, I’ll get data shipped to me as sas7bdat files. This is another proprietary SAS format and essentially expects the recipient to have a license for SAS in order to ingest the data. This can be worked around with some unsupported (by SAS), open-source sas7bdat readers. For example, the R package haven includes one.

There are SAS transport files which are supported by SAS for data transport (.xpt files) and they avoid some of the issues with sas7bdat files, but I’ve never encountered one in the wild from a provider of claims or EHR data. (The NHANES data is provided in this format, however.)

Bad Bytes

It is rare, but not impossible, for a file to have bad bytes within it. When certain fragile software libraries try to read in these bad bytes, the ingestion process throws an error and the entire process dies, sometimes hours into the process.

Split Files

Almost all data providers will split large collections of data across multiple files. For instance, provider claims will be split by year, and sometimes further split within the year. Even more fun is when some years contain different columns or are structured differently than other years.

Files are also split by file type (inpatient, outpatient, etc.). And the split of the outpatient file may have different splits than the inpatient file because it is often so much larger than the inpatient file. It is not uncommon to have over 100 files of raw data to load and parse.

When preparing data for a longitudinal study, these files need to be harmonized and stitched back together in a sensible way in order to be queried properly. Since most analyses are done on a per-person basis, the process of splitting data by year or file type makes re-organizing by patient more difficult.

Compression and Encryption

Data can be huge and compression can certainly be helpful in reducing its overall size. Sometimes, for additional data privacy, the raw data will be encrypted. This means that we might have over 100 files that require decompression and/or decryption. Often the process can be automated, but occasionally the data provider requires us to use their decompression program. This means that a human has to babysit the decryption all of the files, entering the same password for each file before decrypting it. It can take days and days to fully decrypt files if a person has to constantly monitor this process.

There Is Hope

In the era of open-source software and big data, the ingestion of raw data should not be tied to any proprietary software package (like SAS or Stata). In a later posts, we will explore data formats that are well-suited for transferring large amounts of data.

Jigsaw by Outcomes Insights, Inc.

The Challenge of Working on the VRDC

What is the Problem?

Where Do We Go?

Software Development – Really?

How Will That Help with Platforms like the VRDC?

That Is Not Realistic. Or Is It?

That’s Nice. But How Does that Help Anyone Else?

Conclusion

Moving Jigsaw to the Real World

Moving Jigsaw to the Real World

Doesn’t This Exist Already?

What is the Alternative?

Conclusion

Bringing Code to the Data

Bringing Code to the Data

The Solution?

Synthetic Data

Code-Generation Software: Jigsaw

Or Both?

Challenges

Conclusion

Spark of Genius

No Cluster Needed

sparklyr Makes Spark Simple

dplyr and Spark Is a Powerful Combination

Great Performance

More Spark in the Future

Parquet Makes My Day

Sharing Is a Good Thing

Background

Two Algorithm Sharing Options

Link to the algorithm within the Jigsaw algorithm library

Link directly to the stand-alone algorithm summary

Bonus points – Protocols Can Also Be Shared!

Why is this useful?

Jigsaw Algorithm Library is Publicly Available

Why put this out there?

Who is funding this?

Collaboration

Feedback

Browser requirements

But wait, there’s more!

Technical Details

What is an algorithm library?

How to use the Jigsaw algorithm library

What is an algorithm?

What does an algorithm specification produce?

How algorithms operate

Visual representation

Algorithm scope

It isn’t clear how algorithms are used to create analysis data sets

Limitations

Disclaimer

Is every algorithm ready for research?

Is an algorithm specific to a data model or software system?

Why not make the entire application available?

Example Algorithm

Algorithm diagram

Algorithm attributes

ConceptQL

SQL

ETL Is Hell

Surely, I’m Wrong

First Rule of Doing ETL In-House: Don’t

Doing ETL In-House

Find a Commercial Off-The-Shelf Solution (COTS)

Look to the Open Source Community

Build Your Own System (Like Us!)

So, About Your Free Lunch

Data Documentation Difficulties

Data Documentation is Anemic

Why We Need This Information

How We Want Documentation

Side Note: SAS Input Files Should Not Be Documentation

Data Vendors Are Wasting Our Time

Data Vendors Are Wasting Their Time

Humans Shouldn’t Write the Bulk of Documentation

Help Us Help You

PS: Don’t Forget That More Information Is Better

`sparklyr` Makes Spark Simple

`dplyr` and Spark Is a Powerful Combination