ETL Is Hell
They say there is no such thing as a free lunch. When it comes to common data models (CDMs), this is most certainly true.
There are several different CDMs, each offering its own specialized database schema. The promise is that once the data is loaded into this schema, there is a suite of tools to leverage for exploration, cohort creation, and/or analyses.
This is a very tantalizing prospect. Free data model! Free tools! What’s the catch?
The catch is “once the data is loaded into this schema”. The catch is the huge chasm between the data in its current structure and the data in the CDM’s data structure. The catch is you need to Extract, Transform, and Load (ETL) your data.
I cannot overstate how time-intensive, resource-intensive, detail-oriented, and downright tedious the ETL process is. Depending on the size and complexity of the data, expect to spend thousands of dollars and/or many months on ETL.
Surely, I’m Wrong
I’m not. I wish I were. I’ve witnessed multi-billion dollar corporations struggle with ETL for years. I’ve personally written several ETL programs. I’ve used several more. ETL is very, very complicated. And it is complicated because, particularly with healthcare data, the details matter.
First of all, there is one universal truth: the larger the incoming data, the longer each step takes to run (or the more resources are required to process the data). Doing an ETL will either burn time or money, and probably both.
Let’s look at some of the fundamental issues with each step:
- Extract (E)
- As I outlined in a previous post, data from vendors is dirty
- Most powerful ETL tools I’ve encountered are powerful on the T (Transform) part but are lacking in the E (Extract) department
- Simply extracting incoming data from its original format into a more usable format can take many, many hours and require many different specialized programs and libraries to get the job done
- Transform (T)
- Requires substantial knowledge about how healthcare data is coded and used
- Requires intimate knowledge of the source data schema
- Requires intimate knowledge of the target data schema
- Often requires a tremendous amount of custom, one-off code to transform data from the raw schema to the CDM schema
- Takes a great deal of iteration and refinement
- The “wider” the data, the more complex the transformation
- As the complexity or number of raw data fields increases, so does the complexity of the transformation
- The number of patients has less impact on the cost of transformation
- This is practically a fixed cost. If there are 157 patients of data, or 157 million patients of data, the complexity of the transformation is the same. However, unanticipated data issues are more likely to occur with 157 million patients, which may make the transformation marginally more complex
- Load (L)
- Hopefully the person doing the ETL has control over the output format of the transformation step
- That said, I’ve seen situations where the ETL is implemented by a separate team from those loading the data. In this case a lot of careful coordination has to take place
- Loading may also depend on the database that is being used – some are faster than others for large data
- Hopefully the person doing the ETL has control over the output format of the transformation step
First Rule of Doing ETL In-House: Don’t
There are entire companies that have emerged around these CDMs and have specialized in ETLing data on behalf of their clients. If you have some data and thousands of dollars, your money and time is probably best spent using one of these companies to do the ETL for you.
If you’re a multi-billion dollar company with a huge IT staff, a data lake, and a huge budget, don’t do your own in-house ETL. Maybe you think your staff has the chops for ETL. Most likely, they don’t. Maybe you think you’ll save money in the long run building a tool in-house. Most likely, you won’t. At least we have not seen this happen. Don’t forget that to do an ETL, substantial knowledge of healthcare data and how it is used in research is critical, and not readily available.
But maybe you’re a lean, mean small business or academic institution with no budget but some skilled people. I’d still argue that you should avoid in-house ETL, if that is possible.
Doing ETL In-House
If you’re a small business with tighter margins and some talented folks, then maybe you’ll do ETL in house. You shouldn’t, but sometimes you have to. So in this case, what do you do?
Find a Commercial Off-The-Shelf Solution (COTS)
Admittedly, I haven’t spent a lot of time in this space. My understanding about ETL leads me to believe that each ETL project is so unique that it is difficult to create a COTS solution that doesn’t require an extremely robust support model, to the point that it makes better sense for the COTS provider to perform the ETL on behalf of a client rather than expecting the client to learn the tool and perform the ETL themselves.
From the COTS products I’ve looked at, they are far too anemic to do the fiddly kinds of transformations required.
Look to the Open Source Community
If you’re like me, your first instinct is to hit GitHub and find a nice, open source ETL tool that handles a lot of the above issues for you.
Unfortunately, I don’t know of a single CDM that includes a well-supported ETL tool as part of its offering. I’ve seen maybe one or two tools that are haphazardly thrown over the fence, are poorly maintained, and even more poorly documented. Outside the institution where they were developed, they are virtually useless.
I have some speculations as to why free, open-source ETL tools for healthcare data don’t proliferate around the web.
- Many ETLs are written as one-off programs with no expectation of re-use
- Most ETLs are bound to a certain storage engine, such as a particular RDBMS and can’t be easily reused without the specialized tooling they leverage
- Each incoming data set is unique, making it hard to write a generalized ETL
- Most data vendors consider their documentation and even their data’s schema to be guarded intellectual property making it is illegal to share that information without permission
- Most healthcare data is guarded by HIPAA, making it impossible to share example data sets
- Data is dirty and it is a lot of work to generalize data ingestion into an ETL tool
- Creating an ETL takes a lot of time and labor
- It is a huge cost and properly open-sourcing a project of that magnitude is even more costly
- There is very little return on that kind of investment
- ETL, unlike a lot of open source software, is rarely a personal, hobbyist project, but instead the undertaking of a corporation that is guarded about their intellectual property
Build Your Own System (Like Us!)
If you’ve been tasked with writing an in-house ETL despite the warnings above, all I can offer you are my condolences and a few observations:
- Don’t expect great documentation from vendors
- Plan to have a point of contact from the vendor to ask questions, because there will be questions
- The source schema and data aren’t coming for weeks, if not months
- Start learning about the target CDM while you wait
- Extract the source data from its original format and cache it in something faster and easier to work with
- As I’ve said before, incoming data is dirty and many incoming formats are slow to read
- Get good at data exploration
- If you’re lucky, the data documentation you get will show you things like counts and maybe some of the values you can expect in each row
- Know how to grab sample values, averages, mins, maxes, and other handy stats about each each column
- Quickly identify and eliminate blank columns
- The fewer the columns you have to think about, the more time you can focus on the ones that matter
- Don’t be afraid to reshape the source data
- Often times my ETL is made much easier if I build a few of my own source tables from existing source tables
- Doing a mild ETL on the source data can make the main ETL process a lot easier
- Start with the easy stuff
- I normally focus on getting basic patient demographics done first, just to get the easy win
- Work your way through one source table at a time
- It takes me at least two months to learn enough about a source data set to be able to complete my ETLs
- Maybe I’m slow and stupid, but there is a lot of work that goes into
- Extracting and preparing the source data
- Exploring the source data
- Learning the data documentation
- Getting clarifications about issues with the data
- Understanding what the source data actually contains
- Writing out the actual ETL
- Actually waiting hours, sometimes days for ETL processes to complete
- Verifying the results of the ETL
- And don’t forget the Mythical Man Month plays into this – you can’t have a team of 8 make an ETL take 1 week
- Maybe I’m slow and stupid, but there is a lot of work that goes into
So, About Your Free Lunch
I hope this post has revealed some of the hazards that come along with ETLing data. As tempting as CDMs and their suites of tools are, ETL has a very real, and often ignored, cost. Every organization looking to take advantage of a CDM must consider the cost of ETLing their data into that CDM. Bear in mind everything I discuss above is for ETLing a single data set. This process must be repeated for each additional data set, and then redone as the data is updated. And once it is all up and working, don’t forget to pray that the data provider doesn’t change their data structure. Because if that happens, you might be subject to everything described above again.