Data Is Dirty

Data In(di)gestion

After decades of dealing with other people’s data, it is clear that raw data is dirty. Data suppliers almost always provide data with flaws or eccentricities. In the spirit of “misery loves company” we thought it would be therapeutic to describe the frustrating and time-consuming process that data ingestion can be. These issues are based on years of time spent ingesting data from various vendors in multiple formats, and moving data into different storage systems and databases.

CSV Isn’t A Standard Standard

CSV doesn’t actually have a standard. It uses commas to separate data fields, but it can use delimiters other than comma. There’s a chance that data fields might contain the delimiter itself, requiring quotes, which themselves might also be included in a field. Escaping these characters (i.e., differentiating between the delimiter and the character itself) is complex.

CSV doesn’t store a specification of the data types in each column, requiring some external metadata about the columns to properly reconstitute them.

There are dozens of different CSV parsing libraries and they handle odd edge cases differently. If a field contains a carriage return, only some software libraries are willing to support handling them as one continuous line. Sometimes the software library we use to ingest data can also guess the types intelligently, but that isn’t nearly as reliable as an actual specification.

NULL data is difficult to represent. Is it blank fields? Fields with just “”? Some other sentinel value?

Fixed-width Follies

Another popular format is fixed-width files. I’ve seen all kinds of problems with these.

First off, the description of how to read the fixed-width record has to be stored in some other file, normally a SAS input file because SAS supports reading of fixed-width files as well as some advanced logic for storing and reading these files. SAS is semi-ubiquitous but its fixed-width specification is proprietary, meaning that one needs either need a SAS license to read in the input file, or access to some open source library that can read in and execute SAS input files to determine the proper format. I ended up having to write one myself, SASin in order to bypass the need for SAS to simply read in a raw data file. But, just to be clear, there is really no reason that a fixed-width file specification has to be in SAS. The same information could be provided in a machine-readable text file.

Once the fixed-width format is determined, sometimes I’ve found that the format described doesn’t match the format of the fixed-width file. I’ve seen many instances where sentinel values contain three digits but the field is defined as two digits.

SAS Shouldn’t Be A Standard

Occasionally, I’ll get data shipped to me as sas7bdat files. This is another proprietary SAS format and essentially expects the recipient to have a license for SAS in order to ingest the data. This can be worked around with some unsupported (by SAS), open-source sas7bdat readers. For example, the R package haven includes one.

There are SAS transport files which are supported by SAS for data transport (.xpt files) and they avoid some of the issues with sas7bdat files, but I’ve never encountered one in the wild from a provider of claims or EHR data. (The NHANES data is provided in this format, however.)

Bad Bytes

It is rare, but not impossible, for a file to have bad bytes within it. When certain fragile software libraries try to read in these bad bytes, the ingestion process throws an error and the entire process dies, sometimes hours into the process.

Split Files

Almost all data providers will split large collections of data across multiple files. For instance, provider claims will be split by year, and sometimes further split within the year. Even more fun is when some years contain different columns or are structured differently than other years.

Files are also split by file type (inpatient, outpatient, etc.). And the split of the outpatient file may have different splits than the inpatient file because it is often so much larger than the inpatient file. It is not uncommon to have over 100 files of raw data to load and parse.

When preparing data for a longitudinal study, these files need to be harmonized and stitched back together in a sensible way in order to be queried properly. Since most analyses are done on a per-person basis, the process of splitting data by year or file type makes re-organizing by patient more difficult.

Compression and Encryption

Data can be huge and compression can certainly be helpful in reducing its overall size. Sometimes, for additional data privacy, the raw data will be encrypted. This means that we might have over 100 files that require decompression and/or decryption. Often the process can be automated, but occasionally the data provider requires us to use their decompression program. This means that a human has to babysit the decryption all of the files, entering the same password for each file before decrypting it. It can take days and days to fully decrypt files if a person has to constantly monitor this process.

There Is Hope

In the era of open-source software and big data, the ingestion of raw data should not be tied to any proprietary software package (like SAS or Stata). In a later posts, we will explore data formats that are well-suited for transferring large amounts of data.