We’ve already established that data is dirty and now we’d like to explain why data documentation is another challenging aspect of working with raw data.
Data Documentation is Anemic
For today’s discussion, we are focusing on basic documentation. There is a lot of different documentation that can come with data, but we are focused on the bare-bones documentation that should come with every data set. At a minimum, for data documentation to be useful, the following should be provided:
For each file:
- The scope of the file (e.g., is it limited to a year or to a specific type of utilization like prescription drugs)
- Row count for the data for each file
For each column in a file:
- An explanation about what the column represents
- Data type (integer, string, etc)
- Position in the data (especially if data is fixed width)
- Expected length of the column (especially if data is fixed width)
- Possible values that appear in the column (especially if something like sex, race, state, etc)
- along with what those values mean
- e.g. 01 = New Mexico, 02 = California, etc.
- along with what those values mean
- Sentinel values for the column
- e.g. 99 = unknown
- An explanation for why that column might be completely blank
- An explanation for why that column might have any blanks
- An explanation of how data is linked within and across tables (e.g., person identifiers, claim identifiers, record identifiers, etc.)
Rarely, if ever, do we get documentation that contains all this information. Instead, we are forced to spend hours doing data exploration on the incoming data. We end up reaching out to the data vendor to ask for clarification about what we find in the data. This is a waste of our time and theirs.
Why We Need This Information
If we were provided this information, particularly if it is in a machine-readable table, we could do many wonderful things such as:
- Gather all the positions and widths of columns to automate reading in fixed-width data files
- Ensure that every file does indeed contain the correct columns
- Gather all the possible values for a column and:
- Ensure only those values appear in the column
- Give guidance to data users about what each value in a column means
- Create a set of RDBMS enumerations for to enforce data integrity
- Gather all the descriptions of columns and
- Display that in a user interface to make easier to work with the data
- Turn them into comments and apply them to each column inside an RDBMS
- Gather all the widths of columns and make sure we have allocated the correct space in our data structure
- Change sentinel values into proper NULLs or other kinds of values within our data structure
- Possibly derive all foreign keys in the data and ensure the referential integrity of that data
This is just small list of the things we could do if data documentation was given to me in a machine-readable format.
How We Want Documentation
Notice that we used the term “machine-readable”. We want a simple spreadsheet, or, better yet, a set of tables with all this information. This would be the best way to send over this critical, useful information.
Instead, the documentation is almost always provided as a PDF, and sometimes as a Microsoft Word document. It always seems to be maintained by hand. Even the nicest, neatest documentation we’ve seen has been maintained by a human. This implies the data vendor has no underlying database of information about their own data.
Admittedly, these documents are sometimes very handsomely maintained and look good to a human reader, but it is locking all that critically important information away. It is rendering the documentation unusable for so many great things other than just reading.
Side Note: SAS Input Files Should Not Be Documentation
I’ve sometimes seen vendors use SAS Input files as documentation about their data. This is not acceptable. SAS Input files require the SAS execution environment, a proprietary and expensive system, to use. In fact, I’ve written my own library that attempts to process these files without SAS. I’ve also seen plenty of examples where these files are wrong and incorrectly define column widths, positions, etc., meaning the data vendor didn’t bother to check that the SAS Input file was correct before sending it out as documentation.
Data Vendors Are Wasting Our Time
We spend a lot of time parsing documentation so that we can glean some nuggets of information about the data to facilitate its use. I’ve even written an as-yet-to-be-released tool that helps me parse documentation into usable bits of information. But having to write our own tools to parse data documentation means that data vendors are wasting our time.
Data Vendors Are Wasting Their Time
You know what’s a huge waste of time? Painstakingly writing, formatting, and maintaining a document when it can automatically generated from the data.
If data vendors maintained a database of information about their columns, they could generate those pretty documents that they should include with their data. They could use the information in that database to do the same things I’m looking to do: verify the integrity of their data.
Instead, they have someone toiling away in Word, cutting and pasting whole pages including tables – something that rarely works out nicely – and replacing text with other text. This is error prone, tedious, and a waste of someone’s time. And there is no way to verify that the data described in the document is the data that is actually provided.
Humans Shouldn’t Write the Bulk of Documentation
As someone who has spent their life automating systems through writing computer programs, I’m not, in general, a fan of people doing repetitive tasks. Why teach a person to do a thing when I can write a program to do that thing and know I’ll get consistent, predictable results every time.
Data vendors don’t seem to share my enthusiasm for automation, and that is made clear in their data documentation. However, there are considerable advantages for them and their clients, if they maintain their documentation in a machine-readable fashion.
Help Us Help You
Another thing that amazes us is that, often, a data vendor’s documentation is not made freely available on their website. They only provide it as part of a purchased data set.
No one is served by keeping data documentation secret. If more data documentation were made freely available, then we all would be able to share and improve the documentation. Crowd-sourcing is very powerful. And hiding the documentation doesn’t protect any meaningful intellectual property. We all know what is in these data sources already because we have worked with the data before or we have worked with very similar data. Keeping database schemas and their documentation secret serves no one and limits innovation around things like ETL tools and common data models.
PS: Don’t Forget That More Information Is Better
Above the discussion was focused on the “bare bones” documentation for the data itself. However, we also need an overview of the data to make sure we know what data we are working with. This includes, but is not limited to, the following:
- The data source(s) that were included
- The date it was created (or the version of the data)
- The years that are covered by the data
- The selection criteria/process that were used to cut the data
- Any other information that will let us know what we have and what we do not have (e.g., insurance coverage types, scope of prescription data, fields that indicate something about data completeness or quality, etc.)