Inside Data Models

Data models can be useful for epidemiologists, outcomes researchers, and health economists working with real-world data. However, there isn’t much written as a basic introduction. This will be the first of a series of posts on the following 3 topics: standardization, harmonization, and organization.


By way of full disclosure, the content on this blog represents our experiences in working with data models and cohort-building software for the last 10 years. We use these tools to conduct research in pharmacoepidemiology, descriptive epidemiology, and health services research (including health economics). Although we work across many areas of medicine, we do a fair bit of work in oncology using cancer registry data from the SEER program. Therefore, to be useful to us, a data model must support all these research areas.

In 2018, no data models supported all of our use cases. Primarily, their limitations were related to their inability to store oncology-specific information accurately, and to whether they could properly support cost data. In response, we created the Generalized Data Model (GDM) to meet these requirements. Since that time, some data models have developed plans to better support oncology and/or economic use cases, though still none satisfy all our requirements.

The scope of this post includes the OMOP, PCORnet, and Sentinel common data models as well as GDM. The i2b2 data model, among others, is not included. And, unless stated otherwise, PCORnet and Sentinel are mostly interchangeable with respect to their underlying design philosophy. The models themselves are different, however.


In this blog post, “standardization” is focused on the medical coding systems (“vocabularies”) used for research. Note that “harmonization” is a related idea that refers to how to resolve differences among different standards – that will be a separate blog post. And data models themselves represent structural standards for how to store data – that will also be a separate blog post.

In medical data, standardization is simply a way of referring to each unique piece of clinical information in a unique way. As a simple, non-medical example, the US Postal Service has standardized state by using a 2-character abbreviation (e.g., CA, PA, NM, etc.). While people could use the full state name or other variations, adhering to this standard makes it easier for the Postal Service to do its job.

In the United States, medical data uses a few common vocabularies to store medical constructs (also called “concepts”). These include SNOMED CT, ICD-9-DM, and ICD-10-CM codes for diagnoses; ICD-9-CM procedure codes, ICD-10-PCS, CPT and HCPCS codes for procedures; and NDC and RxNorm codes for drugs.

What makes them “standards” is that each of these vocabularies has a unique numeric or alphanumeric code for each unique medical construct it describes. Some codes can represent “catch-all” constructs like “hypertension, not otherwise specified”; but, in a perfect world, all medical concepts would be mapped to a single code. The reality is that it is impossible to enumerate every variation of every clinical condition. For this reason SNOMED CT is unique in that it allows for its codes to be combined to create more complex clinical expressions (also known as “post-coordination”).

Unfortunately, these codes are undecipherable by humans without the associated description fields, which are almost never supplied with the raw data that researchers use. However, most coding systems are organized into groups of related codes. Within these groups, there are hierarchical relationships among the codes to make them easier for humans to navigate based on the description fields.

Coding systems are not as clearly defined as one might expect. For example, while people think of ICD-10-CM codes as being “diagnoses”, they also contain information from other domains like procedures, observations (e.g., family history), and medications. This is important when thinking about the organization of a data model.

Sometimes there is a need to standardize other kinds of data, like laboratory data or clinical assessments. The LOINC vocabulary has a large collection of codes to facilitate the recording of this data. However, it can be challenging to distinguish among various LOINC codes, particularly when there are many ways to measure something. For example, a search of “blood pressure” in LOINC returns over 500 options including “Diastolic blood pressure 12 hour mean”, “Aorta Systolic blood pressure”, and “Systolic blood pressure–during anesthesia”. In some cases, LOINC will use one code for the measurement itself (e.g., “Choriogonadotropin [pregnancy test]”) and other codes for the results (e.g., “Positive” and “Negative”). This latter situation can be thought of as a “question” code with associated “answer” codes.

Finally, there are other kinds of data that, in our limited experience, don’t yet readily lend themselves to standardization. These include genetic data, data from wearable devices, and text-based medical notes.

Back To Data Models

At their most basic level, data models are designed to store clinical codes in a helpful way. Other than decisions about whether to include the dots that are used in some systems (e.g., 410.11 vs 41011) their storage is straightforward. One feature that deserves mention is that in the OMOP data model, each code in every vocabulary is assigned a unique integer identifier (“concept id”). This eliminates the possibility that a code used in multiple vocabularies might be ambiguous (e.g., 412 is an ICD-9-CM code as well as a DRG code).

One last comment is that some data models include codes that are created specifically for the data model itself and are not based on any externally defined standard coding system. The most common examples are the codes used to capture the provenance of the data (i.e., information about the context of the code in the original data source). One of the most important of these is related to “visits” (or “encounters”). The OMOP, PCORnet, and Sentinel models use codes to indicate the visit type (e.g., “inpatient” or “outpatient”) associated with each diagnosis, procedure, observation, medication, laboratory measure, or other aspect of medical care.

While it may seem simple to define visits and visit types, there are many ambiguous situations, including whether a visit to a pharmacy or to blood draw is a visit, whether an emergency department visit that leads to an inpatient hospitalization is an outpatient or inpatient visit, and whether to include telephone or email interactions. As a result, because there is no standard vocabulary, visit types vary across the different data models. We will describe this in more detail when we get to harmonization. Along these lines, GDM specifically avoids defining “visits” and “visit types” for this reason and uses its structure to capture provenance information. For example, GDM stores the context for each code in a specific “contexts” table and allows codes to share contexts with one another. We will describe this in more detail when we get to organization.