<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://outcomesinsights.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://outcomesinsights.github.io/" rel="alternate" type="text/html" /><updated>2026-03-06T23:40:59+00:00</updated><id>https://outcomesinsights.github.io/feed.xml</id><title type="html">Jigsaw by Outcomes Insights, Inc.</title><subtitle>Jigsaw is software for creating analysis-ready datasets from healthcare data.  This blog covers topics related to developing software for the generation of  real-world evidence from real-world data.  </subtitle><entry><title type="html">A Better Way to Build Code Sets</title><link href="https://outcomesinsights.github.io/data/2026/03/06/a-better-way-to-build-code-sets.html" rel="alternate" type="text/html" title="A Better Way to Build Code Sets" /><published>2026-03-06T00:00:00+00:00</published><updated>2026-03-06T00:00:00+00:00</updated><id>https://outcomesinsights.github.io/data/2026/03/06/a-better-way-to-build-code-sets</id><content type="html" xml:base="https://outcomesinsights.github.io/data/2026/03/06/a-better-way-to-build-code-sets.html"><![CDATA[<p>If you work with healthcare claims data, you’ve built code sets. Maybe it was a list of ICD-10-CM codes for diabetes, or heart failure, or chronic kidney disease. And if you have, you know that <em>there has to be a better way to do this.</em>  All of the tools that exist are for billing purposes, not for research.</p>

<p>The traditional approach is some combination of clinical knowledge, keyword searches through code descriptions, and peer-reviewed literature. You open a reference table, search for “diabetes,” scroll through hundreds of results, and try to decide which codes belong and which don’t. It’s tedious, it’s error-prone, and two analysts working on the same condition can produce different code sets.  That divergence matters — it can affect who ends up in the study cohort, and/or the inferences drawn from the results.</p>

<h2 id="the-problem-is-bigger-than-it-looks">The Problem Is Bigger Than It Looks</h2>

<p>ICD-10-CM has roughly 97,000 codes. They’re organized hierarchically, which helps, but the hierarchy is deep and full of clinical nuance. Take type 2 diabetes: the E11 family alone contains codes for ophthalmic, neurological, circulatory, renal, and dermatological complications, each with multiple subcategories. A keyword search for “kidney” won’t find codes described as “renal” or “nephropathy.” And even when you find the right family, you still have to decide which subcategories are relevant to your specific research question.</p>

<p>There are workarounds — reusing code sets from prior studies, borrowing from published literature, or building from institutional templates. They work, but they’re hard to audit, hard to reproduce, and tend to drift over time as analysts make small, undocumented adjustments.</p>

<h2 id="semantic-search-changes-the-game">Semantic Search Changes the Game</h2>

<p>We’ve been building a tool that takes a fundamentally different approach. Instead of searching code <em>descriptions</em> with keywords, it searches code <em>meaning</em> with natural language.</p>

<p>The system encodes every ICD-10-CM code as a high-dimensional vector — a mathematical representation of its clinical meaning, informed by the code’s description, its position in the ICD-10 hierarchy, and LLM-generated clinical context. A query like “type 2 diabetes with kidney complications” returns codes that are <em>semantically close</em> to the intended meaning, not just codes that happen to contain matching words.</p>

<p>This is the same class of technology behind modern search engines and recommendation systems, applied to a very specific and important problem: identifying the right ICD-10-CM codes for a research project.</p>

<h2 id="not-just-search--a-feedback-loop">Not Just Search — a Feedback Loop</h2>

<p>Semantic search is a good starting point, but a starting point isn’t a finished code set. The real value comes from what happens next.</p>

<p>The tool groups results into ICD10 code families defined by their 3 character group (e.g., E11).  Then the tool scores them by relevance which is a measue of how close they are to your original query.  Then, it identifies a natural cutoff between clearly relevant and probably irrelevant families. From there, the user reviews individual codes and marks them as include, exclude, or “unsure”. Each include or exclude decision feeds back into the algorithm through a process called Rocchio relevance feedback — the system adjusts its internal representation of the query based on those choices, pushing away from excluded codes and towards the included ones.  It then re-searches with a refined understanding of what the user is looking for.</p>

<p>This creates an iterative refinement loop. With each round, the results get more precise. New families surface that the user might not have considered. Irrelevant ones drop away. The process is reproducible and auditable — a researcher can explain to a reviewer exactly how and why each code ended up in the set.</p>

<p>There is also a graphical feature that shows how close the codes are from each other in meaning, so you can get visual feedback on what is happening.</p>

<h2 id="ai-assistance-where-it-helps">AI Assistance Where It Helps</h2>

<p>We’ve layered optional LLM features on top of this core workflow. Before the user starts reviewing codes, an AI pre-check can classify subcategories as likely relevant, likely irrelevant, or uncertain — giving a head start on the manual review. A query advisor can flag unexpected code families and ask clarifying questions about intent. And an explain feature can break down what a specific code represents and why it might or might not belong.</p>

<p>These features are genuinely optional. The semantic search and Rocchio refinement work without any LLM. But when available, they reduce the cognitive load of sorting through hundreds of subcategories.</p>

<h2 id="harmonization-comparing-code-sets">Harmonization: Comparing Code Sets</h2>

<p>We also built a harmonization tool for a problem that sometimes arises in practice: there are two code sets for the same condition, built by different analysts or drawn from different sources, and there is a need to reconcile them. The tool shows the user what’s in both sets, what’s unique to each, and lets the user build a merged set with full visibility into the differences.</p>

<h2 id="case-study-dementia-in-the-charlson-comorbidity-index">Case Study: Dementia in the Charlson Comorbidity Index</h2>

<p>To make this concrete, consider a code set that thousands of researchers use routinely: the dementia algorithm from the Charlson Comorbidity Index.</p>

<p>The ICD-10 version of the Charlson index comes from Quan et al. (2005), a carefully conducted study that translated the original ICD-9-CM algorithms into ICD-10 using a multi-step consensus process across research groups in three countries. The dementia algorithm they published includes codes from four ICD-10 families: F00 (dementia in Alzheimer’s disease), F01–F03 (vascular, other, and unspecified dementia), G30 (Alzheimer’s disease), and G31.1 (senile degeneration of brain). It’s a reasonable list — and it’s been cited thousands of times.</p>

<p>But for people working with US claims data coded in ICD-10-CM, this list has problems.</p>

<p>First, <strong>F00 doesn’t exist in ICD-10-CM</strong>. The US clinical modification never adopted that code. In ICD-10-CM, Alzheimer’s disease with documented dementia is typically coded with a G30.- code for the underlying Alzheimer’s disease plus an F02.80/F02.81- code for the dementia manifestation. A researcher who takes the Quan codes at face value and searches for F00 in US claims data will find zero patients — not because there are no Alzheimer’s patients, but because the code doesn’t exist in the system they’re searching.</p>

<p>Second, and more consequentially, the algorithm misses entire categories of dementia that are now explicitly coded in ICD-10-CM:</p>

<ul>
  <li><strong>G31.0x — Frontotemporal dementia</strong>, including Pick’s disease and other frontotemporal variants. This is a clinically significant dementia subtype with its own family of codes.</li>
  <li><strong>G31.83 — Neurocognitive disorder with Lewy bodies</strong>, which captures dementia with Lewy bodies as a distinct neurodegenerative disorder, and was introduced in ICD-10-CM after the Quan paper was published.</li>
</ul>

<p>These aren’t obscure edge cases. Lewy body dementia accounts for an estimated <a href="https://www.cambridge.org/core/journals/canadian-journal-of-neurological-sciences/article/prevalence-and-incidence-of-dementia-with-lewy-bodies-a-systematic-review/5A720B4E79E47546545FCC3B7612A771">3–7% of all dementia cases</a>. Frontotemporal dementia is <a href="https://memory.ucsf.edu/dementia/ftd">a leading cause of dementia in people under age 60</a>. Missing them means missing patients — exactly the kind of systematic gap that introduces bias into observational studies.</p>

<h3 id="what-the-tool-finds">What the Tool Finds</h3>

<p>When we run the query “dementia” through our Code Set Builder, semantic search surfaces exactly the families you’d expect: F01, F02, F03, and G30 — the core of the Quan algorithm. But it also identifies G31, scored highly because codes like G31.0x (frontotemporal dementia) and G31.83 (Neurocognitive disorder with Lewy bodies) are semantically close to the query. The tool doesn’t just match the word “dementia” in code descriptions — it understands that these conditions <em>are</em> dementias, even when the description reads “frontotemporal disease” or uses other clinical terminology.</p>

<p>From there, the AI pre-check step can help sort out which G31 subcategories belong (G31.0x, G31.83 — yes; G31.2, spinocerebellar degeneration — probably not), and iterative refinement allows fine-tuning based on the specific research question. The result is a code set that’s more comprehensive than Quan, specifically adapted to ICD-10-CM, and documented through every step.</p>

<p>This isn’t a criticism of Quan et al. — their work was rigorous and remains foundational. The point is that code sets built by manual translation twenty years ago inevitably have gaps, especially when the underlying coding system has continued to evolve. A tool that starts from semantic meaning rather than code-to-code translation can identify what manual processes miss.</p>

<h2 id="why-this-matters">Why This Matters</h2>

<p>Code set construction is foundational work in observational research. It determines who’s in your study population and who’s not. Getting it wrong — in either direction — compromises everything downstream. Yet we’ve been treating it as an artisanal process, relying on individual expertise and ad hoc methods.</p>

<p>This tool doesn’t replace clinical judgment. It augments it with semantic understanding, algorithmic feedback, and AI assistance, producing code sets that are more comprehensive, more precise, and more defensible than what most of us can build by hand.</p>

<p>We’re using it internally at Outcomes Insights, and we’re excited about where it’s headed.  In fact we were inspired to work on similar projects to identify medications (NDC and HCPCS codes) and procedures (CPT and HCPCS codes) based on disease areas.</p>

<h2 id="preview">Preview</h2>

<p><img src="/images/code-set-builder-dementia.jpg" alt="Code Set Builder showing dementia search results with G31 family expanded" /></p>]]></content><author><name>Mark Danese</name></author><category term="data" /><summary type="html"><![CDATA[If you work with healthcare claims data, you’ve built code sets. Maybe it was a list of ICD-10-CM codes for diabetes, or heart failure, or chronic kidney disease. And if you have, you know that there has to be a better way to do this. All of the tools that exist are for billing purposes, not for research.]]></summary></entry><entry><title type="html">Observational Research in a Box</title><link href="https://outcomesinsights.github.io/data/2025/04/07/observational-research-in-a-box.html" rel="alternate" type="text/html" title="Observational Research in a Box" /><published>2025-04-07T00:00:00+00:00</published><updated>2025-04-07T00:00:00+00:00</updated><id>https://outcomesinsights.github.io/data/2025/04/07/observational-research-in-a-box</id><content type="html" xml:base="https://outcomesinsights.github.io/data/2025/04/07/observational-research-in-a-box.html"><![CDATA[<p><img src="/images/research_in_box.png" alt="Research in box graphic" /></p>

<p>Welcome to the new way to access data – the data enclave.  This is a private data platform, designed and controlled by the data provider, on which researchers must do all their work.  In this post we share some initial impressions from our experience with three separate data enclaves, and how we are moving forward in this new world.</p>

<h3 id="high-level-observations">High-level observations</h3>
<p>One project is based on the Medicare Virtual Research Data Center (VRDC) and the other two are based on commercial data enclaves.  The good news is that R and Python are available in all three.  The bad news is that SAS, Stata, and RStudio are not universally available.  Also, one provider offers Databricks, one uses Snowflake, and the last uses Athena - three different underlying data platforms.</p>

<p>The data documentation has been better than the documentation from providers who allow us to have the data on our servers.  However, no data provider provides sufficiently detailed information about each column (variable) in the data.  We still have to do a lot of data discovery on the raw data before we can use it (e.g., identify missing data, discover unexpected values, determine relationships among the tables, etc.)</p>

<p>For privacy reasons, all three enclaves have tight restrictions on what can be taken off the system.  This makes it challenging for the research team to review the data, and adds a delay since all output has to be reviewed.</p>

<h3 id="opportunity">Opportunity</h3>
<p>As many know, we have our Jigsaw software which can generate the SQL to create analysis-ready datasets from data in a common data model.  We can generate SQL in most dialects used by data platforms, so we can readily adapt our research process to data enclaves.</p>

<p>But before we can construct cohorts using Jigsaw, we need to organize the data into a data model.  This process is time-consuming even on an on-premises system.  To allow us to support multiple data platforms in a reasonable timeframe, we started building tools to shorten the process from 6-12 weeks to something more like 1-3 weeks.</p>

<p>Side note:  We’d like to say we can do it in hours, but we are being realistic given the heterogeneity of data, the platform architectures, and the time it takes to process large amounts of data, even with powerful hardware.  The good news is that once the code is written for a data source, it is substantially faster to process another instance of the same data.</p>

<h3 id="progress">Progress</h3>
<p>We have just completed our first version of tools to facilitate the process.  At a high level these tools do the following:</p>

<ol>
  <li>Systematically explore every table and column in the raw data and characterize it in a Shiny app.
    <ul>
      <li>This includes missing values, ranges, typical values, unusual values (which are often data errors), and the identifiers (keys) that are used to connect records in different tables.</li>
      <li>We also incorporate variable definitions from the data dictionary and show all relevant information in one interface.</li>
    </ul>
  </li>
  <li>
    <p>Process the output of step 1 to create the mapping details required to move the raw data into the Generalized Data Model.  This includes automating as much of the process as possible, and conducting logic checks to ensure that the mappings make sense.</p>
  </li>
  <li>Use the mappings in step 2 to automatically generate an SQL script to implement the complete transformation process on the target data platform.</li>
</ol>

<p>Importantly, these tools are being built in R as much as possible so they can be brought into any data enclave.</p>

<h3 id="why-go-through-all-of-this">Why go through all of this?</h3>
<p>In terms of working on data enclaves, the following are some of the key benefits of a set of flexible tools for data management and cohort building:</p>

<ol>
  <li>The raw data stays on the data platform at all times, as required by data providers.</li>
  <li>The data provider’s schema remains private, as required by some data providers.</li>
  <li>The raw data is fully documented and can be compared to the reorganized data as part of a quality control process.  This is important for research done for regulatory purposes, as well as being a good idea generally.</li>
</ol>

<h3 id="implications">Implications</h3>
<p>The most important implication is that we can use the same process for every project.  We can organize the data into one data model.  We can use one library of algorithms and one protocol builder for all research (Jigsaw).  And because Jigsaw has its own data model for analysis-ready data, we can write efficient analysis code and standardize our analyses.  In other words, we have the same path from start to finish regardless of whether the data is on premises or in a data enclave.</p>

<p>Another benefit is that all this infrastructure can be made open-source and available to anybody to use – competitors, data providers, commercial organizations, government researchers, etc.  We are not ready for that yet, but that is where we are headed.  If you want to know more, or to collaborate with us, don’t hesitate to contact us. </p>]]></content><author><name>Mark Danese</name></author><category term="data" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">The Miracle of Data</title><link href="https://outcomesinsights.github.io/data/2024/08/10/the_miracle-of-data.html" rel="alternate" type="text/html" title="The Miracle of Data" /><published>2024-08-10T00:00:00+00:00</published><updated>2024-08-10T00:00:00+00:00</updated><id>https://outcomesinsights.github.io/data/2024/08/10/the_miracle-of-data</id><content type="html" xml:base="https://outcomesinsights.github.io/data/2024/08/10/the_miracle-of-data.html"><![CDATA[<h3 id="raw-data-is-impossible-in-its-native-format">Raw Data is Impossible in its Native Format</h3>

<p>Research using observational data (“real-world data”) has grown substantially over the last decade.  New data sources have become available, and regulatory agencies have increasingly supported real-world evidence as part of their decision-making.  However, the raw data available from data providers is virtually impossible to analyze in its original form.  As a result, the enormous challenge of reviewing and manipulating the raw data creates a substantial barrier to conducting quality observational research.</p>

<h3 id="the-real-miracle-in-research">The Real Miracle in Research</h3>
<p>Borrowing a thought from <a href="https://www.researchgate.net/figure/Then-a-Miracle-Occurs-Copyrighted-artwork-by-Sydney-Harris-Inc-All-materials-used-with_fig2_302632920">Sydney Harris</a>, the figure below is how most researchers view the process of structuring their data for research.</p>

<p><img src="/images/a_data_miracle_occurs.png" alt="Data miracle graphic" /></p>

<h3 id="breaking-it-down">Breaking it Down</h3>
<p>At a simple level, the manipulation of data can be broken into two tasks.  The first is to organize the raw data by cataloging its contents and the relationships among the tables, examining it for completeness and correctness, and mapping it to a structure suitable to research.  (For regulatory purposes, the organization process itself needs to be documented as well.)  The second task is identifying the relevant cohort to study, extracting the relevant records, and creating analysis-ready datasets.  Once this two-step data pipeline is implemented, researchers can then conduct the analyses of interest.</p>

<h3 id="the-starting-point">The Starting Point</h3>
<p>For the first task, the only option for most researchers is to create an ad hoc data structure specific to the research project at hand.  While adequate for one-off studies, this approach is prone to error, difficult to modify, and impossible to scale.  More flexible approaches to data organization involve either writing a query translation layer or moving data into a common data model structure.  Both approaches map the original data structure into a new, more efficient structure, but the common data model approach can be manipulated more efficiently, reliably, and repeatably by modern database platforms.</p>

<h3 id="the-problem-and-the-solution">The Problem and the Solution</h3>
<p>This highlights the fundamental problem for all researchers:  there are no useful off-the-shelf, open-source toolkits that help researchers process raw healthcare data into a usable form for research. Hence, we are working to build such a toolkit and to make it available for others.</p>]]></content><author><name>Mark Danese</name></author><category term="data" /><summary type="html"><![CDATA[Raw Data is Impossible in its Native Format]]></summary></entry><entry><title type="html">The Challenge of Working on the VRDC</title><link href="https://outcomesinsights.github.io/data/2024/04/16/the-challenge-of-working-on-the-vrdc.html" rel="alternate" type="text/html" title="The Challenge of Working on the VRDC" /><published>2024-04-16T00:00:00+00:00</published><updated>2024-04-16T00:00:00+00:00</updated><id>https://outcomesinsights.github.io/data/2024/04/16/the-challenge-of-working-on-the-vrdc</id><content type="html" xml:base="https://outcomesinsights.github.io/data/2024/04/16/the-challenge-of-working-on-the-vrdc.html"><![CDATA[<h3 id="what-is-the-problem">What is the Problem?</h3>

<p>With Medicare data moving to the Virtual Research Data Center (VRDC) exclusively, researchers are being forced into a new way of conducting research.  Some have expressed concerns that research could take longer, or that quality control could suffer.  But these concerns simply highlight the long-standing struggle to write effective code for conducting observational research.</p>

<p>Part of the problem is that observational research is an insular process.  To ensure patient privacy, data can not easily be shared.  This makes collaboration difficult.  And because 60-80% of a research project is simply wrangling raw data to create analysis-ready datasets, data management is labor-intensive.</p>

<p>To complicate matters further, most data management practices are still stuck in the technology of the 1990’s.  One must look no further than to the ubiquity of SAS or Stata for large-scale data management, the minimal use of relational databases and SQL, and the use of double data programming instead of modern approaches for code testing and version control.  Lastly, many researchers conflate cohort creation with data analysis by trying to use the same software and coding practices for both parts of a project.</p>

<p>In short, the limited adoption of appropriate data management practices has led to a dead-end in the era of data enclaves like the VRDC.</p>

<h3 id="where-do-we-go">Where Do We Go?</h3>

<p>With the advent of better tools for working with raw data, observational research needs to take advantage of fit-for-purpose technologies and apply them in appropriate ways.  Standard data manipulation languages (e.g., SQL), relational databases (e.g., PostgreSQL), and data platforms (e.g., Databricks) make it efficient to extract analysis-ready datasets from terabytes of raw data.</p>

<p>Although tools exist to improve the efficiency of observational research, the solution is not as simple as changing software.  The entire process needs to re-defined.  And therein lies the primary issue – most observational researchers are not well-versed in the software development best practices that are required to move the field forward.  Even more recent approaches to rethinking observational research (e.g., OMOP/OHDSI, FDA Sentinel, PCORnet, i2b2, etc.) still fail to address many of these challenges properly.</p>

<h3 id="software-development--really">Software Development – Really?</h3>

<p>By necessity, observational researchers dabble in software development as a byproduct of writing the code for a project.  But code development is limited by the narrow scope of the project at hand.  The iterative process of adapting code for each new project and evaluating the output to ensure correctness is second-nature to researchers.  But the prospect of working on the VRDC exposes the inefficiencies and limitations of this approach.</p>

<p>The root cause is that researchers are increasingly out of their element when using modern software (e.g., Databricks on the VRDC).  Writing reusable and testable code for modern data platforms requires an understanding of software libraries, programming paradigms, and testing frameworks.  However, the fundamentals of these tools are not part of the training for most researchers and statistical programmers.</p>

<p>To move research forward, researchers need to reconsider their approach and adopt a software development mindset.  Viewed through this lens, cohort-building can be reduced to a set of repetitive tasks that are identical across different observational study designs.  In other words, researchers need to think in terms of building a “data pipeline”.</p>

<p>A data pipeline requires clear specifications for the inputs and outputs for each part of the process.  There are many benefits to this approach.  Importantly, this includes creating specifications for organizing the raw data.  This can be accomplished in a variety of ways including using data models, database views, or other approaches.  The key point related to the VRDC is that, once the structure of the raw data is defined, there is no need to touch the data to specify how to build the cohort and to create the analysis-ready datasets.</p>

<p>By analogy, it is like using Google or Apple Maps.  Once we have a database of GPS coordinates for all roads and addresses, and an understanding of basic driving rules, we can write software to create detailed, optimized directions without ever having to get in the car.</p>

<h3 id="how-will-that-help-with-platforms-like-the-vrdc">How Will That Help with Platforms like the VRDC?</h3>

<p>Working on the VRDC means that researchers need to develop their cohort-creation code outside the VRDC (see <a href="https://jigsaw.io/data/2024/01/21/bringing-code-to-the-data.html">this blog post</a>).  Once researchers can generate the code to create their analysis-ready datasets without needing simultaneous access to the raw data, working on an enclave like the VRDC becomes easier, faster, and cheaper.  In fact, this kind of “offline” approach enables researchers to work on data in a consistent fashion anywhere it is stored – a powerful idea.</p>

<h3 id="that-is-not-realistic-or-is-it">That Is Not Realistic. Or Is It?</h3>

<p>At this point it would be reasonable to say, “but surely everyone can’t build their own validated software stack for the few studies they might do in a year; that is neither possible nor efficient.”  I can tell you that, as a very small company, we have built exactly this system so we know it is possible.  I can tell you that, having learned how to work “offline”, we are excited to work on platforms like the VRDC.  And I can tell you that, as I write this, we are working on some VRDC-related demonstration projects to prove that this approach works.  So, yes, it can be done.  But I can also tell you that it involved a lot of hard lessons and about 10 years of work.</p>

<h3 id="thats-nice--but-how-does-that-help-anyone-else">That’s Nice.  But How Does that Help Anyone Else?</h3>

<p>One of our core principles is to share our work with others.  The key components of our research process have been publicly available via GitHub since they were created.  This includes the <a href="https://github.com/outcomesinsights/generalized_data_model">Generalized Data Model</a>, which defines how we organize raw data, and <a href="https://github.com/outcomesinsights/conceptql">ConceptQL</a>, the language we built to create, store, and share algorithms.  We also made our software, <a href="https://public.jigsaw.io">Jigsaw</a>, and our algorithm library publicly readable as of last year (see <a href="https://jigsaw.io/algorithms/2023/07/11/algorithm-library.html">this blog post</a>).  And we are actively searching for a way make Jigsaw usable by others to solve problems like working on the VRDC (see <a href="https://jigsaw.io/data/2024/01/30/moving-jigsaw-to-the-real-world.html">this blog post</a>).</p>

<h3 id="conclusion">Conclusion</h3>

<p>Observational researchers need to think outside the box that has enclosed observational research for the last 40 years.  The tools are available, and we have laid much of the groundwork.  We will continue to share our progress and help others conduct quality research in an evolving data landscape.  If you have any questions, or if you want to collaborate, feel free to contact us.</p>]]></content><author><name>Mark Danese</name></author><category term="data" /><summary type="html"><![CDATA[What is the Problem?]]></summary></entry><entry><title type="html">Moving Jigsaw to the Real World</title><link href="https://outcomesinsights.github.io/data/2024/01/30/moving-jigsaw-to-the-real-world.html" rel="alternate" type="text/html" title="Moving Jigsaw to the Real World" /><published>2024-01-30T00:00:00+00:00</published><updated>2024-01-30T00:00:00+00:00</updated><id>https://outcomesinsights.github.io/data/2024/01/30/moving-jigsaw-to-the-real-world</id><content type="html" xml:base="https://outcomesinsights.github.io/data/2024/01/30/moving-jigsaw-to-the-real-world.html"><![CDATA[<h1 id="moving-jigsaw-to-the-real-world">Moving Jigsaw to the Real World</h1>

<p>Last year <a href="https://jigsaw.io/algorithms/2023/07/11/algorithm-library.html">we made our algorithm library</a> <a href="https://public.jigsaw.io">publicly available</a>.  In 2024, we are working to make our entire Jigsaw application freely accessible for creating and sharing protocols, creating and sharing algorithms, and generating the code to create analysis-ready datasets from observational data.</p>

<h3 id="doesnt-this-exist-already">Doesn’t This Exist Already?</h3>

<p>Solutions for streamlining observational research fall into one of the following approaches:</p>

<ul>
  <li>Building an internal repository of implementation code</li>
  <li>Implementing and supporting an open-source software system internally</li>
  <li>Licensing access to a commercial platform</li>
</ul>

<p>However, these options also have important limitations:</p>

<ul>
  <li>In-house solutions are generally software-specific and have limited version control</li>
  <li>Open-source platforms require staff for installation, support, and updates</li>
  <li>Commercial platforms can be expensive black boxes with little visibility into underlying processes being implemented</li>
  <li>No solution allows researchers to collaborate across institutions that use different approaches.</li>
</ul>

<p>In short, despite improvements in software capabilities, researchers still reside in their own silos, unable to collaborate efficiently with one another.</p>

<h3 id="what-is-the-alternative">What is the Alternative?</h3>

<p>The ideal solution is a freely accessible, cloud-based software application.</p>

<p>As an analogy, think about GitHub.  For anyone unfamiliar with GitHub, the following is the <a href="https://en.wikipedia.org/wiki/GitHub">one-sentence Wikipedia description</a>:</p>

<blockquote>
  <p>GitHub, Inc. is an AI-powered developer platform that allows developers to create, store, manage and share their code.”</p>
</blockquote>

<p>So, imagine if we alter that slightly for Jigsaw to read as follows:</p>

<blockquote>
  <p>Jigsaw is a cloud-based platform that allows observational researchers to create, store, manage and share their research protocols and to generate the code for implementing them against data.”</p>
</blockquote>

<h3 id="conclusion">Conclusion</h3>

<p>Despite thousands of hours and millions of dollars being invested in potential solutions, the fundamental problem remains – research methods are still too fragmented.  It is about time we solved this problem.  We think Jigsaw is the solution.</p>]]></content><author><name>Mark Danese</name></author><category term="data" /><summary type="html"><![CDATA[Moving Jigsaw to the Real World]]></summary></entry><entry><title type="html">Bringing Code to the Data</title><link href="https://outcomesinsights.github.io/data/2024/01/21/bringing-code-to-the-data.html" rel="alternate" type="text/html" title="Bringing Code to the Data" /><published>2024-01-21T00:00:00+00:00</published><updated>2024-01-21T00:00:00+00:00</updated><id>https://outcomesinsights.github.io/data/2024/01/21/bringing-code-to-the-data</id><content type="html" xml:base="https://outcomesinsights.github.io/data/2024/01/21/bringing-code-to-the-data.html"><![CDATA[<h2 id="bringing-code-to-the-data">Bringing Code to the Data</h2>

<p>If an organization can store all the data it needs on its own servers, it can choose almost any software package or platform to manipulate the data for research purposes.  But what happens if the organization isn’t allowed to store the data on its own servers?  For example, most people who want to access the full Medicare data must do so within the <a href="https://resdac.org/cms-virtual-research-data-center-vrdc">CMS Virtual Research Data Center (VRDC)</a> using either SAS or <a href="https://www.databricks.com/">Databricks</a>.  Similarly, some countries limit their data to servers only accessible within their own country, and some commercial data providers limit access to their own virtual data centers.  While this can improve data security and reduce patient privacy risks, it creates logistical challenges for researchers.</p>

<h2 id="the-solution">The Solution?</h2>

<p>The solution is to bring project-specific software code to the data behind the firewall.  But how does one write code for a project without touching the actual data?  There are at least two options for solving the problem: synthetic data and code-generation software.</p>

<h3 id="synthetic-data">Synthetic Data</h3>

<p>In theory, researchers can craft their project code against a synthetic, but realistic, copy of the data. This presupposes that someone will create and maintain an accessible, usable, privacy-protecting synthetic dataset for each data source of interest.  Even then, working with large, unwieldy data can be a challenge.  Imagine working with a synthetic version of the full Medicare data – that isn’t a job for anyone without access to suitable computing resources.  All-in-all, synthetic data represents a possible solution, but it isn’t particularly efficient because it still relies on hand-writing code for manipulating potentially large datasets.</p>

<h3 id="code-generation-software-jigsaw">Code-Generation Software: Jigsaw</h3>

<p>In our opinion, the better option is to use <a href="https://public.jigsaw.io">Jigsaw</a>.  It may not be obvious, but Jigsaw doesn’t need to touch the actual raw data to do its job.  How is that possible?  Jigsaw’s job is to write all the SQL queries for creating an analysis-ready dataset from the raw, organized data.  As long as the data is organized using a known data model, Jigsaw can write a script containing the required SQL queries.  A user can then run the script on a server containing the organized data, and the script can save the analysis-ready data sets on the same server in a location specified by the user.</p>

<p>Because Jigsaw itself can be Cloud-based, researchers can collaborate on the specification of the analysis-ready data from anywhere.  The protocol and its algorithms remain publicly available and shareable.  As before, all <a href="https://public.jigsaw.io/algorithms">algorithms</a> are explicitly documented in the protocol summary document that can be generated (and also shared).</p>

<h3 id="or-both">Or Both?</h3>

<p>In some scenarios, it could make sense to use both approaches.  One could use Jigaw to create an analysis-ready dataset, and then create a synthetic version for developing the analyses themselves.</p>

<h3 id="challenges">Challenges</h3>

<p>In order for this work, the data on the server needs to be reorganized into a data model.  We strongly prefer the <a href="https://github.com/outcomesinsights/generalized_data_model">Generalized Data Model</a> because it brings together some of the best ideas of the <a href="https://www.i2b2.org/">i2b2</a> and <a href="https://ohdsi.org">OMOP/OHDSI</a> data models.  By this, we referring to combining the i2b2 idea of storing all clinical information in one central “fact” table combined with the OMOP/OHDSI vocabularies.</p>

<p>Once a transformation to a data model is created, it is relatively easy to share the code for others to transform their raw data into a data model.  As of early 2024, we have code for the current <a href="https://healthcaredelivery.cancer.gov/seermedicare/">SEER-Medicare</a> linked data.  Next, we will create the transformation code for the Medicare data on the VRDC.  After that, we are open to suggestions.</p>

<h2 id="conclusion">Conclusion</h2>

<p>Working on other data platforms doesn’t have to be challenging.  We think Jigsaw can be a powerful tool for working with observational data wherever it is without compromising data security or patient privacy.</p>]]></content><author><name>Mark Danese</name></author><category term="data" /><summary type="html"><![CDATA[Bringing Code to the Data]]></summary></entry><entry><title type="html">Spark of Genius</title><link href="https://outcomesinsights.github.io/data/2023/10/16/spark-of-genius.html" rel="alternate" type="text/html" title="Spark of Genius" /><published>2023-10-16T00:00:00+00:00</published><updated>2023-10-16T00:00:00+00:00</updated><id>https://outcomesinsights.github.io/data/2023/10/16/spark-of-genius</id><content type="html" xml:base="https://outcomesinsights.github.io/data/2023/10/16/spark-of-genius.html"><![CDATA[<p>File this post under “late to the game”, but I just completed a project where I used <a href="">Apache Spark</a> for the first time and I’m blown away.  Here’s my experience.</p>

<h2 id="no-cluster-needed">No Cluster Needed</h2>

<p>Perhaps it was my bias from working with <a href="">Apache Impala</a> a few years back, but I just assumed that Spark was going to need <a href="">Hadoop</a> set up on a cluster of servers.  I didn’t want to spend my time getting all that set up just to play around with Spark, so I never bothered with it before.</p>

<p>Turns out, Spark has a rather robust single-machine setup.  Even better, there’s an R library that took care of all the set up for me.</p>

<h2 id="sparklyr-makes-spark-simple"><code class="language-plaintext highlighter-rouge">sparklyr</code> Makes Spark Simple</h2>

<p>The R package <a href=""><code class="language-plaintext highlighter-rouge">sparklyr</code></a> made my foray into Spark dead simple.  The package happily installed Spark for me and provided me functions to easily start and stop a Spark instance from within my R scripts.</p>

<p>Pro tip: by default <code class="language-plaintext highlighter-rouge">sparklyr</code> limits Spark to a single core when it starts up an instance.  You can <a href="">change to multiple cores</a> pretty easily and it makes a world of difference in terms of performance.</p>

<h2 id="dplyr-and-spark-is-a-powerful-combination"><code class="language-plaintext highlighter-rouge">dplyr</code> and Spark Is a Powerful Combination</h2>

<p><code class="language-plaintext highlighter-rouge">sparklyr</code> gave me access to the tables I loaded into Spark.  <a href=""><code class="language-plaintext highlighter-rouge">dplyr</code></a> gave me the ability to manipulate and query those tables via <a href=""><code class="language-plaintext highlighter-rouge">dbplyr</code></a>.</p>

<p><code class="language-plaintext highlighter-rouge">dplyr</code> is amazing.  Rather than hand-writing <a href="">Spark SQL</a>, <code class="language-plaintext highlighter-rouge">dplyr</code> provides a set of functions that allowed me to join tables, add where clauses, and manipulate the columns returned from Spark.</p>

<h2 id="great-performance">Great Performance</h2>

<p>My project was to explore replacing an existing part of our data pipeline.  Using Spark, our processing time went from days to hours.</p>

<h2 id="more-spark-in-the-future">More Spark in the Future</h2>

<p>After this successful venture into Spark territory, I’m pretty sure I’ll be employing Spark in future projects.</p>]]></content><author><name>Ryan Duryea</name></author><category term="data" /><summary type="html"><![CDATA[File this post under “late to the game”, but I just completed a project where I used Apache Spark for the first time and I’m blown away. Here’s my experience.]]></summary></entry><entry><title type="html">Parquet Makes My Day</title><link href="https://outcomesinsights.github.io/data/2023/07/31/parquet-makes-my-day.html" rel="alternate" type="text/html" title="Parquet Makes My Day" /><published>2023-07-31T00:00:00+00:00</published><updated>2023-07-31T00:00:00+00:00</updated><id>https://outcomesinsights.github.io/data/2023/07/31/parquet-makes-my-day</id><content type="html" xml:base="https://outcomesinsights.github.io/data/2023/07/31/parquet-makes-my-day.html"><![CDATA[<p>I was introduced to <a href="https://en.wikipedia.org/wiki/Apache_Parquet">Parquet</a> format back in 2015.  At the time, I was tasked with working with an <a href="https://impala.apache.org/">Impala</a>-based system and it was using Parquet to store its data.  My impression was Parquet was some technology built upon <a href="https://en.wikipedia.org/wiki/Apache_Hadoop#HDFS">HDFS</a> and required some sort of distributed, <a href="https://en.wikipedia.org/wiki/Apache_Hadoop">Hadoop</a>-based system to work with it.  That impression was <em>not</em> accurate.</p>

<p>A few years later, the <a href="https://arrow.apache.org/">Arrow</a> package for <a href="https://www.r-project.org/">R</a> came out and it had support for, much to my surprise, Parquet.  Suddenly Parquet seemed to be freed from HDFS and could plunk huge swaths of data into cute little folders directly in my computer’s filesystem.  What a powerful tool I suddenly had.  It was a great alternative to all the other <a href="/data/2023/05/23/data-is-dirty.html">dirty data formats I dealt with in the past</a>.</p>

<p>Since then, we’ve standardized our data pipeline tools on Parquet.  It’s a great format for data storage and transfer.  It has a true standard for storing data and their types, unlike <a href="https://en.wikipedia.org/wiki/Comma-separated_values">CSV</a>.  It compresses data down very nicely, leaving the files smaller than most other proprietary or CSV formats we’ve seen.  It is fast to read and write, making it a great intermediate storage for our data pipelines.  And, thanks to Arrow, all the languages we use in our tools have first-class support for working with Parquet.  Also, shout out to <a href="https://duckdb.org/">DuckDB</a> for being the <a href="https://sqlite.org/index.html">SQLite</a> of Parquet.</p>

<p>This last winter, one of our data vendors actually offered to send us data in Parquet format.  It was <em>amazing</em>.  <strong>AMAZING</strong>.  We downloaded the files from them and within a minute I was able to query the data, get counts of rows, types of columns, and confirm that we had, indeed, received all the records we expected.  It was unlike <a href="/data/2023/05/23/data-is-dirty.html">any other data ingestion experience I’ve ever had</a>.  No unzipping, no CSV tools, no proprietary formats.  These pre-ETL steps were almost completely unnecessary and we could move right into transformation of the data into <a href="https://github.com/outcomesinsights/generalized_data_model">GDM</a></p>

<p>When I first encountered it, I never thought I’d be such a fan of Parquet, but I now sincerely hope that it continues to become <em>the</em> standard for transferring claims data between vendors and researchers.</p>]]></content><author><name>Ryan Duryea</name></author><category term="data" /><summary type="html"><![CDATA[I was introduced to Parquet format back in 2015. At the time, I was tasked with working with an Impala-based system and it was using Parquet to store its data. My impression was Parquet was some technology built upon HDFS and required some sort of distributed, Hadoop-based system to work with it. That impression was not accurate.]]></summary></entry><entry><title type="html">Sharing Is a Good Thing</title><link href="https://outcomesinsights.github.io/algorithms/2023/07/21/sharing-is-a-good-thing.html" rel="alternate" type="text/html" title="Sharing Is a Good Thing" /><published>2023-07-21T00:00:00+00:00</published><updated>2023-07-21T00:00:00+00:00</updated><id>https://outcomesinsights.github.io/algorithms/2023/07/21/sharing-is-a-good-thing</id><content type="html" xml:base="https://outcomesinsights.github.io/algorithms/2023/07/21/sharing-is-a-good-thing.html"><![CDATA[<p>There was a lot to discuss in last week’s post on the public version of Jigsaw algorithm library, so I will try to make this one short.  It is about creating and sharing algorithm (and protocol) summaries with Jigsaw.</p>

<h2 id="background">Background</h2>
<p>From inside any algorithm in the library, people can create a stand-alone algorithm summary.  This is a <strong>stand-alone webpage</strong> that can be downloaded and saved like any other document (click the green “Download” button at the top).  Importantly, the webpage also contains a <strong>CSV file</strong> with all algorithm codes.  The CSV file can be accessed via link at the bottom of the summary.</p>

<p>To create the summary from inside any algorithm in the library, simply click the “Action” button at the top, and select “Create Algorithm Summary”.</p>

<h2 id="two-algorithm-sharing-options">Two Algorithm Sharing Options</h2>

<h3 id="link-to-the-algorithm-within-the-jigsaw-algorithm-library">Link to the algorithm within the Jigsaw algorithm library</h3>

<p>The first option is a link to the algorithm in the Jigsaw library itself.  For example, below is a link to an algorithm in the library:</p>

<p><a href="https://jigsaw.io/algorithms/2023/07/21/sharing-is-a-good-thing.html">https://public.jigsaw.io/algorithms/646c44ce-be1a-4f04-b4e6-ecc5d7b44e13</a></p>

<p>From that link, people can view algorithm attributes, a full-screen algorithm diagram, and the ConceptQL statement that stores all the algorithm specifications.  People can even generate an SQL implementation of the algorithm from within the diagram by choosing “Source” at the top of the screen.</p>

<h3 id="link-directly-to-the-stand-alone-algorithm-summary">Link directly to the stand-alone algorithm summary</h3>

<p>The second option is a direct link to the stand-alone algorithm summary page.  It is the same URL as the algorithm in the Jigsaw library except it ends in “/summary”.  See the link below for an example:</p>

<p><a href="https://jigsaw.io/algorithms/2023/07/21/sharing-is-a-good-thing.html">https://public.jigsaw.io/algorithms/646c44ce-be1a-4f04-b4e6-ecc5d7b44e13/summary</a></p>

<h2 id="bonus-points--protocols-can-also-be-shared">Bonus points – Protocols Can Also Be Shared!</h2>

<p>Pretty much everything above applies to protocols too.  Not only can an entire protocol be shared from within Jigsaw’s protocol builder, but a stand-alone summary document can also be created, shared, and downloaded.  The entire protocol specification, including the enrollment criteria, algorithms and how they were defined and used (e.g., inclusion, exclusion, outcome, etc.) can be created as a stand-alone, downloadable web page.  For example, see the link below:</p>

<p><a href="https://jigsaw.io/algorithms/2023/07/21/sharing-is-a-good-thing.html">https://public.jigsaw.io/studies/b522827a-0ea3-4d22-add7-849c072a5717/summary</a></p>

<h2 id="why-is-this-useful">Why is this useful?</h2>

<p>The reason we designed these features was to allow for sharing and for documentation purposes.  Imagine being able to see someone else’s algorithms and protocols – the exact steps they followed to create cohorts and analysis data sets.  Government agencies, academic researchers, and commercial companies could share their methods without having to share the actual code or data.</p>

<p>Also, simply having all the detailed specifications in one place allows us to retain them as an appendix to technical reports, and saves time and energy in writing study publications.</p>]]></content><author><name>Mark Danese</name></author><category term="algorithms" /><summary type="html"><![CDATA[There was a lot to discuss in last week’s post on the public version of Jigsaw algorithm library, so I will try to make this one short. It is about creating and sharing algorithm (and protocol) summaries with Jigsaw.]]></summary></entry><entry><title type="html">Jigsaw Algorithm Library is Publicly Available</title><link href="https://outcomesinsights.github.io/algorithms/2023/07/11/algorithm-library.html" rel="alternate" type="text/html" title="Jigsaw Algorithm Library is Publicly Available" /><published>2023-07-11T00:00:00+00:00</published><updated>2023-07-11T00:00:00+00:00</updated><id>https://outcomesinsights.github.io/algorithms/2023/07/11/algorithm-library</id><content type="html" xml:base="https://outcomesinsights.github.io/algorithms/2023/07/11/algorithm-library.html"><![CDATA[<p>We are pleased to announce that we have released our algorithm library to the public at <a href="https://public.jigsaw.io">https://public.jigsaw.io</a>.  This is something we have wanted to do for years, but there was always something else that we wanted to include or change.  In the spirit of “perfect is the enemy of good”, we decided that the algorithm library was finally “good enough” to make it available to others.  It is still a work in progress.  We hope researchers who conduct studies using healthcare data find it useful.</p>

<h3 id="why-put-this-out-there">Why put this out there?</h3>

<p>We have spent many years struggling with the same issues that all researchers face when trying to extract information from healthcare data.  Algorithms are hard to find, poorly documented and difficult to implement consistently.  It is time to solve the problem.</p>

<h3 id="who-is-funding-this">Who is funding this?</h3>

<p>This project is completely self-funded.  Please keep in mind that we have limited resources, and we may not be able to accommodate requests.  It turns out that it is surprisingly challenging to make something publicly and freely available, so we appreciate your understanding.</p>

<h3 id="collaboration">Collaboration</h3>

<p>If any organization is interested in collaborating with us on algorithms, we welcome it – even from organizations that might be considered “competitors”.  Algorithms should be freely available and compatible with other systems than our own.  We are willing to see if we can make them work with other systems, even proprietary ones.</p>

<h3 id="feedback">Feedback</h3>

<p>We welcome suggestions.  We are happy to consider algorithm modifications if an algorithm can be improved in any way.  We would like our algorithms to be usable by anyone. Along these lines, we are willing to consider adding other algorithms to our library, as long they can be made public.</p>

<h3 id="browser-requirements">Browser requirements</h3>

<p>The algorithm library is designed and tested against Google Chrome because of its popularity.  Other browsers may work, but we don’t provide cross-browser support at this time.</p>

<h3 id="but-wait-theres-more">But wait, there’s more!</h3>

<p>In addition to the high-level details above, we also provide more detailed technical information and limitations below, as well as an example algorithm.</p>

<p><a href="#technical-details">Technical details</a></p>

<p><a href="#limitations">Limitations</a></p>

<p><a href="#example-algorithm">Example algorithm</a></p>

<h2 id="technical-details">Technical Details</h2>

<h3 id="what-is-an-algorithm-library">What is an algorithm library?</h3>

<p>Like most libraries, the Jigsaw algorithm library stores algorithms, facilitates searching for algorithms, and contains tools for creating, editing, and versioning algorithms.  The library enables algorithm sharing in two ways: as a stand-alone HTML file downloaded from the website, or via the algorithm’s unique URL.</p>

<h3 id="how-to-use-the-jigsaw-algorithm-library">How to use the Jigsaw algorithm library</h3>

<p>Anyone interested can simply navigate to the <a href="https://public.jigsaw.io/algorithms">algorithm section of the Jigsaw application</a> to find algorithms.  Users can then search for algorithms by different characteristics.  Clicking on any algorithm’s name opens it, and the user is presented with algorithm attributes as well as a diagram of the algorithm specification.  Users can click on the diagram to get more detailed information about the specifications.</p>

<p>We cover some of the <a href="#what-is-an-algorithm">basics below</a>, and we will provide more granular instructions in the future.</p>

<h3 id="what-is-an-algorithm">What is an algorithm?</h3>

<p>An algorithm has two parts – attributes and specifications.</p>

<p>Algorithm attributes include all of the information that makes algorithms friendly to researchers.  This includes titles, tags, documentation, variable names, and even evidence of validity (e.g., sensitivity and specificity).</p>

<p>Algorithm specifications are comprised of the code(s) and operator(s) that are needed to identify information for an analysis dataset.  The most basic operation of an algorithm is to “select” records with the code(s) of interest.  This means that the simplest possible algorithm specification is simply to select all records using a single code (see <a href="/gdm/2023/05/08/standardization.html">Standardization</a> for more on codes).  However, some algorithms can have thousands of codes. Algorithms can also require operations related to the context in which the code was used (provenance), have temporal specifications (before, after, within, etc.), require relationships to other codes (co-reported), use filters (first, last, etc.), or specify other operations.</p>

<p>Sometimes algorithm specifications require many operations in sequence.  For these use cases, we create special operators to make them easier to implement.  The best example is an operator that looks for a code that occurs at least 1 time in the inpatient setting or at least 2 times in the outpatient setting within specified time intervals (sometimes referred to as a “one inpatient or two outpatient” operation).  These operators make creating algorithm specifications much simpler.</p>

<h3 id="what-does-an-algorithm-specification-produce">What does an algorithm specification produce?</h3>

<p>An algorithm specification is a representation of a table that is created and, potentially, modified.  When implemented against data, every algorithm starts with a selection operator to create the first table based on the codes in the specification.  This table can then be passed to subsequent operators for additional modification.  The output of an algorithm is a table that includes all of the relevant records for each person as modified by the specified operations.  As an example, a specification could first select all records with at least one diabetes code based on a list of diabetes codes, and then filter these records to retain only the first code for each person.</p>

<p>The core of each table includes a person id, details about the exact table and record that was returned, the start and end dates, the vocabulary (e.g., ICD-10-CM), and the specific code.  Additional columns are dynamically added for selected types of records (e.g., prescription records will add additional columns specific to prescriptions, like dosing information).</p>

<h3 id="how-algorithms-operate">How algorithms operate</h3>

<p>We have defined all algorithm operations that are commonly used for research using ConceptQL (see <a href="/conceptql/2023/04/04/conceptql-crash-course.html">ConceptQL blog post</a>).  Within the algorithm library, there is a hyperlink from each operator to our <a href="https://github.com/outcomesinsights/conceptql_spec#conceptql-specification">ConceptQL specification in Github</a> that provides a more detailed description of how it functions.</p>

<p>Implicitly, all algorithm operations are conducted by person since almost all research is conducted on cohorts of people.  All algorithms are read left to right, and only one table is ever produced at the end of any operation.  This is true even if multiple tables are used as inputs to an operation.  For algorithms with two sides to them, the left side takes the input of the previous operation and modifies it using information from the right side.  Tables on the right side are never passed through and included in the output.</p>

<p>If the above doesn’t make any sense, don’t worry.  We will provide more details about how algorithms operate in a future post.</p>

<h3 id="visual-representation">Visual representation</h3>

<p>We have found that a visual representation of an algorithm specification is extremely helpful for communicating it to others.  As a result, an algorithm is specified using a visual algorithm builder that generates ConceptQL statements and encapsulates the algorithm specification.  <a href="#algorithm-diagram">See below</a> for an example.</p>

<h3 id="algorithm-scope">Algorithm scope</h3>

<p>Do we have every possible operation that might ever be needed?  Probably not.  Hence, it is possible to extend Jigsaw with new operators as the need arises.  However, we expect the need for new operators to be relatively rare.  Many times, the desired output requires some creativity using existing operations.</p>

<p>Alternatively, it may be that the desired operation is really a <strong>protocol specification</strong>, and not an <strong>algorithm</strong> specification.  It is beyond this blog post to define the difference between an algorithm specification and a protocol specification precisely.  However, in general, an algorithm specifies the variable of interest, and the protocol defines the context in which the algorithm operates.  The separation between an algorithm and a protocol ensures that an algorithm is as reusable as possible.</p>

<p>To make this a bit more concrete, an algorithm might define a myocardial infarction.  Or it might define diabetes diagnosed at least 30 days before a myocardial infarction.  However, a myocardial infarction within 30 days of the index date, or as an inclusion or exclusion criterion, or as a censoring variable – those are ways of using a myocardial infarction algorithm in the context of a protocol.  The same idea applies with clean periods or lookback periods – those are protocol-specific ideas that are not part of an algorithm’s scope.</p>

<p>As a side note, users may see some algorithms in our library that violate this principle.  This is because our thinking about the line between algorithms and protocols has evolved over the years.  As a result, we have added features to our protocol builder to help keep the distinction as clear as possible.  But since we don’t readily remove algorithms we have used, these algorithms remain in the algorithm library.</p>

<h3 id="it-isnt-clear-how-algorithms-are-used-to-create-analysis-data-sets">It isn’t clear how algorithms are used to create analysis data sets</h3>

<p>Protocols, not algorithms, specify the analysis datasets that need to be created.  Protocols specify the index date, inclusion and exclusion criteria, baseline variables, lookback periods, clean periods, date restrictions, and outcome variables.  Protocols use algorithms to specify each variable within the context of a protocol.  It is much like assembling a jigsaw puzzle (protocol) using pieces (algorithms).  Hence, our software is named “Jigsaw”.</p>

<p>As a simple example, the specification for myocardial infarction can be identical regardless of whether it is needed as a baseline variable, index event, or outcome.  However, different records will be selected depending on which context required.</p>

<p>To illustrate how algorithms fit into protocols, we have made our entire Jigsaw application available, including our protocol builder, albeit in read-only form.  There is a <a href="https://public.jigsaw.io/studies/b522827a-0ea3-4d22-add7-849c072a5717">demonstration protocol</a> available within the application.</p>

<h2 id="limitations">Limitations</h2>

<h3 id="disclaimer">Disclaimer</h3>

<p>While we built Jigsaw and the algorithms carefully, we make no guarantees of correctness or suitability for a specific study.  Jigsaw is a tool to make research easier and more transparent; however, it still requires that users fully understand all aspects of the research they are conducting, their data source(s), and the limitations of real-world data.</p>

<h3 id="is-every-algorithm-ready-for-research">Is every algorithm ready for research?</h3>

<p>The algorithm library contains over 1,000 algorithms and spans many years of development and growth.  Therefore, not every algorithm is ready to be used in its current state.  Some were created as examples, some were created prior to ICD-10-CM data becoming available, and some were created or updated very recently.  Most algorithms are classified as “Draft” even though they have been extensively used.  Some “Finalized” algorithms should probably be updated.  As mentioned above, this is a work in progress and we only update algorithms as we reuse them.  We suggest looking at the date of the most recent update to get a sense of how current an algorithm is.  Users should do their own due-diligence before using any algorithm.</p>

<h3 id="is-an-algorithm-specific-to-a-data-model-or-software-system">Is an algorithm specific to a data model or software system?</h3>

<p>The algorithm library is agnostic to both the data structure and the software used for implementation.  The algorithm simply captures the specifications for an algorithm.  For example, an algorithm can encapsulate the idea that we are looking for a “diabetes diagnosis at least 30 days before a myocardial infarction”.</p>

<p>As such, it is not written in SQL, SAS, R, Python, or any language that can be used directly on data.  Instead, it is specified in JSON using the <a href="https://github.com/outcomesinsights/conceptql_spec#conceptql-specification">ConceptQL language</a>.  Jigsaw uses the specification to dynamically generate SQL and implement the algorithm against data.  In theory, ConceptQL could generate something other than SQL, but practically, SQL is so widely available there has been no reason to consider other options.  This approach allows us to generate SQL that is specific to the idiosyncrasies of different database implementations of SQL.  For example, we can generate SQL specific to PostgreSQL, Impala, SQLite or most other database systems that follow SQL standards.</p>

<p>Furthermore, we can generate SQL that works directly against different data models, like the <a href="https://github.com/outcomesinsights/generalized_data_model">Generalized Data Model (GDM)</a>, <a href="https://ohdsi.github.io/CommonDataModel/index.html">OMOP</a>, <a href="https://pcornet.org/data/">PCORnet</a>, and <a href="https://www.sentinelinitiative.org/methods-data-tools/sentinel-common-data-model">Sentinel</a>.  (Note that support for OMOP, PCORnet, and Sentinel is not currently implemented because we have not had a use case for them.)</p>

<p>To see what the SQL code looks like for an algorithm, users can dynamically generate the necessary SQL from inside the algorithm diagram.  The SQL is specific to the Generalized Data Model, and it produces PostgreSQL queries.  <a href="#example-algorithm">See below</a> for a complete example of an algorithm that includes the diagram, attributes, ConceptQL JSON statement, and the resulting SQL.</p>

<h3 id="why-not-make-the-entire-application-available">Why not make the entire application available?</h3>

<p>While Jigsaw is copyrighted and remains our intellectual property, our hope is that we can make its full functionality available to other researchers.  To do that, we must make ETL easier first, which is <a href="/data/2023/06/28/etl-is-hell.html">not a trivial issue</a>.  We have made progress on that front as well, and are willing to collaborate with organizations interested in adopting the Generalized Data Model and/or Jigsaw.</p>

<h2 id="example-algorithm">Example Algorithm</h2>

<p>Below is an example algorithm that identifies diabetes diagnoses coming at least 30 days prior to a myocardial infarction.  Both ICD-9-CM and ICD-10-CM codes are used to identify diabetes and myocardial infarction.  Note that this is an <strong>example</strong> and may not represent the correct codes or a clinically meaningful event.</p>

<h3 id="algorithm-diagram">Algorithm diagram</h3>

<p><img src="/images/diabetes_30d_before_mi.png" alt="The diagram is here." /></p>

<h3 id="algorithm-attributes">Algorithm attributes</h3>

<p><img src="/images/diabetes_mi_metadata.png" alt="The algorithm attributes are here." /></p>

<h3 id="conceptql">ConceptQL</h3>

<p>Below is the ConceptQL statement that corresponds to the diagram above.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[
  "before",
  {
    "left": [
      "union",
      [
        "icd9cm",
        "250.00",
        "250.01",
        "250.02",
        "250.03",
        "250.10",
        "250.11",
        "250.12",
        "250.13",
        "250.20",
        "250.21",
        "250.22",
        "250.23",
        "250.30",
        "250.31",
        "250.32",
        "250.33",
        "250.80",
        "250.81",
        "250.82",
        "250.83",
        "250.90",
        "250.91",
        "250.92",
        "250.93"
      ],
      [
        "icd10cm",
        "E10.10",
        "E10.11",
        "E10.61",
        "E10.61",
        "E10.62",
        "E10.62",
        "E10.62",
        "E10.62",
        "E10.63",
        "E10.63",
        "E10.64",
        "E10.64",
        "E10.65",
        "E10.69",
        "E10.8",
        "E10.9",
        "E11.00",
        "E11.01",
        "E11.10",
        "E11.11",
        "E11.61",
        "E11.61",
        "E11.62",
        "E11.62",
        "E11.62",
        "E11.62",
        "E11.63",
        "E11.63",
        "E11.64",
        "E11.64",
        "E11.65",
        "E11.69",
        "E11.8",
        "E11.9",
        "E13.00",
        "E13.01",
        "E13.10",
        "E13.11",
        "E13.61",
        "E13.61",
        "E13.62",
        "E13.62",
        "E13.62",
        "E13.62",
        "E13.63",
        "E13.63",
        "E13.64",
        "E13.64",
        "E13.65",
        "E13.69",
        "E13.8",
        "E13.9"
      ]
    ],
    "right": [
      "union",
      [
        "icd9cm",
        "410.01",
        "410.11",
        "410.21",
        "410.31",
        "410.41",
        "410.51",
        "410.61",
        "410.71",
        "410.81",
        "410.91"
      ],
      [
        "icd10cm",
        "I21.01",
        "I21.02",
        "I21.09",
        "I21.11",
        "I21.19",
        "I21.21",
        "I21.29",
        "I21.3",
        "I21.4",
        "I21.9",
        "I21.A1",
        "I21.A9",
        "I22.0",
        "I22.1",
        "I22.2",
        "I22.8",
        "I22.9"
      ]
    ],
    "at_least": "30d"
  }
]
</code></pre></div></div>

<h3 id="sql">SQL</h3>

<p>The code below is generated from the ConceptQL specification.  This particular SQL code will create an output table from data stored in a PostgreSQL relational database based on the Generalized Data Model schema.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>SELECT
    *
FROM ( WITH "before_22_3_24a2b6fb9b0f37c3735c654d1ffd3cab" AS MATERIALIZED (
        SELECT
            *
        FROM (
            SELECT
                "person_id",
                "criterion_id",
                "criterion_table",
                "criterion_domain",
                "start_date",
                "end_date",
                "source_value",
                "source_vocabulary_id",
                "label"
            FROM (
                SELECT
                    *
                FROM (
                    SELECT
                        *
                    FROM (
                        SELECT
                            *
                        FROM (
                            WITH "union_22_2_64ac7676c653ae02d50931398ee95903" AS MATERIALIZED (
                                SELECT
                                    *
                                FROM (
                                    SELECT
                                        "person_id",
                                        "criterion_id",
                                        "criterion_table",
                                        "criterion_domain",
                                        "start_date",
                                        "end_date",
                                        "source_value",
                                        "source_vocabulary_id",
                                        "label"
                                    FROM (
                                        SELECT
                                            "person_id",
                                            "criterion_id",
                                            "criterion_table",
                                            "criterion_domain",
                                            "start_date",
                                            "end_date",
                                            "source_value",
                                            "source_vocabulary_id",
                                            "label"
                                        FROM (
                                            SELECT
                                                *
                                            FROM (
                                                SELECT
                                                    "patient_id" AS "person_id",
                                                    "id" AS "criterion_id",
                                                    CAST(
                                                        'clinical_codes' AS text
) AS "criterion_table",
                                                    CAST(
                                                        'condition_occurrence' AS text
) AS "criterion_domain",
                                                    "start_date",
                                                    "end_date",
                                                    CAST(
                                                        "clinical_code_source_value" AS text
) AS "source_value",
                                                    CAST(
                                                        "clinical_code_vocabulary_id" AS text
) AS "source_vocabulary_id",
                                                    CAST(
                                                        NULL AS text
) AS "label"
                                                FROM
                                                    "clinical_codes"
                                                WHERE ((
                                                        "clinical_code_concept_id" IN (
                                                            SELECT
                                                                "id"
                                                            FROM
                                                                "concepts"
                                                            WHERE ((
                                                                    "vocabulary_id" = 'ICD9CM'
)
                                                                AND (
                                                                    "concept_code" IN (
                                                                        '250.00', '250.01', '250.02', '250.03', '250.10', '250.11', '250.12', '250.13', '250.20', '250.21', '250.22', '250.23', '250.30', '250.31', '250.32', '250.33', '250.80', '250.81', '250.82', '250.83', '250.90', '250.91', '250.92', '250.93'
)
)
)
)
)
                                                    OR (
                                                        "clinical_code_concept_id" IN (
                                                            SELECT
                                                                "id"
                                                            FROM
                                                                "concepts"
                                                            WHERE ((
                                                                    "vocabulary_id" = 'ICD10CM'
)
                                                                AND (
                                                                    "concept_code" IN (
                                                                        'E10.10', 'E10.11', 'E10.61', 'E10.61', 'E10.62', 'E10.62', 'E10.62', 'E10.62', 'E10.63', 'E10.63', 'E10.64', 'E10.64', 'E10.65', 'E10.69', 'E10.8', 'E10.9', 'E11.00', 'E11.01', 'E11.10', 'E11.11', 'E11.61', 'E11.61', 'E11.62', 'E11.62', 'E11.62', 'E11.62', 'E11.63', 'E11.63', 'E11.64', 'E11.64', 'E11.65', 'E11.69', 'E11.8', 'E11.9', 'E13.00', 'E13.01', 'E13.10', 'E13.11', 'E13.61', 'E13.61', 'E13.62', 'E13.62', 'E13.62', 'E13.62', 'E13.63', 'E13.63', 'E13.64', 'E13.64', 'E13.65', 'E13.69', 'E13.8', 'E13.9'
)
)
)
)
)
) -- #&lt;ConceptQL::Operator::icd9cm @arguments=["250.00", "250.01", "250.02", "250.03", "250.10", "250.11", "250.12", "250.13", "250.20", "250.21", "250.22", "250.23", "250.30", "250.31", "250.32", "250.33", "250.80", "250.81", "250.82", "250.83", "250.90", "250.91", "250.92", "250.93"]&gt;
) AS "t1"
) AS "t1"
) AS "t1" -- #&lt;ConceptQL::Operators::Union @upstreams=[#&lt;ConceptQL::Operator::icd9cm @arguments=["250.00", "250.01", "250.02", "250.03", "250.10", "250.11", "250.12", "250.13", "250.20", "250.21", "250.22", "250.23", "250.30", "250.31", "250.32", "250.33", "250.80", "250.81", "250.82", "250.83", "250.90", "250.91", "250.92", "250.93"]&gt;, #&lt;ConceptQL::Operator::icd10cm @arguments=["E10.10", "E10.11", "E10.61", "E10.61", "E10.62", "E10.62", "E10.62", "E10.62", "E10.63", "E10.63", "E10.64", "E10.64", "E10.65", "E10.69", "E10.8", "E10.9", "E11.00", "E11.01", "E11.10", "E11.11", "E11.61", "E11.61", "E11.62", "E11.62", "E11.62", "E11.62", "E11.63", "E11.63", "E11.64", "E11.64", "E11.65", "E11.69", "E11.8", "E11.9", "E13.00", "E13.01", "E13.10", "E13.11", "E13.61", "E13.61", "E13.62", "E13.62", "E13.62", "E13.62", "E13.63", "E13.63", "E13.64", "E13.64", "E13.65", "E13.69", "E13.8", "E13.9"]&gt;]&gt;
) AS "t1"
)
                            SELECT
                                *
                            FROM
                                "union_22_2_64ac7676c653ae02d50931398ee95903"
) AS "l"
) AS "l"
                    WHERE (
                        EXISTS (
                            SELECT
                                1
                            FROM (
                                SELECT
                                    *
                                FROM (
                                    SELECT
                                        "person_id",
                                        max(
                                            "start_date"
) AS "start_date"
                                    FROM (
                                        WITH "union_22_1_d3d7bbd424bca0fe3b3f966d4ee80692" AS MATERIALIZED (
                                            SELECT
                                                *
                                            FROM (
                                                SELECT
                                                    "person_id",
                                                    "criterion_id",
                                                    "criterion_table",
                                                    "criterion_domain",
                                                    "start_date",
                                                    "end_date",
                                                    "source_value",
                                                    "source_vocabulary_id",
                                                    "label"
                                                FROM (
                                                    SELECT
                                                        "person_id",
                                                        "criterion_id",
                                                        "criterion_table",
                                                        "criterion_domain",
                                                        "start_date",
                                                        "end_date",
                                                        "source_value",
                                                        "source_vocabulary_id",
                                                        "label"
                                                    FROM (
                                                        SELECT
                                                            *
                                                        FROM (
                                                            SELECT
                                                                "patient_id" AS "person_id",
                                                                "id" AS "criterion_id",
                                                                CAST(
                                                                    'clinical_codes' AS text
) AS "criterion_table",
                                                                CAST(
                                                                    'condition_occurrence' AS text
) AS "criterion_domain",
                                                                "start_date",
                                                                "end_date",
                                                                CAST(
                                                                    "clinical_code_source_value" AS text
) AS "source_value",
                                                                CAST(
                                                                    "clinical_code_vocabulary_id" AS text
) AS "source_vocabulary_id",
                                                                CAST(
                                                                    NULL AS text
) AS "label"
                                                            FROM
                                                                "clinical_codes"
                                                            WHERE ((
                                                                    "clinical_code_concept_id" IN (
                                                                        SELECT
                                                                            "id"
                                                                        FROM
                                                                            "concepts"
                                                                        WHERE ((
                                                                                "vocabulary_id" = 'ICD9CM'
)
                                                                            AND (
                                                                                "concept_code" IN (
                                                                                    '410.01', '410.11', '410.21', '410.31', '410.41', '410.51', '410.61', '410.71', '410.81', '410.91'
)
)
)
)
)
                                                                OR (
                                                                    "clinical_code_concept_id" IN (
                                                                        SELECT
                                                                            "id"
                                                                        FROM
                                                                            "concepts"
                                                                        WHERE ((
                                                                                "vocabulary_id" = 'ICD10CM'
)
                                                                            AND (
                                                                                "concept_code" IN (
                                                                                    'I21.01', 'I21.02', 'I21.09', 'I21.11', 'I21.19', 'I21.21', 'I21.29', 'I21.3', 'I21.4', 'I21.9', 'I21.A1', 'I21.A9', 'I22.0', 'I22.1', 'I22.2', 'I22.8', 'I22.9'
)
)
)
)
)
) -- #&lt;ConceptQL::Operator::icd9cm @arguments=["410.01", "410.11", "410.21", "410.31", "410.41", "410.51", "410.61", "410.71", "410.81", "410.91"]&gt;
) AS "t1"
) AS "t1"
) AS "t1" -- #&lt;ConceptQL::Operators::Union @upstreams=[#&lt;ConceptQL::Operator::icd9cm @arguments=["410.01", "410.11", "410.21", "410.31", "410.41", "410.51", "410.61", "410.71", "410.81", "410.91"]&gt;, #&lt;ConceptQL::Operator::icd10cm @arguments=["I21.01", "I21.02", "I21.09", "I21.11", "I21.19", "I21.21", "I21.29", "I21.3", "I21.4", "I21.9", "I21.A1", "I21.A9", "I22.0", "I22.1", "I22.2", "I22.8", "I22.9"]&gt;]&gt;
) AS "t1"
)
                                        SELECT
                                            *
                                        FROM
                                            "union_22_1_d3d7bbd424bca0fe3b3f966d4ee80692"
) AS "t1"
                                    GROUP BY
                                        "person_id"
) AS "r"
) AS "r"
                            WHERE ((
                                    "l"."person_id" = "r"."person_id"
)
                                AND (
                                    "l"."end_date" &lt; CAST((
                                            CAST(
                                                "r"."start_date" AS timestamp
) + make_interval(
                                                days := - 30
)
) AS date
)
)
)
)
)
) AS "t1"
) AS "t1" -- #&lt;ConceptQL::Operators::Before @upstreams={:left=&gt; #&lt;ConceptQL::Operators::Union @upstreams=[#&lt;ConceptQL::Operator::icd9cm @arguments=["250.00", "250.01", "250.02", "250.03", "250.10", "250.11", "250.12", "250.13", "250.20", "250.21", "250.22", "250.23", "250.30", "250.31", "250.32", "250.33", "250.80", "250.81", "250.82", "250.83", "250.90", "250.91", "250.92", "250.93"]&gt;, #&lt;ConceptQL::Operator::icd10cm @arguments=["E10.10", "E10.11", "E10.61", "E10.61", "E10.62", "E10.62", "E10.62", "E10.62", "E10.63", "E10.63", "E10.64", "E10.64", "E10.65", "E10.69", "E10.8", "E10.9", "E11.00", "E11.01", "E11.10", "E11.11", "E11.61", "E11.61", "E11.62", "E11.62", "E11.62", "E11.62", "E11.63", "E11.63", "E11.64", "E11.64", "E11.65", "E11.69", "E11.8", "E11.9", "E13.00", "E13.01", "E13.10", "E13.11", "E13.61", "E13.61", "E13.62", "E13.62", "E13.62", "E13.62", "E13.63", "E13.63", "E13.64", "E13.64", "E13.65", "E13.69", "E13.8", "E13.9"]&gt;]&gt;, :right=&gt; #&lt;ConceptQL::Operators::Union @upstreams=[#&lt;ConceptQL::Operator::icd9cm @arguments=["410.01", "410.11", "410.21", "410.31", "410.41", "410.51", "410.61", "410.71", "410.81", "410.91"]&gt;, #&lt;ConceptQL::Operator::icd10cm @arguments=["I21.01", "I21.02", "I21.09", "I21.11", "I21.19", "I21.21", "I21.29", "I21.3", "I21.4", "I21.9", "I21.A1", "I21.A9", "I22.0", "I22.1", "I22.2", "I22.8", "I22.9"]&gt;]&gt;}&gt;
) AS "t1"
)
    SELECT
        *
    FROM
        "before_22_3_24a2b6fb9b0f37c3735c654d1ffd3cab") AS "t1"
</code></pre></div></div>]]></content><author><name>Mark Danese</name></author><category term="algorithms" /><summary type="html"><![CDATA[We are pleased to announce that we have released our algorithm library to the public at https://public.jigsaw.io. This is something we have wanted to do for years, but there was always something else that we wanted to include or change. In the spirit of “perfect is the enemy of good”, we decided that the algorithm library was finally “good enough” to make it available to others. It is still a work in progress. We hope researchers who conduct studies using healthcare data find it useful.]]></summary></entry></feed>