Why is phenotyping difficult in OHDSI? Blame the concept set expression and vocabulary

Today, I want to discuss a known but perhaps less spoken about problem, inspired by the recent forum post from @Chris_Knoll (Using the ‘Mapped’ function of concept set expressions).

I am a big fan of the OHDSI Cohort Definition model (i.e., the CIRCE model) and the ATLAS UI for cohort definition as it provides an elegant framework for defining entry events, inclusion rules, and exit criteria. However, I often find myself not appreciating the limitations of its concept set expression, the structure of the OMOP vocabulary tables (concept_ancestor and concept_relationship), and the sometimes-painfully ambigious distinctions between standard vs. non-standard concepts.

I believe we must improve these foundational elements. We need to reduce the inherent ‘difficulty’ in phenotyping, minimize the need for ‘vocabulary SQL acrobatics’, and escape the recurring ‘version nightmares’.

Table of Contents

1. Complex Concept Sets are Unavoidable

Let’s start with an obvious point: we cannot avoid using complex concept set expressions. Relying solely on a conceptId plus its descendants is rarely sufficient (Sorry, @Christian_Reich ). This is because of the polyhierarchy in concept_ancestor, simply pulling in all descendants often introduces specificity errors. Lets call this established and not debate this.

2. CIRCE Concept Set Expression does not support the concept_relationship thus ignores laterality

If we agree that complex expressions are necessary, the next question is: Do our tools support building them effectively, or are we forced into complex acrobatics?

Currently, the concept set expression model in CIRCE supports only two modes of traversal:

Hierarchy (Vertical): Using includeDescendants, which queries the pre-computed concept_ancestor table. This is fast and powerful for parent-child relationships.
Mapping (Limited Horizontal): Using includeMapped. This is an occasionally used acrobatic maneuver strictly limited to the hardcoded 'Maps to' relationship in concept_relationship (as discussed in the forum link above from @Chris_Knoll ).

This limitation means that the rich, ‘lateral’ ontological relationships stored in concept_relationship—such as ‘Has causative agent’, ‘Has finding site’, or ‘Has active ingredient’—are essentially locked away from us within the ATLAS/CIRCE framework. We cannot use them to define our concept sets.

3. Example: The Challenge of DILI (Drug-Induced Liver Injury)

Complex phenotypes like DILI perfectly illustrate this difficulty. The clinical idea is straightforward:

Find conditions representing liver injury that are NOT caused by other specific etiologies like viruses, alcohol, or physical obstruction.

Currently, we solve this using cohort definition acrobatics—perhaps 10 inclusion rules requiring exactly 0 occurrences of hepatic viruses within -7 to +7 days of the hepatitis event, etc. We do this because the entry event concept set cannot leverage laterality; it can only navigate parent-child relationships.

What we should be able to ask the vocabulary is: “Show me all descendants of ‘Liver Injury’ that DO NOT have a ‘Caused by’ relationship to a descendant of ‘Virus’ or ‘Alcohol’.”

Because CIRCE concept set expression cannot traverse the ‘Caused by’ relationship, we are forced into a painful, manual workaround:

Manual Curation: We must painstakingly search the vocabulary for every pre-coordinated concept that implies a non-drug cause (e.g., “Alcoholic hepatitis,” “Acute viral hepatitis B,” “Obstructive biliary disease”).
Brittle Exclusion List: We then add this long, manually curated list to the “Excluded Concepts” panel. Its brittle because the concept set expression does not automatically add any updates to our clinical idea on what may be causing the disease.

This workaround is not just tedious; it’s the primary source of the “difficulty” and “version nightmares” I mentioned. It is error-prone (leading to low recall or specificity errors), and the list becomes obsolete with the next vocabulary update, requiring the entire manual process to be repeated. And do not get me started on the added complexity of non-standard ICD10CM/ICD9CM code mapping.

4. Why This Matters for the Future – because the community is solving this phenotyping is difficult problem.

I am hearing about many innovations being built on top of the existing concept_ancestor structure and the current CIRCE model—for example, using generative AI or vector embedding-based semantic search to build concept set expressions.

Perhaps these approaches will succeed, and we can continue to rely solely on concept_ancestor without using the concept_relationship. But perhaps we are overlooking a solution that already exists within our ontology (i.e., laterality) and hamstringing these new tools by building them on an incomplete foundation.

This is even more critical in domains like cancer phenotyping, where laterality is essential for distinguishing key characteristics:

Origin/Status: Primary (originating in the lung) vs. Secondary (metastasis from elsewhere).
Morphology: Malignant (cancerous) vs. Benign.
Histology: Non-Small Cell Lung Cancer (NSCLC) vs. Small Cell Lung Cancer (SCLC).
Location: Lower respiratory tract (lung/bronchus) vs. Upper respiratory tract.

To achieve robust, reproducible, and scalable phenotyping, we need tools that can navigate the full richness of our vocabulary, not just the hierarchy.

There are many ways to tackle this problem – if we agree this is a problem. Maybe we should dream big – rethink this completely and adopt labeled property knowledge graphs and throw away the concept_ancestor and concept_relationship (making us AI/LLM ready). Maybe we should not blindly put any concept that does not have equivalence as non-standard (e.g. ICD10CM codes that do not have 1:1 equivalence mapping) and do lossy mapping but instead make them snomed extensions and make them standard with proper descendants. Maybe we should do a breaking change to circe to support knoweldge graphs instead of SQL.

A little more about the two OMOP vocabulary tables: concept_relationship and concept_ancestor. concept_relationship provides the foundational, direct links between concepts. concept_ancestor is a derived optimization table specifically designed to handle complex hierarchies.

Concept Relationship (https://ohdsi.github.io/CommonDataModel/cdm54.html#concept_relationship)
The concept_relationship table stores the direct, asserted relationships (fundamental truths, axioms) between two concepts, as defined by the source vocabularies (e.g., SNOMED CT, RxNorm, LOINC) or curated by the OHDSI community vocabulary team. It defines how two concepts are related. It bridges different vocabulary domains (e.g., linking a condition to a drug that treats it) and different source terminologies (‘Maps to’, e.g., mapping ICD-10 to SNOMED). It only stores immediate relationships. If A is related to B, and B is related to C, this table stores (A-B) and (B-C), but not (A-C). It captures many relationships e.g., ‘Has active ingredient’, ‘Has finding site’, ‘Has causative agent’ via relationship_id with hierarchical relationships represented using ‘Is a’.

Concept Ancestor (https://ohdsi.github.io/CommonDataModel/cdm54.html#concept_ancestor)
The concept_ancestor table is derived from the hierarchical relationships (‘Is a’) present in concept_relationship. It represents the transitive closure of the vocabulary hierarchies. Its purpose is to provide a pre-calculated lookup of all hierarchical lineages. It stores beyond immediate relationships, i.e., If A is a B, and B is a C, concept_ancestor explicitly stores (A-B), (B-C), and (A-C)—but only in a hierarchical context (i.e., derived from ‘Is a’). It does not include non-hierarchical relationships like ‘Maps to’ or ‘Has ingredient’. Since it is a derived table from hierarchical relationships, it is limited to ‘Is a’ relationships between valid and standard concepts. By precalculating, the concept_ancestor transforms complex hierarchical traversals into simple, efficient JOIN operations. Without concept_ancestor, concept_relationship table based queries will require a complex and slow recursive CTE.

In summary, concept_relationship is difficult to query using SQL for deep relationships because of the explosion of CTEs and recursiveness. Concept_ancestor is an elegant workaround to achieve the goal of transforming complex hierarchical traversals into simple, efficient JOIN operations. However, concept_ancestor is limited to ‘Is a’/‘subsumes’ relationship_id, standard concepts (and classification?), and valid concepts.

In reality, concept_ancestor is a pragmatic choice designed to achieve the OHDSI mission of network studies on multiple DBMS with computational efficiency. It achieves this by using pre-computed content where possible. OHDSI tools (Circe) do not support the dynamic generation of complex SQL to traverse the concept_relationship table. Further, SQL-based queries that are overtly complex are likely to have performance issues and potentially fail over the multiple DBMS that OHDSI network studies would like to support.

But because it only supports a limited set of relationship_id (‘Is a’/‘Subsumes’) and because OHDSI tools do not support the richness of the concept_relationship, we can’t leverage these in the Cohort Definitions. Obviously, ‘Has causative agent’ is a special relationship_id that does not represent a parent-child/ancestor relationship.

The above is just one explanation as to why phenotyping is difficult. This difficulty is further explained by:

Standard/non-standard: We attempt to pick standard vs non-standard based on some quality judgments like whether a code follows principles like non-redundancy, atomicity, compositionality, clarity and unambiguity etc., but vocabulary creators have to make pragmatic decisions to balance expressive power and practical usability. Do we know how good is our classification of standard/non-standard. If we get a list of ICD codes as the source of truth, we waive it off as being inferior—probably because the effort to create a ‘standard’ based concept set expression is a pain, and it is so much easier to start a concept set expression from scratch.
Mapping choices: if we declare a code to be standard and a code to be non-standard, then we have to map them. If a non-standard code has a meaning that is not equivalent to a standard code, then we map in a lossy way to a broader standard concept. This, while intuitive, causes headaches in concept set expression building because selecting a descendant will increase the probability of pulling codes in unexpected ways.

I think this problem may be succinctly described as the SQL Expressivity Gap. This gap is fundamentally a mismatch between the query language (SQL) and the underlying data structure (an ontology, which is inherently a graph). Graph databases (e.g., Labeled Property Graphs or LPGs) and their associated query languages (e.g., Cypher) are explicitly designed to handle the interconnected nature of ontological data.

In theory, they offer significant advantages over SQL in this domain:

Expressivity and Intuition: Graph languages can express complex biomedical criteria much more intuitively by directly mirroring the ontological structure.
Conciseness: A query traversing multiple relationship types that might require dozens of JOINs in SQL can often be represented in a few lines of Cypher.
Maintainability and Transparency: Simpler expressions enhance readability and reduce the overhead associated with maintaining and validating concept sets, particularly across vocabulary updates.
LLM Compatibility: The pattern-matching syntax of graph languages aligns well with the capabilities of Large Language Models (LLMs), suggesting significant potential for accelerated, AI-assisted phenotype development.

Illustrative Use Case: DILI

Continuing with the example of Drug-Induced Liver Injury (DILI):

“Create a concept set representing liver injury Concept IDs that are NOT caused by other specific etiologies like viruses, alcohol, or physical obstruction.”

In the current SQL/Circe paradigm, we resort to “vocabulary acrobatics”: manually curating extensive, brittle lists of pre-coordinated concepts (e.g., “Alcoholic hepatitis”) for exclusion. This list requires constant maintenance as the vocabulary evolves.

In a graph paradigm, this requirement can be expressed directly against the ontology: “Select all descendants of ‘Liver Injury’ that DO NOT possess a ‘Has causative agent’ relationship to any descendant of ‘Virus’ or ‘Alcohol’.”

A simplified conceptual Cypher query:

// Find all concepts descending from 'Liver Injury'
MATCH (injury:Concept)-[:IS_A*]->(:Concept {name: 'Liver Injury'})
// Ensure that the injury concept does not have a causative agent
// that descends from 'Virus' or 'Alcohol'
WHERE NOT EXISTS {
    MATCH (injury)-[:HAS_CAUSATIVE_AGENT]->(cause)
    WHERE (cause)-[:IS_A*]->(:Concept {name: 'Virus'})
       OR (cause)-[:IS_A*]->(:Concept {name: 'Alcohol'})
}
RETURN injury.concept_id

This approach is declarative, robust to vocabulary updates, and directly reflects the clinical intent. While concept_ancestor facilitates efficient hierarchical queries using SQL by pre-calculating the transitive closure of the ‘Is a’ relationships, it cannot support the lateral, axiomatic filtering required here.

Specifically, the concept_ancestor table cannot execute the following logic:

// === AXIOMATIC (Lateral) Filtering - Exclusion Criteria ===
// Filter the candidate set by excluding specific etiologies. This is where
// we overcome the SQL Expressivity Gap by utilizing asserted axioms.
WHERE NOT EXISTS {
    // Step 2a: LATERAL Traversal
    // We traverse from the condition (injury) to its asserted cause using the
    // 'HAS_CAUSATIVE_AGENT' relationship. This is the critical lateral link
    // (stored in concept_relationship) that is generally inaccessible in standard
    // ATLAS/Circe concept set definitions.
    MATCH (injury)-[:HAS_CAUSATIVE_AGENT]->(cause)

    // Step 2b: HIERARCHICAL Traversal on the Cause
    // Verify if that identified 'cause' is a descendant of 'Virus' OR 'Alcohol'.
    WHERE (cause)-[:IS_A*]->(:Concept {name: 'Virus'})
       OR (cause)-[:IS_A*]->(:Concept {name: 'Alcohol'})
}


Integrating Graph Queries within the CIRCE Model

It is possible to replace SQL-based concept set resolution with a graph-based approach within the existing OHDSI framework. The Concept Set Expression is a modular component within the broader Circe cohort definition JSON structure. Currently, this expression is rendered into SQL, which resolves to a temporary table containing the relevant Concept IDs.

It is architecturally feasible to replace this SQL rendering engine with a graph query execution engine. The graph query would fulfill the same contract—returning a set of Concept IDs—which can then be utilized by the subsequent stages of the cohort definition logic. While this implementation is not trivial, it maybe a consideration for the roadmap of Atlas 3.x @anthonysena @Chris_Knoll .

Is Graph Empirically Superiority?

How do we empirically test that a graph language is superior to the current SQL-based approach? We propose the following hypotheses:

H1 (Expressivity): Graph languages (e.g., Cypher) can represent complex concept set expressions with greater conciseness and fidelity to clinical intent compared to current Circe/SQL implementations.
H2 (LLM Efficacy): LLMs can reliably generate and validate graph-based concept set expressions, significantly improving productivity in phenotype development.
H3 (UI/UX Feasibility): It is feasible to develop a user interface within tools like Atlas that allows researchers to intuitively construct and visualize graph-based queries, replacing the current concept set editor.
H4 (Infrastructure): A graph database representation of the OMOP vocabulary can effectively supersede the functional roles of concept_relationship and the derived concept_ancestor table while meeting performance requirements.

Implementation Pathways

To test these hypotheses, we need to construct a knowledge graph database. Two primary pathways are available:

Reuse and transform OMOP-Vocabulary into Graph DB (Pragmatic Short-Term): Convert the existing concept_relationship table into a graph representation (LPG or RDF). This is the most straightforward approach for initial hypothesis testing, as the knowledge graph is easier to construct from existing tables.
UMLS-Native Approach (Strategic Long-Term): If we establish the superiority of knowledge graphs, we can pursue a more robust solution. This involves loading the vocabulary directly from the UMLS Metathesaurus into a graph database, adding the OMOP concept_id as a property to the corresponding UMLS Concept Unique Identifier (CUI). This approach ensures native alignment with the foundational source ontology, enhancing defensibility, traceability, and interoperability. This strategy could eventually render concept_ancestor and significant parts of concept_relationship obsolete.

I think, OHDSI community adopting a graph-based approach extends far beyond phenotyping. It establishes a foundation for advanced AI applications, including Graph Retrieval-Augmented Generation (Graph RAG) and sophisticated semantic search. We can embed concept_id and even embeddings as properties in graph, and make them a community asset in Neo4js. Furthermore, this infrastructure improves the reliability of AI outputs by facilitating standardized Named Entity Recognition (NER) and Normalization (NEN) anchored to OMOP/UMLS concepts within prompts.

This evolution is likely a necessary step toward realizing the full potential of observational research and ensuring the OHDSI ecosystem is prepared for the next generation of AI-driven evidence generation.

Hopefully, I will get access to a Neo4j database containing an LPG representation of the concept_relationship table. I do not have experience with Cypher queries, so although theoretically this sounds interesting, I do not know if it will work yet. Here is an update on my assessment of the four key hypotheses:

H1 (Expressivity)

The superior expressivity of graph databases is well-established and can largely be considered established knowledge. However, it is still worthwhile to validate this specifically within the context of the OHDSI vocabulary. Once the Neo4j instance is available, and I have time to familiarize myself with the Cypher language, I will conduct specific tests and report back.

H2 (LLM Efficacy)

Hypothesis: LLMs can effectively translate natural language intent into executable graph queries, bridging the gap for non-technical users.

I assume that with the increasing maturity of LLMs and advancements in “Text-to-Query” fine-tuning, this hypothesis should be true. LLMs have demonstrated a sophisticated ability to understand both the semantic intent of a query and how to apply that intent to complex database schemas to render accurate queries.

To test this, I prompted Gemini with a couple of complex scenarios, assuming a standard LPG schema (Nodes labeled :Concept, relationships typed by relationship_id). The results were very convincing albeit untested:

Scenario 1: Cross-Domain Relationships

Natural Language Intent:
“Find all conditions that are known indications for drugs containing the ingredient ‘Metformin’ (Concept ID 1503297).”

LLM-Generated Cypher:

MATCH (ingredient:Concept {concept_id: 1503297})
// Find drugs that have this ingredient. (Drug)-[:Has_ingredient]->(Ingredient)
MATCH (drug:Concept)-[:Has_ingredient]->(ingredient)
// Find conditions that the drug may treat. (Drug)-[:May_treat]->(Condition)
MATCH (drug)-[:May_treat]->(condition:Concept)
WHERE condition.domain_id = 'Condition'
RETURN DISTINCT condition.concept_id, condition.concept_name

Scenario 2: Complex “Graph Concept Set” (Hierarchy + Attributes)

Natural Language Intent:
“I want to create a concept set for ‘Breast Cancer’. This should include all descendants of ‘Malignant neoplasm of breast’ (Concept ID 4112853). However, I only want to include concepts where the associated morphology is specifically ‘Infiltrating duct carcinoma’ (Concept ID 4237178).”

LLM-Generated Cypher:

// 1. Identify the starting hierarchical concept and the target attribute concept
MATCH (start_concept:Concept {concept_id: 4112853})
MATCH (target_morphology:Concept {concept_id: 4237178})

// 2. Find all descendants of the starting concept (traversing backwards)
// Note: Assumes (Descendant)-[:Subsumes]->(Ancestor) based on OHDSI standard
MATCH (start_concept)<-[:Subsumes*0..]-(descendant:Concept)

// 3. FILTER: Keep only descendants that also have the specified morphology relationship
// (Condition)-[:Has_morphology]->(Morphology)
WHERE (descendant)-[:Has_morphology]->(target_morphology)

RETURN DISTINCT descendant.concept_id, descendant.concept_name

These examples strongly suggest that LLMs can effectively handle the complex, multi-hop logic required for advanced phenotyping without requiring the researcher to write the graph query manually. Ofcourse, I have given it concept_id in this prompt and have not tested or challenged it to pick concept_id. Picking concept_id is something that others are trying to solve now using semantic search, and this should be doable with NER/NEN.

H3 (UI/UX Feasibility)

Hypothesis: It is feasible to develop a user interface that allows researchers to intuitively construct and visualize graph-based queries.

My preliminary research suggests this is feasible. There are several dedicated Visual Query Builders, Visual Graph Exploration Tools, and Foundational Libraries (like Cytoscape.js) for custom development. Foundational libraries could be integrated into OHDSI tools (e.g., Atlas) to build a UI/UX for complex concept set expressions. I am going to assume that these Visual Query Builders are capable of supporting the construction of sufficiently complex Cypher queries.

H4 (Infrastructure)

This will be the last item to prove and likely the most challenging aspect of adoption. Obviously, replacing concept_relationship and concept_ancestor with a graph database backend would be a significant breaking change to the OHDSI tool stack and infrastructure. This requires careful consideration.

Regarding the conceptual implementation pathways for creating a graph database.

1. Converting Current OMOP-Vocabulary tables into Graph DB:

Migrating the existing OMOP vocabulary tables into a Labeled Property Graph (LPG) involves restructuring the relational data. This transformation turns the implicit relationships defined by relational joins into explicit, traversable edges within the graph. The migration sources are the core tables: concept, concept_relationship, relationship, concept_synonym, vocabulary, and domain.

1.1. Defining the Node Model (Entities):
The central entity is the Concept. Properties such as concept_id, concept_name, concept_code, standard_concept, concept_class_id, and validity dates are stored directly on the node. We may also want to add secondary labels based on the domain_id (e.g., :Drug, :Condition, :Procedure). We may need metadata nodes sourced from the domain and vocabulary tables to provide context and provenance. Data from the concept_synonym table must be linked to the appropriate concepts, optimally stored as a list or array property (e.g., ‘synonyms’) on the corresponding :Concept node.

1.2. Defining the Edge Model (Connections):
The core of the model consists of Semantic Edges (Concept-to-Concept). These are directed edges created from the node identified by concept_id_1 to the node identified by concept_id_2. The edge type is defined by the relationship_id (e.g., ‘Is a’, ‘Maps to’), which specifies the semantic nature of the connection (e.g., (c1)-[:Is_a]->(c2)). Edge Properties must include the validity information (valid_start_date, valid_end_date, invalid_reason), as this data pertains to the relationship itself. We can also add contextual edges to link the :Concept nodes to their metadata, replacing the domain_id and vocabulary_id foreign keys. The relationship table provides the definitions for these edge types. Attributes like is_hierarchical, defines_ancestry, and information on reverse relationships inform how algorithms should traverse the graph (e.g., identifying which relationships to follow when determining lineage).

2. Convert UMLS-Native files into Graph DB:

Conceptually, constructing an LPG directly from the UMLS The Metathesaurus, and subsequently enriching it with OMOP concept_ids, should be possible.

The UMLS Metathesaurus integrates over 200 biomedical vocabularies (e.g., SNOMED CT, RxNorm, LOINC). It is distributed primarily in the Rich Release Format (RRF) Metathesaurus Data Files, a set of pipe-delimited relational files.

The National Library of Medicine (NLM) provides loaders/parsers for RRF. The official tool, MetamorphoSys MetamorphoSys, allows for customizing, subsetting, and parsing the RRF files. It also includes scripts to load the UMLS data into standard relational databases (RDBMS), such as MySQL or PostgreSQL. Next, we could use ETL tools (e.g., Neo4j’s) to migrate the data from the RDBMS tables into the target graph structure. This approach leverages the NLM’s official parser and manages the complexity of the RRF format.

We must decide on an appropriate Graph Model. This requires consideration of the different identifier systems: OMOP mappings rely on specific source codes and vocabulary names, while UMLS uses identifiers such as CUI (Concept Unique Identifier), LUI (Lexical Unique Identifier), and AUI (Atom Unique Identifier).

The nodes (entities) may represent CUIs or AUIs, with properties such as the term string, term type (TTY), name, and source vocabulary (SAB). The edges (connections) should link concepts to atoms (e.g., CUI to AUI) and concepts to other concepts (CUI to CUI). These edges will also need appropriate properties—such as the relationship type (REL/RELA) and the source asserting the relationship—to represent their provenance.

Finally, we have to map and attach the corresponding OMOP concept_id as a property to the relevant nodes. This approach effectively replaces the need for the OMOP concept_ancestor, concept_relationship, and concept_synonym tables, as these relationships are inherently captured by the graph structure, while still retaining the utility of the OMOP concept_id.

Implications:

Converting UMLS-Native files into a Graph DB is significantly more complex, but such a resource offers substantial advantages. It can serve applications beyond the OHDSI ecosystem while simultaneously supporting OHDSI requirements. For example, the integrated graph (containing UMLS CUIs linked to OMOP concept_ids) can be highly effective in AI applications involving Named Entity Recognition (NER) and Named Entity Normalization (NEN). Furthermore, this approach simplifies OMOP vocabulary maintenance, as we can theoretically deprecate the concept_relationship and concept_ancestor tables.

This transformation is a lot of work but our community has experts who are able to perform these transformations. Our OMOP vocabulary team @Anton_Tatur and others have created the concept_ancestor and concept_relationship tables. @Anton_Tatur and I are going to try this out.

Why are we interested in this problem space:
In addition to my hunch that knowledge graph may help us improve concept set creation and thus better phenotyping, we are also working on several efforts to leverage public knowledge in scientific and technical documents for making AI enabled decision making from drug development, to drug launch and pharmacovigilance. More on that to follow. I will probably discuss that in the Generative AI workgroup soon.

Featured

Why AI Augments, Not Replaces, Scientists

Featured

Using AI to Break Down Pharma's Knowledge Silos

Featured