Digital Pathology Podcast

201: Confidence-Based AI Pathology for Cholangiocarcinoma Diagnosis

Subscriber Episode Aleksandra Zuraw, DVM, PhD Episode 201

This episode is only available to subscribers.

Digital Pathology Podcast +

AI-powered summaries of the newest digital pathology and AI in healthcare papers

Send us Fan Mail

Paper Discussed in this Episode:

A confidence-based, artificial intelligence pathology model for diagnosis of intrahepatic cholangiocarcinoma. Chang, Jay, Calderaro, et al. Annals of Oncology 2026. DOI: 10.1016/j.annonc.2026.02.018.

Episode Summary: In this journal club deep dive, we tackle one of the most frustrating diagnostic puzzles in liver cancer: differentiating primary intrahepatic cholangiocarcinoma (ICCA) from metastatic liver cancers. We examine a groundbreaking 2026 study introducing AI2CCA, a deep-learning pathology model that evaluates routine digitized slides. The study forces us to ask a critical question: how can we safely deploy AI in the clinic? The answer lies in teaching the machine to measure its own uncertainty, drastically reducing the need for invasive, exclusionary tests and accelerating life-saving treatments.

In This Episode, We Cover:

The Ultimate Clinical Bottleneck: Understanding the high-stakes diagnostic overlap between ICCA and metastatic adenocarcinomas. Because these tumors look functionally identical—sharing irregular glandular structures, mucin secretion, and fibrotic responses—patients often face weeks of invasive endoscopies and body scans to rule out an occult primary site before targeted treatment can begin.

The Foundation Model Bake-Off: Researchers pitted three advanced, self-supervised deep learning architectures against each other using retrospective data from 544 patients across five European centers: ◦ Ctranspath paired with HistoBistro. ◦ UNI paired with CLAM. ◦ CONCH paired with TITAN, which emerged as the winner by mapping gigapixels of tissue to pathology reports using multimodal visual-language training.

The Secret Sauce - Predictive Entropy: An initial AUROC of 0.840 is not safe enough for clinical deployment. We break down how the team used Generalized-ODIN (G-ODIN) to calculate "predictive entropy"—a mathematical measurement of the AI's internal confusion when tissue is highly ambiguous.

The Power of Saying "I Don't Know": By setting a strict confidence threshold and refusing to diagnose ambiguous slides, the AI2CCA model improved its AUROC to 0.958 and dropped its false positive rate to absolute zero. While it only retained 46% of cases for high-confidence predictions, it provides a safe "fast-track" that could essentially halve the clinical backlog for unnecessary gastrointestinal scopes.

The Global Stress Test: To prove the AI didn't just memorize European lab stains, the team prospectively tested 161 new patients across France, India, and South Korea. Despite navigating completely different disease backgrounds—such as heavy cirrhosis and endemic liver flukes—the model achieved near-perfect accuracy (AUROCs of 1.00 and 0.965) with only one single misclassification globally.

Key Takeaway: True clinical AI doesn't need to replace the human diagnostic process; it just needs to know what it doesn't know. By perfectly triaging 46% of routine cases with zero false positives, AI2CCA transforms the human pathologist into the ultimate biological arbiter, freeing up their cognitive bandwidth for the most complex cases while allowing thousands of patients to skip unnecessary invasive tests

Get the "Digital Pathology 101" FREE E-book and join us!

Welcome back to the digital pathology podcast trailblazers. I'm your host and today we are jumping into a really fantastic journal club deep dive. We're looking at a groundbreaking paper from the annals of oncology.


Hi everyone. Yeah, I am uh I'm really thrilled to be here for this one. This paper by Chang Jay Calderaro and their international team is well it's frankly trying to solve one of the most frustrating diagnostic puzzles in liver cancer.


Right. The AI2CA model. And to really understand why this is a big deal, I want you, the listener, to just imagine a patient lying in a hospital bed. They just got biopsied for a mass in their liver.


Uh-huh. A really scary moment for anyone.


Exactly. So, the digitized slide goes under the microscope and it is undeniably an adnocarcinoma. But instead of, you know, immediately drawing up a targeted chemo plan or prepping for surgery, the entire clinical team just hits pause.


They hit a total wall. They have to order a colonoscopy. They order an upper GI endoscopy. Uh they schedule a full body PT scan


which takes weeks. And why? Because looking at that slide, the human pathologist literally cannot definitively tell if this tumor started in the liver's bile ducts, you know, an intrapatic colangio carcinoma or ICCA or if it's a metastatic colon or gastric cancer that just happened to land in the liver.


Yeah. And that is the ultimate clinical bottleneck in apatabilarycology. Right now, we are essentially forcing a patient with a brand new cancer diagnosis to undergo this massive battery of invasive expensive scope. just to rule things out.


Exactly. Simply because human morphology has reached its absolute limit. You just can't see the difference.


So looking at these routine digitized slides, it's kind of like trying to tell identical twins apart, right? Except one is a local resident, the ICCA, and the other is just visiting from out of town, the metastasis.


That is a perfect way to put it. They both form these irregular glandular structures. They both provoke this intense fibroic desmoplastic reaction in the surrounding stroma.


Um they both treat mucin


so functionally identical


pretty much yeah on a standard H& stain slide the primary liver tumor and the metastatic invader look exactly the same


but I guess my question is why does this ambiguity actually matter so much for the patient right in that moment can't we just treat the liver mass


well no because you cannot treat a metastatic colurectal cancer with the same systemic therapy you'd use for primary bilary cancer the mutation profiles the targeted therapies surgical protocols they're entirely divergent pathways.


Wow. Okay. So, the stakes are incredibly high.


Yeah. The burden of proof totally falls on the pathologist to rule out an occult primary tumor somewhere else in the gut before anyone dares to treat it as a primary ICCA


which causes that massive delay. So, since human experts are, you know, fundamentally limited by that visual overlap, Chang and Cderaro turned to machines and they didn't just build a standard AI from scratch. Right.


Right. They ran a huge architecture bakeoff. They took retrospective data from 544 patients. across five European centers and they basically pitted three advanced deep learning foundation models against each other.


So it was Cranspath paired with Histabistro, Uni paired with Clam and Conc paired with Titan.


Exactly. And what's fascinating here is the progression in how these models learn. Older models relied heavily on manually annotated pathology patches, but foundation models like you and I and Tonch, they are trained via self-supervised learning on millions of unan annotated whole slide images.


So they kind of learn the intrinsic visual language of human tissue without our biases telling them what to look for.


Yeah, exactly.


And when I saw they tested CEO in there, I was super interested because KCH leverages visual language multimodal training. It maps gigapixels of tissue directly to pathology report,


right? And associates that specific glandular formation mathematically with the clinical text describing it, which gave it a distinct ed here.


And to process the whole slide, they use multiple instance learning or MIL. Right. Yeah. M. So the slides broken down into thousands of smaller tiles. The Kona encoder looks at each patch and then the Titan aggregator acts like an attention mechanism pulling all those insights together to make a slide level diagnosis.


And out of the three, the Kioch and Titan combination was the winner for this European data set.


It was it achieved an area under the receiver operating characteristic curve an AOC of.840.


Okay, let's unpack this right here because an AC of840 is well it's a solid B-grade. It's great for a research poster,


but if I'm the patient, I do not want a model that is just pretty good making calls about my liver cancer. I mean, an 84% is kind of terrifying in a clinical setting.


Oh, absolutely. That gap between a neat research algorithm and a deployable safe clinical tool is where most pathology AI projects just die,


right? So, how did the authors bridge that gap? How would they stop it from just guessing?


Well, they recognized that forcing a prediction on highly ambiguous tissue is a fatal flaw. So to make AI2 CCA safe, they engineered a way for the model to actually measure its own uncertainty.


Oh wow.


Yeah. They implemented a confidence assessment system using something called generalized Odin or GeoM.


I really want to break down how Geodin calculates that uncertainty because reading the paper, this felt like the secret sauce.


It totally is the secret sauce. So it relies on a metric called predictive entropy. Okay. When a neural network makes a classification, it outputs a probability distribution across Ross the possible classes using a math function called softmax.


Right. The softmax function.


Yeah. So if the AI looks at a slide and it aligns perfectly with its learned representation of ICCA, the output might be 99% ICCA and 1% metastasis.


Meaning the probability is sharply peaked.


Exactly. In mathematical terms that state has extremely low entropy. The model is highly confident.


But when it looks at a slide where the stroma and the glands are just completely ambiguous, the distribution flats. out right


yes the output might hover around 52% ICCA and 48% metastasis the internal state of the model is chaotic that is high predictive entropy it is literally the quantifiable manifestation of the AI being confused


okay it's like a medical student taking a multiple choice test instead of blindly guessing and getting penalized the student is allowed to just leave the hardest questions blank they only answer the ones they are 100% sure about


that's exactly it and Geodin applies temperature scaling to that soft max function which amplifies the difference between a genuinely confident prediction and one that's just noise.


So it draws a hard mathematical line,


right? It says any slide with predictive entropy above this threshold just will not receive a diagnosis.


And the results of doing that were dramatic, weren't they?


Beyond dramatic. By filtering out those low confidence, high entropy guesses, the model's a skyrocketed from 0840 up to 958.


That's a massive jump. But the really huge thing For you trailblazers listening, the metric that actually changes practice is the false positive rate. By using that threshold, the false positive rate dropped to absolute zero.


Zero. Not a single time did the AI confidently mclassify a metastasis as an ICCA or vice versa. When it said it was sure, it was flawless.


But there's a trade-off, right?


Yeah, there is. The model only retained 46% of the samples for these high confidence predictions. It essentially flagged the remaining 54% as too ambiguous, requiring standard human review. Which honestly we need to contextualize what retaining 46% actually means. Right now practically every patient with this biopsy goes into that clinical bottleneck for endoscopies.


Every single one.


Right. So if AI2CA can definitively safely clear 46% of those patients immediately. I mean nearly half the patients could skip the scopes and fasttrack right to treatment.


It basically hves the clinical backlog overnight without compromising safety for the other half.


Unbelievable. But And I have to play devil's advocate here. Retrospective data is safe and controlled. The slides are archived. The variables are known.


And real world clinical data is incredibly messy. AI models usually degrade fast.


Exactly. So to truly change clinical practice, they needed a perspective test. Right.


They did. To prove the model wasn't just overindexed on European lab stains. They ran a phase 2 perspective validation. They locked the model and tested it on 161 brand new patients arriving at four international centers in France, India, and South Korea. Yeah.


And testing in Asia is a profound stress test. I mean, the underlying ideology of ICCA in western populations is often sporadic, but in parts of India and Korea, it's driven by endemic risk factors like hepatitis B or liver flukes, right?


Like opasauruses vivverini. Which means the background liver tissue surrounding the tumor looks fundamentally different.


The AI isn't just looking at the cancer. It has to navigate through heavy cerosis, severe chronic inflammation, processed with totally different chemical H& stains.


Yeah, if the foundation model had only memorized the tint of a Parisian lab, it would have collapsed in soul. But the perspective results were stunning. In the French cohort, the high confidence predictions hit a perfect AOC of 1.0.


Literally perfect.


Perfect. And in the Asian cohort, navigating those totally different disease backgrounds, it achieved an AOC of.965. Across the entire perspective series, the model made exactly one mclassification on its confidence. predictions.


Wait, really? Just one?


Just one error across international borders, distinct genomic populations, and different scanner hardware.


That proves it actually learned the fundamental biological morphology, not just a local shortcut. Yeah.


And the G Odin system caught the model's uncertainty globally.


It really is a masterclass in engineering clinical safety into deep learning. They've built a confidence-based biomarker that works on routine H& slides.


No extrogenomic sequencing, no waiting weeks for specialized stains,


just rapid accurate triage right when the biopsy is digitized. It has the immediate potential to eliminate unnecessary gastrointestinal investigations for nearly half of these patients.


So, synthesizing this for the trailblazers listening today, Chang Culderaro and their team have essentially created a high-speed lane for diagnosing liver cancer safely.


Absolutely.


But it leaves us with a really provocative thought to end on. We always talk about AI trying to replace the human diagnostic process, but if a model like AI2CA perfectly triages the easy 46% the routine cases just disappear from your queue.


Yeah. Your daily workload becomes hyperconentrated.


Exactly. When the AI is designed to know what it doesn't know, the human isn't the primary screener anymore. The human becomes the ultimate biological arbiter. You dedicate all your cognitive bandwidth to the 54% of cases that are so ambiguous they terrify the machine.


You become the ultimate safety net.


It's a massive shift in how we work. Something to mull over before your next shift. Thank you, trailblazers for joining this journal club session of the digital pathology podcast and we will catch you on the next deep dive.