Digital Pathology Podcast

193: Entropy as a Lie Detector for Radiology

Subscriber Episode Aleksandra Zuraw, DVM, PhD Episode 193

This episode is only available to subscribers.

Digital Pathology Podcast +

AI-powered summaries of the newest digital pathology and AI in healthcare papers

Send a text

Paper Discussed in this Episode:

Wienholt, P., Caselitz, S., Siepmann, R. et al. Hallucination filtering in radiology vision-language models using discrete semantic entropy. Eur Radiol (2026). https://doi.org/10.1007/s00330-026-12384-z

Episode Summary: In this deep dive, we strip away the marketing hype surrounding medical AI and confront the "black box" problem of Vision Language Models (VLMs) like GPT-4o. We examine a groundbreaking 2026 study published in European Radiology that tackles a terrifying clinical issue: these AI models are incredibly confident, articulate, and often completely wrong. We explore a clever new mathematical wrapper designed to catch the AI in a lie, forcing us to ask: how do we stop the AI from hallucinating with dangerous authority, and can we actually teach it to say "I don't know"?

In This Episode, We Cover:

The Confident Liar Problem (The Baseline): Why generalist VLMs are fundamentally different from traditional, narrow medical AI. They are probabilistic engines designed to predict the next word, resulting in a dangerous baseline accuracy of just 51.7% on real-world clinical data—essentially a coin flip.

The Mathematical Lie Detector (Discrete Semantic Entropy): How turning up the AI's "temperature" to 1.0 and asking the exact same question 15 times forces the model to brainstorm, revealing its hidden uncertainties.

Semantic Clustering (Cutting through the Noise): If the AI says "pneumonia" and then "lung infection," human clinicians know it means the same thing. We discuss how the DSE algorithm groups these answers by their underlying clinical meaning to calculate whether the AI is confidently consistent (low entropy) or randomly guessing (high entropy).

The Coverage Cost vs. Accuracy Trade-Off: The dramatic results of applying a strict DSE filter. GPT-4o's accuracy jumped from roughly 51% to over 76%, but with a massive catch—it remained completely silent on over half the cases, answering only 47.3% of the clinical questions.

The Danger Zone (Where AI Fails): Breaking down the performance across modalities. While the AI shone at identifying organs and surprisingly excelled at angiography, it completely fell flat on abnormality detection. On complex 3D CT scans, the filter had to reject over 90% of the questions because the model was fundamentally confused.

The Trap of the "Confident Hallucination": Why DSE measures consistency, not truth. We explore the nightmare scenario where an AI stubbornly hallucinates the exact same lie 15 times in a row, slipping past the safety filter and creating a massive risk for "automation bias" among clinicians.

Clinical Feasibility: The surprising practicality of running 15 parallel queries in a real hospital workflow. Because they run simultaneously via an API, the safety check takes only 6 seconds and costs roughly $0.72 per question.

Key Takeaway: Building safer AI might paradoxically risk creating riskier doctors. While Discrete Semantic Entropy successfully filters out the AI's digital noise and confusion—transforming a failing model into a somewhat reliable, albeit very quiet, assistant—it leaves us with a critical human factors challenge. If the system flawlessly cherry-picks the easy cases and stays silent on the hard ones, we must ensure our own diagnostic muscles don't atrophy from over-trusting the machine.

Get the "Digital Pathology 101" FREE E-book and join us!

Hello and welcome back uh trailblazers. You are tuning in to the digital pathology podcast.


Yeah, thanks for having me. I am definitely excited for this one.


It is great to have you here. This is the place where we try to uh uh basically strip away the marketing hype and look squarely under the hood of the medical AI revolution,


right? Because there is a lot of hype right now.


So much hype. And today we are tackling a subject that I know keeps a lot of radiologists awake at night.


It certainly keeps me awake. I mean, we're we're really talking about the blackbox problem specifically looking at it in the context of vision language models


right exactly because we have seen this massive explosion of AI tools recently you know GPT4 Gemini claude models that can see they can read and seemingly reason through complex data


yeah the pitch from the tech companies is always exactly the same they say things like this will solve the radiologist shortage or this is your new super resident


but anyone who has actually played around with these models in a clinical setting knows there's a a pretty massive catch.


A very dangerous catch, actually.


These models are incredibly confident. They are articulate and they're often just completely wrong. They hallucinate.


Yeah, they hallucinate with such authority. And you know, in a creative writing app or if you're writing a marketing email, a hallucination is just a quirky feature.


But in a hospital setting, it's a malpractice lawsuit waiting to happen.


Or obviously worse, it's a patient safety disaster.


Exactly. So, The the core question we are exploring for you today is how do we stop the AI from lying to us?


Or maybe more importantly, can we actually teach it to admit when it just has no idea what it's looking at?


That's the dream. And to figure that out, we are doing a deep dive into a really significant paper that was published in European radiology in 2026.


Yes. The paper is titled Hallucination: Filtering in Radiology Vision Language Models Using Discrete Semantic Entropy. It's authored by Patrick Weinhold. and his colleagues.


It is a bit of a mouthful of a title. I know.


It is a huge mouthful. Yeah. But don't let the title scare you off.


Right. The concept behind discrete semantic entropy or DSSE is actually incredibly intuitive once you break it down.


It really is. It's essentially a mathematical lie detector test for AI.


I love that framing. A lie detector for the black box. But before we get to the actual solution, let's establish the baseline here. We are talking about vision language models or VLMs. How is this funament potentally different from the medical AI we've been hearing about for the last I don't know 10 years.


That is a really crucial distinction to make. So traditional medical AI was often very narrow. You would have a model trained specifically to find say lung nodules on a CT scan


and that's all it did.


That is literally all it did. It was a onetrick pony. But VLMs are completely different. They are generalists.


So you can throw anything at them.


Exactly. You can show them an X-ray, an MRI, a photo of a skin lesion, and then you can ask them anything in plain English like what is the D nosis or is the heart enlarged or where's the fracture


which honestly sounds like the absolute dream scenario for a busy clinical workflow.


It does, but this flexibility comes with that confident liar problem we just mentioned.


Right? Because unlike a human resident who might look at a scan and say, "I'm honestly not sure. Let me check with the attending."


Yeah, these models don't do that. They are probabilistic engines. They are just designed to predict the next most likely word. They don't actually have a concept of human truth.


They only have a concept of likelihood.


Exactly. So, they will describe a fracture that isn't actually there with the exact same tone and linguistic certainty as a fracture that is completely obvious.


And Wayne Hull's team actually put some hard numbers to this, didn't they? They didn't just rely on anecdotal evidence.


No, they did a very rigorous baseline test. They used two models, GPT 40 and GPT 4.1, and they ran them through a combined data set of 706 image and question pairs.


And this data set is important, right? Because it wasn't just clean, perfect textbook images,


right? It wasn't just the VQA med database, which is very curated. They also included the RAD data set,


which is the realworld clinical data.


Yes. It's messy. It's ambiguous. It's 2D images from routine everyday exams.


And real world data is usually where AI tends to fall on its face. So, what actually happened when they just asked the AI to answer these clinical questions?


It was pretty sobering, honestly, for GPT40. The baseline accuracy was 51.7%.


51.7. I mean, that is effectively a coin flip.


It literally is.


Yeah.


GPT4.1 was marginally better at 54.8%.


Yeah.


But if you just think about that for a second in a real clinical context.


Yeah. If you had a colleague who was wrong every other time they opened their mouth,


but they spoke with absolute unshakable authority.


How long would they last in your department? They wouldn't make it to lunch.


Not a chance. And that is terrifying. It means you essentially cannot trust a single word the model says without completely verifying it yourself,


which completely defeats the purpose of having the AI there to save you time in the first place.


Exactly. Right. And this is the exact problem Wayne Holt and his team set out to solve


because they realized they couldn't easily just fix the model to make it smarter. Right.


No, you can't. That requires retraining the entire neural network, which costs tens of millions of dollars and massive compute.


But they thought maybe we could build a filter to catch the hallucinations before they ever reach the Dr. Screen,


enter discrete semantic entropy.


Okay, so for the trailblazers listening who haven't touched a physics textbook in 20 years, the word entropy usually means disorder or chaos. How are they using it in this context?


In this context, think of it as a measure of uncertainty. The researchers used a really clever technique here. Instead of asking the AI the question just once, they asked the exact same question 15 separate times.


Wait. Okay. But if I ask a calculator 2 plus do 15 times. It just says four 15 times. Does the AI actually change its mind?


Normally, no, it wouldn't. But they tweaked a specific parameter in the AI called temperature.


Ah, temperature.


Yeah. In AI, temperature controls creativity or randomness. If the temperature is set to zero, the model is completely robotic. It will always pick the single most mathematically likely word.


But they didn't leave it at zero.


No, they crank the temperature all the way up to 1.0, which forces the model to explore different linguistic possibilities.


You are essentially telling it, hey, don't just give me your first guess. Give me your second, third, and fourth guess, too.


You're forcing the AI to brainstorm. Essentially,


so they generate 15 different responses to the exact same image.


And then comes the semantic part of discrete semantic entropy,


right? They analyze those 15 different answers to see if they actually mean the same thing in a medical context.


So, this is the semantic clustering part.


Yes. Because the AI might look at a scan and say pneumonia one time.


Mhm. and then say lung infection the next time and maybe consolidation the third time


and to a standard computer program those are just completely different character strings


but clinically we know those map to the exact same underlying concept so their algorithm groups those similar meanings together


so if the AI gives 15 answers that use different phrasing but all ultimately mean pneumonia that represents a consistent story


correct that is what we call low entropy it means the model even when you force it to be creative just keeps arriving at the exact same conclusion


which suggests it has high confidence and is probably trustworthy on that specific image.


Exactly. But on the other hand, if the AI says pneumonia and then the next guess is tumor and then plural eusion and the normal judge


and you just have complete mess.


You have total mess that is high entropy the model is all over the place. It's totally guessing


which strongly suggests the model is hallucinating because it doesn't actually have a firm grasp on what it is looking at


precisely.


I really like that mechanism because it's not just checking for specific keywords. It's checking for the actual consistency of the medical meaning.


And the real beauty of this approach is that it treats the AI entirely as a blackbox.


Right? You don't need access to the underlying neural weights.


No, you don't need the proprietary code of GPT4 at all. You just look at the final text output. It's a very practical external wrapper that anyone can use.


Okay, so let's get to the final verdict on this. They have this mathematical lie detector running on 15 parallel queries. Does Does it actually work? Does it improve that dismal 51% accuracy?


It does. And honestly, the jump is quite dramatic. So, they set a strict filter threshold, meaning if the entropy score was above 0.3, the system would just refuse to answer entirely.


It would just output I don't know


exactly or I'm uncertain,


which in medicine silence is always better than a confident lie.


100%.


Yeah.


And with this strict filter in place, GPD4's accuracy jumped from 51 7% all the way up to 76.3%.


Wow, that is a massive improvement. You're talking about roughly a 25 percentage point gain just by filtering.


You essentially turned a failing student into a totally respectable B student.


That's incredible. And did the other model improve, too?


Yeah, GBT4.1 also saw a significant bump. It went up to 63.8%. So, the core hypothesis absolutely holds water. High entropy strongly correlates with wrong answers.


So, if you filter out the high entropy noisy responses you are just naturally left with a much higher quality set of accurate answers.


That's the mechanism. Yeah.


But there is no free lunch especially in medical technology. You mentioned the system refuses to answer. So practically speaking if I am a radiologist with a stack of a hundred images to get through and I run them through this filtered system, how many answers am I actually getting back?


That is the big trade-off here and it is a pretty steep one. In the literature we call it the coverage cost. The average cost of


okay to achieve that 76% accuracy the model had to choose to stay completely silent on a huge portion of the clinical cases.


How huge are we talking?


Specifically for GPT40 at that strict 0.3 threshold it only answered 47.3% of the questions.


Wow. So less than half.


Yes. Out of the 706 questions it only provided an answer for 334.


And for the other 372 it effectively just said I am way too uncertain to speak. on this. That really changes the whole value proposition of these tools, doesn't it? Because you aren't buying an autonomous diagnostic tool that does your job for you. You are buying a tool that just cherrypicks the easy cases and leaves all the hard ambiguous ones for the human to figure out,


which to be perfectly fair is exactly what we want in a human in the loop workflow right now.


That's a good point.


We refer to this whole philosophy as selective prediction. And the general consensus in the medical AI community is that we would much rather have an AI that does 50% of the work with high reliability


than an AI that tries to do 100% of the work but lies to you half the time.


I agree completely with that. But I do want to dig into what specific types of questions it was actually answering because radiology is a massive field. It covers everything from simply asking is this a CT scan to is this a glyopblast, right? Was the performance uniform across all those different clinical tasks?


Not at all. And this is where the paper gets really nuanced and Interesting. The breakdown by category tells a very specific story. Let's unpack that then. Where did the model actually shine?


It's shown on the absolute basics. So modality identification


like is this an X-ray, a CT or an MRI?


Yes. And organ identification like is this an image of a brain, a lung or a liver. The models are naturally very consistent on recognizing these broad visual patterns.


So the entropy was very low on those


very low entropy and therefore the rejection rate was also very low. It answered those confidently and correctly.


That makes sense. Those are visually distinct tasks. But were there any surprises?


Actually, there was one surprising standout category, and that was angography.


Really? Blood vessels.


Yes. Interestingly enough, GBC 40 went from 50% baseline accuracy on angography all the way to 75% after the DSSE filter was applied.


That is surprisingly good. Why do you think it performed so well there?


It seems the high contrast visual nature of angography. You know, bright white vessels on a very dark background. It's a visual pattern that the model can be surprisingly consistent about interpreting.


That is definitely promising for vascular work, but I suspect the news isn't nearly as good for the thing we actually care about the most, which is pathology.


No, it's really not. This is what we could call the danger zone of the study. When it came to abnormality detection, actually finding and naming the disease, the baseline performance was just abysmal. What were the numbers looking like?


We are talking about a baseline accuracy before filtering of around 13%.


13%. I mean honestly that is worse than random guessing in some multiplechoice contexts.


It is frighteningly low. It clearly shows that these general purpose vision models just do not fundamentally understand clinical pathology yet.


But did the filter help at all?


The DSSE filter did help. Relatively speaking, it raised the accuracy from 13% to about 36.4%.


I mean, that's a relative improvement, sure, but 36% accuracy is still absolutely not something you would ever deploy in a live clinic.


And here's the real kicker with that stat. To get to that 36% accuracy, the model had to reject over 90% of the abnormality questions.


So practically speaking, if I ask the AI, is there a tumor here? Nine times out of 10, it will say absolutely nothing.


Right?


And the one time it finally does speak up, it is still only right a third of the time.


In the specific context of abnormality detection,


yes, that is exact. exactly what the data shows.


Wow.


And it gets even more specific. For CT scans, the filter was incredibly aggressive.


Aggressive.


For GPT40, interpreting CT scans, the filter ended up retaining only 8.3% of the questions.


It rejected over 90% of CT questions.


Yes. And in a very weird counterintuitive twist, the accuracy for those remaining CT questions actually dropped slightly after filtering in some specific instances.


Wait, that doesn't make any sense. Why would the accur go down if we are specifically filtering out the uncertain noisy answers.


It strongly suggests that for highly complex 3D data like a CT scan, the model is so fundamentally confused by what it's looking at that its consistent answers are just as likely to be completely wrong as its random answers.


Which leads us right into the major logical trap of this whole methodology. The authors of the paper call it the confident hallucination.


Yes. And this is probably the most critical takeaway for our listeners today. We have to remember that D SE measures consistency.


It does not measure truth.


Exactly.


So if the AI is hallucinating, but it hallucinated the exact same lie 15 times in a row,


then the calculated entropy is zero. The system looks at that and thinks, "Wow, the model is rock solid on this interpretation. This must be the safe correct answer."


But it is dead wrong.


It is dead wrong.


Did they actually find specific examples of this happening in the clinical study?


They did find them. Yes. There was a really great clear example in Figure three of the paper. It was an MRI of the spine. And the question asked to the AI was very simple. Is this a contrast or a non-contrast MRI?


Just a basic binary choice.


Exactly. And the AI looked at it and said non-contrast 15 times in a row. It was totally consistent across all temperature runs.


But let me guess, it was actually a contrast scan.


Yes, it was factually a contrast scan. But because the AI was stubbornly wrong rather than confusedly wrong, the DSSE filter let it right through as a certified safe answer.


That is truly the nightmare scenario for a clinician.


It really is


because if I am the radiologist working with this tool, I have been systematically trained by the software to trust the silence.


Right. You assume that if the AI finally speaks up, it has passed this rigorous safety check.


Yes. So, I let my guard down.


And that introduces the massive risk of automation bias. If we create a safety system that successfully filters out all the obvious noise and confusion, we will naturally lower our guard. for the signals that do get through.


Because if the AI is right 90% of the time that it speaks, we might eventually just stop checking its work.


And that one confident hallucination that slips through the filter is the one that will ultimately hurt a patient.


It really seems like DSSE is an excellent tool for catching confusion, but it simply cannot catch incompetence.


That is a perfect way to summarize it. It filters out the uncertainty, but it cannot filter out a firmly held delusion.


We also definitely need to touch on the technical limitations of the study itself because you mentioned they used real world data, but these were still just 2D images, right?


Yes. And that is a massive limitation to keep in mind. Real clinical radiology is volutric,


right? A radiologist doesn't just look at one flat JPEG of a lung.


No, they scroll through hundreds or thousands of slices in a 3D stack. They're looking at the continuous 3D structure of the tissue.


So asking this AI to diagnose a complex pathology from a single 2D slice is a bit like asking a mechanic to diagnose engine failure by looking at Polaroid of the hood.


It is absolutely a best case scenario estimate of its capabilities. The authors are very transparent and acknowledge this. In a real 3D clinical workflow, the computational complexity just explodes.


We really don't know if the DSSE consistency would hold up.


No, we don't.


And we also don't know if the computational cost would become prohibitive when you running 15 parallel queries on a scan with 500 slices per patient.


Speaking of cost and computation time, that is usually the massive barrier to entry for any new AI tool in healthcare. Hospital administrators always say, "This sounds great, but we cannot afford to buy a supercomput,


right?


Is this parallel query method actually feasible for a regular hospital budget?"


Surprisingly, yes, it is very feasible.


The authors did a thorough feasibility analysis on this because the 15 queries you send are entirely independent of each other. You can just run them in parallel on the API.


So, you don't have to wait for query one to finish before you start query two.


Exactly. It doesn't take 15 times as long. process, the latency or the delay was only about 6 seconds.


6 seconds, that's really nothing.


It's roughly twice the time of a single standard API call. And in a real radiology workflow, a 6-second wait for a safety check second opinion is completely negligible.


And what about the actual financial cost of running those extra queries?


Using standard commercial API pricing, they estimated it costs about 0.72 per question.


72. I mean, that is less than a cup of break room coffee. much less for a safety mechanism that boosts your baseline accuracy by 25%. That is incredibly cost effective.


It really makes it financially viable to integrate this into daily clinical workflows right now, assuming, of course, you accept those heavy coverage limitations we talked about.


Yes, the economics of it are very sound.


So, let's bring this all together for a trailblazers listening. We have a paper here that clearly shows we can take a generic hallucinationprone AI and turn it into a fairly reliable all be it very quiet clinical assistant.


I think the best way to view DSSE moving forward is as a practical rapper. We as clinicians can't fix the black box inside. The massive tech companies hold the keys to those weights,


right? We can't retrain GPT4.


But as health care professionals and clinical researchers, we can build these protective wrappers. We can build intelligent safety valves like DSSE that sit right on top of the commercial models.


It really empowers the clinical implementation side of the industry. We don't have to just sit around and wait. for the perfect flawless model to be invented


which may never happen.


Right. We can take imperfect available models and engineer ways to make them safer for our patients today.


Exactly. It practically facilitates that human in the loop model we always talk about wanting. It shifts the dynamic entirely


from the AI is autonomously diagnosing the patient to the AI is flagging the easy stuff and explicitly asking the human for help on the hard stuff.


But the ultimate burden of clinical vigilance still remains firmly on us. Absolutely. The confident hallucination phenomenon proves that beyond a shadow of a doubt. We can never fully outsource the final diagnostic judgment. The AI is a tool. It is not an oracle.


Very well said.


I do want to leave our trailblazers with a final thought to mull over as they head back to the clinic. We are actively moving toward integrating systems that are designed to only speak up when they are highly confident,


which logically is the safer way to build them.


But paradoxically, I wonder does a safer AI eventually make for a riskier doctor?


That's a fascinating way to look at it.


Think about it. If the AI is silent on all the hard ambiguous cases, we stay sharp because we have to figure them out. But if the system starts handing us perfectly correct diagnosis on a silver platter for all the routine stuff day after day after day,


do our own diagnostic muscles slowly start to atrophy?


Exactly. Do we eventually just get fatigued and click approve without truly scrutinizing the image?


That is the ultimate human factors question of the AI. era, right?


We are successfully filtering out the digital noise, but we have to make absolutely sure we don't accidentally filter out our own critical thinking in the process.


A very sobering thought to end our deep dive on today. If you want to dive into the seni diagrams, which are actually really cool to look at and see the specific failure cases they documented, I highly recommend reading the full paper.


Yeah, it is definitely worth a read for the granular data alone.


Again, that is by one halt and colleagues in European radiology 2026. Thank you all for listening. Keep questioning. Keep valid. and keep blazing those trails. We will catch you on the next deep dive.