226: LLM Performance in Cervical Cytology Interpretation: GPT-5 vs. Gemini 2.5 Artwork

Digital Pathology Podcast

Aleksandra Zuraw from Digital Pathology Place discusses digital pathology from the basic concepts to the newest developments, including image analysis and artificial intelligence. She reviews scientific literature and together with her guests discusses the current industry and research digital pathology trends.

All Episodes

Digital Pathology Podcast

226: LLM Performance in Cervical Cytology Interpretation: GPT-5 vs. Gemini 2.5

April 10, 2026 • Aleksandra Zuraw, DVM, PhD • Episode 226

0:00 | 24:50

Send us Fan Mail

Paper Discussed in this Episode: Can large language models like ChatGPT and Gemini interpret cervical cytology accurately? Saroja Devi Geetha. Annals of Diagnostic Pathology 2026; Volume 83, 152641.

Episode Summary: In this journal club deep dive, we explore what happens when advanced artificial intelligence is thrown into the visually chaotic realm of human biology. We examine a 2026 study evaluating whether two massive multimodal models—GPT-5 and Gemini 2.5 Pro—can accurately read digital cervical Pap smears without any prior fine-tuning,,. We unpack how these general-purpose models perform on highly specialized visual tasks, revealing that while they aren't ready to fly solo, they exhibit fascinating and distinct diagnostic "personalities" that will undoubtedly reshape the future of the pathology lab,.

In This Episode, We Cover:

• The "Textbook" Test Setup: How researchers tested the baseline visual reasoning of GPT-5 and Gemini 2.5 Pro by feeding them 100 curated, gold-standard digital Pap test images from the Hologic Education Site to classify using the Bethesda System,,.

• The Clinical Reality Check: While the models only achieved a coin-toss exact diagnostic match rate (47% for GPT-5 and 48% for Gemini), their accuracy jumped to 66% when evaluating clinical management protocols—proving they are beginning to grasp the underlying severity and medical consequences of cellular abnormalities,,.

• The Over-Anxious Resident (Gemini 2.5 Pro): Gemini acted like a highly sensitive but unrefined trainee, hitting 84% sensitivity and expertly spotting infectious organisms (71%),,. However, its tendency to confuse dense, overlapping cellular clumps with high-grade squamous intraepithelial lesions (HSIL) led to massive overcalling, dragging its specificity down to 71% and creating a risk of false alarms,.

• The Big-Picture Academic (GPT-5): GPT-5 proved to be much more measured, demonstrating better overall specificity (74%) and excelling at identifying subtle structural shifts like low-grade squamous intraepithelial lesions (LSIL) (75%) and glandular changes,. Yet, in its focus on the big picture, it completely missed obvious infectious organisms, scoring a dismal 20%,.

• The Future of the Lab - Prompt Engineering & The Algorithmic Auditor: Why the next era of cytopathology requires rigorous AI fine-tuning on proprietary datasets and cytology-specific prompt optimization. We discuss a major paradigm shift where human pathologists may transition from actively hunting for disease to acting as "algorithmic auditors" whose primary job is to filter out the hyper-vigilant machine's noise,.

Key Takeaway: Current multimodal LLMs are not yet reliable for independent Pap test interpretation due to critical blind spots and tendencies to overcall lesions,. However, their out-of-the-box performance establishes a staggering baseline. By understanding their unique mechanical flaws, pathologists can prepare to use these systems as highly effective co-pilots, seamlessly combining the algorithm's computational brute force with the indispensable filter of human medical reasoning

Support the show

Get the "Digital Pathology 101" FREE E-book and join us!

Welcome back to the digital pathology podcast.

Hey everyone, great to be here.

We are dedicating today to a very special journal club edition of the show.

Oh yeah, this is a fun one.

It really is. And our mission today is to answer a question that is, you know, likely occupying a massive amount of brain space for you, the trailblazer listening to this right now.

The question that's basically on everyone's mind in the field.

Exactly.

Can today's most advanced state-of-the-art artificial intelligence actually read a papsmear,

right? Not just text, but actually look at a slide.

Yes. And to find out, we are tearing into a brand new paper. It's titled uh can large language modders like Chat GPT and Gemini interpret cervical cytology accurately.

A very direct title,

right to the point. It was authored by Sroa Debi Githa and published in the Annals of Diagnostic Pathology Volume 83 in August of 2026.

Hot off the presses.

Okay, so let's unpack this. I mean, we've heard so much about AI passing the medical board. Oh, constantly. Or drafting beautifully structured patient discharge summaries.

Exactly. But handing an AI a microscope slide and asking for a diagnosis that requires an entirely different set of cognitive or well computational muscles.

It forces these models out of the neat structured world of language, you know, and throws them into the highly subjective, visually chaotic realm of human biology.

Visually chaotic is the perfect way to phrase it.

Yeah. Because we're watching artificial intelligence sweep through healthcare, right? Automating administrative tasks, synthesizing research at this breakneck pace.

But zytology has remained like a largely uncharted frontier for large language models.

Primarily because the entire discipline just defies simple categorization.

I always think about it in contrast to other fields of medical imaging.

How so?

Well, usually when we discuss a medical diagnosis, there's an expectation of absolute undeniable precision. It feels like engineering.

Oh, I see what you mean. Like an X-ray.

Yes. You fall off your bike, you break your arm, the X-ray spits out an image with a jagged white line through the radius bone and the doctor just points to it.

Right. The diagnosis is structural. It's binary

and it's glaringly visible. But the moment you step into the world of cytology and specifically when you're evaluating a cervical pepsmear, that pristine x-ray clarity just vanishes.

Well, completely. It's gone.

You're suddenly looking at this landscape of diagnostic muddy waters.

Yeah. You're staring at a universe of overlapping guamous cells

trying to decipher incredibly subtle variations in nuclear contours

or microscopic shifts in chromatin texture, right? It relies so heavily on the human eye

and decades of clinical intuition, which sets the stakes perfectly for Keith's paper.

It really does because this study isn't testing whether an AI can write a polite email to a patient.

No,

it's testing whether two of the most powerful multimodal AI systems on the planet can actually perform that highly nuanced visual interpretation. And we are talking about the heavy hitters here.

OpenAI's GPT5 released in August 2025 and Google's Gemini 2.5 Pro released in March 2025.

Right. The researchers are asking a general purpose language model to look at that murky cellular landscape you just described

and output a highly specific clinical judgment. Before we get to the scorecard though and reveal whether these multi-billion dollar models passed or failed, we have to talk about the mechanics of the setup.

Yeah. How do you even test a chatbot on visual pathology,

right? Because it requires a true multimodal approach. We aren't just typing text prompts into a blank window and asking for medical advice based on symptoms.

No, we are feeding the models raw visual data. True multimodal models like GPT5 and Gemini 2.5 Pro don't just process text anymore.

They fuse vision, language, and reasoning.

Exactly. When you upload an image to these models, they break that image down into thousands of visual tokens.

Oh wow.

Yeah. They map the spatial relationships, the pixel densities, the color variations,

and then they cross reference that visual map against their massive internal training data,

which includes, you know, everything from medical textbooks to Wikipedia articles.

That's wild. Taking a general purpose model trained on the open internet and throwing it into a cytology lab is a massive conceptual leap.

It really is

because just to frame this for you as a trailblazer working in this space, we already have purpose-built tools for this,

right? We have FDA approved system.

Exactly. Like the hos and genenius digital diagnostic system.

Yeah. And those were designed from the ground up trained exclusively on thousands of pap tests just to assist in screening.

They are highly specialized.

So taking an AI that can, I don't know, write Python code or generate a recipe and asking it to diagnose cervical lesions, it's incredibly novel.

And the methodology GEA used to bridge that gap is actually pretty elegant in its simplicity.

Walk us through it.

Well, they pulled 100 digital cervical pap test images directly from the Hosanogic education site.

Okay, 100 images,

right? And the models were prompted to examine the images and provide a diagnosis strictly using the third edition of the Bethesda system for reporting cervical cytoathology.

It's a very specific reporting standard.

Exactly. They even included additional specialized images specifically for cases involving infectious organisms

like trickamonas or candida.

Yeah, exactly. That

Okay, I have to push back on the parameters of that methodology for a second.

Go for it. Specific specifically the volume. I mean, if we are evaluating the bleeding edge of artificial intelligence models trained on literally trillions of parameters across massive server farms, is a sample size of a 100 cases really enough to prove anything?

I get what you're saying.

It just feels like testing a supercomput with a third grade math quiz.

Yeah,

it's a microscopic drop in the bucket compared to the volume of slides a human cytoathologist sees in a single month.

It's a totally fair critique and honestly, it's one that frequently surfaces when evaluating clinical AI, right?

The power of this particular setup, however, lies in data quality over data quantity.

Okay. How so?

Well, the images pulled from the holog education site are not random, messy, realworld smears pulled from a busy clinic on a Friday afternoon.

Oh, nice.

The hologic provided diagnosis for each of these hundred cases is considered the absolute gold standard. These images are curated.

So, they are the most pristine representative examples of their respective pathologies.

Exactly. You are giving the AI the clearest, most textbook look at the cellular abnormality possible.

So, it's essentially the ultimate idealized final exam.

Yeah. We are serving them perfectly pitched slowballs

because if they strike out on these beautifully curated textbook examples, they have absolutely zero chance of surviving the messy artifact reality of a live hospital lab.

That is the exact underlying logic they are asking these generalist models to perform a specialized task right out of the box

without any prior fine-tuning on proprietary pathology data sets.

Right? It is a pure benchmark test of their baseline multimodal reasoning capabilities.

Okay, so we have the gold standard, we have the AI's best guess, and we have the rigorous setup.

The stage is set.

The obvious next step is uncovering how the AI actually scored.

And the topline verdict will likely provide a deep sigh of relief for any cytoologists listening to this.

Let's hear it.

Neither GPT5 nor Gemini 2.5 Pro is suitable for independent pap test interpretation. Not yet, anyway.

Okay, let's break down the exact numbers because the devil is entirely into details here.

Oh, absolutely.

When looking at exact diagnostic matches, meaning the AI looked at the image, analyzed the visual tokens, and nailed the precise Bethesda category without any deviation.

Right.

The performance was decidedly mediocre.

Mio,

GBT achieved an exact match rate of 47%. Yeah.

And Gemini 2.5 Pro hit 48%. %.

Yeah,

if we're being brutally honest, that is basically a coin toss.

It is. But the researchers didn't stop at that surface level exact match.

What did they do next?

They introduced a second, far more pragmatic layer of analysis. Yeah.

They evaluated the model's concordance when the diagnoses were grouped by clinical management categories.

Okay. Wait, I want to make sure I'm wrapping my head around the mechanics of that jump.

Sure.

If the AI misses the exact terminology, but we group by clinical management, we are essentially saying, did the AI suggest a agnosis that would result in the exact same clinical action or follow-up protocol for the patient.

Yes, precisely.

So, if it uses a slightly different diagnostic term but still accurately recognizes that the patient needs a culposcopy gets a passing grade for that case.

That is the exact mechanism of the secondary analysis.

Okay.

And when evaluated through that lens of clinical utility, the concordance rate for both models jumped significantly

up to what

averaging around 66%.

So, let's pause and be real for a second.

Yeah. If you are a patient waiting anxiously on a cancer screening and the lab director tells you that the algorithm evaluating your cells gets the clinical management right 66% of the time. That does not inspire a shred of confidence.

No, it really doesn't.

I mean, in the realm of high stakes pathology, a 66% success rate is still a failing grade. It means one out of every three patients might receive the wrong follow-up protocol.

Oh, if we only look at this as a finalized product ready for deployment tomorrow. 66% is absolutely unacceptable.

Right?

But if we connect this to the trajectory of artificial intelligence,

okay,

that jump from a 47% exact match to a 66% clinical management match is a massive glowing signal.

Really? How so?

It demonstrates that these models are no longer just regurgitating memorized text or performing rudimentary pixel matching. They are actually beginning to grasp the clinical implications of what they are seeing. Ah, yeah.

They are stepping over the threshold into actual medical reasoning.

So, they are capturing the underlying vibe of the disease, so to speak.

Yeah, that's a good way to put it.

Like the model might not be able to perfectly classify the cellular anomaly with the exact Bethesda vocabulary word,

right?

But its underlying neural network recognizes that the image is abnormal enough to warrant a specific escalated medical protocol.

Exactly. And that is a massive leap from where image recognition was. even just two years ago.

That makes sense. It shows an understanding of severity and consequence rather than just geometric pattern recognition.

And that 66% average actually obscures a much deeper, far more fascinating story.

yeah. When you open up the hood of this study and look at the specific errors each model made on those 100 cases, the two systems did not fail in the same way.

They displayed vastly different diagnostic tendencies. It's almost as if GPT5 and Gemini have entirely different diagnostic IC personalities.

Personalities. I love that. Let's dig into that.

The divergence in their failure modes is perhaps the most critical takeaway from Githa's entire paper.

Okay, let's start by putting Gemini 2.5 Pro under the microscope.

Let's do it.

Gemini showed a dramatically higher overall sensitivity. It hit 84% sensitivity compared to GPT5's 74%.

Right. It was highly sensitive.

It was also incredibly effective at detecting infectious organisms, things like tricomas or al changes from herpes simplex. It scored 71% there,

which is impressive.

While GPT5 completely bombed that category with a dismal 20%.

Yeah, GPT5 missed those.

So, Gemini catches the bugs.

It catches the bugs. Yes. But that impressive sensitivity came at a steep clinical cost.

What's the trade-off?

Gemini demonstrated a massive tendency to overall cases.

Oh, overalling.

Specifically, it aggressively flagged cases as HSil high-grade squimus intrapithelial lesion. It achieved an 82% concordance in the HSIL category, which sounds fantastic until you read the paper's explicit note that this high score was largely a byproduct of the model simply guessing HSL at an incredibly high frequency.

Oh wow. So it was just spamming the high-grade alarm.

Exactly. And because of this constant overalling, Gemini's overall specificity was dragged down to just 71%.

Let's think about the underlying mechanism of why an AI might do that. Yeah,

we know that in a pap smear, cells often clump together, right?

Quite often. Yeah.

You have dense overlapping three-dimensional clusters compressed into a two-dimensional digital image,

right?

So, if Gemini is breaking that image into visual tokens, it's likely looking at those dark overlapping cellular clumps, misinterpreting the density as hyperchromatic nuclei, and aggressively flagging them as high-grade precancerous lesions.

It's a very plausible theory.

It literally cannot decipher the spatial layers of normal overlapping cells. So, it's algorithm assumes the absolute worst.

The model lacks the specific visual refinement to separate a benign cluster from a malignant nucleus. So, it defaults to a hypers sensitive alarm state.

It behaves exactly like an overly anxious firstear medical resident.

Yes, exactly.

You know the archetype. They are absolutely terrified of making a catastrophic mistake on their first day in the lab. So, they flag every single minor cellular irregularity as a high-grade emergency.

They refuse to miss anything,

right? They catch all 71% of the infectious organisms, but they are recommending a culposcopy for perfectly healthy patients. They are hyper sensitive but entirely lack refinement

and that lack of refinement creates a bottleneck. Now contrast that anxious behavior with how GPT5 processed the exact same set of images.

Okay, so GPT5 operates like the much more measured academic student.

Yes, much more measured.

It demonstrated slightly better specificity coming in at 74%. And it was vastly superior at identifying LSIL lowgrade squamus intropithelial lesions hitting a 75% accuracy rate there.

A huge improvement over Gemini in that category.

It also performed much better at distinguishing the incredibly tricky glandular lesions that Gemini struggled with.

What?

But its blind spot is glaring. Like we said, it completely misses the obvious bugs, scoring only 20% on infectious organisms.

It's fascinating.

It spots the subtle nuance of a glandular issue, but miss This is the massive red flags of an infection.

Mechanically, this suggests GPT5's architecture might be heavily optimizing for broad global structural patterns across the entire slide.

Ah, looking at the big picture,

right? It is looking at the overall cellular arrangement to accurately classify those subtle lowgrade or granular changes.

Made sense.

But in doing so, it effectively smooths out or ignores focal tiny anomalies like a small cluster of infectious organisms hidden in the background. It's seeing the forest so clearly that it completely fails to notice the invasive beetles crawling on the bark.

That is a perfect analogy. And this divergence raises a fundamental philosophical question that cuts to the very heart of medical screening.

What is

Well, cervical cytology is already a discipline plagued by significant human interobserver variability, you know.

Oh, for sure.

You can have two brilliant board-certified experts look at the exact same slide and occasionally disagree on the diagnosis.

It happens all the time.

So, If we are introducing artificial intelligence into this highly subjective mix, what is inherently more dangerous for the patient?

That's the million-dollar question.

Do we deploy a model like Gemini that constantly overalls high-grade lesions, inevitably causing severe patient anxiety and triggering a cascade of expensive invasive extra tests?

Or do we deploy a model like GPT5 that is more measured and specific, but runs the terrifying risk of missing a high-grade lesion or an active infection entirely. Wow. That is the ultimate tightroppe walk in pathology.

It really is.

Missing a high-grade lesion is the absolute nightmare scenario. It represents a missed window for early life-saving intervention before cancer fully develops. But the flip side is equally destructive to the healthcare apparatus. If a model overalls everything is HSL, it floods the entire system with false positives

and it creates a huge backlog for biopsies.

It introduces immense psychological address to the patient and it completely negates the purpose of having an efficient streamlined screening tool in the first place.

Right? You can't have an autopilot system that defaults to panic mode

and you can't have one that is confidently blind to infections either.

Which leads directly to the core conclusion of Githa's research. These models in their current out-of-the-box state cannot be utilized for independent interpretation.

They just aren't ready to fly solo.

No. The limitations in identifying high-grade lesions with precision coup led with their varied struggles with glandular abnormalities and infections. Yeah,

they carry clinical implications that are simply too critical to leave to an unsupervised algorithm.

So, if the AI is either too anxious or too focused on the big picture to do the job alone, where does this leave you?

It's a great question.

For the lab directors, the cytoathologists and the cytochnologists listening right now,

how do you actually prepare your lab for the inevitable integration of this technology?

Well, the paper outlines a very clear three pronged path forward to transform these raw models into highly effective adjunctive tools.

Okay, what's the plan?

To make them viable for the daily cytology workflow, the field must invest in rigorous fine-tuning on proprietary medical data sets.

Makes sense.

Extensive cytology specific AI training and crucially prompt optimization.

Okay, that concept right there, prompt optimization in the context of a microscope slide is totally fascinating.

It's wild to think about.

Usually When we talk about prompt engineering, we're talking about telling an AI to write a marketing email in a specific tone of voice, right? Or asking it to summarize a PDF.

Exactly.

We don't usually think about prompt engineering a cellular diagnosis.

No, we don't.

So, as a trailblazer in digital pathology, does this mean your future job description involves learning the exact linguistic phrasing required to coax a neural network into seeing a lesion?

It sounds kind of like science fiction, but it is rapidly becoming science fact.

Are we literally going to be typing act as a world class cytoathologist and strictly evaluate the nuclear to cytoplasmic ratio in the upper left quadrant of this image.

I mean maybe we already have robust data from other sub fields of medical AI showing that how you frame the clinical context to a language model drastically alters how it weights its visual analysis.

Wait, really?

Oh yeah. Giving the model patient history or restricting its diagnostic parameters through a carefully engineered text prompt significantly reduces hallucinations.

So the skill isn't just knowing the pathology, it's knowing how to speak the machine's language so it looks at the pathology correctly.

Precisely. The underlying shift here is about workflow collaboration. The most vital takeaway for any health care professional is that human critical thinking remains the absolute paramount skill.

So the machines aren't taking over the lab tomorrow.

These multimmodal tools are not replacing the human eye anytime soon. The future of the lab belongs to those who learn how to seamlessly collaborate with the AI

utilizing its computational brute. force to cover human blind spots,

right? While applying human medical reasoning to filter out the algorithm's noise,

it fundamentally changes the daily workflow. I mean, you might deploy a model like Gemini as a hypervigilant, tireless second set of eyes.

Sure,

it can process a thousand slides a minute. And because you know it has an anxious personality, you let it flag all 50 mildly suspicious cases.

And then what?

Then you step in. You look at the clusters Gemini flagged as HSIL. You apply your year of physical real world context and you say no AI that is just a reactive change

or no that's just an artifact from the slide preparation

exactly you become the ultimate filter for the overhauls

you are transitioning the human role from searching for a needle in a massive haystack to simply verifying whether the needles the AI collected are actually sharp

that is such a cool way to look at it

but to do that effectively you must understand the specific failure modes of the AI you are employing

right you have to know its personality.

Exactly. If you know Gemini overalls, you calibrate your trust to account for false positives. If your lab uses GBT5, you know you must maintain extreme manual vigilance for infectious organisms.

You manage the AI the same way you would manage the strengths and weaknesses of a human resident.

Usually said

and that level of management is exactly why deep discussions into studies like this one from Sroja Devi are so vital for the field. really pull back the curtain.

They expose these massive opaque neural networks and reveal the exact mechanical flaws in their reasoning.

And we have to remember we are also looking at the absolute floor of this technologies capability.

The floor, not the ceiling.

Definitely the floor. Remember, these models were tested right out of the box. Imagine the diagnostic power when a multimodal architecture like GPT5 is actually fine-tuned on millions of annotated cytology images

rather than relying on the general visual training it's great from the internet. Exactly.

The ceiling for this technology is staggering, even if the current baseline is a bit wobbly. So, let's bring all of these threads together.

Let's do it.

Multimodal AI powerhouses like GPT5 and Gemini 2.5 Pro are officially knocking on the door of the Scytology lab.

Loudly knocking.

However, based on this rigorous benchmark utilizing the Hallogic Gold Standard, they are strictly co-pilots, not autopilots.

Very important distinction.

They offer a moderate 66 % concordance when evaluating clinical management outcomes, but they come saddled with distinct mechanically driven diagnostic quirks.

Yep.

Gemini operates as your highly sensitive, constantly overcalling alarm bell, while GPT5 offers structural nuance, but remains dangerously blind to key infections.

A perfect summary. And as we prepare for that future, I think there is a broader paradigm shift to consider.

Let's hear it.

We spend a lot of time discussing the debate between a model that overalls versus is a model that misses disease, right? But consider this. If we eventually fine-tune an AI that handles the rapid baseline screening of thousands of slides, okay?

And its natural algorithmic tendency is to hyperfixate on anything remotely suspicious, the entire role of the human expert flips.

Flips out.

Instead of spending your career hunting for the elusive presence of disease, your primary daily function might become proving the machine wrong.

Oh wow.

Are we transitioning the role of the psychopathologist from a diagnostic expert who finds cancer into an algorithmic auditor whose sole job is to protect healthy patients from the hypers sensitivity of the machine.

That is a phenomenal paradigm shift to consider. Moving from the person who sounds the alarm to the person who turns the false alarms off.

Exactly.

That is something you can mull over during your next long session at the microscope. Thank you for joining us for this Journal Club edition of the Digital Pathology Podcast.

Thanks for listening everyone.

Stay curious. Keep pushing the boundaries of your field and And remember, even in a world dominated by advanced algorithms and massive data sets, the most vital diagnostic tool in the lab is still the human mind deciphering the muddy waters. See you next time.