217: AI vs. Pathologist: Validating Ki-67 Assessment in Pulmonary Neuroendocrine Neoplasms Artwork

Digital Pathology Podcast

Aleksandra Zuraw from Digital Pathology Place discusses digital pathology from the basic concepts to the newest developments, including image analysis and artificial intelligence. She reviews scientific literature and together with her guests discusses the current industry and research digital pathology trends.

All Episodes

Digital Pathology Podcast

217: AI vs. Pathologist: Validating Ki-67 Assessment in Pulmonary Neuroendocrine Neoplasms

April 02, 2026 • Episode 217

0:00 | 13:56

Send us Fan Mail

Paper Discussed in this Episode:

Ki-67 Proliferation Index in Pulmonary Neuroendocrine Neoplasms: Interobserver Agreement Among Pathologists and Comparison of Two Artificial Intelligence-Based Image Analysis Systems. Teoman G, Turkmen Usta Z, Sagnak Yilmaz Z, Ersoz S. MDPI 2026.

Episode Summary:

In this journal club deep dive, we step into the lab to examine a direct comparison between expert human pathologists and artificial intelligence. We explore a 2026 study that evaluates how two different AI image analysis systems score the critical Ki-67 biomarker in Pulmonary Neuroendocrine Neoplasms (PNENs) alongside four experienced human experts. Unlike stories where AI and humans clash, this study explores a different exciting reality: Can AI perfectly match the human gold standard to automate and standardize a highly tedious, labor-intensive medical process?

In This Episode, We Cover:

• The Diagnostic Challenge of Lung NENs: Understanding Pulmonary Neuroendocrine Neoplasms, a biologically diverse group of lung tumors ranging from slow-growing typical carcinoids to highly aggressive large cell neuroendocrine carcinomas. We discuss why precise classification is critical for predicting patient outcomes and guiding treatment.

• The Spotlight Biomarker (The Speedometer): ◦ Ki-67: The definitive marker of active cellular proliferation, essentially acting as the tumor's "speedometer". While not formally incorporated into the WHO grading criteria for lung NENs, it is a vital clinical tool used to distinguish low-grade from high-grade tumors and identify biologically aggressive lesions.

• The Showdown - Humans vs. AI: Four experienced pathologists go head-to-head with two digital heavyweights—the Roche uPath Ki-67 and the Virasoft Virasight Ki-67 algorithms. They analyzed 63 cases across different tumor subtypes, meticulously evaluating approximately 2,000 cells per predefined tumor hotspot.

• Round 1 - Impressive Human Concordance: The human experts achieved near-perfect interobserver agreement (an Intraclass Correlation Coefficient of 0.998) when utilizing pre-selected hotspot regions, proving that standardized manual counting by experts is highly reliable.

• Round 2 - AI Meets the Gold Standard: Both AI systems demonstrated massive, statistically significant correlations with the human experts' assessments. The AI reliably stratified the lung tumors into low, intermediate, and high-risk clinical categories without systematic bias, proving the algorithms can match human accuracy.

• The Future of the Lab: Why AI shouldn't replace pathologists, but rather serve as a reproducible, objective assistant in the pathology lab. We discuss how automated AI analysis can reduce observer fatigue, enable rapid assessment of large tumor areas, and standardize testing across institutions, despite current roadblocks like algorithm complexity and a lack of wide accessibility.

Key Takeaway:

Artificial intelligence doesn't have to disagree with humans to prove its profound clinical worth. By successfully matching the excellent accuracy of top pathologists, these AI systems proved they can reliably handle the exhausting, subjective task of tumor cell counting. This paves the way for faster, highly standardized tumor evaluation, which could ultimately lead to more consistent and reliable prognostic diagnoses for lung cancer patients

Support the show

Get the "Digital Pathology 101" FREE E-book and join us!

Welcome back trailblazers to another deep dive into the source material right here on the digital pathology podcast.

Hey everyone, great to be here.

So today is April 1st, 2026 and we've got a really fascinating matchup to look at. We are essentially looking at a humans versus machines showdown.

Yeah, but with you know actual life or death stakes in diagnostic medicine. It's not just a game.

Exactly. Yeah.

Our mission for today's deep dive is to explore a brand new paper that just dropped last month. So March 2026 in the Journal Biio Medicines

right which is published by MDPI.

Yep. And this research comes to us from a team at Keradenis Technical University in Turkey. The authors are Gizm Tiamn Zenip Turkmana Zaneps Neak Yelmaz and Sephac Urersos.

And the paper is titled let me make sure I get this exactly right. Kai 67 proliferation index in pulmonary neuroendocrine neoplasms

which is a mouthful but the hook here is just incredible. They pitted four human expert pathologists against two distinct AI systems to count tumor cells

to see who wins basically and more importantly to figure out if AI can safely standardize how we grade these specific notoriously difficult lung tumors

because right now that is a huge bottleneck isn't it?

Oh absolutely. I mean we often project this aura of like absolute mathematical certainty onto diagnostic medicine. We want to believe that classifying a tumor is just this strictly binary exercise.

Right. You either have it or you don't.

Exactly. But the reality of a pathology lab is incredibly tactile and quite frankly it's visually exhausting. The foundation of patient care frequently rests on someone just staring through a microscope

doing a painstaking subjective manual count of cells run.

Yes. Without succumbing to sheer visual fatigue. It's really hard work.

And that tension between the limitations of the human eye and the clinical demand for mathematical precision is exactly what we're unpacking today. So before we look at the AI showdown itself, we should probably establish why this specific medical problem needed solving.

Yeah. Let's talk about the subject matter. Pulmonary neuroendocrine neoplasms or PNENS,

right? For the trailblazers listening, you probably know they're a highly heterogeneous group of lung tumors.

Highly, I mean, they run the absolute gamut. On one end, you have typical carcinoids or TCs, which are usually pretty slow growing and indolent.

But then on the other end, you've got large cell neuronetrine carcinomomas, the LCes, which are highly aggressive.

Exactly. Very poor survival outcomes there. But sitting right in the middle is the eight Typical carcinoid,

the tricky one,

very tricky. It's this notoriously difficult diagnostic gray zone. And accurately classifying where a patient's tumor falls on that spectrum dictates everything.

The therapy, the prognosis, everything,

right? So, you need an objective tiebreaker to figure out exactly how fast that tumor is growing.

Okay, let's unpack this for a second. Think of the Kai 67 protein like a neon sign flashing under construction inside a cell.

Oo, I like that.

The more of those neon signs you see, the faster the tumor is dividing and growing. That's our biioarker here, the Kai 67 proliferation index.

Yeah. And what's fascinating here is even though Kai 67 isn't formally incorporated into the current WH grading system for lung neuroendocrine tumors just yet, it's widely used anyway.

You kind of have to, right?

Exactly. It's the deacto standard in routine practice. Pathologists use it as an essential adjunct marker, especially for those borderline cases where the morphology alone just isn't giving you an answer.

But this is where we hit that human bottleneck you mentioned earlier because getting that Kai 67 index manually is well it's agonizing.

It really is. You're asking a highly trained physician to manually count hundreds sometimes thousands of cells. And first you have the interobserver variability of just picking the hot spot.

The hot spot being the most active dense area of the tumor.

Right? Pathologist A might look at a slide and think the top left corner is the hot spot, but pathologist B might think uh Actually, the bottom right looks slightly more active.

So, they're not even counting the same cells to begin with.

Exactly. And beyond just finding the hot spot, you have the cognitive load of the count itself. You have to differentiate an actual tumor cell from like a reactive lymphosy or a strummal cell

and then decide what actually counts as a positive stain,

right? Because biology isn't perfectly binary.

You use a DIB chromogen which creates a brown stain in positive nuclei while negative ones are blue. But you frequently get the pale brown blushes.

So, you're constantly second guessing like, is that weak background noise or is that a true weak positive signal?

Exactly. And if you're sitting in a lab right now listening to this, you know exactly how exhausting it is to constantly calibrate that threshold in your head all day long.

It's no wonder pathologists disagree on these counts, especially in those atypical carcinoids where just a few percentage points changes the entire diagnosis.

Exactly.

So, knowing how variable manual counting is, how did Toman and her team design a fair fight? between the humans and the machines.

Well, they started with a highly curated data set. They pulled 63 retrospective cases of PNN's from between 2020 and 2025.

Okay. 63 cases.

Yep. 29 typical carcinoids, 13 atypical carcinoids, and 21 large cell neuroendocrine carcinomomas.

I noticed they excluded small cell lung carcinomomas entirely from this study.

They did. Yeah.

From a digital pathology perspective, that actually makes total sense because small cell Carcinomas are almost always diagnosed on tiny needle core biopsies. Right.

Right. Or cytology specimens. And for this kind of robust image analysis, you really need the massive real estate that you get from surgical resection.

Yeah. Tiny needle biopsy just doesn't give the algorithm or the human enough tissue architecture to reliably hunt for a true hot spot.

You're just looking at a fragmented sample. So by sticking to surgical resections, they ensured they had enough whole slide contacts.

Okay. So they had these 63 cases. is what were the actual rules of the game?

So, four highly experienced pathologists were tasked with independently evaluating the Kai 67 index. They had to manually count about 2,000 tumor cells per case.

2,000? Wow.

Yeah, it's a lot. And they were totally blinded. Blinded to each other's results, blinded to the AI, and blinded to the clinical data.

But there was a catch, right?

A brilliant catch. Crucially, they were all looking at the exact same predefined hotspot.

A senior pathologist actually went in and did digitally annotated the exact region for every single case before they started.

Yep. All four humans and both AI systems were forced to look at the exact same pre-drawn box.

Okay, here's where it gets really interesting.

Yeah.

I have to ask if you're designing a study to test real world variability. And you know, the biggest problem is humans picking different hotspots.

Yeah.

Why did the researchers predefine the hot spot for them? Doesn't that artificially inflate how well the humans perform?

It does, but they did it to isolate the counting variable. Think about it. If you let everyone pick their own hotspots and the final percentages don't match, you have no idea why they didn't match.

Oh, I see.

Was the AI terrible at recognizing a tumor cell or was it just analyzing a completely different geographic region than the human?

So, you have to separate the task of finding the hot spot from the task of actually counting the cells.

Exactly. By predefining the hot spot, they brilliantly isolated the counting variable. They needed to prove whether the AI could actually classify cells as accurately as humans in a controlled environment,

eliminating the location choice altogether. That makes perfect sense. So, with the arena set, let's look at the results. How did the humans do?

Staggeringly well. Because the task was purely counting inside that pre-drawn box, the four pathologists had near perfect agreement.

The study reports an interclass correlation coefficient or ICC of.998.

Right? And for context, an ICC of 1.0 represents absolute mathematical perfection. So.998 is just a stunning level of concordance.

It's basically flawless.

It proves that when you remove the ambiguity of finding the hot spot, human experts are incredibly consistent at categorizing these cells.

Although there was a slight dip when you look at the atypical carcinoids, right?

Yeah, good catch. The agreement across the board was great, but for those tricky atypical cases, the ICC dipped to 840,

which makes sense because that's that exact diagnostic gray zone we talked about,

right? Those tumors have heterogeneous staining patterns. Even looking at the exact same 2,00 cells, humans still experience a little friction deciding if a pale cell is positive or negative.

Okay, so the humans brought a near-perfect.998 to the table. Let's introduce the AI competitors.

They used two distinct commercially available digital pathology systems.

First up was the Roath Kai 67 algorithm which analyzed slides scanned on a Ventana DP600 scanner.

And against it, they ran the Verus soft veracite ki67 algorithm using slides digitized on a Leica Apirio AT2 scanner.

Now the paper has some really vivid imagery explaining how these two algorithms actually see the tissue.

Yeah, they operate quite differently when you look under the hood.

From what I read, the ROS system operates almost like a simple binary onoff switch.

Exactly. It segments the nuclei and basically just drops a yellow dot on any nucleus it thinks is positive and a black dot on the negative ones. Hard binary,

right? But the Verusoft algorithm operates more like a dimmer switch. Yeah, it grades the intensity of the stain. So, it uses red dots for a strong positive, orange for moderate, yellow for weak, and blue for negative.

That's fascinating. So, how did they stack up against the humans?

Both AIs correlated incredibly well with the human experts. The RO algorithm hit a Spearman correlation of N61.

Wow.

And Virusoft hit 4904. Plus, the two AIs even agreed with each other at a rate of 0.926.

So, they're all basically arriving at the exact same clinical truth.

Exactly. Whether a tumor had a proliferation index of like less than 1% or a massive 85%, the algorithms tracked right alongside the human consensus.

Okay, but I have to push back here for a second.

Sure.

If the humans are already scoring a nearperfect.998 on their own, why do we even need the AI at all? It seems like an expensive solution to a problem that doesn't exist.

If we connect this to the bigger picture,

yeah,

you have to remember why the humans scored a.998 in this study.

Because The hotspot was predefined.

Yes, a senior pathologist had already done the grueling work of finding the hot spot and standardizing that 2,000 cell boundary for them.

And in a busy clinical workflow, nobody has time to do that for every single case.

Exactly. Pathologists simply don't have the time to meticulously handc count 2,000 individual cells to achieve that perfection every time. They're often forced to sort of eyeball it,

which brings all that interobserver variability rushing right back in.

Right? So the AI isn't necess necessarily smarter than the human, but it provides that exact meticulous level of precision instantly

without getting eye strain,

right? Without fatigue. It standardizes care across different hospitals every single time.

Okay, that makes sense. But knowing the AI can count accurately is one thing. The real question is, does it actually translate to meaningful clinical diagnosis?

The so what of the study

exactly? Does it help the oncologist waiting for the pathology report?

Well, to answer that, the researchers translated those raw percentages into three action able clinical tiers,

right? I have those here. A low index is 0 to under 10%. An intermediate index is 10 to 25%.

And high is anything over 25%.

And the data stratified these tumors beautifully into those exact buckets.

It really did. The typical carcinoid stayed strictly in the low bucket with a median of just 2.09%.

Barely ticking over.

Right. Then the atypical carcoid sat perfectly in the intermediate bucket with a median of 16.2%. And those highly aggressive large cell neuroendocrine carcinomomas just rocketed into the high bucket.

A median of 63.7% for those. So what does this all mean for our trailblazers listening today?

It means the categorical agreement between manual scoring and AI scoring was excellent. They used Cohen's Kappa to measure it.

And RO hit a Kappa of 877 while Veroft hit 827.

Right. And anything over a is considered almost perfect agreement. So it proves AI isn't just generating accurate math.

It's generating reliable clinical interpretations.

Exactly. It can tell a doctor exactly how aggressive a patient's tumor is just as effectively as an expert human.

That's incredible. So, just to recap our journey today, we went from the biology of PNENS to a perfectly controlled counting showdown.

Yep. Isolating the counting variable

and proved that AI can definitively match the microscopic precision of an expert pathologist without the visual fatigue.

It's a phenomenal validation of the tech. But, you know, this raises an important question.

Oh. Yeah, in this study, the AI proved it can flawlessly count the cells inside a predefined hotspot selected by a human. But as these digital pathology systems evolve, the next frontier isn't just counting.

It's hunting.

Exactly. What happens to the diagnostic workflow when the AI is tasked with autonomously scouring the entire slide to discover the most dangerous hot spot all on its own?

Man, that is a provocative final thought. When the machine stops waiting for us to draw the box and just starts drawing the box itself,

it's a whole different ball game.

Well, thank you Trailblazers for joining us on this journal club deep dive into the source material. We hope this exploration gives you something to mull over the next time you're evaluating a digital workflow for your lab

or staring down a tricky Kai 67 stain.

Exactly. Think about how AI might soon be acting as a digital second opinion in your own clinics. Keep pushing the boundaries and we will catch you on the next deep dive into the source material.