Digital Pathology Podcast

159: What If Your AI Tool Is Lying: Hidden Bias in Pathology Algorithms

Aleksandra Zuraw, DVM, PhD Episode 159

Send us a text

What if the AI tools we trust for cancer diagnosis are not always correct? This episode of DigiPath Digest takes on the uncomfortable but critical question: can AI “lie” to us—and how do we verify its performance before adopting it in clinical practice?

Highlights:

  • [00:02:00] Foundation models in action: Deployment of a fine-tuned pathology foundation model for EGFR biomarker detection in lung cancer—reducing the need for rapid molecular tests by 43%.
  • [00:08:41] Bone marrow AI misclassifications: Why automated digital morphology still struggles with consistency across leukemia and lymphoma cases.
  • [00:14:45] Lossy DICOM conversion: How file format changes can subtly—but significantly—affect AI model performance.
  • [00:21:45] Federated tumor segmentation challenge: Coordinating 32 international institutions to benchmark healthcare AI fairly across diverse datasets.
  • [00:27:47] AI in gynecologic cytology: Reviewing AI-driven Pap smear screening—promise, limitations, and why rigorous validation remains essential.
  • [00:32:27] Takeaway: Trust but verify—AI tools must be validated before they can support or replace clinical decisions.

Resources from this Episode

  • Nature Medicine – Fine-tuned pathology foundation model for lung cancer EGFR biomarker detection.
  • Scientific Reports (Germany) – Study on how DICOM conversion impacts AI performance in digital pathology.
  • Federated Tumor Segmentation Challenge – Benchmarking AI across 32 global institutions.
  • Acta Cytologica – Review on AI in gynecologic cytology and Pap smear screening.

Support the show

Become a Digital Pathology Trailblazer get the "Digital Pathology 101" FREE E-book and join us!

00:00:02 - 00:01:19

Aleks: Good morning, my digital pathology trailblazers. How are you today? 6:00 a.m. in Pennsylvania. Are you ready for DigiPath Digest? I checked it's the 26th. So, I am opening the chat. If you are there, let me know that you've arrived and when where you're tuning in from. I'm in Pennsylvania 6:00 a.m. today. Um, and we have a little bit I hope it was a thought-provoking title. What if AI tool is lying to us? What if AI is not that correct as we want it to be? Which is kind of the question that is in



00:00:40 - 00:01:51

everybody's uh in the back of everybody's mind when we are considering using AI algorithms for diagnostics. Um, so we're going to dive into this a little bit today. Um, when you join, let me know where you're tuning in from. I'm gonna be soon going to Poland again. So, I don't know if I'm going to stop the live streams for vacation or if we're going to continue. Uh, usually in the summer, the attendance is lower than in the less warm months because everybody is going on vacation. Um, but we'll see.



00:01:15 - 00:02:46

Let's see. Um, let me share my screen with you. and we can start discussing our paper, our papers. Actually, we're going to start with something that is more positive than uh AI lying to us. H and something that was just published just because it was July 9th. Um let me see July 9th. Let me know if you hear me well. Let me know if you see everything and of course where you're tuning in from. So um let's start with real model deployment of a fine-tuned pathology foundation model for lung



00:02:00 - 00:03:26

cancer biomarker detection. I love this one because this is something I was waiting for for a long time. Um and what do I mean? Why was I waiting for this particular paper? I was not waiting for this paper but I was waiting for this technology uh or like this application of AI to actually find utility and what we are discussing in this paper is a clinical trial about uh I'm going to tell you in a second let's see so uh this is um nature medicine and we have teams from um different places right we have



00:02:43 - 00:04:30

Gabriela Campanella And Gabriella Campanella is kind of like famous in the digital pathology space because I'm just double checking not to um not to tell you like the wrong thing but um she was on the paper with about multiple instance learning for uh prostate cancer detection and it was kind of a a pivotal paper in the digital pathology space. So anyway, here we have her uh on the new uh paper. Okay. Now uh we have uh these are people from Mount Si, New York, Munich, Germany, Sweden, uh New York, Syra Cruz.



00:03:40 - 00:05:10

The the thing is that uh in this paper what happened is that models uh obviously we are training different models for histopathology slides and um then the problem is that okay so we have hematoxin and eosin uh stained tissues and the the computational methods promise tissue preserving diagnostic tools for patients with cancer, right? Tissue preserving is the key word here. And uh the problem is that the clinical utility in real world settings remains unproven. This is always the problem still. And I see new



00:04:25 - 00:05:43

people joining. Let me know where you're tuning in from in the chat. Just say hi in the chat uh to let me know that you're really here and it's not just numbers on my screen. So um in this paper we want to assess uh EGFR mutation in lung adenocarcinoma. H it depends a rapid accurate and cost effective tests that preserve tissue for genomic sequencing and uh PCR is done right PC well there there are two things PCR and next gene sequencing. So PCR based um tests they provide rapid results but reduced accuracy and um



00:05:03 - 00:06:27

compared with next generation sequencing because next generation sequencing would be like the gold standard um that is uh very accurate. Um and the thing I was waiting for is to leverage the computational biomarkers. Uh in this case it is a modern foundation model uh that can address these limitations these uh tissue tissue preservation limit limitations. So uh they assembled a data set of digital lung adenocarcinoma slides 8,461 slides. That's a pretty big number. And I see even more people joining. Let me



00:05:46 - 00:07:12

know in the chat where you're tuning in from. Um because they wanted to develop their computational EGFR biomarker and I saw a case report like two or two years ago, three years ago about this and I was waiting for the next thing. Um so what happened here is their fine-tuned uh open source foundation model. um did this was like detected detected EGFR and they conducted a prospective silent trial. What is a prospective silent trial? This is when AI or like some digital test is deployed in par parallel



00:06:29 - 00:07:45

with standard workflows. Okay, more people joining even. Thank you so much. AI in parallel with standard workflows. So the uh standard workflow is still happening. Nobody know well nobody knows the the people who are treating the patients they don't know that um there is like an additional piece of information and they don't get this additional piece of information and the uh goal of this trial is to check okay is this uh good enough is this as good as standard of care and what they achieved here with this biomarker this



00:07:06 - 00:08:49

digital eGFR detecting EGFR mutation status from Pisshio uh was that um that they were able to reduce the number of rapid molecular tests needed by up to 43%. And I'm like how can you how do you know that you can reduce it by 43%. So what they did they they ran the um this algorithm on those images and um there's a score right a score whether uh what's the probability of the EGFR mutation and um they set the score when the score was very high the likelihood of uh EGFR mutation being there was very high so um



00:07:58 - 00:09:25

you can skip the PCR when the score was very low, the likelihood of this mutation being there was very low, so you can skip the PCR. And then if the um if the score was in the middle, go do the PCR and confirm. So in 43% of the cases, um the scores were high enough or low enough uh compared to the threshold. And the threshold was set in a way that um that kind of mimicked the clinical workflow. So um that's how they uh proved that they have proven the uh real world clinical utility of this computational pathology biomarker



00:08:41 - 00:10:13

and uh one thing that okay and we have people joining welcome welcome thank you so much for joining from Indonesia this is cool so this is how they have proven it and Uh I wanted to start on a positive note because we have a few um abstracts that are showing us that AI is not always that fantastic and one of them is the next one that I decided I'm going to choose for you. The utility of automated artificial intelligence assisted digital cytoorphology for bone marrow analysis and diagnostics



00:09:30 - 00:11:00

diagnostic hematoncology and I look at this one and it says no abstract available. I'm like why why is there no abstract available? There is no abstract available because well I think it's because because it's letter to the journal where the authors decided to write a letter and it was uh here that was March May May right what they show here is a very important thing to just be aware of and always have that in mind for these tests until we have enough proof that we can like trust them



00:10:15 - 00:11:47

like we trust PCR right when you do PCR or any other test there is always a lab validation that you have to go through even though these methods are well established right our AI methods are semi well established there we are just starting with this right so what they did uh the they have this um artificial intelligence assisted digital cytoorphology and they have an abbreviation for this. Yeah. Here this AIdriven automated digital morphology ADM is automated digital morphology. So um we have we had third 328 routine bone



00:11:01 - 00:12:18

marrow smears and what they did they did the conventional optical microscopy and automatic digital microscopy automatic with the algorithm. Um they did 500 cells per sample and then they did statistics. They did something called visual data mining patient similarity network color coding. So like color coding uh patients with similar uh like characteristics to um to to identify patient co cohorts. And um they the outcome that they were measuring is cellular classification consistency, clinical classification consistency,



00:11:40 - 00:13:04

critical mclassification and diagnostic confirmity. And what they have here is this table these two tables. So first on this one you can exactly see like what cells uh they were my mclassified. So there's diagnosis. This is how the cell looks like and they look very similar. So I kind of am not surprised that um that the automated classification had problems. expert classification and automated classification. Uh and we have this like for different types uh sorry different uh different cells right um and all the



00:12:21 - 00:13:51

diagnosis uh are listed here. So marginal zone lymphoma, mental cell lymphoma, hairy cell leukemia, uh chronic lympostic leukemia, uh acute myoid leukemia and all the different things. Right? So here we have the diagnosis what was the uh expert classification and what was the automatic uh automatic classification. So all of these are mistakes, right? Um and what were the mistakes? The summary of mistakes is that for this particular diagnosis um number of cases that were incorrect were nine out of 43. Um and



00:13:08 - 00:14:40

what were the mclassified cell? So mostly here a typical lymphocytes was classified as blast not not further classified not or not other classified. Um we had plasma cell also classified as blast. Um myeloblast monoblast classified as lymphocy promyolite lymphoblast as lymphocy. So there are things that matter and that's what they showed us in this particular um letter to the journal. I don't know why it's a letter to the journal instead of like a publication. Um but basically they said



00:13:55 - 00:15:36

that you need to pay attention and validate. So this is one of the instances where um AI may oh and we have guests from Germany. Welcome welcome. I'm glad everything is working fine. If you have just joined, let me know in the chat that you're just uh joining. And um we move on. We move on in the theme of what if AI tool is lying to us. So this one uh this this one is not lying per se, but the performance decreases. So what happened in this one? By the way, on purpose, I put these earrings for our live streams in case



00:14:45 - 00:15:59

anybody is interested uh in getting them because we have them now in the store. So, I am putting a QR code to the store where you can check um the type of earrings that we have and also um couple of other things and uh mainly courses. So, you can have a look at that. I'm going to leave the code here. You can scan it wherever uh whenever uh as we move through a few more publications. So lossy diccom conversion may affect AI performance. So may affect right we're not like it's not a disaster. Um it's not a disaster



00:15:23 - 00:16:54

but we have to pay attention right. So this is a group this is what's the paper um scientific reports and we have our science reports a group from Germany Frankfurt alenberg mostly Germany this is a group from Germany so um many pathologies pathology department uh digitized their glass slides right this is what we want we want the oh And I see people are interested in uh in checking out the store. Go ahead. Um as I talk about this particular one. So we digitize, right? And the problem is uh that to ensure the



00:16:08 - 00:17:20

long-term accessibility, it is desirable to store in the DICOM format. Like there is no digital pathology discussion without mentioning DICOM, right? We want a standard. We want to uh save and actually like create these images in DICOM. But currently what is happening that scanners initially store the images in vendor specific formats and only provide diccom converters uh with only a few producing diccom directly. Um and this conversion is not lossless. What does that mean? That it loses something.



00:16:44 - 00:18:19

It's not lossless, right? Um it loses some stuff, some information. And uh here they were working with the MRXs files. These are files from uh 3D hist scanner 3D hist and they have um they had two like ways of making them into dyom. Um they so they converted these files depicting bladder, ovarian and prostate tissue into diccom images using 3D histex hist and sysmx converter um and an opensource tool both using baseline JPEG for the recompression. So we have two things we have the 3D histex sysmix



00:17:32 - 00:18:44

converter and we have an open-source thing. Um, so what happened? What happened when you do that? And if you're just joining, let me know in the chat. Uh, where are you tuning in from and what time it is for you. Um, so after conversion, when you looked at those images, there was no human perceptible differences. Uh, so like for for diagnostic purposes that would not be a problem, right? Because you don't even see the difference. H. However, these images were not identical and there were some structure similarity



00:18:07 - 00:19:32

indices of 0.85 to 0.96. Um so so they were not identical. Um the vendor specific converter in general achieved higher values. So, so the one specific for uh MR MRXS files for 3D hyc files um this one was better uh but there was it was not 100% right. So AI models based on convolutional neuronet networks and also current foundation models could distinguish between the original and converted images in most cases with an accuracy of 99.5. So they know that an AI model will know which one is compressed, which one is not



00:18:50 - 00:20:20

compressed. And the problem was that already trained AI models show uh showed performance differences between the image formats in five out of 64 scenarios. So um here five out of 64 uh there was performance difference um mainly when only little data was used during a training. So um DICOM images are intended for diagnostic use. Uh if if we intend to use DICOM for diagnostics which we do then uh we have to develop or re-evaluate AI algorithms on this format. But that's kind of like not surprising



00:19:35 - 00:21:08

um that you have to re-evaluate or train on um different formats and the I think that the surprise or like the the message here is h it's not identical. So please check I don't know if the assumption was that it should be identical. I don't know. Um but the digital pathology comp um company the digital pathology community uh is committed to diccom right um we want to have diccom we want to have a standard for those images so because we want to ensure interperability and um so some first AI trainings with converted files



00:20:22 - 00:21:41

did not result in systematically decreased per performance. So if you train with these um everything should be okay but just make sure or or like keep in mind that this is a different format right so if you have any questions any anything you want to uh let me know put it in the chat and let's move on to the next one. Uh, by the way, if you're just starting your digital pathology journey, then you might want to read digital pathology book, uh, digital pathology 101. I wrote this book for those who are starting



00:21:02 - 00:22:28

digital pathology journey. And I'm going to put a QR code here. No, that's not the one. Uh, QR code here. This is the digital pathology book. This is a free PDF that you can download through this QR code. Let me go back to full screen. Is it preventing us from seeing it? Um, I think so. So, I'm going to take it down for now, but if you're interested, get the book. H and we're going to talk about the next one towards fair decentralized benchmarking of healthcare AI uh with the federated tumor



00:21:45 - 00:22:59

segmentation challenge and nature communication and we have these image analysis challenges um computational competitions are the standard for benchmarking medical image uh analysis algorithms. Uh if you have been in the digital pathology space, you probably have heard about the chameleon challenge where um where the um accuracy or like the performance of algorithm algorithms for um detection of metastasis in lymph nodes was evaluated. Um they did it two years in a row. I don't know probably the data set is



00:22:22 - 00:23:42

still h is still there and I see that people are checking out the uh the store for earrings and for courses right not just beautiful but also educational and so these competitions are happening and um they typically use small curated test data sets acquired at few centers right so that's not the world uh they leave a gap to the reality of diverse multicentric patient data H and this federated tumor segmentation challenge represents the paradigm for real world algorithmic performance evaluation. So what does that mean



00:23:01 - 00:24:18

federated? Federated is that this algorithm can like go to different places where data is available like to different centers, right? You you and it's it's just a compar like um an analogy but you like send the algorithm to learn and bring it back with the weights with the knowledge. Whereas um for the classical way of checking and training is you bring all the data to the algorithm like you bring the data but there is only a limited uh possibility to bring you can only bring so much data right but if you um and



00:23:40 - 00:24:56

also privacy concerns prevent uh many partners from sharing this data. But what can happen you can send this model to different places and that's the federated way of training models and in this case testing models. So uh there was this competition to benchmark federated learning aggregation algorithms and stateofthe-art segmentation algorithms across multiple international sites. So um they they they compared it using a multi-entric brain tumor data set in realistic federated learning simulations



00:24:18 - 00:25:35

um yielding benefits for adaptive weight aggregation and efficiency gains through client sampling. So what is weight aggregation? This is basically combining model updates and client sampling. So clients would be like the centers, the labs, the hospitals where the data comes from and you choose them for each round of training. So not the model doesn't train on all of them all the time but uh there is something called client sampling to choose it. So um this is impressive. They had data distributed



00:24:57 - 00:26:10

internationally across 32 institutions. Like can you imagine coordinating 32 institutions? I am impressed uh by this particular aspect of of uh this paper that they managed to coordinate 32 institutions. If you have and also do we have a lot of authors here? Yes. Yes. That's why we have so many authors here. This is the author list. They didn't even like put where those authors are from. Uh because then it would be probably longer. definitely longer than the abstract. Um yeah, so federated



00:25:33 - 00:27:29

learning, right? And you can now compare in a federated way. So congratulations on that. What do we have here? We have um last but not least, we have AI in gynecologic or gyne gynecologic that's the correct pronunciation. Um cytology. So um yeah papsmear evaluation um this kind of uh AI and uh this was published in actylogica and we have authors from uh Pittsburgh PA go Pennsylvania from China from Massachusetts and I'm like Zenzo China I'm sorry for not pronouncing all the cities. Uh, okay. So,



00:26:30 - 00:27:50

this is our last paper. Don't go wait till the end. Uh, I'm going to put the book uh QR code um again. And I'm going to put one more here if you're interested. This is a specific course that I have on AI and I wanted to let you know about it because we just updated with my team. Um, we put the edited versions of the seven-part AI series live stream there. So, it's very up to-date and it has like the basics to introduce you to AI. Uh, and also it has the updated all the live streams, all



00:27:11 - 00:28:24

the papers, all the audio versions of the uh seven-part AI series. If you have been in our one of these uh live streams, let me know in the chat that you actually attended. Uh but now we have an edited version like you can skip all my ramblings, skip all my mistakes because my editor took care of it. So uh no losing time, just value path AI makeover. Have a look at it. It's very affordable. So uh if you're interested, but you can do that in the meantime. and let's let's talk about the uh AI in



00:27:47 - 00:29:09

gynecologic uh cytologology. So cervical cancer the background is that cervical cancer is the fourth most common cancer and uh in women in women obviously uh globally with highest incidents and mortality identified in less developed and medically underserved areas in the world which uh it's funny because well funny um it's like my mind always goes to like different continents uh than the North American continent. When I hear about unders medically underserved areas, no, I have enough medically underserved areas here in the



00:28:29 - 00:30:00

Appalachian Mountains in uh Pennsylvania. So, it's not um it's across geographies. So, the diminishing cytology workforce, unavailability of expert consultation, high volume of paps needing manual screening. Um so the digital pathology world is exploring innovative solutions right AI solutions. Can we somehow improve care um without the necessary um like without enough people to actually provide this curse. So uh yeah there are there are AI algorithms and they um are potentially transforming for traditional



00:29:15 - 00:30:39

cytoathologic practice and I'm going to tell you something after uh I go through the abstracts like a real life example in um in a country where this is actually deployed at scale. you can guess what country that is and let me know in the chat if you know about like an AIdriven pap smear screening at the national level. So uh these AI based systems are relatively new relatively new with limited published data on their validation and clinical utility in clinical practice. Uh so this article aims to increase awareness



00:29:57 - 00:31:11

of the availability of such systems. Uh talks about history. So it's a review right uh development of AI assisted screening platforms. So whoever is considering deploying this in your institution that's a good one pretty upto-date one to read to check okay what's out there. Um so um there are uh these AI platforms for screening pap tests. How uh the performance car characteristics compare and also elaborate on technical challenges associated with conducting clinical trials employing this



00:30:33 - 00:31:41

technology. Um so um yeah and they discuss considerations about deploying such systems in routine cytoathology practice. Uh, obviously you have to be digitally enabled. Oh, we're almost done and I still see people joining. Let me know where we're joining in from. Uh, I'm going to put the uh, sorry, the store uh, QR code because uh, the course is in the store as well. So, if you're interested in the store, uh, let me know. Well, you don't have to let me know. You can just scan the code uh, and check it



00:31:08 - 00:32:32

out for yourself. Um and you'll find these earrings there and courses. Um the key message is that these systems are being developed and utilized in cytoathology practices to screen pop tests and some of them have good performance and some don't have good performance. Right? So the main message is uh you need a judicious review of these systems using evidence-based studies and this is imperative to promote widespread adoption. So the message of today's episode is um and even more people joining at the end.



00:31:50 - 00:33:05

Thank you so much for joining. I know it's early 6 a.m. That's very early. Let me put the book QR code for you. If you are new to digital pathology, this book is going to help you uh understand the field better. Um let me just do this. No, I want myself the papers. We're done with the papers. Okay. So, this is the book digital pathology 101. If you want the paper copy, it's available on Amazon, but you can totally get a PDF for free on the digital pathology place website. Um, you



00:32:27 - 00:33:40

scan the QR code and uh you get access to it. You just you have to put your email in and then I'm going to send you the link to download. Uh yeah, the message today is and it's not a new message. It's like for any new method, right? trust but verify or rather first verify before you trust. Uh so and it kind of I don't want to say it taps into the fear that we have that it's going to be wrong. It's a yeah it's a fear but it's a legitimate concern. Let's not call it fear. Let's call it legitimate



00:33:04 - 00:34:27

um clinical and scientific concern that if we are trying to introduce any new tools into into the science into the uh clinical practice specifically because then it directly affects the diagnosis then we need to be sure that it works and or we need to have a way to prove that it works. So, thank you so much for joining me. Uh, thank you so much for scanning the codes and checking out the store. The store was um a labor of love with my um with my website designer. She's been working with me on the



00:33:45 - 00:34:53

digital pathology place endeavor since the very beginning. So, you will see a very um comparable design and it's just so beautiful. So I just wanted to share that like you know regardless whether you want to get something from the store or not but it just looks so beautiful because um she like knows what digital pathology place is about and it's so uh interesting to work with somebody for so many years. So we started in 2021 and uh this is when when we put up the beautiful website. At the beginning I



00:34:19 - 00:35:21

was just like trying to do it myself. didn't look too great, but the information got uh enough traction for me to engage with her. And fun fact, she's my friend from primary school from Poland. She now lives in Spain and also traveled the world and is like a brilliant designer, but we know each other since we were like five and we still work together. Um, so just bragging about how beautiful the store is. Thank you so much for joining me and