227: Implementing Generative AI and LLM Assistants in Oncology Practice Artwork

Digital Pathology Podcast

Aleksandra Zuraw from Digital Pathology Place discusses digital pathology from the basic concepts to the newest developments, including image analysis and artificial intelligence. She reviews scientific literature and together with her guests discusses the current industry and research digital pathology trends.

All Episodes

Digital Pathology Podcast

227: Implementing Generative AI and LLM Assistants in Oncology Practice

Subscriber Episode • April 10, 2026 • Aleksandra Zuraw, DVM, PhD • Episode 227

This episode is only available to subscribers.

Digital Pathology Podcast +

AI-powered summaries of the newest digital pathology and AI in healthcare papers

Send us Fan Mail

Paper Discussed in this Episode:
How to bring generative AI to oncology practice. D. Truhn & J. N. Kather. ESMO Real World Data and Digital Oncology 2026.
Episode Summary:
In this journal club deep dive, we step out of the theoretical sci-fi hype of artificial intelligence and look at a practical, real-world roadmap for bringing Generative AI into oncology. We examine a 2026 paper that maps out the trajectory for deploying Large Language Models (LLMs) to combat the overwhelming cognitive load of modern cancer care. Rather than replacing clinicians, this episode explores how AI can synthesize massive amounts of unstructured data—like dense pathology narratives and shifting molecular reports—so doctors can get back to practicing medicine instead of acting as data entry clerks.
In This Episode, We Cover:
• The Data Avalanche in Oncology: Why the shifting landscape of decades of patient histories, clinical trial registries, and handwritten notes creates an information load that human cognition simply wasn't evolved to process all at once.
• How LLMs Actually "Think": Why predicting the "next word" based on massive training data allows AI to mimic medical reasoning and organize complex clinical concepts—like linking a BRAF mutation directly to a specific inhibitor without looking up a rulebook.
• The Three Evolutionary Steps of AI Complexity: ◦ Step 1: Stand-alone Models: The "closed-book exam." These models (like early ChatGPT) are frozen in time with their original training data and have zero access to new clinical trials or FDA updates. ◦ Step 2: Retrieval-Augmented Generation (RAG): The "open-book exam." The AI searches continually updated external databases and guidelines before answering, significantly reducing fabricated answers, or "hallucinations". ◦ Step 3: Agentic AI: The ultimate goal. Fully functioning "research assistants" that can iteratively reason, plan steps, and invoke external software tools (like lab APIs and medical calculators) to complete complex tasks like proposing tumor board summaries.
• The Deployment Roadblocks: Why you can't just drop an autonomous agent into a fragmented hospital IT network built in 2005. We unpack strict security silos, audit logs, and the dangerous reality of "domain shift"—where an AI trained perfectly at Johns Hopkins might silently fail at a community clinic simply due to different doctor shorthand or microscopic slide scanner colors.
• The Human Element & Automation Bias: The hidden dangers of junior doctors losing their clinical intuition (deskilling) and why system design must force the AI to "show its work" with intentional friction to prevent doctors from blindly clicking accept on a hallucinated treatment plan.
• Your Edits Are the Future: A fascinating look at how a clinician's daily administrative annoyances—every strike-through and manual correction of an AI draft—serve as the ultimate, high-value ground-truth data to train the next generation of oncology AI.
Key Takeaway:
The destination we are driving toward is augmentation, not automation. By handling massive information synthesis, uncovering patterns, and explicitly showing its work, AI can act as a tireless assistant that improves routine care, while leaving the final, nuanced clinical judgment exactly where it belongs: with the human physician.

Get the "Digital Pathology 101" FREE E-book and join us!

Welcome back, trailblazers, to another Journal Club edition of the Digital Pathology Podcast. You know, usually when we talk about a medical diagnosis, there's there's this certain expectation of precision.

Oh, absolutely. It feels almost like engineering, right?

Yeah, exactly. Like you break your arm, the X-ray shows a jagged white line on a black background and the doctor points to it and says, "There it is. It's totally binary. Broken or not broken,

right? It's clean. It's visible. And it's categorized." And you know, human brains just prefer that kind of definitive visual confirmation.

We really do. But I mean, step into the world of oncology and suddenly that clean diagnostic picture just completely shatters.

No, it shatters into a million pieces,

right? You aren't looking at a single image anymore. You are staring down this uh this mountain of unstructured, messy information, decades of patient history, handwritten clinic notes, dense pathology narratives,

not to mention the constantly updating molecular reports.

Yes. And the shifting clinical trial registries. It's basically diagnostic muddy waters and it's completely overwhelming

which is frankly an information load that human cognition simply wasn't evolved to process all at once.

Right. Nobody can read all that.

Exactly. I mean a single patient's file can span thousands of pages of text and then you combine that with complex genomic data that literally changes meaning based on newly published research. That cognitive overload is really the focal point of our discussion today.

It is. So our mission today is to unpack a highly practical real roadmap focused paper titled how to bring generative AI to oncology practice.

It's a great piece.

It really is. It's by authors DRun and JN Cather and it was recently published in the journal ESMO real world data and digital oncology. And you know what makes this paper stand out for me is that it completely ignores the usual sci-fi hype.

Yeah. There's no Terminator stuff in here.

Exactly. It maps out the actual constraints, the IT nightmares, and the realistic trajectory for getting AI safely into the clinic.

And the central premise that establish right away is that artificial intelligence in this space is not, I repeat, not about replacing the clinician.

Crucial point,

right? Oncology fundamentally runs on narrative texts and structured tables. So AI's role is strictly to synthesize that massive textual overload so the clinician can actually practice medicine

instead of spending their whole day acting as a data entry clerk.

Exactly. It's an assistant that reads across sources, tracks timelines, and crucially exposes its evidence. Okay, let's unpack this. Why is generative AI specifically large language models or LLMs? Why is that the specific tool for this job? Moving from why oncology needs help to how the AI actually processes this data requires a bit of a well a mental shift.

It does because these models aren't actually thinking right or like looking up facts in a traditional database,

right? They aren't Google.

Mathematically speaking, large language models are basically sophisticated sentence continuation engines. Their primary function is just to predict the the next word or more accurately the next token in a sequence

just autocomp completing basically

essentially yes based on the billions of parameters they developed during their training on vast swavs of human text they are solving a statistical puzzle at just an unprecedented scale

but wait how does predicting the next word equate to doing complex medical reasoning because uh those sound like two completely different things to me

it does sound strange but if we connect this to the bigger picture human language isn't just random symbols. It actually contains the underlying logic of the real world.

Okay, I see where you're going.

Yeah. So, by solving these statistical puzzles so accurately, the model's internal neural network actually organizes concepts spatially. It learns that certain clinical conditions, genetic mutations, and therapeutic responses, well, they live close together in that mathematical space.

Oh, wow. So, it effectively mimics medical reasoning through really complex pattern recognition.

Precisely.

Let's ground that a bit for the trailblazers listening. What does that that mathematical space actually look like when applied to an oncology case?

Well, the paper uses a great example of a molecular tumor board. Imagine an LLM is processing a clinical narrative and it encounters the phrase metastatic colarctyl cancer combined with a BRA V600E mutation.

Okay, pretty standard complex case,

right? And it doesn't look up a rule book. Instead, it calculates the most statistically probable continuation of that specific semantic concept

because it's read so much oncology literature.

Exactly. Because of its training on all that literature, it assigns a much higher probability to a continuation involving a BRAF inhibitor rather than one involving just standard generic chemotherapy.

Man, it's just autocomplete in the clinical logic. That actually reminds me of how an experienced oncologist operates on the floor.

Oh, for sure. It's intuitive pattern matching,

right? A seasoned doctor sees a specific cluster of symptoms. They note a few key lab results and their brain just naturally auto completes the picture to formulate a diagnosis.

They aren't consciously flipping through a textbook in their head.

Exactly. They're years of exposure to thousands of cases just naturally connect the dots. The LOM is doing the exact same thing, just leveraging billions of textual data parameters instead of, you know, human clinical years.

Yeah. The underlying logic of the clinical diagnostic process and the logic of a language model map onto each other surprisingly well. They're both systems trained on massive amounts of prior examples to predict the most likely correct path. forward.

But I mean, a massive predictive text engine frozen in time has to have some serious limitations. And the paper breaks down AI complexity into three functional levels to explain how we actually structure these models for real world clinical use.

Yeah. Trun and Cather categorize LLM assistance into three distinct evolutionary steps. Step one is the standalone model

like early chat GPT.

Exactly. Think of the baseline versions of early consumer AI. You put in a query and it generates an answer entirely from the static knowledge embedded in its parameters at the exact moment its training was completed.

Which means its knowledge is permanently frozen,

completely frozen.

So if I'm an oncologist asking a step one model about a chemotherapy regimen, it might give me a perfectly fluent summary based on data from two years ago.

But it has zero access to the clinical trial results published yesterday,

right? Or a recent FDA label modification. And worse, if it doesn't know the answer, the mathematical drive to predict the next word means it might just confidently fabricate an answer,

which is what we call a hallucination.

Yeah. So, a step one model is essentially a medical student taking a closed book exam. Whatever they crammed into their head before the test is literally all they have.

That's a perfect analogy. And the utility of standalone models for a fast-moving field like oncology is exhausted almost immediately. Which brings us to step two, retrieval augmented generation or rag.

Rag. Yeah, we hear that term a lot,

right? In a rag system, the underlying language model is still fixed, but we give it an external mechanism to pull in new information.

Okay. How does that work in practice?

When you ask it a question, the system first runs a search against a continually updated external database like current clinical guidelines or real-time adverse event profiles.

Oh.

It extracts the most relevant paragraphs from that database, invisibly pastes those paragraphs into your prompt, and then instructs the AI answer the user's question using only this provided text.

So, RAG is that same medical student, but now they're taking an open book exam.

Exactly.

They still have their foundational understanding of medicine to help them read and interpret the material, but they're looking up the absolute latest verified dosing recommendations right there in the exam room before they write down their answer.

What's fascinating here is how that openbook concept pushes us toward the ultimate goal. If step two is the open book exam, step three, which is Aentic AI, is the fully funded research assistant running the lab. Oh wow. Okay.

This is where advanced AI enters routine practice. An agentic model doesn't just produce a text response. It has an internal logic loop where it reasons about the data, plans a sequence of steps, invokes external software tools, evaluates the output of those tools, and then revises its strategy.

Here's where it gets really interesting. How exactly does it plan that sequence? Like what is the actual mechanism making it an agent rather than just a really good search engine?

It uses a framework often called reasoning and acting. So when you give an agentic AI a complex goal like prepare a summary for this patient's tumor board, it doesn't just try to write the whole thing at once.

It breaks it down.

Exactly. It breaks the goal down into discrete steps. It might generate an internal thought. First, I need to know the patient's current kidney function.

Okay, that makes sense.

It then generates an action, writing a query to the hospital's lab system API to actually pull the latest creatinine levels. It observes the result, then generates its next thought. function is low. Next action, open the medical calculator tool to adjust the chemotherapy dosage.

Oh, that is wild.

Yeah, it iteratively loops through thinking, acting, and observing until the full task is complete.

So, it's actively coordinating external software. It's reading the hisystologology report, realizing it needs more information, and then actively invoking a separate specialized pathology algorithm to say predict microatellite instability.

Yes. And then it translates all that into the free text criteria needed to check if the patient is eligible for a specific clinical trial.

That is a massive amount of administrative heavy lifting

and it structures the output according to international standards like EncoB guidelines. So it generates a transparent rationale for why a patient should or shouldn't receive a specific targeted therapy.

But wait, if these agents are actively writing code to query APIs, running calculators, and pulling lab results, the immediate question I have is, do we just let them run wild in the hospital's electronic health record system? Yeah, that's the scary part

because giving an AI administrative access to poke around a fragile hospital database sounds like a recipe for a catastrophic IT failure.

Oh, absolutely. And this raises an important question about the barriers to deployment which TR and Cather map out in really stark detail. The reality is that hospital IT infrastructure is incredibly fragmented

to put it mildly,

right? Data is locked in silos across different vendors guarded tightly for privacy and security. Systems share data only through very narrow, fragile interfaces, and access is controlled by strict user roles.

Yeah, you can't just drop an autonomous agent into a network built in 2005.

You really cannot. Any AI assistant has to operate within strict predefined guardrails. And more importantly, anything the system does requires a full audit log and version control.

It has to show its math.

It has to show everything. The system must explicitly record the exact question asked, the specific database is consulted, the precise software tools it invoked, the exact version of the model running, and even a mathematical estimate of its confidence.

Because without that granular transparency, investigating a medical error just becomes impossible. But let's play devil's advocate. Let's say a massive well-funded research hospital manages to build all those IT interfaces and audit logs perfectly. Their agentic AI is running flawlessly. The paper introduces this concept of domain shift.

Yes, domain shift is huge. If it works perfectly at a place like Mayo Clinic or John's Hopkins, shouldn't it work just as well at a community clinic down the road? I mean, cancer is cancer. The biology of a tumor doesn't change just because you cross state lines.

Well, the biology doesn't change, but the data representing that biology changes drastically.

How so?

Different hospitals have entirely different notewriting cultures. Physicians use different shortorthhand. Laboratory panels have different reference ranges. Even in pathology, different hospitals use varying chemical stating protocol.

Oh, right.

Which means the exact same tissue sample will look slightly different under a microscope depending on where you are. Furthermore, they buy their imaging scanners from different vendors, creating minute variations in image resolution and color balance.

So, a model trained entirely on John's Hopkins data learns the subtle specific patterns of how John's Hopkins doctors write and exactly what shade of pink their specific slide scanners produce.

Precisely. All of those subtle local variations basically confuse a model trained in a different environment.

That makes a lot of sense.

Yeah. When you move it to a community clinic, the extraction accuracy just drops. The AI might silently start missing key diagnosis simply because the new hospital scanner has a slightly different contrast ratio, which throws off the internal mathematics of the neural network.

It's like uh learning to drive a car perfectly in your quiet hometown. You know where every pothole is, the rhythm of the traffic lights, the exact width of the lanes,

right? But then you fly to a foreign country, you rent a car and suddenly the steering wheel is on the right side. The street signs use completely different iconography and the local drivers follow this totally unwritten set of rules.

That is exactly what it is.

You understand the fundamental mechanics of operating a vehicle, but the context has shifted so drastically that you are highly likely to crash.

And that contextual shift is why standard leaderboard benchmarks where AI models are ranked in a vacuum on static data sets. They are essentially useless for judging clinical r penis, you have to prove the model can drive safely in the new city.

Okay. So, how do you do that without risking patient safety?

Through perspective multi-center evaluation using what the authors call silent clinical studies.

Oh, also known as shadow mode.

Exactly. You deploy the agentic AI into the new hospital's infrastructure and you let it run live on incoming data. It reads the files, invokes its tools, and produces tumor board summaries. But that output is strictly quarantined. It is never shown to the clinician.

So, just practicing in the background. No harm, no foul.

Right? Then researchers take the AI's quarantined output and compare it against the actual ground truth, which is the final signed reports generated by the human experts at that specific hospital. You rigorously evaluate if the AI's extraction accuracy holds up against the local shorthand and local imaging equipment, and you only graduate the system to a live clinicianfacing interface once its performance stabilizes across those diverse local settings. So that handles the IT infrastructure and the local data variations, but we still have to confront the inherent risks of the AI model itself. Going back to step two, where we gave the AI an open book through retrieval augmented generation, does pulling in external documents actually cure the hallucination problem?

Well, R significantly reduces hallucinations, but it definitely does not eliminate them. The vulnerability just shifts.

Shifts how?

It shifts from making things up out of thin air to misinterpreting the retrieved data. The model might successfully pull the correct clinical guideline, but the neural network might misattribute a sentence from paragraph A to a concept in paragraph B.

Oh, I see.

Yeah. Or it might paraphrase a complex conditional rule in a way that accidentally alters its medical meaning. Or honestly, it might just get confused by dense tables and mix up similar drug names or contraindications.

That's terrifying. If the model can still confidently hallucinate a drug dosage based on a misread table, how do we protect the patient? Because we can't expect a doctor to double check every single mathematical calculation the AI does behind the scenes.

No, we can't. And that protection really comes down to rigorous user interface design. The system architecture must enforce an evidence before conclusion protocol.

What does that look like?

The UI cannot just present the final answer. It must display the exact cited passages, visually highlight the specific clause of the guideline it used, and present the raw mathematical inputs of any calculation before the clinician even reads the final recommend. commendation.

Ah, so show your work first so the doctor can verify the logic at a glance.

Furthermore, the agent must be programmed with an abstension protocol. If the retrieved evidence is thin or contradictory or just missing, the AI needs to be trained to explicitly state, "I lack sufficient context to formulate a plan

rather than attempting to guess the missing pieces."

Exactly. It has to escalate the task to a human,

which introduces a massive regulatory headache. I mean, where does an AI assistant end? and a highly regulated medical device begin?

Yeah, that's the milliondoll question. The regulatory boundary really hinges on execution and autonomy.

Okay.

If the agent is purely retrieving passages and drafting a narrative summary for a human doctor to review, edit, and sign, well, it operates largely as an assistive documentation tool. But the moment that agent begins writing discrete executable data back into the electronic health record,

like actively proposing an executable order for a chemotherapy, drug.

Exactly. It crosses the line. It becomes a medical device subject to intense oversight. Under the EU AI act, systems proposing treatments are classified as high risk.

And in the US,

in the US, the FDA applies software as a medical device rules because the outputs directly influence therapeutic action. This means a hospital can't just push a software update to the AI over the weekend.

Oh wow.

Yeah. Every model version change requires strict regression testing and a traceable regulatory approval path, just like modifying a drug formulary. That regulatory friction is totally necessary, but it also points directly to the human element to this equation. Even if the UI forces the AI to show its work, humans suffer from alert fatigue.

We do.

If this agentic AI is right 99% of the time, the natural human tendency is to just stop checking the math. How do we prevent doctors from just blindly clicking accept on whatever the screen proposes, allowing that 1% hallucination to slip right into the patient's record?

You are describing automation. bias, which the authors identify as a critical long-term risk. And there is also the compounding danger of deskkilling.

Desk-killing.

Yeah. If junior physicians rely entirely on AI summaries from day one, they may never develop the deep intuitive clinical pattern matching skills we discussed earlier.

Oh wow, that's a really good point.

So to combat this, the system must be designed with intentional friction. High-risisk actions cannot be a single click. They require conscious multi-step confirmation.

So we have this massive transition ahead of us. We need to solve the IT fragmentation. run silent shadow studies for domain shift, build friction into the UI to stop automation bias, and navigate the FDA. Trun and Capital outline a very pragmatic step-by-step path for institutions to actually survive this transition.

Right? They do. They map a three-step adoption trajectory. Step one addresses the immediate crisis, which is shadow IT.

Right?

Because EHRs are currently so cumbersome, clinicians are already desperate for help and are using public nonsecure AI chat tools on their phones to draft clinical letters,

which is a catastrophic privacy risk.

Exactly. So, the first mandatory step for any hospital is providing sanctioned private alternative tools that explicitly protect protected health information. Stop the bleeding of shadow it.

Makes sense. What's step two?

Step two is embedding dedicated narrow assistance directly inside the electronic health record. These are tools with modest abilities like basic ARC functionality and medical calculators that can draft notes with citations, but fundamentally cannot take any action without a human physically hitting approve.

And once that is stabilized and trusted,

step three is the deep integration of the event driven agents we discussed. This is the fully realized system where say a new molecular pathology result arriving from the lab, automatically triggers the AI to wake up, analyze the new data against the patient's history, and propose an updated tumor board summary before the doctor even opens the file.

But to get to step three, to build these highly capable medically accur agents. We need significantly better AI models. And the paper makes it clear that the ultimate bottleneck preventing those better models isn't computing power.

No, it's not.

It's curated oncology training data. An agent can only learn the nuance of oncology if we feed it massive well-curated multimodal data sets.

The limiting region is really highquality human supervision. The model needs to see the notes, the radiology images, the genomic profiles, and the longitudinal outcomes all linked together and verified by experts.

So what does this all mean? for you listening right now because there's a detail in this paper that completely reframes the daily grind of practicing medicine.

Yeah, this is my favorite part.

The authors point out that when it comes to fine-tuning these advanced models, the most valuable training labels don't come from a lab. They live in routine clinical edits. Every single time a clinician reviews an AI generated draft and makes a correction, every strikethrough of a redundant sentence, every time you manually fix a slightly mclassified tumor grade, that action is a high-value supervision signal.

It is the ultimate ground truth data. When you correct the machine, you are explicitly mapping the boundary between mathematical probability and actual clinical reality.

Your daily administrative annoyances, your routine corrections of the medical record. They're not just busy work anymore. If your hospital has the proper data governance and explicit consent in place, your daily edits are literally training the future of oncology AI.

Exactly.

You are actively teaching the neural network how to think, how to weigh idence and how to operate like a human expert

which perfectly synthesizes the core thesis of Trun and Cather's road map. The destination we are driving toward is augmentation not automation.

Augmentation not automation.

Right. The ultimate goal is a tireless learning assistant that handles the massive information synthesis, shows its mathematical work and improves routine care while leaving the final nuanced clinical judgment exactly where it belongs with the human physician.

That is the perfect summary. Thank you for joining us. for this deep dive into the mechanics and the future of generative AI in oncology. But before we sign off, I want to leave you with a final thought to maul over.

Oh, I love these.

Something that builds on that concept of the striketh through. If our daily edits, our highly personal clinical corrections become the ultimate training data for the next generation of AI agents. Who actually owns that collective clinical intuition?

Wow.

Right. If a model is fine-tuned exclusively on the specific edits of the oncologists at once, specific hospital over several years, will that hospital's AI eventually develop its own distinct clinical personality?

That's a fascinating question.

Will an AI trained by doctors in New York make slightly different semantic choices or weigh risk slightly differently than an AI trained by doctors in London simply because of the specific human instance that shaped its neural pathways?

We are no longer just passing medical knowledge down through static textbooks or lectures. We are actively imprinting our collective clinical instincts directly into the code.

It is a wild frontier. Keep asking those hard questions. Keep correcting the record and keep demanding that these systems show their work. We will catch you on the next journal club.