199: Reporting Standards for Medical Foundation and Language Models Artwork

Digital Pathology Podcast

Aleksandra Zuraw from Digital Pathology Place discusses digital pathology from the basic concepts to the newest developments, including image analysis and artificial intelligence. She reviews scientific literature and together with her guests discusses the current industry and research digital pathology trends.

All Episodes

Digital Pathology Podcast

199: Reporting Standards for Medical Foundation and Language Models

Subscriber Episode • March 12, 2026 • Episode 199

This episode is only available to subscribers.

Digital Pathology Podcast +

AI-powered summaries of the newest digital pathology and AI in healthcare papers

Send us Fan Mail

Paper Discussed in this Episode:

Reporting checklist for foundation and large language models in medical research (REFINE): an international consensus guideline. Mese I, Akinci D’Antonoli T, Bluethgen C, et al. Diagn Interv Radiol 2026.

Episode Summary: In this special journal club edition of the digital pathology podcast, we tackle a massive structural problem in medical imaging and AI: the rapid adoption of foundation models and large language models (LLMs) that are completely outgrowing our traditional evaluation frameworks. We examine the groundbreaking 2026 REFINE consensus guideline that addresses the opaque and stochastic nature of generative AI, forcing researchers to fundamentally change how they report on these tools to move away from black-box unpredictability toward true reproducibility.

In This Episode, We Cover:

• The "Wooden Ruler" Problem: Traditional AI reporting tools, such as CLAIM and TRIPOD-AI, were built under the assumption that algorithms are deterministic, meaning they give the exact same output every time. Generative AI is inherently stochastic and sensitive to subtle variables, making old checklists function like rigid wooden rulers trying to measure a fluid target.

• The REFINE Framework: Created via a rigorous Delphi consensus process by 57 contributors from 17 countries, this robust 44-item, 6-section checklist is a massive global effort. It features a deliberate "N/A" filtering mechanism to practically accommodate highly diverse text, imaging, and multimodal study designs.

• Prompting is the New Coding: We explore why researchers must now treat prompt engineering with the exact same rigor as traditional source code. The guideline requires full transparency on prompting strategies, session memory policies, and precisely how patient clinical context (like BI-RADS or ICD codes) is integrated into the model.

• Corralling the Chaos (Stochasticity & The Human Element): Controlling an LLM requires detailing generation parameters like "temperature," which dictates model creativity. Crucially, studies must also document the prompt operator's characteristics, as a senior attending radiologist will intuitively guide a model very differently than a first-year resident, drastically skewing the output.

• The Contamination Crisis: We discuss the existential threat of dataset contamination, which occurs when an LLM has already memorized public test datasets (like MIMIC-CXR) during its pre-training phase. The guideline demands rigorous checks against the model's knowledge cut-off dates and full transparency regarding the use of synthetic data.

• Clinical Reality Check: A model's performance in a vacuum is meaningless if it cannot seamlessly integrate into a hospital's clinical workflow, such as its PACS. We detail why researchers must now explicitly outline clinical non-use cases, map out data privacy safeguards, and conduct formal failure analyses to categorize errors like hallucinations.

Key Takeaway: The REFINE guideline marks a critical maturation point for medical AI research. By rigorously addressing the unique chaotic elements of generative AI—such as prompt sensitivity, stochastic generation, and dataset contamination—this framework ensures that future medical AI studies provide a trustworthy, reproducible foundation of evidence that frontline clinicians can safely rely on for patient care

Get the "Digital Pathology 101" FREE E-book and join us!

Welcome trailblazers to a special journal club edition of the digital pathology podcast. Uh today we are taking a really deep dive into our sources to look at a massive structural problem in digital pathology and medical imaging.

Yeah. And it is a problem that is only getting bigger by the day.

Exactly. I mean to all of you trailblazers joining us who work at that intersection of healthcare and machine learning, you already know this. You see hospitals and research labs rapidly adopting foundation models and uh large language models.

They are everywhere now,

right? We're seeing these systems integrated into everything from complex diagnostics to patient triage. But and this is the but there's a glaring issue with this rapid adoption.

Our old reporting guidelines just cannot handle them.

No, they can't. They simply cannot handle the opaque completely random nature of generative AI. We are basically trying to measure this fluid shifting target with a rigid wooden ruler

which is a perfect way to describe it.

So our mission for today's deep dive is to thoroughly review a groundbreaking 2026 paper published in diagnostic and interventional radiology.

A really essential read for anyone in the field.

Truly, it's titled reporting checklist for foundation and large language models in medical research and in parenthesis refine. It's an international consensus guideline

authored by Ismael Maze, Tuba Akinchi Dentonoli, Christian Bluthkin, Barack Kak, and just a massive international team.

Okay, let's unpack this. Why exactly do the legacy reporting tools fall so short when we apply them to a modern language model?

Well, um it really comes down to the fundamental architecture of the models we're evaluating today versus the models we were evaluating say 5 years ago, right?

Traditional AI reporting guidelines are incredibly robust. I mean, you have frameworks like claim for medical imaging, tripod AI for prediction models,

and consort AI for the clinical trial.

Exactly. But all of those guidelines were built under the assumption that the algorithm being tested is deterministic.

Meaning it does the same thing every time.

Right? In a deterministic system, if you feed the exact same chest X-ray or the exact same pathology slide data into the model, you are going to get the exact same output every single time.

But foundation models and large language models do not operate that way at all.

No, they don't. They are inherently stochastic. They are random.

There is this built-in probabilistic element to how they generate responses.

Yeah. And furthermore, their behavior is sensitive to variables that the older guidelines never even had to consider

like prompting strategies.

Prompting strategies, subtle parameter adjustments, and the hidden knowledge cut offs of their massive proprietary training data sets. When you combine the sheer scale of these models with their blackbox nature, the traditional checklists simply fail. They fail to capture the variables that actually dictate how the model behaves,

which is why we need refine.

Exactly. The refine guideline was created to bridge that. that gap. It demands a much stronger governance framework around intended use, output validation, and system transparency.

And that really explains the scale of the effort behind this paper because the refined development group wasn't just a handful of researchers at one university.

No, it was 57 contributors.

57 contributors spanning 17 different countries. It was a highly coordinated global effort. I understand they utilize a modified deli process to build the framework, but how did they actually enforce consensus among such a diverse group of experts? So the steering committee prespecified and publicly archived a highly structured protocol before they even began. They went through two formal voting rounds and that was followed by a harmonization phase.

And the voting was pretty strict, right?

Very strict. To keep any proposed item on the final checklist, it needed a strict 75% consensus from the panelists.

Wow. 75%.

Yeah. If an item didn't reach that threshold or if it sparked too much disagreement regarding its wording, it was either heavily revised and put to another vote. or it was just scrapped entirely.

And the output of that rigorous filtering is the final framework. It's a robust 44 item checklist divided into six major sections.

And when researchers actually fill this out, the response options are yes, partial, no, and NA.

Right? And I was looking at that structure and honestly giving researchers an NA option feels a bit like handing them a get out of jail free card. Doesn't that open the door for study authors to just skip the harder, more complex evaluation metrics? by claiming they aren't applicable.

I get why it looks that way, but the NA option is actually a necessary functional requirement, not a loophole. The checklist is designed to apply universally across text, imaging, structured data, and multimodal applications.

Ah, I see.

If a research team is conducting a study strictly on a textbased LLM summarizing pathology reports, they cannot be expected to answer items regarding image synthesis or spatial resolution.

That would make no sense,

right? So, the NA option acts as a deliberate filtering mechanism. So the checklist remains practical across highly diverse study designs. It prevents the framework from becoming this bloated unworkable administrative burden.

And considering 68% of the deli panel were experts specifically in radiologydriven AI, the items that remain are hyper relevant to clinical utility for the trailblazers listening today. It makes sense. It forces comprehensive reporting without demanding the impossible. Let's get into the actual meat of the framework. Starting with sections one and two, which cover model specifications and prompt design.

This is where things get really detailed.

Yeah. Looking at section one, the required level of granular detail is a massive shift. The days of a methodology section vaguely stating we used a large language model are officially over.

Long gone.

Researchers must report the exact model name, the vendor, the specific version like GBT 520807, and crucially the exact knowledge cutoff date.

What's fascinating here is how they mandate detailing the exact training and adaptation stage. It is no longer enough to just name the model.

You have to specify its precise state.

Yes. Was this foundation model pre-trained from scratch on a massive general medical corpus? Was it fine-tuned, meaning the actual neural weights were updated using supervised learning on your specific institutional data? Or was it purely an inference time adaptation?

And just to clarify, inference time adaptation means the core model weights remain frozen, but you are changing how it interacts with data at the moment of the query, right? Like using retrieval augmented generation to pull from a local clinical database before it formulates an answer.

Precisely. You also have to report the computational requirements

like the hardware.

Yeah. The specific hardware and GPU nodes required to run it and whether you were making the code, the underlying data, and the model artifacts publicly available on repositories.

I imagine that level of transparency is going to cause some friction, especially for startups or proprietary labs that view their inference time adaptation stat back as their core intellectual property.

It definitely will. But from a scientific standpoint, without that data, reproducibility is dead in the water.

Which leads us perfectly into section two, prompt design. The checklist effectively treats prompt engineering with the exact same rigor as traditional source code. Prompting is basically the new coding.

It really is. It requires researchers to report their exact prompting strategy, whether that is zero, fshot, or chain of thought prompting.

And beyond just naming the strategy, the guideline requires researchers to report exactly how patient clinical context is integrated into that prompt.

This is huge. If you are using patient history, how is that information selected and formatted? Are you feeding the model a block of raw unstructured clinical text or are you standardizing diagnoses using ICD codes?

And in imaging scenarios, are you structuring findings with standardized reporting systems like BI rats for breast imaging?

Exactly. Or PI RAS for prostate imaging. You have to provide the full prompt content verbatim.

That integration of clinical context is critical because if one study feeds a model raw physician notes full of typos and local hospital jargon and another study feeds a perfectly structured by our rats categories, you're testing two completely different clinical scenarios.

Even if you were using the exact same underlying model.

Exactly. The checklist also mandates that researchers detail the model's session memory policy. And I found this requirement fascinating.

It's a game changer for reproducibility.

We are so used to to thinking of queries as isolated events. But if a pathologist is using a conversational model to work through a complex differential diagnosis, whether that model remembers the specific slide characteristics detailed three messages ago fundamentally alters the trajectory of the output.

It changes everything. If the model operates with a clean slate on every single prompt, its reasoning is isolated. But if it retains conversation history, it can build upon prior context.

But it can also compound early errors. Right.

Yes. Exactly. Without Documenting that interaction style and memory policy, replicating a study's diagnostic results is practically impossible.

So once a researcher has perfectly documented the model specifications in the prompt design, they run head first into the next massive hurdle, the model's inherent unpredictability.

Yes,

here's where it gets really interesting. This brings us to sections three and four. Dealing with stochasticity control and data set integrity. FMs and LLMs are inherently random. How does refine force researchers to corral the chaos of a probabilist model.

Section 3 demands total transparency around generation parameters. Researchers must explicitly report settings like temperature and top sampling

and temperature dictates the creativity. Right.

Right. If a team is using a high temperature setting, the model's output will be more diverse and creative. That might be useful for brainstorming research hypotheses, but it is generally dangerous for strict diagnostic tasks

where you want a temperature closer to zero for focused deterministic outputs.

Exactly. But the checklist goes beyond just the machine's parameters. It demands regarding the human operating it.

The prompt operator characteristics. This might be my favorite inclusion in the entire framework. The checklist requires studies to state exactly who is typing the prompts.

Because the prompt operator is a massive confounding variable, consider the difference between a firstear medical resident and a senior attending radiologist querying a model about a complex scan.

They're going to ask completely different questions.

The senior attending will intuitively include subtle clinical context, rule out obvious confounders in their prompt, and guide the model differently than the resident would. Their experience level drastically changes the input,

which heavily skews the output.

Yes. And the checklist also requires researchers to disclose how many prompt attempts were made and the specific methodology for how the final output was selected.

Meaning, if a model generates three different pathology reports for a single tissue sample, did a script automatically select the longest one

or did it select the one with the highest confidence score?

Or did a human expert review all three and manually cherrypick the most accurate one?

That distinction is vital. If an algorithm automatically selects the output, you are evaluating the systems autonomous capability. But if a human expert is cherrypicking the best response, you are no longer evaluating the AI.

You are evaluating a human AI collaborative workflow.

Exactly. And failing to specify that completely misrepresents the model. standalone clinical readiness.

It really pulls back the curtain on how the sausage is made in these studies. Which brings us to section four, tackling data set integrity. And this addresses what is arguably the most existential threat to foundation model research right now.

The problem of contamination.

Contamination. It invalidates more generative AI research than almost any other factor.

Because these models are pre-trained on vast undocumented swavs of the internet, there is a very high probability that public medical data sets were swept up in that training data.

The paper uses this great analogy of giving a student the final exam as a study guide and then praising them for getting a perfect score.

It's exactly like that.

If you want to test how well a new medical LLM can diagnose chest X-rays based on radiology reports and you use a public test data set like MACCXR version 2.0,

you have to prove the model didn't ingest MICXR during its pre-training phase prior to the knowledge cutoff.

Right? Because if it did, the EV valuation is entirely compromised. The model isn't demonstrating clinical reasoning. It is simply regurgitating the memorized answers

and proving that lack of contamination is incredibly difficult when dealing with closed source proprietary models. The refine guideline explicitly requires authors to assess and report this risk by meticulously checking the publication dates of their test data sets against the model's declared knowledge cutoffs.

They also mandate transparency around whether synthetic data was utilized. So if a research team augmented their training set with synthetic MR images generated by a diffusion model that needs to be clearly documented.

Additionally, section 4 requires a detailed analysis of representational bias within the data sets. You must report the sample characteristics things like age, sex, disease severity, and geographic origin.

And I want to be clear here for everyone listening. The checklist isn't treating representational bias as just a philosophical or ethical talking point. It is treating it as a fundamental data integrity and deployment issue. If a model is trained almost exclusively on data from an urban hospital in North America, it might fail spectacularly when deployed in a rural European clinic due to differences in imaging equipment or population demographics or local disease prevalence.

Documenting those sample characteristics is the only way the broader medical community can understand the limits of the model's generalizability. It is about defining the boundaries of safe deployment,

which flows directly into sections five and six, output evaluation, and real world implementation. For section five, output evaluation, how do we actually score these models? Because the framework pushes researchers far beyond basic accuracy scores.

They require a real hybrid approach to scoring. You have to explicitly list your specific metrics and justify why they're appropriate for a generative model. Traditional task metrics like AOC, which measures basic classification accuracy, are still important.

But generative models produce text and complex reasoning, not just binary classifications.

Right? So, Researchers might also need to use text similarity metrics like BLEU or OA to compare the model's generated reports against human gold standard reports.

But even text similarity has limits, right? A model could generate a report that overlaps 90% with a human report. But if the 10% change reverses the actual diagnosis, the text similarity score might look great while the clinical utility is an absolute disaster.

Which is precisely why refine requires human liker ratings for actual clinical usefulness and factual correctness. Alongside those automated metrics, the checklist even sets standards for scenarios where researchers use LLM as a judge.

Where a more advanced computationally heavy AI is utilized to score the outputs of the smaller model being tested.

Yes. And beyond just scoring the successes, section 5 mandates a formal failure analysis. It is no longer acceptable for a paper to just report that a model achieved 92% accuracy and call it a day.

The checklist demands that researchers categorize that 8% of failures.

Was the failure a pure hallucination where the model invented a finding? Was it a logical reasoning error or was it just a formatting issue where it output a paragraph instead of a bulleted list?

Tracking those specific failure modes, especially hallucination rates, is vital for establishing clinical trust.

Furthermore, researchers must track how performance changes across different model versions over time. Generative models are constantly subjected to silent updates by their vendors.

Right? So testing Llama 3.18b version 1.0 might yield completely different diagnostic breakdown than version 1.1.

And if a study does not record the exact version tested, the findings essentially have an expiration date the moment they are published.

So what does this all mean? We have covered the model specs, the prompting, the data integrity, and the evaluation. Section six, implementation, is where we transition from the controlled environment of the research lab into the actual clinic. This is where the rubber meets the road for patient care.

Ray Fine demands explicit reporting on clinical workflow integration. Am model's performance in a vacuum is meaningless if it cannot seamlessly integrate into a clinician's workflow.

You have to know where the AI sits in the operational process.

Is it running autonomously in the background, embedded directly into the hospital's PCS, the picture archiving and communication system for pre-ereading triage? Or is it a standalone web dashboard that a clinician must actively choose to consult after they have drafted their initial report?

And crucially, the studies must explicitly state the non-use cases. You have to clearly outline the specific specific clinical scenarios where the model is unsupported, dangerous, and absolutely should not be utilized.

Setting those negative boundaries is arguably more important than highlighting the positive use cases. If a model was exclusively validated for triaging normal versus abnormal chest X-rays, using it to finalize a complex oncology staging diagnosis is a severe non-use case.

The guideline also mandates reporting on safety testing for clinically unsafe outputs and explicit documentation regarding data privacy safeguards. Data privacy is a massive hurdle when using these massive cloud-based LLMs.

Right? If a hospital is using a public API to process patient pathology reports, they have to report their deidentification protocols and where that data is geographically routed.

Is the patient data being securely processed within an enterprise cloud environment that complies with regional health regulations or is it bouncing off a public server halfway across the world?

These are critical governance and auditability measures that healthcare IT departments need before or they ever approve a deployment. It forces the research to confront the logistical realities of hospital IT infrastructure from day one.

Exactly.

Now, despite how rigorous and potentially exhausting this 44 item checklist sounds, the authors did something incredibly smart for practical implementation. They didn't just publish a dense PDF and walk away. They built a fully mobile compatible web tool for this framework.

It is an excellent resource.

You can find it at refine checklist.github.io. It features interactive tool tips that explain ex ly what each complex item means. It generates automated completion summaries and it allows researchers to export their completed data directly to Excel for systematic reviews

or export it as a PDF to submit right alongside their manuscript to a medical journal. It drastically lowers the barrier to adoption

and the authors fully acknowledge that the field of generative AI is moving at a breakneck pace.

Right. Oh, absolutely. What is considered standard prompting or standard architectural design today might be entirely obsolete in 18 months. Because of this rapid evolution, Riine is specifically designed as a living document,

meaning they aren't waiting a decade to issue a version two.

Exactly. The steering committee plans to formally re-evaluate and update the checklist items every two years. They are also actively exploring the potential for domain specific modular add-ons in the future.

So, we might eventually see specific extensions of the checklist tailored purely for textonly clinical documentation

or highly specialized modules for specific imaging modalities. If we connect this to the bigger picture, adopting these standards is going to require a massive cultural shift in medical research.

Journals, peer reviewers, and academic institutions are going to have to forcefully mandate these guidelines to make them the norm. Is this going to slow down the publication of AI research in medicine?

Undoubtedly, it will introduce friction and slow down the rapidfire publication of lower quality AI studies. But that friction is entirely necessary. The introduction of the refined guideline marks a critical maturation point for medical AI research.

The integration of foundation models into healthcare carries enormous potential to alleviate clinician burnout and improve diagnostic accuracy.

But it also carries unprecedented systemic risk. Without this rigorous standardized documentation, any evidence generated by FM and LLM will remain impossible to trust, impossible to compare across different institutions, and impossible to reproduce.

It forces the discipline to up.

Precisely. Refine provides the shared scientific vocabulary and the rigorous structural framework we desperately need to separate marketing hype from genuine reproducible clinical utility.

The payoff for that added friction is a solid foundation of evidence that frontline clinicians can actually rely on when treating patients. Well said.

To summarize, the core takeaway for today's deep dive. As foundation models and large language models aggressively move into digital pathology, radiology, and broader clinical workflows, the checklist is officially the new gold standard for research. By addressing the unique chaotic elements of generative AI like prompt sensitivity, stochastic generation, and the everpresent threat of data set contamination, this framework ensures that medical AI research remains transparent, reproducible, and above all, trustworthy for patient care.

It's a new era for medical AI validation.

It really is. Now, before we wrap up, I want to leave you with a final provocative thought to mull over as you head back to the lab or the clinic. As these large language models become increasingly advanced, multimodal, and perhaps even autonomous in their reasoning capabilities, could we eventually see a scenario where future versions of large language models are actually tasked with automatically evaluating their own research papers against the refine checklist before they are ever submitted for peer review.

That is a wild thought.

Imagine an AI acting as the initial rigorous gatekeeper for its own scientific transparency, flagging its own unrecorded temperature settings or ing flaws before a human reviewer ever sees the manuscript.

The AI policing the AI.

Exactly. Thank you Trailblazers for joining us on this journal club session of the digital pathology podcast. We highly encourage you to read the full REF paper in diagnostic and interventional radiology and definitely bookmark that online checklist tool for your institution's upcoming research. Keep pushing the boundaries, keep demanding transparency and we will catch you on the next deep dive.