Digital Pathology Podcast

186: Beyond the Glass Slide – Fusing Pathology and Genomics into 64-Bit Barcodes

Subscriber Episode Aleksandra Zuraw, DVM, PhD Episode 186

This episode is only available to subscribers.

Digital Pathology Podcast +

AI-powered summaries of the newest digital pathology and AI in healthcare papers

Send a text

Paper Discussed in this Episode: Multimodal learning for scalable representation of high-dimensional medical data. Alsaafin A, Shafique A, Alfasly S, Kalari KR and Tizhoosh HR (2026). Front. Digit. Health 7:1709277. doi: 10.3389/fdgth.2025.1709277

Episode Overview In this episode, we tackle the infrastructure challenge in digital diagnostics: how do we efficiently store, search, and integrate the overwhelming amount of multimodal data generated by modern medicine? We take a deep dive into a groundbreaking paper from the Kimia Lab at Mayo Clinic that proposes an audacious solution. Learn how researchers are compressing gigapixel whole slide images and complex immune receptor sequences into a tiny, searchable 64-bit binary barcode (a "monogram") to power the next generation of case-based reasoning in oncology.

Key Topics Discussed

The Intrinsic Heterogeneity Problem: Pathologists and computational biologists currently face a "silo" problem where visual whole slide images (WSIs) and textual immunogenomic data (T-cell and B-cell receptor sequences) exist in completely different computational worlds. Integrating them is like comparing a satellite photo of a city to a book of poetry written in that city.

Late vs. Early Fusion: Standard "late fusion" models are computationally heavy because they run two full, distinct pipelines, while "early fusion" often leads to the curse of dimensionality, creating huge continuous vectors that are impossible to search through in real-time.

Introducing MarbliX: We break down Multimodal Association and Retrieval with Binary Latent Indexed matriX (MarbliX), a framework designed to compress gigabytes of multimodal data into an 8x8 binary barcode.

Under the Hood of MarbliX (The 3 Phases):

    ◦ Phase 1 (Unimodal Transformation): The image data is prepped using SPLICE to segment tissue and fed into a DINO ViT vision transformer, while the messy genomic sequences are harmonized using "Seqwash" and fed into a BERT natural language model. Both output 768-dimensional vectors.

    ◦ Phase 2 (Multimodal Latent Association): The AI plays a "translation game" using hybrid autoencoders. One network looks at the tissue image to predict the genetic sequence, and the other looks at the genetics to predict the tissue architecture. This forces the model to learn the shared biological signal connecting phenotype and genotype.

    ◦ Phase 3 (Binarization): Using triplet contrastive learning, the model organizes patients in a mathematical space so similar diseases cluster together, eventually squashing the data into just 64 zeros and ones.

The Binary Trade-off & Hamming Distance: While binarization loses some precision compared to continuous floating-point math, it enables the use of "Hamming distance." This simple bitwise operation counts mismatches, allowing a database of 10 million patients to be searched in milliseconds on standard hardware.

Real-World Results: Tested on TCGA datasets, the MarbliX multimodal approach showed a massive 15% jump in retrieval performance over using histopathology images alone, achieving 85% to 89% accuracy in distinguishing lung cancer subtypes.

AI as a Librarian, Not a Judge: By retrieving the top 10 most similar historical cases based on barcode similarity, MarbliX empowers doctors with context and historical evidence rather than just giving a black-box diagnosis.

Get the "Digital Pathology 101" FREE E-book and join us!

Hello and uh welcome back to the digital pathology podcast.

Hey everyone, great to be here.

Yeah. And if you are tuning in today, you are likely well you're one of the trailblazers we absolutely love to design these sessions for.

Right. The people on the front lines.

Exactly. Pathologists, computational biologists, researchers, people who aren't just, you know, satisfied with the way we've always done diagnostics

because the old ways are hitting a wall.

They really are. And today we are looking beyond the glass slide. We're looking at the uh the infrastructure of the future.

It is great to be back and I have to say infrastructure is the exact right word to start with here.

Yeah.

Yeah. I mean we talk a lot about accuracy and AI, right? About getting that AU curve as high as possible.

Oh, always. It's the obsession,

right? But we rarely talk about the plumbing.

Um how do we actually store, search, and retrieve this massive amount of data we are generating?

Exactly. We are drowning in data, but we're starving for connection. Beautifully put.

Thanks. So today we are doing a deep dive into a paper that proposes a really uh elegant solution to this exact problem.

A very recent one too.

Yes. It was published just recently, January 27, 2026 in Frontiers in Digital Health.

And this is coming out of the Kima lab at Mayo Clinic in Rochester.

Right.

The paper is titled um multimodal learning for scalable representation of highdimensional medical data.

A bit of a mouthful.

It it is, but the authors are Alsen, Abu Bakr Shafi and the senior author is HR Tizhoosh,

which if you know the space, Tizhoosh and the Kimia Lab are heavy hitters.

Absolutely.

They've been pushing the idea of hashing in image retrieval for a while now.

They have. And this paper really feels like a I don't know a culmination of that philosophy. They're tackling the silo problem headon.

So, let's set the stage for the trailblazers listening. In digital pathology, we have these two well

massive distinct worlds, right?

On one side, you have the hisystopathology, the whole slide images. the WSIS.

Yeah, these are gigapixel behemoths. Pure visual data, morphology, texture, architecture,

right? And then on the other side, you have the OMIX,

completely different beast.

Totally. Specifically in this paper, they are looking at amunogenomics, the T- cell and B cell receptor sequences,

which isn't a picture at all.

No, it's a sequence. It's text. I mean, it's a language of amino acids.

It's like apples and oranges doesn't even cover it.

No, not at all.

It's more like um comparing a satellite photograph of a city to a book of poetry written in that city.

Oh, that is a fantastic analogy.

They are fundamentally different data types.

And that is the intrinsic heterogeneity problem the authors discuss right up front because these data types are so different. They usually live in completely different computational models.

Right. We see late fusion a lot. Right.

All the time.

You build a model for the image, you build a model for the genes, and then you just sort of average the scores at the end.

Exactly. And late fusion is fine, but It's computationally heavy because you are running two full distinct pipelines.

Double the work.

Plus, you miss the interactions. You miss how the satellite photo influences the poetry to borrow your analogy.

Right. So, the other option is early fusion where you smash the data together at the beginning.

But that leads to the curse of dimensionality.

It always does.

It does. Current multimodal models tend to produce what we call highdimensional continuous embeddings.

Which means what practically Imagine representing a single patient not as a simple code but as a vector with thousands of decimal points

which is fine if you're analyzing 50 patients in a research study.

Sure. But if you are a hospital system with a database of 2 million patients.

Oh boy.


Right. And you want to say find me the patient who looks like this case. You cannot search through two million highdimensional vectors in real time.


It's too slow.


The math is just too heavy.


So we have a storage problem, a search speed problem, and an integration problem. And this is exactly where the paper introduces Marbly.


Marbley X. It stands for multimodal association and retrieval with binary latent indexed matrix.


I do love a complex acronym.


It's practically a requirement in this field.


It really is. But the binary part is what caught my eye immediately. The mission here is audacious.


Very.


They want to compress gigabytes of slide data and complex genomic sequencing into a tiny 8x8 binary barcode. a monogram.


Just 64 bits.


It's wild.


64 zeros and ones. That is the entire representation of the patient's tumor morphology and their immune profile.


That seems intuitively impossible. I mean, a standard icon on my desktop is larger than 64 bits.


Yep.


How can you possibly capture the nuance of a tumor micro environment and a genetic sequence in something that small?


It sounds like magic, but it is combinotaurics.


Okay, lay it on me.


Even with just 64 bits, you have two to the power of 64 possible combinations,


which is a huge number.


That is roughly 1.8 * 10^ the 19th power.


Wow. Okay. So, plenty of room for unique IDs.


Exactly. You aren't going to run out of codes. The challenge is in space. The challenge is meaning. How do you generate those 64 bits so that they actually mean something clinically? How do you ensure that similar patients get similar codes?


Well, let's look under the hood, then. This isn't just a compression algorithm like zipping a file.


No, not at all.


This involves some heavy deep learning. architecture. They break it down into three phases,


right? Phase one is unimodal transformation. This is the prep work.


The ingredients.


Yeah. You can't just throw raw data into the blender. They need to turn the images and the genes into compatible mathematical vectors first.


Let's start with the images because they don't just dump the whole slide image in.


No, that would be way too noisy


and huge,


right? So, they use a method called SB LIC. Essentially, it segments the tissue from the background.


It ignores the glass.


Ignores the glass. The white face and focuses only on the tissue. It creates a sort of collage of the most representative patches


and then they feed that collage into a vision transformer.


Yes. Specifically, Dino Vit.


D I N O.


Exactly. This is a very popular model in our field right now because it is self-supervised


meaning no annotations needed.


Right. It doesn't need a human to circle the cancer cells first. It learns the visual features on its own. The stroma, the nuclei, the texture,


and it outputs a vector.


A seven 68 dimensional vector. Think of it as a mathematical summary of what the cancer looks like visually.


Okay, so that's the visual side. Now the genomic side, they were using RNA sect data from TCGA, the cancer genome atlas.


Correct. But they aren't looking at the whole genome.


Too much noise again.


Exactly. They are zooming in on the immune repertoire, the T- cell and bell receptors. They use a tool called tree rust T4 to extract those sequences.


Got genetic sequences are messy, right? They aren't all the same length.


They are incredibly messy. They vary in length, order, composition. You can't just plug them into a standard neural network.


So, what do they do?


So, the authors use a method called sequash.


Sequash. It sounds like a laundry cycle.


In a way, it is. It cleans and harmonizes the data. It treats the genetic sequences like sentences in a tag processing task.


Oh, that's clever.


Yeah. And once the data is washed, they feed it into a BERT model.


BERT. And that's the architecture that revolutionized Google search. and natural language processing.


Exactly. BERT is designed to understand context and language here. It's understanding the language of the immune receptors.


And I'm guessing it outputs a vector, too.


Yes. And just like with the images, it outputs a 768 dimensional vector.


Okay. So, end of phase one, we have an image vector and a genomic vector. They're the same size, but they speak completely different languages,


right? And this brings us to phase two, the multimodal latent association.


This is where it gets really cool.


This is where the innovation really happens. They don't just staple these vectors together. They use hybrid autoenccoders.


Walk us through this because I think this is the real mindmeld moment of the paper.


Okay. Typically, an autoenccoder takes an input, say an image, compresses it, and then tries to reconstruct the exact same image


just to prove it kept the important parts,


right? It's a way of teaching the computer to identify the most important features. But Marble X does something entirely different.


They cross the streams.


They do. They set up two networks. Network one takes the image features as input, but it is tasked with trying to reconstruct the genomic features as the output


and network 2 does the reverse.


Yes. It takes the genomics and tries to reconstruct the image.


That is fascinating. So the AI is effectively playing a translation game.


Exactly.


It has to look at the tissue slide and predict what the tea cells are doing genetically and then it has to look at the T- cells and predict what the tissue architecture looks like.


Precisely. And the only way that AI can succeed at this game is if there is a shared biological signal,


a connection,


right? It forces the model to learn the intrinsic links between phenotype, what it looks like, and genotype, what it is.


Because if they weren't related, it would just output garbage.


If those two things were unrelated, the model would completely fail.


So by forcing this translation, they're filtering out the noise and keeping only the signal that connects the two worlds.


Exactly. They are extracting the coherence of the disease.


Okay. So now we have these Super smart fused features. Yeah,


but there are still complex decimal vectors. We promised the trailblazers listening a barcode. How do we actually get to the 64 bits?


That is phase three. They feed these fused features into a final network they call network Q. And this network is trained using triplet contrastive learning.


I love triplet loss. It feels very intuitive to me. You have an anchor, a positive, and a negative.


It is the standard for metric learning. A Imagine the anchor is the patient we are currently analyzing.


Okay.


The model looks through the training data to find a positive. Another patient with the exact same diagnosis. Got it.


Then it finds a negative. A patient with a different diagnosis


and it basically treats them like magnets.


That is a great way to visualize it. The math adjusts the parameters to pull the anchor closer to the positive and push it away from the negative.


So it's organizing the patients in mathematical space so that similar diseases physically cluster together.


Exactly. And then the final step is the binarization,


the actual squashing,


right? The final layer of the network forces every single value to be either a zero or a one. It quantizes that complex spatial relationship into the 8 by8 monogram.


See, this is the moment where I get a bit skeptical.


Why is that?


Because you are taking highfidelity continuous math where you can have a very precise value like 8934 and you are just rounding it off to a one,


right?


Don't you lose a massive amount of information? doing that


you absolutely do lose information that is what they call the binary trade-off


okay


the question is do you lose clinically relevant information and that is what they had to prove with the evidence


so let's look at the results they tested this on tcga data sets specifically lung cancer and kidney cancer


right for lung cancer they were distinguishing between adenocarcinoma luad and squamous cell carcinoma lusc


which is a classic pathology problem they look somewhat similar under the microscope sometimes but the treatment path is very different


exactly and when they use Marblelex X to search and retrieve cases. They achieved accuracy and F1 scores between 85% and 89%.


89% is impressive for a pure retrieval system. But how did that compare to just using the modalities alone? Like just the images.


This is the key finding of the whole paper. If you used just hisystopathology, accuracy hovered around 69% to 71%.


Okay.


If you use just the immunogenomics, it was about 73% to 76%.


Wow. So, Combining them via Marble X gave a 15% jump in performance over the baseline.


Yes,


that is not a marginal gain. That is a massive leap.


It validates the whole hypothesis. The multimodal view provides a signal that neither modality has on its own.


And what about the kidney data set? They looked at three subtypes there. Kirc, KP, and KIC.


The results were even stronger there. Actually, the real valued monograms, meaning before they were fully squashed a binary, hit up to 90% accuracy.


But did the binderization hurt those kidney results. We talked about that trade-off.


It did a little bit. When they forced it to fully binary, the F1 score dropped to about 78% to 82%.


Okay. So, the trade-off is real.


Yes, the trade-off is real. You sacrifice a little bit of precision for massive scalability. But the authors argue that for a retrieval system where you really just want to find the top 10 most similar patients, 82% is still very robust.


There was a visual in the paper, figure six, that I thought was just brilliant. The heat map.


Yes, the visual proof. They wanted to show that the bits weren't just random. noise.


So they took the barcodes of different patients and exored them.


Exo exclusive or basically asking the computer are these bits different.


Right. And when they compared patients with the same cancer, the intraclass comparison, the map was mostly white,


meaning the codes were nearly identical.


Exactly. Very few differences. But when they compared say lung aduno to squamous cell, the map lit up with yellow and purple pixels


showing that the bits were flipping


showing clear distinction in the binary code. It proved that the structure of the code had physically changed to reflect the different biology. The diagnosis is quite literally encoded in the pattern of the zeros and ones.


That is incredible. So, bringing it back to reality, why does this matter for the trailblazers listening? Why should a pathologist or a hospital administrator care about binary codes?


It comes down to two big things. Searchability and scalability. We touched on hamming distance earlier.


Let's explain that actually. Why is hamming distance better than the way we usually Measure distance in AI.


Well, if you have continuous vectors, vectors with decimals, you have to use uklitian distance or cosine similarity,


which means lots of math.


Exactly. It involves floating port multiplication, square roots. It's computationally expensive for the processor.


But with binary,


with binary, you use heming distance. It's a simple bitwise operation. It just asks, is this bit a zero or a one?


It just counts the mismatches.


Exactly. And computers are incredibly fast at this. You can search a database of 10 mill million patients in milliseconds on standard hardware. You don't need a supercomput.


And that capability enables a completely new way of practicing medicine. Casebased reasoning.


This is the real paradigm shift. Right now, most AI acts as a classifier. You give it a slide and it acts like a judge.


Diagnosis is cancer. Confidence 99%. Right.


And that's a black box. Doctors often don't trust it because they don't know why it thinks that.


Exactly. Marble X turns the AI into a librarian instead of a judge. I love that framing.


The pathologist uploads a slide. The system converts it to a monogram and says, "I'm not going to tell you what this is. Instead, here are the 10 most similar patients from our hospital's history who had this exact same barcode.


Wow.


Here is how they were treated. Here's how they survived."


That is so much more powerful. It gives the physician context. It allows the human to make the final call based on real historical evidence.


It empowers the human rather than replacing them. And there is another benefit to this approach, too. stability,


right? The redundancy factor.


We all know pathology slides are messy. Tissue folds, staining artifacts, blur, pen marks from a Sharpie,


the bane of digital pathology.


Exactly. If a model only looks at the image, a bad scan breaks the prediction completely. But with Marble X, the genomic part of the code acts as a stabilizing anchor.


So even if the image is garbage, the immune signal keeps the retrieval accurate.


Exactly. The multimodal nature makes the system rob. bust against the daily messiness of clinical data. If one modality fails, the other holds the line.


Now, we have to be rigorous here. We can't just blindly praise the paper. What are the technical limitations? What breaks this system?


The authors were very transparent about this. Actually, the first limitation is the data source itself. They use TCGA


and TCGA is the gold standard for research, but it's not perfect.


It's not real world clinical data. It has variable slide quality, different scanners, different staining protocols across instit ions.


So there's a risk of bias.


There's always a risk that the model learns batch effects,


meaning it learns to recognize that say hospital A uses a slightly darker pink die rather than learning what the tumor actually looks like.


Exactly. While the multimodal approach helps mitigate that because it has to correlate with the genes, you really need to validate this on external independent data sets to be absolutely sure. Human in the loop oversight is still required.


And what about the collision problem? If we only have 64 bits, Eventually, two very different patients will get the same code just by bad luck.


It is a mathematical certainty. They call it a hash collision. In this study with a few thousand patients, it wasn't a major issue. But if you scale this to a national database of a 100 million patients, you will get collisions.


How do you handle that? In the real world,


you would need a tiered system. Maybe the 64-bit code is the initial global filter to get you down to a 100 candidates instantly. Okay?


And then you use a more precise continuous model to rank those top 100. It's a course defined search strategy


that makes total sense. Use the binary speed for the heavy lifting and the precision math for the final mile. So to summarize the whole Marble X approach for you trailblazers out there, we are taking looking the images and reading the sequences. We are fusing them through a mindmeld of autoenccoders and printing out a tiny searchable barcode that represents the patients holistic state.


That is it in a nutshell. It paves the way for large scale real time decision support systems in hospitals. It moves us from AI as a classifier to AI as a retrieval system of medical wisdom.


It really makes you think about how we store data today. We are so used to keeping radiology in one pans system, pathology in a separate LIS and genomics in a text file somewhere else.


The data silos,


right? We build walls around our data types. Which brings up a provocative thought to leave you with today.


Let's hear it.


If we can compress a patient's tumor morphology in their immune system into to 64 bits. What else are we missing? Simply because we keep our data types in separate folders.


That is the big question.


The future of medicine might not be about collecting more data. We are drowning in data already. It might just be about how we connect the data we already have.


Connection over collection. I like that.


Trailblazers, thank you for listening to the deep dive on the digital pepology podcast. If you want the nitty-gritty math on the triplet loss or the specific details on the autoenccoder architecture, definitely pull up the paper in frontiers in digital health.


It's a dense read, but absolutely worth your time.


It is. It's a true glimpse into where the field is going. Until next time, keep connecting the dots.


Goodbye, everyone.