Toward Universal Embeddings in Digital Pathology through Multimodal LLMs

Qifeng Zhou Thao M. Dang Wenliang Zhong Yuzhi Guo
Hehuan Ma Saiyang Na Haiqing Li Junzhou Huang
The University of Texas at Arlington Corresponding author: jzhuang@uta.edu

Abstract

Pathology plays a critical role in diagnosing a wide range of diseases, yet existing approaches often rely heavily on task-specific models trained on extensive, well-labeled datasets. These methods face sustainability challenges due to the diversity of pathologies and the labor-intensive nature of data collection. To address these limitations, we highlight the need for universal multimodal embeddings that can support multiple downstream tasks. Previous approaches involve fine-tuning CLIP-based models, which handle images and texts separately, limiting their ability to capture complex multimodal relationships. Additionally, these models are evaluated across diverse datasets without a unified benchmark. In this paper, we explore the possibility of applying Multimodal Large Language Models (MLLMs) to generate pathology universal embeddings to address these challenges. Our contributions can be summarized in the following aspects: 1) We propose MLLM4PUE, a novel framework that leverages MLLMs to generate embeddings for various pathology downstream tasks. 2) We further introduce the Pathology Multimodal Embedding Benchmark (PMEB), a comprehensive benchmark designed to assess the quality of pathology multimodal embeddings, which comprises 16 original tasks drawn from 15 datasets. 3) Extensive experimental results demonstrate the superiority of MLLM4PUE, illustrating MLLM-based models can effectively support a wide range of downstream tasks and unify the research direction for foundation models in pathology.

1 Introduction

Refer to caption — Figure 1: Performance comparison across benchmark datasets, showing that our proposed MLLM4PUE achieves baseline methods across most of tasks. Weight F1 (wF1) is reported for classification task. For the retrieval and composed retrieval task, we report recall@5 scores. This highlights the advanced multimodal fusion capabilities of MLLM4PUE.

Pathology remains the gold standard for diagnosing a wide range of diseases [41, 32]. With advancements in artificial intelligence (AI), digital pathology has emerged as a promising field, using AI to address challenges across various tasks such as cancer diagnosis [29, 38], metastasis detection [43, 21], mutation prediction [8, 33], and survival analysis [40, 39, 44]. These achievements, however, rely on models trained on large, well-labeled datasets. Given the hundreds of tumor types cataloged in the WHO classification¹¹1tumourclassification.iarc.who.int/ and the labor-intensive nature of data collection and annotation [11], developing separate models for each pathology task is impractical, especially for rare diseases with limited data [23, 30]. Furthermore, while many efforts focus on image, pathology often involves text, including diagnostic reports and research papers [6, 7, 31, 13]. Thus, there is a growing need to shift toward task-agnostic models that generate universal multimodal embeddings capable of supporting different pathology tasks.

Due to the success of CLIP [34] model, several studies [30, 15, 37, 16, 31] have adapted CLIP-based models for pathology, training on extensive image-text pairs to extract embeddings that enable zero-shot evaluation across various pathology tasks. However, these methods encounter two significant challenges. First, CLIP [34] processes images and text independently, limiting its ability to capture complex interactions between modalities [18]. Additionally, CLIP [34] can only handle text inputs with a maximum length of 74 tokens, which restricts its ability to leverage longer and more detailed textual descriptions. Second, evaluations of these models have mostly centered on isolated tasks, such as classification or retrieval [30, 37, 16], which separate image and text components and do not fully assess multimodal integration capabilities. Moreover, inconsistencies in datasets and sample selection criteria across studies lead to non-reproducible and incomparable results. These challenges emphasize the need for a more integrated model and a standardized benchmarking framework to advance multimodal embeddings in pathology.

To tackle the first challenge, we propose MLLM4PUE, a framework that uses a MLLM to generate universal multimodal embeddings for pathology. Unlike CLIP [34], which processes images and text separately, MLLM employs a transformer-based architecture to fully integrate these modalities, allowing it to learn from complex image-text relationships [28, 27]. Moreover, MLLM can handle longer textual inputs compared to CLIP, overcoming the 74-token limitation and enabling richer contextual understanding, which is important in pathology area. To harness the capability of MLLM in generating embeddings, we take the following steps. First, we employ summarization prompts to guide the MLLM to output embeddings. Next, we fine-tune MLLM with a LoRA module using pathology image-text pairs, adapting it specifically for the pathology domain.

To address the second challenge, we introduce the Pathology Multimodal Embedding Benchmark (PMEB) to establish a standardized evaluation platform for multimodal embeddings in pathology. This benchmark encompasses three meta-task categories: retrieval, composed retrieval, and classification. To ensure the tasks effectively assess multimodal embeddings, we have reformulated them to better suit this purpose. Each task provides the model with prompts to process a query and identify the correct target from a set of candidates. These queries and targets can consist of images, text, or a combination of both, enabling a flexible and comprehensive evaluation of multimodal capabilities. PMEB offers a standardized and reproducible approach for assessing multimodal embeddings in pathology. Overall, our contributions are summarized as follows:

•

We introduce MLLM4PUE, a framework converting pathology images and texts into robust embeddings within a unified model. This is the first time to explore the possibility of re-purposing MLLM for extracting embeddings in pathology area.
•

We provide PMEB, a comprehensive benchmark for evaluating multimodal embeddings in pathology, encompassing a broad range of tasks to support robust and generalizable assessment of these embeddings.
•

MLLM4PUE effectively represents multimodal information, consistently surpassing the performance of baseline models across amounts of evaluated tasks, as demonstrated by the quantitative results in Fig. 1.
•

Our approach offers new insights into digital pathology by adopting MLLM for universal embeddings. Future research can build on this work to unify the development of models for diverse pathology tasks.

Meta-Task	Data source	Query	Target	Original Tasks	Original samples	Selected samples
Retrieval	Arch-book [12]	I	T	Image-text retrieval	1,306	1,306
	Arch-book [12]	T	I	Text-image retrieval	1,306	1,306
	Arch-pubmed [12]	I	T	Image-text retrieval	1,923	1,923
	Arch-pubmed [12]	T	I	Text-image retrieval	1,923	1,923
Composed	Quilt-VQA [35]	I+T	T	Histopathology VQA	985	724
retrieval	Quilt-VQA-RED [35]	I+T	T	Histopathology VQA	335	252
Classification	Bach [2]	I	T	Breast tissue	399	399
	CRC-100k [19]	I	T	Colorectal cancer	100,000	1,000
	Databiox [19]	I	T	Colorectal cancer	922	922
	Digestpath [9]	I	T	Signet ring cell	455	455
	Digestpath [9]	I	T	Colonoscopy tissue	660	660
	Kathercolon [20]	I	T	Colorectal cancer tissue	7,180	1,000
	LC25000 [4]	I	T	Colon adenocarcinoma	10,000	1,000
	LC25000 [4]	I	T	Lung adenocarcinoma	15,000	1,000
	Osteo [3]	I	T	Osteosarcoma	1,144	1,144
	Renalcell [5]	I	T	Renal tissue	36,687	1,000
	Sicap [36]	I	T	Gleason pattern	12,081	1,000
	Skincancer [22]	I	T	Skin cancer tissue	129,369	1,000
	Wsss4luad [14]	I	T	LUAD tissue	10,091	1,000

Table 1: The statistics of our PMEB benchmark: 15 original tasks from 14 datasets. I and T denote image and text, respectively.

2 Related Work

Vision-language Models in Digital Pathology. In digital pathology, recent studies have adopted the CLIP [34] model, leveraging paired image-text data within a contrastive learning framework to align similar image-text embeddings while separating dissimilar ones [30, 15, 37, 16, 17]. For instance, PLIP [15] fine-tunes CLIP on large collections of image-text pairs sourced from platforms like Twitter and other public datasets. Similarly, PathCLIP [37] and QuiltNet [16] scale up pathology-specific data for training. CONCH [31] employs over 1.17 million image-caption pairs to fine-tune CoCa [42] model. Despite these advancements, these methods fine-tune either CLIP [34] or CoCa [42] using pathology image-caption pairs and rely on separate encoders for processing images and text. In contrast, we propose an MLLM-based framework to capture universal multimodal embeddings for pathology by integrating image and text data within a unified model. This integrated framework enhances performance across a wide range of digital pathology tasks, offering a more holistic and effective solution.

Multimodal Embedding Learning. Creating effective multimodal embeddings remains a challenging area of research. CLIP [34] addresses this by employing separate encoders for images and text, aligning them in a common space through contrastive learning. This approach has influenced many subsequent models, such as BLIP [25] and CoCa [42]. While these models excel in a variety of tasks, the use of distinct encoders restricts their capacity to fully integrate visual and textual data, which is a key requirement for tasks that involve combined visual-language inputs, such as search queries involving both images and text.

With the rise of large language models (LLMs), MLLMs have extended LLMs to process multiple data modalities, achieving notable progress in understanding and reasoning across diverse input types [28, 27, 24]. Although MLLMs exhibit strong capabilities in interpreting multimodal content and following complex instructions, there is still limited research on their application in creating embeddings. The recent work E5-V [18] fine-tunes an LLM with summarization prompts and text-only data to extract embeddings, then integrates a vision module to obtain multimodal embeddings for zero-shot multimodal retrieval tasks. However, for pathological adaptation, E5-V [18] requires a domain-specific LLM pretrained on pathology instruction data, which demands detailed and standardized annotations. In contrast, our framework, MLLM4PUE can fine-tune MLLM with a LoRA module using existing pathology image-text pairs, bypassing these prohibitive requirements.

3 Methodology

In this section, we first introduce the Pathology Multimodal Embedding Benchmark (PMEB), a comprehensive benchmark for evaluating multimodal embeddings in diverse pathology tasks. Then, we present the MLLM4PUE framework, which integrates multimodal embeddings, contrastive learning, and zero-shot transfer methods to align and adapt pathology image-text data effectively, as illustrated in Fig. 2.

3.1 Pathology Multimodal Embedding Benchmark

We propose PMEB, a comprehensive benchmark designed to evaluate pathology multimodal embeddings. PMEB include cancer tissue classification [2, 4, 22], Gleason pattern grading [36], and cell identification [9], among other tasks. As shown in Table 1, PMEB comprises 15 original tasks from 14 datasets, organized into three meta-tasks: retrieval, classification, and composed retrieval. All tasks are reformulated to assess multimodal embeddings, where the model receives an instructional prompt and a query, which may consist of text, images, or both, and selects the correct target from a set of options.

In the retrieval task, the query can be either an image or text, with the target being the corresponding paired text or image. For the classification task, queries are images with targets as text labels representing various classes. The composed retrieval task involves queries that combine an image with text formatted as a question, with the target being the expected answer. This composed retrieval task, specifically designed for multimodal embedding evaluation, incorporates integrated visual and language inputs and has not been applied in previous methods.

Due to the large size of some datasets, we standardize their sizes. For datasets containing more than 1,000 samples (except Osteo [3]), we choose a random sample of 1,000, ensuring the original category distribution is preserved. For the composed retrieval task, we select only open-ended questions from the data source, where answers are more complex than a simple “yes” or “no”. Additionally, since there is an overlap between questions and answers in the VQA data source, we utilize ChatGPT [1] for data cleaning to ensure clarity and accuracy in task evaluation. Further details of each dataset within PMEB are provided in Supplementary Materials.

3.2 Multimodal Embeddings with MLLM

We first introduce how our proposed MLLM4PUE leverages MLLMs for universal multimodal embedding learning. Unlike CLIP, which directly generates embeddings for multimodal inputs, MLLMs are inherently generative models and require a different approach to produce embeddings. We uses a prompt-based strategy specifically designed to guide MLLMs in embedding multimodal data.

Given an MLLM $f_{\varphi}$ and a query $q$ , we use the prompt, < $q$ > \n Summarize the above <> in one word:, to generate $q_{input}$ . In this setup, < $q$ > serves as a placeholder for the query, while <> indicates the type of the query’s content. For instance, if query $q$ is an image, the prompt would be <image>\n Summarize the above H&E image in one word:; if query $q$ is a sentence, then the prompt would be <text>\n Summarize the above sentence in one word:.

The prompt design incorporates two components: first, the word Summarize directs the MLLM to distill the information of the multimodal input. The second part, in one word:, instructs the MLLM to compress this distilled information into a single token, thereby facilitating a unified multimodal embedding. Once the extended input $q_{input}$ is constructed, it is processed by the MLLM $f_{\varphi}$ to generate embeddings: $h=f_{\varphi}(q_{input})$ , where $h$ is the embedding of last token obtained from the MLLM output, as illustrated in Fig. 3. This approach tailors the generation capabilities of MLLMs to effectively produce and utilize multimodal embeddings, expanding their application potential beyond traditional generative outputs.

3.3 Pathology Contrastive Learning

Next, we describe the approach for projecting pathological data into a latent embedding space using conservative learning. As illustrated in Fig. 2, given a batch of $N$ paired image and text samples $\{(v_{n},t_{n})\}_{n=1,\dots,N}$ and MLLM $f_{\varphi}$ , for each sample $(v_{i},t_{i})$ , we feed them into $f_{\varphi}$ to get the image and text embeddings $(h_{vi},h_{ti})$ as described in Sec. 3.2.

The objective is to build a latent space for pathology image-text embeddings that maximizes the similarity between embeddings of the paired samples in the batch, while minimizing the similarity between embeddings of the $N-1$ incorrect pairs. We optimize an infoNCE contrastive Loss $\mathcal{L}$ to train our model.

\mathcal{L}=-(\log\frac{e^{cos(h_{vi},h_{ti})/\tau}}{\sum_{j=1}^{n}e^{cos(h_{vi},h_{tj})/\tau}}+\log\frac{e^{cos(h_{ti},h_{vi})/\tau}}{\sum_{j=1}^{n}e^{cos(h_{ti},h_{vj})/\tau}}),

(1)

where $\tau$ is a temperature parameter and $cos(h_{vi},h_{ti})$ and $cos(h_{ti},h_{vi})$ represent the cosine similarity in both directions in contrastive learning.

3.4 Zero-shot Transfer Evaluation

Our model, trained primarily on retrieval tasks focused on image-text pairs, requires adaptation to handle zero-shot classification and composed retrieval tasks. The details of how we modify the model for these different zero-shot downstream tasks are introduced as follows.

For zero-shot classification, we adopt a prompt-based method inspired by the CLIP model [34]. In this approach, each class name is expanded into a sentence using a specific template. For instance, the class name “Colon adenocarcinoma” is converted into the sentence “An H&E image of Colon adenocarcinoma” using the template “An H&E image of { }.” We apply this method to create sentences for all class names. Our model then computes embeddings for these sentences and the test images, extracting the last token of the MLLM output as described in Section 3.2. The similarity between these embeddings is calculated as outlined in Section 3.3, and test image labels are assigned based on the highest similarity scores.

For the composed retrieval task, where the query consists of both an image and a question, we need to convert this multimodal information into a unified embedding. Thus, we use the prompt <image>\n <question>\n Summarize the above H&E image and question in one word: to integrate the image and the question. Our model then generates embeddings for this combined input, as explained in Section 3.2, which are used to retrieve the embeddings of the corresponding answers.

Task	Model	Arch-PubMed [12]			Arch-book [12]
Task	Model	Recall@5	Recall@10	Recall@50	Recall@5	Recall@10	Recall@50
i2t	E5-V [18]	0.006	0.011	0.052	0.009	0.018	0.067
	PLIP [15]	0.037	0.067	0.185	0.096	0.152	0.393
	MLLM4PUE-O	0.105	0.160	0.366	0.185	0.264	0.575
	PathCLIP [16]	0.275	0.388	0.680	0.152	0.234	0.482
	MLLM4PUE-P	0.372	0.495	0.782	0.192	0.283	0.603
	QuiltNet [16]	0.069	0.111	0.273	0.116	0.168	0.384
	MLLM4PUE-Q	0.122	0.177	0.407	0.182	0.248	0.502
t2i	E5-V [18]	0.006	0.011	0.052	0.009	0.018	0.067
	PLIP [15]	0.037	0.067	0.181	0.112	0.164	0.419
	MLLM4PUE-O	0.114	0.173	0.385	0.193	0.280	0.600
	PathCLIP [16]	0.236	0.348	0.630	0.137	0.196	0.445
	MLLM4PUE-P	0.297	0.399	0.688	0.185	0.277	0.555
	QuiltNet [16]	0.056	0.092	0.237	0.100	0.152	0.389
	MLLM4PUE-Q	0.135	0.193	0.433	0.210	0.302	0.584

Table 2: Performance comparison with baseline methods on two retrieval datasets, reporting recall metrics at different thresholds. Task i2t and t2i denote image-to-text and text-to-image retrieval, respectively. MLLM4PUE-O, MLLM4PUE-P, and MLLM4PUE-Q denote our model pre-trained on Openpath [15], PathCap [37], Quilt1M [16], respectively. Bold values indicate the best performance. Same color represents models trained on the same dataset.

4 Experiments

Training datasets. We leverage three public datasets, Openpath [15], PathCap [37] and Quilt1M [16], to construct a robust training foundation for pathology image-text pairs. The Openpath dataset [15] is assembled by gathering data from Twitter and LAION using popular pathology hashtags. Since PLIP[15] provided only Twitter IDs, we retrieve the corresponding data from Twitter and LAION using these IDs, resulting in a dataset of 138,874 image-text pairs after denoising. The PatchCap dataset [37] comprises 207K high-quality samples from authoritative sources, and we use all provided image-text samples. Similarly, the Quilt1M dataset [16] aggregates pathology image-text pairs from several public sources. As many images in Quilt1M are associated with multiple captions, we concatenate the captions for each image to create comprehensive entries. Through these careful preparations, we compile an extensive training dataset consisting of 593,838 image-text pairs.

Baseline Methods and Metrics. We evaluate the performance of our proposed MLLM4PUE against several recent baseline models across various downstream tasks. Specifically, we compare it with three pathology CLIP-based models: PLIP [15], PathCLIP [37], and QuiltNet [16], all of which are trained on the same dataset as our method. Additionally, we compare our approach with E5-V [18], a method that fine-tunes an LLM with summarization prompts and text-only data to extract embeddings, then integrates a vision module to obtain multimodal embeddings for zero-shot multimodal retrieval tasks. For zero-shot classification tasks, we also compare our method with CONCH [31], a model trained on over 1.17 million private image-caption pairs, whereas our method is trained on fewer than 0.6 million pairs.

To evaluate the performance of retrieval and composed retrieval tasks, we use the Recall@K metric, which quantifies the proportion of relevant items contained within the top K retrieved results. For classification tasks, due to the imbalanced distribution of classes, we use the weighted F1 (wF1) score to assess performance. The weighted F1 is calculated by averaging the F1 scores for each class, with each class’s score weighted by its frequency of occurrence.

Implementation Details. Our framework is implemented in Pytorch and we use LLaVA-NeXT-8B [24] as our backbone MLLM. To save the GPU memory, we use QLoRA [10] and gradient checkpointing with DeepSpeed ZeRO-2 to fine-tune the MLLM, and all the images are resized to 336 × 336 pixels. The temperature is set to 0.02. All experiments are run on six H100 GPUs with a gradient accumulation batch size of 144 and a learning rate of 4e-4.

Model

Bach

Data

biox

Digest

(Colon)

Digest

(Signet)

CRC

100k

Kather

colon

lung

Osteo

Renal

cell

Sicap

Skin

cancer

Wsss

4luad

E5-V [18]

0.100

0.162

0.197

0.049

0.224

0.206

0.333

0.209

0.300

0.051

0.230

0.017

0.132

CONCH [31]

0.511

0.413

0.835

0.824

0.680

0.625

0.890

0.650

0.382

0.393

0.359

0.593

0.772

PLIP [15]

0.124

0.194

0.696

0.252

0.503

0.504

0.697

0.743

0.554

0.413

0.080

0.311

0.394

MLLM4PUE-O

0.561

0.437

0.784

0.480

0.642

0.509

0.958

0.744

0.581

0.507

0.284

0.411

0.636

PathCLIP [37]

0.528

0.282

0.785

0.707

0.611

0.492

0.937

0.887

0.668

0.359

0.267

0.332

0.648

MLLM4PUE-P

0.577

0.388

0.799

0.900

0.763

0.783

0.937

0.886

0.751

0.543

0.502

0.504

0.754

QuiltNet [16]

0.403

0.318

0.491

0.564

0.591

0.884

0.801

0.578

0.435

0.322

0.467

0.364

MLLM4PUE-Q

0.437

0.434

0.868

0.851

0.686

0.668

0.886

0.866

0.613

0.516

0.440

0.507

0.758

E5-V [18]

0.251

0.324

0.367

0.169

0.299

0.306

0.500

0.348

0.385

0.166

0.337

0.053

0.267

CONCH [31]

0.559

0.444

0.832

0.802

0.703

0.644

0.890

0.665

0.439

0.447

0.405

0.607

0.783

PLIP [15]

0.251

0.306

0.708

0.253

0.532

0.541

0.721

0.749

0.551

0.488

0.179

0.290

0.450

MLLM4PUE-O

0.569

0.448

0.793

0.429

0.666

0.546

0.958

0.775

0.624

0.518

0.426

0.453

0.671

PathCLIP [37]

0.556

0.386

0.784

0.666

0.610

0.489

0.937

0.888

0.698

0.350

0.362

0.378

0.688

MLLM4PUE-P

0.599

0.386

0.812

0.892

0.767

0.784

0.937

0.886

0.744

0.576

0.509

0.496

0.760

QuiltNet [16]

0.441

0.413

0.628

0.503

0.607

0.586

0.904

0.802

0.588

0.434

0.338

0.458

0.449

MLLM4PUE-Q

0.479

0.441

0.866

0.849

0.695

0.661

0.887

0.872

0.609

0.494

0.444

0.520

0.761

Table 3: Performance comparison with baseline methods on zero-shot classification tasks, reporting weighted F1 (wF1) (top) and accuracy (bottom) metrics. MLLM4PUE-O, MLLM4PUE-P, and MLLM4PUE-Q denote our model pre-trained on Openpath [15], PathCap [37], Quilt1M [16], respectively. Bold values indicate the best performance. Same color represents models trained on the same dataset.

5 Results

5.1 Zero-shot Cross-modal Retrieval

We first evaluate our methods on the retrieval task on Arch-PubMed [12] and Arch-book datasets [12], where the query is an image or text, with the target being the corresponding paired text or image. Recall@k metric is employed to assess the model’s ability to retrieve relevant pathology image-text pairs among top k results. As shown in Table 2, our models consistently outperform the baseline models, especially at higher recall values. Notably, in the image-to-text (i2t) task on Arch-book dataset [12], our MLLM4PUE trained on Openpath dataset [15] achieves a Recall@5 of 0.185, indicating an approximate 9% improvement over PLIP [15]. Moreover, our PathCap-pretrained model surpasses the performance of PathCLIP [37] on both benchmark datasets. A similar enhancement is observed in the text-to-image (t2i) retrieval task. For instance, on the Arch-PubMed dataset [12], our model achieves a Recall@10 of 0.193, marking a 10% increase in performance. These results collectively demonstrate that our framework excels at capturing multimodal embeddings more effectively than CLIP-based models with the same training data. Furthermore, it is important to note that the E5-V model [18] yields significantly lower recall scores across all tasks and datasets due to the lack of fine-tuning with pathology-specific data.

Model	Quilt-VQA [35]			Quilt-VQA-RED [35]
Model	Recall@5	Recall@10	Recall@50	Recall@5	Recall@10	Recall@50
E5-V [18]	0.214	0.323	0.495	0.351	0.441	0.668
PLIP [15]	0.191	0.224	0.349	0.250	0.286	0.500
MLLM4PUE-O	0.239	0.334	0.583	0.373	0.460	0.790
PathCLIP [16]	0.192	0.242	0.413	0.262	0.337	0.603
MLLM4PUE-P	0.218	0.329	0.541	0.363	0.468	0.790
QuiltNet [16]	0.180	0.209	0.336	0.222	0.274	0.460
MLLM4PUE-Q	0.475	0.598	0.823	0.675	0.754	0.948

Table 4: Zero-shot composed retrieval performance on the Quilt-VQA [35] and Quilt-VQA-RED [35] datasets, with Recall metrics reported at various thresholds. MLLM4PUE-O, MLLM4PUE-P, and MLLM4PUE-Q denote our model pre-trained on Openpath [15], PathCap [37], Quilt1M [16], respectively. Bold values indicate the best performance. Same color represents models trained on the same dataset.

5.2 Zero-shot Patch Classification

We then evaluate the performance of zero-shot classification on patch-level pathology images across 13 classification tasks from our PMEB. In this downstream task, each query is an image, and the target is an extended sentence of the class name as described in Section 3.4. Table 3 illustrates that our MLLM4PUE consistently surpasses baseline models on most datasets. Notably, with the same training data, MLLM4PUE-O outperforms PLIP [15] across all datasets. Similarly, MLLM4PUE-Q achieves the best performance compared to QuiltNet [16], demonstrating the effectiveness of our pretraining strategy on the same training dataset. While CONCH [31] attains the highest score on SkinCancer [22] and Wsss4luad [14], its superior performance stems from utilizing a much larger private dataset (1.16M). In contrast, our model, MLLM4PUE-P, outperforms all baseline models across seven datasets with only 0.2 million training samples, and is only 0.02 less than CONCH [31] on the Wsss4luad dataset [14]. This underscores the advantages of leveraging vision-language alignment in pathology representation learning. These findings collectively highlight the effectiveness of our approach, especially considering that our models were trained on fewer samples, further demonstrating their data efficiency in pathology-specific zero-shot classification. The accuracy metric shows a similar trend, where MLLM4PUE-P outperforms the baselines on most datasets.

5.3 Zero-shot Composed Retrieval

We further evaluate the performance of composed retrieval on two datasets, Quilt-VQA [35] and Quilt-VQA-RED [35], where the task involves using an image and a question as the query to retrieve the correct answer. For our model and E5-V [18], we apply the prompt <image>\n <question>\n Summarize above H&E image and question in one word: to integrate the image and the question. For the other three CLIP-based baseline models, we add the image and question embedding directly since these models handle image and text separately. The Recall@k metric is employed to assess how well the model can identify the correct answer within the top k candidates. As shown in Table 3, our proposed MLLM4PUE achieves better results compared to baseline methods. Specifically, on the Quilt-VQA dataset [35], MLLM4PUE achieves an increase in Recall@5 by approximately 4%, 3%, and 30% over PLIP [15], PathCLIP [37], and QuiltNet [16], respectively. Similarly, on the Quilt-RED-VQA dataset [35], MLLM4PUE pre-trained on Quilt1M [16], has achieved a Recall@5 of 0.675, which represents the best performance. Interestingly, the E5-V model [18] exhibits superior performance compared to all CLIP-based models, despite lacking fine-tuning on pathology-specific data. This demonstrates the robust capacity of our method in multimodal fusion, highlighting its proficiency in handling diverse data inputs effectively.

5.4 Visualization of Modality Gap

To illustrate the modality gap, we employed t-SNE visualization to compare the embeddings of our MLLM4PUE-P model with those of PathCLIP [37], both trained on the Patchcap dataset [37]. As shown in Fig 4, the qualitative analysis clearly shows that the embeddings produced by MLLM4PUE-P form tighter clusters in the low-dimensional space, whereas those from PathCLIP [37] are more dispersed. This difference reflects a significant improvement in cross-modal alignment of our method.

To further quantify the modality gap, we adopted the metric $||\Delta||_{gap}$ as proposed in Liang et al. [26]. Our experiments reveal that, on the Arch-PubMed dataset [35], our method reduces the gap from 0.9284 to 0.2587, and on the Arch-Book dataset [35], from 1.0408 to 0.3739. These results provide strong evidence that our model can significantly narrow the gap between image and text embeddings.

5.5 Ablation Studies

Model		Quilt-VQA [12]			Quilt-VQA-RED [12]
Model		Recall@5	Recall@10	Recall@50	Recall@5	Recall@10	Recall@50
QuiltNet [16]	only question	0.170	0.189	0.275	0.187	0.234	0.440
QuiltNet [16]	add embedding	0.180	0.209	0.336	0.222	0.274	0.460
MLLM4PUE	only question	0.291	0.362	0.539	0.429	0.488	0.694
	add embedding	0.394	0.500	0.746	0.520	0.619	0.909
	prompt	0.475	0.598	0.823	0.675	0.754	0.948

Table 5: Ablation study of image and question embedding fusion methods on composed retrieval task. All models are trained on the Quilt1M [16] dataset. Recall metrics are reported. Bold values indicate the best performance.

Robustness of PMEB. Since some datasets in our PMEB are super large, we choose a random sample of 1,000. To assess the robustness of our benchmark, we conduct an ablation study by computing the wf1 scores using 1,000, 3,000, 5,000, and all available samples for each dataset. As shown in Figure 5, the performance change between 1,000 and the full dataset remains within a small margin across all datasets, confirming that using only 1,000 samples provides a reliable approximation of the complete dataset’s performance. Given that the performance difference is typically within 0.01 to 0.04 across different sample sizes, using 1,000 samples significantly reduces computational costs while maintaining reliable performance. This makes PMEB highly efficient for large-scale experiments without sacrificing meaningful evaluation. We also report the result of accuracy metric and the results in CONCH [31] and PathCLIP [37] model in the Supplementary Materials, which shows the same trend.

Embedding Fusion Methods. This experiment is designed to validate the quality of our benchmark and highlight the effectiveness of our MLLM4PUE in modality fusion compared to CLIP-based model in composed retrieval tasks. As shown in Table 5, MLLM4PUE outperforms QuiltNet [16] in every setting. First, comparing “only question” results, our MLLM4PUE shows significant improvement compared with QuiltNet [16], while both models demonstrate substantial performance gains when incorporating visual information via “add embedding”. This consistent pattern across both models confirms that our benchmark has been properly curated, requiring true integration of visual and textual information rather than allowing models to infer answers solely from questions.

Second, the results reveal that QuiltNet [16], a CLIP-based model, shows little difference between using only the question text and integrating image and text embeddings. This indicates that CLIP’s method lacks robust modality integration, as the inclusion of visual information fails to significantly enhance its effectiveness. Conversely, our model MLLM4PUE demonstrates substantial improvements by effectively fusing image and text information. The true strength of our approach becomes evident when employing a prompt-based method, leading to significantly higher recall scores. Specifically, it achieves Recall@5 values of 0.475 on Quilt-VQA and 0.675 on Quilt-VQA-RED, representing approximately an 8% and 12% improvement over the simple addition method, respectively, and an 18% and 25% improvement over using the question text alone. These findings clearly indicate that our model’s ability to integrate both visual and textual modalities results in significant enhancements in retrieval accuracy. This surpasses both mere modality addition and the question-only configuration, highlighting the effectiveness of our modality fusion technique for composed retrieval tasks. Our approach provides advantages that cannot be achieved through separate handling or limited integration of text and image data.

6 Conclusion

In this paper, we have explored a novel paradigm for applying MLMMs to generate multimodal universal embeddings for a broad range of pathology downstream tasks. By applying summarization prompts and inserting lightweight LoRA modules, our proposed framework, MLLM4PUE, can effectively integrates image and text data, and addresses the limitations of previous CLIP-based methods. We also introduce PMEB, which provides a standardized and comprehensive benchmark for evaluating multimodal capabilities, promoting reproducibility and comparability in the field. Comprehensive experiment results across 15 diverse zero-shot retrieval and classification tasks underscore the versatility and effectiveness of our method. We anticipate that our framework will provide valuable insights into the pathology vision-language foundation models and inspire further research in this area.

References

Achiam et al. [2023] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
Aresta et al. [2019] Guilherme Aresta, Teresa Araújo, Scotty Kwok, Sai Saketh Chennamsetty, Mohammed Safwan, Varghese Alex, Bahram Marami, Marcel Prastawa, Monica Chan, Michael Donovan, et al. Bach: Grand challenge on breast cancer histology images. Medical image analysis, 56:122–139, 2019.
Arunachalam et al. [2019] Harish Babu Arunachalam, Rashika Mishra, Ovidiu Daescu, Kevin Cederberg, Dinesh Rakheja, Anita Sengupta, David Leonard, Rami Hallac, and Patrick Leavey. Viable and necrotic tumor assessment from whole slide images of osteosarcoma using machine-learning and deep-learning models. PloS one, 14(4):e0210706, 2019.
Borkowski et al. [2019] Andrew A Borkowski, Marilyn M Bui, L Brannon Thomas, Catherine P Wilson, Lauren A DeLand, and Stephen M Mastorides. Lung and colon cancer histopathological image dataset (lc25000). arXiv preprint arXiv:1912.12142, 2019.
Brummer et al. [2022] Otso Brummer, Petri Pölönen, Satu Mustjoki, and Oscar Brück. Integrative analysis of histological textures and lymphocyte infiltration in renal cell carcinoma using deep learning. bioRxiv, pages 2022–08, 2022.
Chen et al. [2024] Pingyi Chen, Honglin Li, Chenglu Zhu, Sunyi Zheng, Zhongyi Shui, and Lin Yang. WsiCaption: Multiple Instance Generation of Pathology Reports for Gigapixel Whole-Slide Images . In proceedings of Medical Image Computing and Computer Assisted Intervention – MICCAI 2024. Springer Nature Switzerland, 2024.
Chen et al. [2025] Pingyi Chen, Chenglu Zhu, Sunyi Zheng, Honglin Li, and Lin Yang. Wsi-vqa: Interpreting whole slide images by generative visual question answering. In European Conference on Computer Vision, pages 401–417. Springer, 2025.
Coudray et al. [2018] Nicolas Coudray, Paolo Santiago Ocampo, Theodore Sakellaropoulos, Navneet Narula, Matija Snuderl, David Fenyö, Andre L Moreira, Narges Razavian, and Aristotelis Tsirigos. Classification and mutation prediction from non–small cell lung cancer histopathology images using deep learning. Nature medicine, 24(10):1559–1567, 2018.
Da et al. [2022] Qian Da, Xiaodi Huang, Zhongyu Li, Yanfei Zuo, Chenbin Zhang, Jingxin Liu, Wen Chen, Jiahui Li, Dou Xu, Zhiqiang Hu, et al. Digestpath: A benchmark dataset with challenge review for the pathological detection and segmentation of digestive-system. Medical Image Analysis, 80:102485, 2022.
Dettmers et al. [2024] Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36, 2024.
Erickson et al. [2022] Lori A Erickson, Ozgur Mete, C Christofer Juhlin, Aurel Perren, and Anthony J Gill. Overview of the 2022 who classification of parathyroid tumors. Endocrine Pathology, 33(1):64–89, 2022.
Gamper and Rajpoot [2021] Jevgenij Gamper and Nasir Rajpoot. Multiple instance captioning: Learning representations from histopathology textbooks and articles. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16549–16559, 2021.
Guo et al. [2024] Zhengrui Guo, Jiabo Ma, Yingxue Xu, Yihui Wang, Liansheng Wang, and Hao Chen. Histgen: Histopathology report generation via local-global feature encoding and cross-modal context interaction. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 189–199. Springer, 2024.
Han et al. [2022] Chu Han, Xipeng Pan, Lixu Yan, Huan Lin, Bingbing Li, Su Yao, Shanshan Lv, Zhenwei Shi, Jinhai Mai, Jiatai Lin, et al. Wsss4luad: Grand challenge on weakly-supervised tissue semantic segmentation for lung adenocarcinoma. arXiv preprint arXiv:2204.06455, 2022.
Huang et al. [2023] Zhi Huang, Federico Bianchi, Mert Yuksekgonul, Thomas J Montine, and James Zou. A visual–language foundation model for pathology image analysis using medical twitter. Nature medicine, 29(9):2307–2316, 2023.
Ikezogwo et al. [2024] Wisdom Ikezogwo, Saygin Seyfioglu, Fatemeh Ghezloo, Dylan Geva, Fatwir Sheikh Mohammed, Pavan Kumar Anand, Ranjay Krishna, and Linda Shapiro. Quilt-1m: One million image-text pairs for histopathology. Advances in neural information processing systems, 36, 2024.
Javed et al. [2024] Sajid Javed, Arif Mahmood, Iyyakutti Iyappan Ganapathi, Fayaz Ali Dharejo, Naoufel Werghi, and Mohammed Bennamoun. Cplip: Zero-shot learning for histopathology with comprehensive vision-language alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11450–11459, 2024.
Jiang et al. [2024] Ting Jiang, Minghui Song, Zihan Zhang, Haizhen Huang, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, and Fuzhen Zhuang. E5-v: Universal embeddings with multimodal large language models. arXiv preprint arXiv:2407.12580, 2024.
Kather et al. [2018] Jakob Nikolas Kather, Niels Halama, and Alexander Marx. 100,000 histological images of human colorectal cancer and healthy tissue. Zenodo10, 5281(9), 2018.
Kather et al. [2019] Jakob Nikolas Kather, Johannes Krisam, Pornpimol Charoentong, Tom Luedde, Esther Herpel, Cleo-Aron Weis, Timo Gaiser, Alexander Marx, Nektarios A Valous, Dyke Ferber, et al. Predicting survival from colorectal cancer histology slides using deep learning: A retrospective multicenter study. PLoS medicine, 16(1):e1002730, 2019.
Khaliliboroujeni et al. [2022] Sepideh Khaliliboroujeni, Xiangjian He, Wenjing Jia, and Saeed Amirgholipour. End-to-end metastasis detection of breast cancer from histopathology whole slide images. Computerized Medical Imaging and Graphics, 102:102136, 2022.
Kriegsmann et al. [2022] Katharina Kriegsmann, Frithjof Lobers, Christiane Zgorzelski, Jörg Kriegsmann, Charlotte Janßen, Rolf Rüdinger Meliß, Thomas Muley, Ulrich Sack, Georg Steinbuss, and Mark Kriegsmann. Deep learning for the detection of anatomical tissue structures and neoplasms of the skin on scanned histopathological tissue sections. Frontiers in Oncology, 12:1022967, 2022.
Kundra et al. [2021] Ritika Kundra, Hongxin Zhang, Robert Sheridan, Sahussapont Joseph Sirintrapun, Avery Wang, Angelica Ochoa, Manda Wilson, Benjamin Gross, Yichao Sun, Ramyasree Madupuri, et al. Oncotree: a cancer classification system for precision oncology. JCO clinical cancer informatics, 5:221–230, 2021.
Li et al. [2024] Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models. arXiv preprint arXiv:2407.07895, 2024.
Li et al. [2022] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning, pages 12888–12900. PMLR, 2022.
Liang et al. [2022] Victor Weixin Liang, Yuhui Zhang, Yongchan Kwon, Serena Yeung, and James Y Zou. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. Advances in Neural Information Processing Systems, 35:17612–17625, 2022.
Liu et al. [2024a] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024a.
Liu et al. [2024b] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024b.
Lu et al. [2021] Ming Y Lu, Drew FK Williamson, Tiffany Y Chen, Richard J Chen, Matteo Barbieri, and Faisal Mahmood. Data-efficient and weakly supervised computational pathology on whole-slide images. Nature biomedical engineering, 5(6):555–570, 2021.
Lu et al. [2023] Ming Y Lu, Bowen Chen, Andrew Zhang, Drew FK Williamson, Richard J Chen, Tong Ding, Long Phi Le, Yung-Sung Chuang, and Faisal Mahmood. Visual language pretrained multiple instance zero-shot transfer for histopathology images. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19764–19775, 2023.
Lu et al. [2024] Ming Y Lu, Bowen Chen, Drew FK Williamson, Richard J Chen, Ivy Liang, Tong Ding, Guillaume Jaume, Igor Odintsov, Long Phi Le, Georg Gerber, et al. A visual-language foundation model for computational pathology. Nature Medicine, 30(3):863–874, 2024.
Ma et al. [2024] Jiabo Ma, Zhengrui Guo, Fengtao Zhou, Yihui Wang, Yingxue Xu, Yu Cai, Zhengjie Zhu, Cheng Jin, Yi Lin Xinrui Jiang, Anjia Han, et al. Towards a generalizable pathology foundation model via unified knowledge distillation. arXiv preprint arXiv:2407.18449, 2024.
Qu et al. [2021] Hui Qu, Mu Zhou, Zhennan Yan, He Wang, Vinod K Rustgi, Shaoting Zhang, Olivier Gevaert, and Dimitris N Metaxas. Genetic mutation and biological pathway prediction based on whole slide images in breast carcinoma using deep learning. NPJ precision oncology, 5(1):87, 2021.
Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
Seyfioglu et al. [2024] Mehmet Saygin Seyfioglu, Wisdom O Ikezogwo, Fatemeh Ghezloo, Ranjay Krishna, and Linda Shapiro. Quilt-llava: Visual instruction tuning by extracting localized narratives from open-source histopathology videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13183–13192, 2024.
Silva-Rodríguez et al. [2020] Julio Silva-Rodríguez, Adrián Colomer, María A Sales, Rafael Molina, and Valery Naranjo. Going deeper through the gleason scoring scale: An automatic end-to-end system for histology prostate grading and cribriform pattern detection. Computer methods and programs in biomedicine, 195:105637, 2020.
Sun et al. [2024a] Yuxuan Sun, Chenglu Zhu, Sunyi Zheng, Kai Zhang, Lin Sun, Zhongyi Shui, Yunlong Zhang, Honglin Li, and Lin Yang. Pathasst: A generative foundation ai assistant towards artificial general intelligence of pathology. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 5034–5042, 2024a.
Sun et al. [2024b] Yuxuan Sun, Chenglu Zhu, Sunyi Zheng, Yunlong Zhang, Honglin Li, and Lin Yang. Context-aware text-assisted multimodal framework for cervical cytology cell diagnosis and chatting. In 2024 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2024b.
Wang et al. [2024] Zhikang Wang, Jiani Ma, Qian Gao, Chris Bain, Seiya Imoto, Pietro Liò, Hongmin Cai, Hao Chen, and Jiangning Song. Dual-stream multi-dependency graph neural network enables precise cancer survival analysis. Medical Image Analysis, 97:103252, 2024.
Xiong et al. [2024] Conghao Xiong, Hao Chen, Hao Zheng, Dong Wei, Yefeng Zheng, Joseph JY Sung, and Irwin King. Mome: Mixture of multimodal experts for cancer survival prediction. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 318–328. Springer, 2024.
Xu et al. [2024] Yingxue Xu, Yihui Wang, Fengtao Zhou, Jiabo Ma, Shu Yang, Huangjing Lin, Xin Wang, Jiguang Wang, Li Liang, Anjia Han, et al. A multimodal knowledge-enhanced whole-slide pathology foundation model. arXiv preprint arXiv:2407.15362, 2024.
Yu et al. [2022] Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
Zhang et al. [2023] Shichuan Zhang, Sunyi Zheng, Zhongyi Shui, Honglin Li, and Lin Yang. Multi-modal learning with missing modality in predicting axillary lymph node metastasis. In 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 2395–2400. IEEE, 2023.
Zhou et al. [2024] Huajun Zhou, Fengtao Zhou, and Hao Chen. Cohort-individual cooperative learning for multimodal cancer survival analysis. arXiv preprint arXiv:2404.02394, 2024.