This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

MM-Retinal V2: Transfer an Elite Knowledge Spark into Fundus Vision-Language Pretraining

Ruiqi Wu, Na Su, Chenran Zhang, Tengfei Ma, Tao Zhou, Zhiting Cui, Nianfeng Tang, Tianyu Mao,
Yi Zhou, , Wen Fan, Tianxing Wu, Shenqi Jing, Huazhu Fu
R. Wu, C. Zhang, T. Ma, Y. Zhou, and T. Wu are with the School of Computer Science and Engineering, Southeast University, Nanjing, China.N. Su, Z. Cui, N. Tang, T. Mao, W. Fan, and S. Jing are with the Department of Ophthalmology, The First Affiliated Hospital of Nanjing Medical University, Nanjing, China.T. Zhou is with the School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China.H. Fu is with the Institute of High-Performance Computing, Agency for Science, Technology and Research, Singapore.
Abstract

Vision-language pretraining (VLP) has been investigated to generalize across diverse downstream tasks for fundus image analysis. Although recent methods showcase promising achievements, they significantly rely on large-scale private image-text data but pay less attention to the pretraining manner, which limits their further advancements. In this work, we introduce MM-Retinal V2, a high-quality image-text paired dataset comprising CFP, FFA, and OCT image modalities. Then, we propose a novel fundus vision-language pretraining model, namely KeepFIT V2, which is pretrained by integrating knowledge from the elite data spark into categorical public datasets. Specifically, a preliminary textual pretraining is adopted to equip the text encoder with primarily ophthalmic textual knowledge. Moreover, a hybrid image-text knowledge injection module is designed for knowledge transfer, which is essentially based on a combination of global semantic concepts from contrastive learning and local appearance details from generative learning. Extensive experiments across zero-shot, few-shot, and linear probing settings highlight the generalization and transferability of KeepFIT V2, delivering performance competitive to state-of-the-art fundus VLP models trained on large-scale private image-text datasets. Our dataset and model are publicly available via https://github.com/lxirich/MM-Retinal.

Index Terms:
Fundus image analysis, multi-modality, knowledge-enhanced vision-language pretraining.
footnotetext: These authors contributed equally to this work.**footnotetext: Corresponding authors: Yi Zhou(yizhou.szcn@gmail.com) and Wen Fan.

I Introduction

Fundus imaging serves as a pivotal tool for the examination and diagnosis of ocular diseases. Traditional fundus image analysis models [13][38][27][45] are usually tailored to specific diseases through categorically-labeled training without knowledge integration. These models tend to exhibit limited generalization and transferability. With the considerable advancements in vision-language pretraining, increasing efforts have been put toward the development of fundus foundation models[7][40][37], aiming to build a model capable of diagnosing a wide range of fundus diseases.

To pretrain fundus foundation models, the image-text paired data serve as a critical basis, enabling models to learn aligned vision-language representations that capture numerous ocular diseases’ features. However, fundus image-text data are highly scarce. Although a few retinal foundation models[22][29][39] have been proposed, the training data they utilized are not released, and most of the existing works only focus on single modality, especially color fundus photography. In real-world clinical diagnosis, different fundus imaging modalities, such as color fundus photography (CFP), fundus fluorescein angiography (FFA), and optical coherence tomography (OCT), are equally important, but collecting large-scale image-text paired data for all the fundus modalities is difficult. Therefore, such limited data acquisition of image-text pairs constrains the development and application of fundus foundation models.

Early fundus foundation models are trained without image-text datasets. RETFound [46] is trained exclusively on unlabeled images. FLAIR[33] attempts to expand the categorical label of public datasets using templates to generate image-text data. However, both exhibit limited performance. Because of the deficiency of public fundus image-text paired data, recent works[7][41][37][31] tend to focus on building fundus foundation models by collecting large-scale private image-text pairs. Despite achieving some success, the inherent constraints of these approaches are evident. On the one hand, most of them perform the vision-language pretraining in a brute-force way, and barely emphasize studying the learning manner. On the other hand, the majority of these models are tailored to the CFP modality and trained on private datasets rather than making full use of accumulated public datasets over the past decades, thereby limiting their contribution to the research community.

To address the above problems, we first construct MM-Retinal V2, a high-quality public image-text dataset comprising CFP, FFA, and OCT modalities with around 5K pairs for each modality, and covering over 96 fundus diseases and abnormalities. Enabled by combining MM-Retinal V2 and existing public categorically-labeled datasets, we propose KeepFIT V2, an effective fundus vision-language pretraining method that only requires fewer image-text data resources. The key idea of KeepFIT V2 consists of a preliminary textual knowledge pretraining and a hybrid image-text knowledge injection. In particular, the latter module is performed through a hybrid visual feature matching method which adopts a combination of high-level semantic features derived from contrastive learning and low-level appearance features from generative learning. The two different matching ways complementarily attend the global semantic concepts and local appearance details for knowledge reference from MM-Retinal V2. Finally, an expert knowledge refinement loss is proposed to complete the knowledge injection. Through this training manner, KeepFIT V2 achieves competitive performance with those foundation models pretrained on large-scale private image-text datasets, by leveraging only a small amount of elite image-text data as a knowledge spark. It transfers professional fundus knowledge from MM-Retinal V2 into public datasets that only have categorical labels, and enhances feature alignment and learning during pretraining.

Compared to the previous work MM-Retinal V1 [40], in this paper, we make three major aspects of extension.

  • For the aspect of dataset construction: MM-Retinal V2 significantly expands the CFP and FFA modalities into more than 5K image-text pairs. We also introduce a new OCT modality with a 5K data scale, since the OCT data in [40] are negligible. Moreover, an MM-Retinal-Text subset essentially from ophthalmology domain is added for text pretraining.

  • For the aspect of model design: First, compared to KeepFIT V1, preliminary textual knowledge pretraining is employed to enhance the KeepFIT V2 text encoder to better encode retinal knowledge. More importantly, we propose a new hybrid image-text knowledge injection method that combines semantic representation from contrastive learning and appearance representation from generative learning, leading to stronger knowledge transfer. This highlights an effective approach for pretraining a retinal foundation model with limited data resource, delivering performance competitive to models trained on large-scale private image-text datasets.

  • For the aspect of experiment study: Compared to [40], more comprehensive and solid experiment evaluations have been conducted, including various experiment settings, compared state-of-the-arts, ablation studies, and more analysis. It shows that the KeepFIT V2 achieves significant improvements. Moreover, experimental results for the new OCT modality are presented in this work.

II Related Work

II-A Retinal Datasets for Disease Diagnosis

With the growing interest in retinal image diagnosis models, efforts have been made to construct retinal datasets, which can be broadly categorized into two types. (1) Unimodal categorically-labeled datasets[6][28][25] are the most prevalent ones, focusing on disease-specific tasks such as diabetic retinopathy, glaucoma, AMD, and pathologic myopia. These datasets primarily serve as specialized datasets for training disease-specific models. While valuable for image-level classification, these datasets provide limited textual descriptions of the images, restricting their application to foundational pertaining. (2) Image-text paired datasets are crucial for pretraining vision-language models. However, such datasets are scarce due to challenges in acquiring large-scale paired data. Most existing image-text pair datasets[7][41][37] are private and limited in imaging modality, failing to comprehensively support vision-language research. To address this gap, we publicly release the MM-Retinal V2 dataset, which provides high-quality image-text pairs covering CFP, FFA, and OCT modalities, as detailed in Section III.

II-B Vision-Language Pre-training

Vision-language pre-training aims to build relationships between images and texts. There are two mainstream methods. The first category leverages multi-modal encoders based on Transformer to model the interaction between images and texts[5][19][23][15]. The second category employs an unimodal encoder for images and texts, utilizing contrastive learning to align their representations[30][18][20]. In the biomedical domain, PubMedCLIP[8], BiomedCLIP[43], and BioViL[4] are proposed as generalist foundation models. Nevertheless, the broad diversity of training data impedes the model’s capability to excel in specialized domains like fundus imaging.

II-C Retinal Foundational Models

In addition to generalist biomedical VLP, the rapid advancements in deep learning have brought abundant retinal foundation models to the research area. RETFound[46] learns from unlabeled retinal images in a self-supervised paradigm. FLAIR[33] utilizes 37 categorical public datasets with textual prompts for foundation model pre-training. However, these works lack image-text paired training data, resulting in limited performance. Recently, RET-CLIP[7] proposes a binocular pretraining model and ViLRef[41] presents a Chinese vision-language retinal pretraining model, and these two models were trained on over 190K and 450K private clinical retinal images and diagnostic reports, respectively. VisionUnite[22] is tuned on a large private multi-modal fundus dataset which includes over 296K private image-text pairs. RetiZero[37] develops a private image-text dataset from three sources with ophthalmologists’ manual data collection and cleaning. Nonetheless, imaging modalities such as FFA and OCT are also prevalent and of critical importance in real-world clinical practice. All the aforementioned VLP models only support the single CFP modality and the pre-training datasets are not publicly available. Despite some fundus foundation models being designed for multiple image modalities[31][29], they have not been publicly released. Therefore, in this work, in addition to releasing MM-Retinal V2, we also publish pre-trained foundation models for CFP, FFA, and OCT modalities. We aim to provide a method for training VLP models by collaborating small-scale, high-quality image-text paired data with public categorical data in a knowledge-enhanced learning manner by hybrid knowledge injection.

III MM-Retinal V2 Dataset

As mentioned in Section II, the lack of high-quality image-text paired retinal datasets limits the rapid development of foundation models for fundus imaging. Therefore, we curated a multi-modal dataset comprising retinal image-text pairs in CFP, FFA, and OCT modalities from retinal diagram books and professional assessments provided by ophthalmology experts, named MM-Retinal V2. Meanwhile, to enhance the pre-training of text encoder with comprehensive medical knowledge, particularly in ophthalmology, we constructed a fundus-centric text-only subset named MM-Retinal-Text.

III-A Dataset Construction

III-A1 Image-Text Pairs of CFP and FFA Modalities

Refer to caption

Figure 1: The construction pipeline of MM-Retinal V2 and randomly selected examples from MM-Retinal V2. For CFP and FFA modalities, we propose a semi-automated method consisting of image-text pair collection, image-text alignment, and modality classification. For OCT modality, experienced ophthalmologists are invited for assessment. Each image is annotated twice and finally validated by senior ophthalmologists. To primarily enrich the text encoder with extensive ophthalmic knowledge, we also constructed a text-only subset from fundus diagram books, ophthalmology textbooks, and public datasets for medical LLMs pretraining.

All image-text pairs in the CFP and FFA modalities were sourced from four fundus diagram books. As depicted in the upper left of Fig. 1, a semi-automated three-stage pipeline was implemented to construct the dataset. First, we manually captured all image-text pairs from the diagram books, ensuring each pair was recorded in a single screenshot with a resolution of no less than 800×800 pixels for each image. These screenshots were then processed using a custom program that integrates Adobe tools for image extraction and OCR technology for text extraction. Then, given the challenge of aligning extracted images and texts—especially in cases where multiple sub-figures are associated with a single caption—we applied regular expressions to achieve precise caption separation. Finally, the images were classified into CFP and FFA modalities using a color histogram-based method. We manually corrected OCR recognition errors and translated the text into English and Chinese reports to ensure language consistency. To further expand our dataset, we refined the DEN dataset[14] through a systematic filtering and cleaning process. This involved manually removing images that did not belong to the CFP, and FFA modalities, as well as excluding collages comprising multiple images. Subsequently, a color histogram-based classification method was used to automatically classify two modalities, and the label files were reorganized according to their respective modalities.

III-A2 Image-Text Pairs of OCT Modality

Due to the scarcity of electronic OCT diagram books with rich image-text paired data, we collaborated with a key provincial hospital to establish the OCT modality part of MM-Retinal V2. We initially collected 5,587 OCT images from 3,403 patients. To control data quality, images that could not be diagnosed solely based on OCT were excluded. The ophthalmologist provided detailed definitions for the full range of abnormalities that can be observed from OCT images. Then, each image was independently assessed by two ophthalmologists. Two senior ophthalmologists reviewed all captions and provided final verification and decisions, ensuring the correctness and consistency of the dataset. At last, each OCT image was accompanied by a detailed textual description, capturing both the diagnosed diseases and the pathological features observed in the images.

III-A3 Ophthalmic Knowledge Texts

When diagnosing ocular diseases, ophthalmologists usually rely not only on ophthalmic images but also on their knowledge of ophthalmology and other medical specialties. To support this, we also constructed an MM-Retinal-Text subset, which primarily integrates knowledge from the ophthalmic field. The textual data were sourced from three main origins: (1) four fundus diagram books, (2) three ophthalmology textbooks, and (3) twelve public datasets for medical LLM pretraining. Texts from books were digitized using OCR, with irrelevant elements such as names and figure sequence numbers removed. For public datasets used in medical LLM pretraining, we filtered them using ophthalmology-related keywords derived from the contents of diagram books, as these datasets were originally designed for comprehensive medical areas. Although this filtering approach ensures a focus on ophthalmic knowledge, it may also include content from other medical specialties. We opted not to clean this data further, as a certain proportion of knowledge from other fields would improve the model’s capability to generalize on common medical terms. The detailed experimental results are presented in Table VII. More details on these datasets can be found in our project page.

III-B Dataset Statistics

Refer to caption

Figure 2: Statistical overview of MM-Retinal V2. (a) highlights a part of the most frequently occurring terms in the CFP, FFA, and OCT modalities. (b) plots the distribution of caption length in three modalities, respectively. (c) illustrates the comparison with related public fundus datasets in the aspect of vocabulary. The average vocabulary size refers to the total number of unique words throughout all captions in the dataset, while lexical diversity measures the average number of unique words used in each individual caption.

Upon MM-Retinal V1 [40], our MM-Retinal V2 finally consists of 6,720 CFP cases, 5,119 FFA cases, and 5,502 OCT cases, each containing an image paired with corresponding descriptions in both Chinese and English. The MM-Retinal V2 dataset is further enriched by a text-only subset, with a total number of 452K utterances, providing a substantial repository of textual information.

III-B1 Diverse Modalities

MM-Retinal V2 is the first high-quality dataset to simultaneously encompass image-text pairs of CFP, FFA, and OCT modalities with ophthalmic text data. In the diagnostic process, different imaging modalities offer unique perspectives. CFP highlights the structure of the fundus, FFA captures vascular changes, while OCT reveals details of the retinal layers. By integrating these modalities with high-quality textual descriptions, MM-Retinal V2 provides a solid base for the development of advanced retinal foundational models.

III-B2 Comprehensive Categories

As the MM-Retinal V2 dataset is derived from comprehensive ocular diagram books and clinical sources, it comprises over 96 fundus abnormalities and disease categories, including both common and rare diseases, such as retinal vascular diseases, macular diseases, vitreous diseases, optic nerve diseases, congenital anomalies, and inflammatory diseases, among others. Fig. 2 (a) shows the words that appear most frequently. More detailed retinal categories are presented in the supplementary material, providing a broader perspective on the extensive coverage of the dataset.

III-B3 Detailed Captions

Fig. 2 (b) shows the distribution of caption lengths in MM-Retinal V2. In the CFP modality, 46.8% and 35.0% of captions range from 1 to 25 words and 26 to 50 words. In the FFA modality, 57.8% caption length is over 25 words. In the OCT modality, 91.3% captions range from 1 to 50 words. Moreover, a small percentage of texts exceed 50 words, reaching up to nearly 101 words. This demonstrates that the textual captions in MM-Retinal V2 are detailed and comprehensive, providing accurate and rich descriptions of the images and effectively conveying the information appearing.

III-B4 Extensive Vocabulary

MM-Retinal V2 captions encompass disease diagnoses, detailed lesion attributes (such as color, shape, and appearance), clinical symptoms, and post-treatment efficacy, utilizing a rich vocabulary for comprehensive descriptions. Fig. 2(c) compares the vocabulary size and lexical diversity across various public datasets and languages. Notably, since MM-Retinal V2 does not generate captions by merely expanding category names with fixed templates like [33], its vocabulary is exceptionally rich and diverse, exhibiting high lexical diversity.

IV KeepFIT V2

Refer to caption

Figure 3: The architecture of the proposed KeepFIT V2. KeepFIT V2 generally complies with the vision-language pretraining paradigm and introduces four specific parts, including a preliminary textual knowledge pretraining, a semantics-oriented knowledge extraction module, an appearance-oriented knowledge extraction module, and an expert knowledge refinement module. The image encoder and the text encoder extract features and encourage modality alignment by contrastive learning. The semantics-oriented and appearance-oriented knowledge extraction modules are introduced to distill expert knowledge from MM-Retinal V2. Subsequently, the text refinement module injects the obtained knowledge into public datasets to enhance model pretraining.

In this section, we introduce the proposed KeepFIT V2, a new knowledge-enhanced multi-modal foundation model designed for retinal image analysis. Compared to the previous version [40], KeepFIT V2 firmly follows the vision-language pretraining paradigm and adopts a more effective hybrid image-text knowledge injection approach by leveraging the high-quality MM-Retinal V2 dataset. Therefore, such an elite knowledge spark can be transferred into the general vision-language pertaining to enhance model performance. The framework of KeepFIT V2 is illustrated in Fig. 3.

IV-A Vision-Language Pretraining Framework

KeepFIT V2 is trained on MM-Retinal V2 mm and public retinal datasets pp that only encompass category-level labels. Due to the effectiveness of vision-language pretraining paradigm, CLIP[30] is applied as the backbone of KeepFIT V2 for multi-modal learning. Specifically, KeepFIT V2 comprises two encoders. Provided with a set of image-text pairs {Xi,Yi}i=1N\{X_{i},Y_{i}\}_{i=1}^{N}, where XiX_{i} represents the image, and YiY_{i} is the corresponding text, the image is processed by the vision encoder EvE_{v} to extract the visual feature ViV_{i}, while the text is fed into the text encoder EtE_{t} to obtain the textual feature TiT_{i}. For categorical public datasets, a template is needed to convert the category label into text, such as “A fundus photograph of [class name]” for CFP modality, and a label augmentation is applied following FLAIR[33]. Then, the features are projected to a shared space by modality-specific projector PvP_{v} and PtP_{t} to ensure that the feature dimensions dd of different modalities are consistent for contrastive learning. Let θ\theta and ϕ\phi symbolize the parameters of the image encoder and text encoder, respectively. The generated image feature ViV_{i} and text feature TiT_{i} can be formulated as follows:

Vi=PvEv(Xi;θ)𝕕d,Ti=PtEt(Yi;ϕ)𝕕d.V_{i}=P_{v}\circ E_{v}(X_{i};\theta)\in\mathbb{R^{d}}^{d},~{}~{}T_{i}=P_{t}\circ E_{t}(Y_{i};\phi)\in\mathbb{R^{d}}^{d}. (1)

To eliminate the modality gap within the shared space after projection, we use contrastive loss to enable the modality alignment ability. Give an image-text pair, image-to-text and text-to-image similarities after applying a softmax function are calculated by:

Uv2t(Vi)=exp(S(Vi,Ti)/τ)j=1exp(S(Vi,Tj)/τ),\displaystyle U_{v2t}(V_{i})=\frac{exp(S(V_{i},T_{i})/\tau)}{\sum_{j=1}^{\mathcal{B}}exp(S(V_{i},T_{j})/\tau)}, (2)
Ut2v(Ti)=exp(S(Ti,Vi)/τ)j=1exp(S(Ti,Vj)/τ),\displaystyle U_{t2v}(T_{i})=\frac{exp(S(T_{i},V_{i})/\tau)}{\sum_{j=1}^{\mathcal{B}}exp(S(T_{i},V_{j})/\tau)}, (3)

where S(,)S(\cdot,\cdot) refers to cross-modality similarity, τ\tau is a temperature parameter, and \mathcal{B} is batch size. Then, the image-text contrastive loss is defined as:

itc\displaystyle\mathcal{L}_{itc} =12𝔼(V,T)[CE(Gv2t(V),Uv2t(V))\displaystyle=\frac{1}{2}\mathbb{E}_{(V,T)\sim\mathcal{B}}[CE(G_{v2t}(V),U_{v2t}(V)) (4)
+CE(Gt2v(T),Ut2v(T))].\displaystyle+CE(G_{t2v}(T),U_{t2v}(T))].

To achieve better image-text alignment, Eq.(4) is employed on MM-Retinal V2 and categorical public datasets. For MM-retinal V2, the matching labels GG are identity matrices of dimension ×\mid\mathcal{B}\mid\times\mid\mathcal{B}\mid. On the other hand, for public datasets that only provide category labels, negative samples for the contrastive task come from different categories. Thus, the matching labels are symmetric matrices of size ×\mid\mathcal{B}\mid\times\mid\mathcal{B}\mid.

IV-B Preliminary Textual Knowledge Pretraining

Following FLAIR[33], we adopt ResNet50[12] initialized with ImageNet pre-trained parameters as the image encoder, and choose the architecture of BioClinicalBert[3] as the text encoder. As mentioned in Section III-A3, extensive professional texts in the medical field, especially ophthalmology, serve as a preliminary expert knowledge source for the diagnosis of fundus diseases. Hence, we propose to exploit such abundant and profound knowledge from MM-Retinal-Text to better adapt the text encoder of KeepFIT V2. Specifically, we pretrain the text encoder using MM-Retinal-Text through the masked language modeling (MLM) fashion. The pre-trained parameters are used to initialize the text encoder. Experiments in Table VII show that this training step yields stronger performance and enhances the text encoder’s capability to capture intricate medical terminology and context, thus facilitating better alignment with the image encoder in our KeepFIT V2 framework.

IV-C Hybrid Image-Text Knowledge Injection

In light of the MM-Retinal V2 dataset incorporating a wealth of fundus image-text expert knowledge, we consider injecting expert knowledge from MM-Retinal V2 into public datasets to promote vision-language pretraining. By jointly training the model on MM-Retinal V2 and public datasets, we seek to enhance the model’s understanding of fundus images and related textual information.

To accomplish this, three key obstacles must be addressed. The first question is what the expected transferable knowledge for extraction should be and where the extracted knowledge is injected. The second focuses on how to extract knowledge effectively. The third is how to inject the extracted knowledge. In the following subsections, we will systematically address these problems.

IV-C1 Elite Knowledge Spark

By comparing the images from two data sources, namely public datasets and our MM-Retinal V2 dataset, we observe that MM-Retinal V2 exhibits a high degree of similarity with the public ones and almost covers all the common retinal disease categories. However, the textual content of the public datasets contains simple expansions of category labels based on fixed text templates, while the texts in MM-Retinal V2 are extensive and lexically diverse. Textual discrepancies reveal where knowledge dissemination is most needed. Consequently, to address the first obstacle, the image-guided texts of the MM-Retinal V2 will be regarded as the source for knowledge extraction, and the textual content of public datasets as the destination where knowledge is injected. This allows the knowledge from the MM-Retinal V2 to act as a spark, spreading and enriching the public datasets, and jointly contribute to the pretraining of KeepFIT V2.

IV-C2 Semantics-Oriented Expert Knowledge Extraction

We next investigate how to extract the expert knowledge from MM-Retinal V2. As mentioned above, MM-Retinal V2 shows minimal domain differences with public datasets and encompasses their fundus disease categories, making it ideal for knowledge extraction. Hence, we propose a semantics-oriented expert knowledge extraction module based on multi-head cross attention[35] to equip the model with high-level semantic retrieval ability. In particular, we perform visual matching on the images from two data sources and weight the corresponding text from the MM-Retinal V2 as expert knowledge based on the matching score.

Given the image feature VpV_{p}, VmV_{m} from image encoder EvE_{v} and text feature TpT_{p}, TmT_{m} from text encoder EtE_{t}, we input the image features of public datasets as queryquery of cross attention, the image features of the MM-Retinal V2 as keykey, and the text features of the MM-Retinal V2 as valuevalue. Subsequently, matrix multiplication followed by the softmax function is applied to compute the attention weights Ψh\Psi^{h} for head h[H]h\in[H], which represent similarity scores between images:

Qih=Vp,iWQhd/H,\displaystyle Q_{i}^{h}=V_{p,i}\cdot W_{Q}^{h}\in\mathbb{R}^{d/H}, (5)
Kjh=Vm,jWKhd/H,\displaystyle K_{j}^{h}=V_{m,j}\cdot W_{K}^{h}\in\mathbb{R}^{d/H}, (6)
Ψijh=softmax(QihKjhTd/H),\displaystyle\Psi^{h}_{ij}=softmax(\frac{Q_{i}^{h}{K_{j}^{h}}^{T}}{\sqrt{d/H}}), (7)

where pp symbolizes public datasets that only have category labels and mm indicates MM-Retinal V2 dataset, and WQhW_{Q}^{h}, WKhW_{K}^{h} are learnable projection matrices.

Next, similarity scores are used to reweight the text features of MM-Retinal V2, assigning different levels of attention to the text features based on the semantic similarity between image features. The semantics-level expert knowledge EKsEK_{s} from MM-Retinal V2 can be formulated as:

Vjh=Tm,jWVhd/h,\displaystyle V_{j}^{h}=T_{m,j}\cdot W_{V}^{h}\in\mathbb{R}^{d/h}, (8)
EKsh=ΨijhVjh,\displaystyle EK_{s}^{h}=\Psi^{h}_{ij}\cdot V^{h}_{j}, (9)
EKs=concat(EKsh)h=1HWO,\displaystyle EK_{s}=concat(EK_{s}^{h})_{h=1}^{H}\cdot W_{O}, (10)

where WVhW_{V}^{h} and WOW_{O} are learnable projection matrices.

When images from the public datasets and the MM-Retinal V2 demonstrate high similarity, their corresponding texts are expected to align more closely. As a result, higher similarity scores lead to greater text retention from MM-Retinal V2, thereby enriching the public datasets with relevant and complementary textual knowledge in a semantic visual matching way.

IV-C3 Appearance-Oriented Expert Knowledge Extraction

In addition to high-level semantic visual matching, we further explore matching the visual features between public datasets and MM-Retinal V2 in a low-level appearance way. Detailed appearance features can be derived from image generative learning. Therefore, a combination of semantic representation from contrastive learning and appearance representation from generative learning is a complementary way to achieve this goal. Specifically, we employ a vector quantization (VQ) approach to converting semantic visual features into discrete tokens for appearance-oriented knowledge extraction.

Fully representing continuous retinal features from the image encoder of KeepFIT V2 by quantized discrete tokens is challenging, as these retinal image features are highly semantic and contain limited low-level visual information like detailed lesion appearance. To address this, image tokenization in appearance-oriented knowledge extraction is inspired by index backpropagation quantization (IBQ) [32], which leverages a large-scale, high-dimensional codebook with efficient utilization. Moreover, IBQ allows the joint optimization of all codebook embeddings, effectively preventing codebook collapse. In this work, the codebook is trained in advance, following [32], using the same training data as KeepFIT V2, which includes images from MM-Retinal V2 and public retinal datasets.

As illustrated in the lower right of Fig. 3, after obtaining the image features VV from the image encoder, these image features comprise two parts, with the first part being the flattened features processed through the projector PvP_{v} and the second part being unflattened feature maps bypassing the projector. The visual feature maps are first projected by a convolutional layer to achieve dimensional consistency. Then a quantization process is performed to tokenize the contiguous visual feature maps into discrete tokens using a fixed codebook CK×DC\in\mathbb{R}^{K\times D}, where KK is the codebook size and DD is the code dimension. First, the dot product between the visual feature and all code embeddings CkC_{k} is calculated and followed by the softmax function and one-hot function to obtain probabilities:

logits=[VTC1,VTC2,,VTCK]TK,\displaystyle logits=[V^{T}C_{1},V^{T}C_{2},...,V^{T}C_{K}]^{T}\in\mathbb{R}^{K}, (11)
Indsoft=softmax(logits),\displaystyle Ind_{soft}=softmax(logits), (12)
Indhard=OneHot(argmax(Indsoft)).\displaystyle Ind_{hard}=OneHot(argmax(Ind_{soft})). (13)

Afterward, the gradients of soft one-hot distribution are transferred to hard one-hot index:

Ind=Indhardsg[Indsoft]+Indsoft,Ind=Ind_{hard}-sg[Ind_{soft}]+Ind_{soft}, (14)

where sg[]sg[\cdot] means stop-gradient operation.

After acquiring the index, the discrete code 𝒬\mathcal{Q} obtained by IBQ is:

𝒬=IndTC.\mathcal{Q}=Ind^{T}C. (15)

Similarly, multi-head cross-attention is leveraged for knowledge extraction in an appearance-oriented way. The quantized vector 𝒬p\mathcal{Q}_{p} from the public datasets is used as the queryquery, 𝒬m\mathcal{Q}_{m} from the MM-Retinal V2 as the keykey, and the text feature TmT_{m} from the MM-Retinal V2 as the valuevalue. Consequently, Eqs. (5) and (6) are changed into:

Qih=𝒬p,iWQhd/H,\displaystyle Q_{i}^{h}=\mathcal{Q}_{p,i}\cdot W_{Q}^{h}\in\mathbb{R}^{d^{\prime}/H}, (16)
Kjh=𝒬m,jWKhd/H.\displaystyle K_{j}^{h}=\mathcal{Q}_{m,j}\cdot W_{K}^{h}\in\mathbb{R}^{d^{\prime}/H}. (17)

The final extracted appearance-level expert knowledge EKaEK_{a} is computed by:

EKah=στ(QihKjh)Vjh,\displaystyle EK_{a}^{h}=\sigma_{\tau}(Q^{h}_{i}\cdot K^{h}_{j})\cdot V^{h}_{j}, (18)
EKa=concat(EKah)h=1HWO,\displaystyle EK_{a}=concat(EK_{a}^{h})_{h=1}^{H}\cdot W_{O}, (19)

where στ\sigma_{\tau} denotes the softmax function with temperature τ\tau.

The hybrid expert knowledge extraction module empowers KeepFIT V2 with both global semantic representation ability from contrastive learning and local appearance representation ability from generative learning, which is one of the major improvements compared to KeepFIT V1. Through this module, the cross-modality alignment capability and representation capability of KeepFIT V2 are significantly enhanced, resulting in more precise visual matching and knowledge injection.

IV-C4 Expert Knowledge Refinement

The last significant problem is how to inject the obtained knowledge into the text of the public datasets to assist vision-language pretraining. Considering that the primary distinction between the public datasets and MM-Retinal V2 lies in the depth and granularity of their texts, we propose expert knowledge refinement loss LEKsL_{EK_{s}} for semantics-oriented knowledge refinement and LEKaL_{EK_{a}} for appearance-oriented knowledge refinement. These losses encourage the text of the public datasets to closely resemble the text extracted from MM-Retinal V2 that corresponds to their image features. To achieve this, the mean squared error (MSE) Loss is utilized as the basis:

EKs=1i=1(EKsTp)2,\displaystyle\mathcal{L}_{EK}^{s}=\frac{1}{\mathcal{B}}\sum_{i=1}^{\mathcal{B}}(EK_{s}-T_{p})^{2}, (20)
EKa=1i=1(EKaTp)2.\displaystyle\mathcal{L}_{EK}^{a}=\frac{1}{\mathcal{B}}\sum_{i=1}^{\mathcal{B}}(EK_{a}-T_{p})^{2}. (21)

The above formulas use the knowledge extracted at two levels to refine the text of the public datasets, creating a complementary and synergistic effect that makes the text refinement more comprehensive.

TABLE I: Zero-Shot Classification Performance in CFP Modality across Four datasets. (%)
Model REFUGE ODIR200×\times3 Retina iChallenge-AMD
ACC AUC AUPR AVG ACC AUC AUPR AVG ACC AUC AUPR AVG ACC AUC AUPR AVG
VLPs with large-scale image-text paired data
\hdashlineViLRef 64.3 76.7 69.7 70.2 88.3 96.2 92.9 92.5 54.6 80.7 62.6 66.0 84.3 95.2 94.0 91.2
RET-CLIP 81.0 94.6 90.0 88.5 88.2 96.6 93.0 92.6 61.7 89.6 76.6 76.0 87.2 94.5 93.2 91.6
RetiZero 53.3 82.5 70.5 68.8 71.3 97.9 96.0 88.4 42.4 78.6 58.6 65.0 66.6 87.9 87.0 80.5
VLPs with small-scale elite image-text paired data / public categorical data
\hdashlineFLAIR 84.7 92.6 90.5 89.3 40.3 87.5 76.9 68.2 33.8 69.9 45.9 49.9 69.5 79.5 75.7 74.9
KeepFITV1 84.9 94.1 89.3 89.4 81.2 92.9 87.5 87.2 42.9 77.4 52.0 57.4 76.5 88.6 86.2 83.8
KeepFITV2 89.6 96.2 92.7 92.8 80.8 93.1 87.6 87.2 43.6 80.8 58.8 61.1 80.4 90.3 87.1 85.9
KeepFITV2L 86.2 96.9 94.8 92.6 85.2 94.8 90.3 90.1 45.1 78.1 57.9 60.4 82.7 91.2 90.2 88.0

IV-C5 Overall Training Objective

As depicted in the Fig. 3, the pretrained codebook is frozen, and we optimize the parameters of the image encoder, text encoder, and hybrid knowledge extraction modules, simultaneously. Finally, the overall training objective is defined as:

=itcp+itcm+λ1EKs+λ2EKa,\mathcal{L}=\mathcal{L}_{itc}^{p}+\mathcal{L}_{itc}^{m}+\lambda_{1}\mathcal{L}_{EK}^{s}+\lambda_{2}\mathcal{L}_{EK}^{a}, (22)

where λ1\lambda_{1} and λ2\lambda_{2} are hyperparameters, which are set to 100 and 1×1041\times 10^{4} in our implementation, achieving the best performance.

V Experiments

TABLE II: Few-Shot Classification Performance in CFP Modality across Four datasets. (%)
Model Clipadapter Tipadapter Tipadapter-f AVG
1 5 10 1 5 10 1 5 10
ACC AUC AUPR ACC AUC AUPR ACC AUC AUPR ACC AUC AUPR ACC AUC AUPR ACC AUC AUPR ACC AUC AUPR ACC AUC AUPR ACC AUC AUPR
REFUGE
VLPs with large-scale image-text paired data
\hdashlineViLRef 78.1 88.1 83.6 81.2 89.0 87.5 83.5 90.1 88.0 62.9 69.5 64.1 66.0 72.2 65.3 69.3 75.2 67.6 60.3 67.7 62.8 68.4 76.4 65.5 77.9 82.2 73.4 70.5
RET-CLIP 84.0 91.5 88.2 80.7 90.2 88.5 84.6 90.1 89.3 75.3 86.8 82.3 79.2 86.9 83.8 81.3 87.2 85.6 78.5 88.1 84.4 79.8 90.7 86.6 76.0 87.9 85.2 84.1
RetiZero 72.1 82.5 73.2 76.2 85.0 81.9 78.7 86.5 84.7 58.7 76.4 68.1 60.3 78.2 71.4 61.3 79.7 75.5 56.9 79.0 72.2 61.9 74.6 70.8 63.6 76.2 72.9 80.1
VLPs with small-scale elite image-text paired data / public categorical data
\hdashlineFLAIR 82.9 90.1 87.9 83.6 89.8 88.5 83.3 89.4 88.2 82.2 89.7 87.5 82.5 88.6 87.1 82.7 87.1 86.6 82.0 90.2 88.1 82.5 91.2 88.9 82.5 91.5 89.5 87.4
KeepFITV1 74.4 93.6 89.0 81.0 89.9 87.6 83.3 91.3 89.2 82.2 93.5 87.9 80.0 93.1 87.9 80.8 92.8 87.7 81.2 93.4 87.4 83.3 94.0 89.0 81.9 92.0 87.6 87.8
KeepFITV2 88.8 95.6 90.1 82.7 91.4 87.7 85.0 96.0 91.4 86.0 95.0 90.3 85.8 94.8 90.0 86.5 94.6 90.5 85.1 93.6 88.2 87.8 95.0 89.7 86.0 95.2 90.5 90.1
KeepFITV2L 90.0 96.2 93.4 81.1 90.3 88.9 91.1 94.5 94.4 86.8 94.6 91.4 86.2 94.4 91.5 87.9 94.5 91.7 86.9 95.7 91.4 89.0 95.4 92.2 88.3 93.3 91.4 91.5
ODIR200×\times3
VLPs with large-scale image-text paired data
\hdashlineViLRef 88.0 96.9 94.6 88.3 97.5 95.4 88.5 97.6 95.5 82.2 94.3 88.1 82.3 94.4 88.0 82.3 94.5 88.2 82.8 93.8 87.6 80.8 93.1 85.7 82.2 93.7 87.0 74.7
RET-CLIP 89.7 98.0 96.3 90.0 98.0 96.3 90.3 98.2 96.8 83.3 94.4 90.2 84.5 95.8 92.9 85.7 96.7 94.7 82.8 94.6 90.8 85.8 96.5 93.9 88.8 97.1 95.0 84.9
RetiZero 89.7 98.1 96.5 92.2 98.3 96.8 91.5 98.4 97.0 71.8 96.7 93.7 73.0 97.0 94.1 75.5 97.2 94.6 71.7 96.4 93.0 76.3 96.8 93.7 80.0 97.6 95.1 95.4
VLPs with small-scale elite image-text paired data / public categorical data
\hdashlineFLAIR 72.0 89.4 83.2 83.2 92.9 88.4 87.0 95.7 92.4 39.7 83.7 69.9 41.2 84.4 71.3 42.7 85.1 72.4 40.7 83.9 70.0 45.5 85.7 73.3 53.8 88.0 77.5 73.8
KeepFITV1 84.2 96.3 93.9 87.7 96.9 94.9 89.7 97.4 95.5 81.7 93.4 88.2 83.3 94.2 89.6 84.2 95.0 91.0 81.3 93.9 89.3 84.2 94.8 90.8 86.3 95.9 92.8 90.6
KeepFITV2 81.2 94.3 90.7 86.5 95.8 93.0 87.7 96.5 94.2 84.3 94.2 90.2 86.0 95.1 91.7 86.3 95.6 92.8 84.2 94.0 89.9 85.1 95.1 91.8 87.0 95.7 92.9 90.8
KeepFITV2L 85.0 95.9 93.2 87.5 96.9 94.8 89.3 97.4 95.4 84.2 94.3 90.1 87.2 95.2 91.9 87.7 95.8 92.9 85.0 94.5 89.7 87.3 95.6 92.7 87.8 96.2 93.3 91.7
Retina
VLPs with large-scale image-text paired data
\hdashlineViLRef 61.2 84.9 70.2 61.9 84.8 68.4 66.0 86.3 72.3 50.2 77.4 58.0 50.7 77.6 58.2 50.9 77.8 58.6 52.0 76.1 56.0 52.7 77.0 56.6 52.2 77.3 57.3 65.7
RET-CLIP 65.2 88.3 75.3 65.7 86.3 72.1 67.4 87.1 72.6 57.9 83.4 66.3 57.3 83.9 66.5 58.2 84.7 67.9 57.0 83.6 66.6 59.6 83.9 67.2 65.5 84.8 68.6 72.0
RetiZero 58.4 83.7 67.2 59.4 82.0 64.4 62.6 84.4 68.0 41.3 75.7 54.4 42.0 75.7 54.7 42.9 75.9 55.5 41.9 77.0 55.7 41.3 76.6 54.6 45.9 77.9 58.3 70.0
VLPs with small-scale elite image-text paired data / public categorical data
\hdashlineFLAIR 41.1 66.8 46.6 42.9 71.2 50.0 53.6 78.8 59.0 33.9 67.3 45.2 34.1 67.5 45.3 34.3 67.8 45.6 35.8 66.3 45.2 36.4 66.3 43.8 39.2 68.5 47.5 51.9
KeepFITV1 57.6 82.1 65.2 62.7 83.1 69.3 66.0 85.5 72.7 41.8 76.3 54.4 42.3 77.2 55.4 43.4 78.1 56.8 41.9 77.0 55.6 43.3 79.3 58.2 45.2 78.9 59.4 63.3
KeepFITV2 59.2 82.2 66.1 56.4 80.7 61.7 64.7 85.2 70.0 43.4 79.3 59.1 44.7 79.6 59.7 44.7 80.4 61.1 43.2 79.9 59.5 43.4 79.9 61.6 47.8 81.5 63.3 64.4
KeepFITV2L 58.2 81.2 65.9 58.0 81.0 66.0 62.4 82.2 66.4 44.8 77.0 57.8 45.8 78.3 59.6 46.1 78.8 60.8 44.9 77.6 59.1 49.5 81.6 63.6 51.0 80.2 64.0 64.5
iChallenge-AMD
VLPs with large-scale image-text paired data
\hdashlineViLRef 83.9 94.1 92.6 83.0 93.4 92.3 89.2 95.1 93.5 76.9 88.3 84.8 77.5 88.3 84.8 77.5 88.4 84.8 78.1 87.1 83.4 77.9 87.8 84.4 79.6 86.4 83.1 85.8
RET-CLIP 82.4 94.0 82.8 84.1 94.0 92.6 85.9 95.7 94.1 80.7 88.6 87.0 79.7 88.6 87.0 77.7 89.0 87.3 77.9 87.9 86.3 80.1 90.1 89.4 81.2 95.9 90.6 87.1
RetiZero 81.6 87.5 86.9 79.7 89.5 88.2 81.9 92.2 90.9 66.9 81.4 79.5 68.1 81.8 79.7 68.4 82.5 80.3 69.1 84.1 82.3 68.8 81.1 79.2 70.1 87.2 85.0 86.5
VLPs with small-scale elite image-text paired data / public categorical data
\hdashlineFLAIR 65.3 79.1 75.1 77.5 85.6 81.8 81.3 89.8 85.9 68.9 80.1 75.0 68.9 80.7 75.6 69.1 81.5 76.2 70.7 79.9 74.3 69.2 79.7 74.3 73.9 84.6 79.6 77.1
KeepFITV1 69.4 88.6 86.9 78.8 91.3 88.5 79.4 92.3 89.7 76.0 87.9 86.4 76.1 88.6 86.9 75.5 89.7 87.8 75.7 89.3 87.7 76.4 89.6 87.7 78.9 90.6 88.8 84.6
KeepFITV2 78.2 88.9 86.9 81.6 91.7 88.7 84.0 92.2 88.8 80.9 88.0 87.4 82.1 87.5 86.5 81.5 87.9 86.9 81.5 88.7 87.3 81.9 87.8 86.4 82.0 89.7 87.9 86.0
KeepFITV2L 80.7 90.4 89.6 79.3 89.7 88.8 82.7 92.9 90.7 82.6 89.2 88.8 82.4 89.1 88.7 82.0 89.2 88.7 80.7 89.9 88.8 79.8 88.7 87.9 81.7 90.8 89.5 86.8

V-A Datasets

V-A1 CFP Modality

KeepFIT V2 in CFP modality is trained on the proposed image-text MM-Retinal V2 and the categorical public retinal datasets from flair[33] which consist of over 190K images across 96 categories. Besides, we additionally collect over 80K data samples from public categorical retinal datasets. Using all 270K public datasets along with MM-Retinal V2, we trained KeepFIT V2L. For evaluation, we utilize REFUGE[24], ODIR200×\times3[34], iChallenge-AMD[9], Retina[1], FIVES[16] and APTOS[2] to perform various downstream classification tasks. These evaluation datasets encompass several common fundus diseases, including glaucoma, pathologic myopia, cataract, diabetic retinopathy, age-related macular degeneration, and other retinal disorders.

V-A2 FFA Modality

The pretraining for FFA modality is conducted on MM-Retinal V2 and FFA-IR[21], an image-text paired dataset comprising 10,790 reports and 1,048,584 FFA images, spanning 46 retinal lesion categories. We extensively evaluate KeepFIT V2 using two FFA public datasets. MPOS[36] includes 600 images across four fundus disease categories. AngioReport (APTOS2023)[44] contains a total of over 50K images, covering 24 distinct categories.

V-A3 OCT Modality

For pretraining, besides MM-Retinal V2, we also collect eleven OCT public datasets with only category labels for KeepFIT V2 pretraining, totaling over 181K images. In addition, OCTID[11] and OCTDL[17] datasets are used for evaluation, consisting of 9 common fundus diseases.

V-B Methods for Comparison

For CFP modality, we executed the downstream tasks across six methods to ensure a comprehensive comparison. Specifically, the methods for comparison can be divided into two groups. The first group is VLPs with categorical public datasets/small-scale image-text paired data. This group includes models of FLAIR[33], KeepFIT V1[40], KeepFIT V2, and KeppFIT VL. In contrast, the second group is VLPs with large-scale private image-text paired data, including RET-CLIP[7], ViLReF[41], and RetiZero[37]. These models are trained on 193,865, 451,956, and 341,896 image-text pairs, respectively. Compared to RET-CLIP, ViLReF, and RetiZero, KeepFIT V1 and V2 utilize only 1% of the image-text paired data. Moreover, RETFound[46] is an exception. It is trained using a masked image modeling (MIM) approach. Thus, owing to the absence of text encoder, RETFound can only be evaluated in linear probing setting.

Due to the scarcity of released available comparable models for FFA and OCT modalities, we selected two generalist vision-language models in the medical domain for biomedical understanding: BiomedCLIP[43] and PubMedCLIP[8]. BiomedCLIP is trained on 15M biomedical image-text pairs from 4.4M scientific articles in PMC. PubMedCLIP is pre-trained on ROCO dataset[26], which comprises over 80K samples spanning a wide range of medical imaging modalities, including ultrasound, X-rays, MRI, angiography, etc. Additionally, we also incorporate CLIP as a comparison model, training on the same retinal public datasets as our models, rather than using its original pretrained weights.

V-C Implementation Details

Following the design in KeepFIT V1, we adopt the same model architecture of image and text encoders and hyperparameter settings. Specifically, all the images are resized to a resolution of 512×\times512 and the texts are in English version with a maximum length of 256 tokens. The multi-head cross-attention modules are trained from scratch with a feature dimension of 512. Our model is trained with the AdamW optimizer with a learning rate of 1×\times10-4 and a weight decay of 1×\times10-5. A cosine scheduler is used with a warm-up for the first epoch. The visual tokenizer is trained using 16,384 codebook size and 256 code dimension as the default setting in [32]. All the metrics presented are averaged across five cross-validation folds.

V-D Evaluation of the CFP Modality

Refer to caption


Figure 4: Linear probing classification performance in CFP modality across datasets (each row represents the results of a dataset).

V-D1 Zero-shot

Table I reports the zero-shot experimental results on four unseen downstream datasets in terms of ACC, AUC, and AUPR. In addition, ODIR200×\times3 includes two unseen categories during training, which are pathologic myopia and cataract. From an overall perspective, KeepFIT V2 and KeepFIT V2L surpass the KeepFIT V1 on all the downstream datasets, achieving an average score improvement of up to 3.4%, 2.9%, 3.7%, and 4.2% on REFUGE, ODIR200×\times3, Retina, and iChallenge-AMD, respectively. With the expansion of training data volume, though KeepFIT V2L shows a slight performance decline on REFUGE and Retina, it achieves an improvement of 2.9% on ODIR200×\times3 and 2.1% on iChallenge-AMD when compared to KeepFIT V2, indicating that a larger volume of training data can lead to performance gains.

It is worth noting that ViLRef, RET-CLIP, and RetiZero were trained on private datasets of 200K to 450K image-text pairs, which have not been publicly released. This may explain the performance gap between VLP models trained on large-scale and small-scale image-text paired data. Nonetheless, the KeepFIT V2 model consistently outperforms all comparison models on the REFUGE dataset, achieving improvements of 22.6%, 4.3%, and 24.0%, respectively. These results highlight that, even with a modest high-quality image-text dataset, KeepFIT V2 could effectively incorporate expert knowledge from MM-Retinal V2 into foundational vision-language pertaining, by spreading such an elite spark on public retinal datasets, demonstrating its exceptional generalization capabilities.

TABLE III: Zero-Shot Classification Performance in FFA Modality across Two Datasets. (%)
Model Angiographic MPOS
ACC AUC AUPR AVG ACC AUC AUPR AVG
Generalist VLPs for biomedical understanding
\hdashlineBiomedCLIP 4.7 49.9 4.7 19.8 20.6 65.6 34.5 40.2
PubMedCLIP 4.7 50.5 4.8 20.0 18.5 48.8 19.8 29.0
Specialist VLPs for retinal understanding
\hdashlineCLIP 14.2 59.5 5.7 26.5 31.5 62.8 32.3 42.2
KeepFITV1 11.3 62.0 6.1 26.5 31.2 64.7 33.4 43.1
KeepFITV2 15.1 69.0 9.1 31.1 64.8 89.7 73.8 76.1

V-D2 Few-Shot

Next, we assess the performance of the proposed KeepFIT V2 in low-data regimes by conducting few-shot classification experiments. These experiments are derived by varying the number of shots (images per category) used for adaptation with the utility of Clip-Adapter [10] and Tip-Adapter [42]. From Table II, KeepFIT V2 and KeepFIT V2L consistently outperform KeepFIT V1, improving the average score of ACC, AUC and AUPR metrics with a maximum improvement of 3.7% on REFUGE, 1.1% on ODIR 200×\times3, 1.2% on Retina, and 2.2% on iChallenge-AMD when compared to KeepFIT V1.

Our KeepFIT V2 and KeepFIT V2L achieve top-2 performance on the REFUGE and iChallenge-AMD datasets, as well as top-3 performance on the ODIR200×\times3 dataset. Notably, they not only outperform FLAIR and KeepFIT V1 but also surpass models that are trained on large-scale image-text paired datasets. These results point to the superior representation and vision-language alignment capability of KeepFIT V2, underscoring its generalizability under limited image-text data resource.

V-D3 Linear Probing

To further evaluate the effectiveness and transferability of the proposed KeepFIT V2, we conduct linear probing experiments. Concretely, we use the frozen image encoder from KeepFIT V2 as image feature extractor and incorporate an additional linear layer as the classifier, whose parameters are fine-tuned on downstream datasets.

The experiment results are presented in Fig. 4. Compared to VLPs trained on small-scale elite image-text paired data / public categorical data (i.e. FLAIR, RETFound, and KeepFIT V1), our KeepFIT V2 and KeepFIT V2L achieve significant performance improvement across all metrics. Furthermore, when compared to VLPs trained on large-scale image-text paired data (i.e., ViLRef, RET-CLIP, and RetiZero), KeepFIT V2 and KeepFIT V2L still maintain comparable or even superior performance. From the above experiments in CFP modality, we can conclude that no foundation models can perform the best across all datasets in every evaluation setting. This indicates that building a retinal foundation model that performs well across diverse scenarios and various disease types is highly challenging and calls for further exploration.

V-E Evaluation of the FFA Modality

TABLE IV: Few-Shot Classification Performance in FFA Modality across Two Datasets. (%)
Model Clipadapter Tipadapter Tipadapter-f AVG
1 5 10 1 5 10 1 5 10
ACC AUC AUPR ACC AUC AUPR ACC AUC AUPR ACC AUC AUPR ACC AUC AUPR ACC AUC AUPR ACC AUC AUPR ACC AUC AUPR ACC AUC AUPR
Angiographic
Generalist VLPs for biomedical understanding
\hdashlineBiomedCLIP 12.3 63.0 9.8 25.3 73.2 16.4 34.6 78.6 22.2 5.6 56.2 6.5 6.7 58.7 6.8 7.3 61.7 7.2 5.4 56.5 6.6 8.9 60.0 7.7 15.0 66.1 9.9 29.2
PubMedCLIP 15.7 60.7 10.6 27.4 70.9 17.1 35.8 75.7 21.6 5.5 52.7 5.6 6.1 60.2 6.3 9.2 63.9 8.4 5.6 52.3 5.7 9.3 61.8 8.7 12.2 67.1 10.3 29.1
Specialist VLPs for retinal understanding
\hdashlineCLIP 20.6 67.4 13.3 41.8 79.0 30.7 53.2 84.1 42.0 7.2 59.2 6.3 7.6 60.1 6.5 8.2 61.1 6.9 7.3 59.3 6.5 9.0 62.7 7.7 15.4 67.1 10.5 33.4
KeepFITV1 24.0 70.8 16.4 42.2 79.0 31.8 51.9 85.2 41.9 10.9 64.6 6.8 13.1 66.4 7.6 14.4 68.2 9.8 10.9 65.0 6.9 17.4 69.8 12.6 29.4 76.4 22.2 37.6
KeepFITV2 24.1 72.8 16.5 39.1 79.2 27.0 48.2 83.9 35.6 13.2 71.3 10.3 17.5 73.2 12.9 22.5 75.3 16.6 13.5 71.7 10.3 25.9 76.8 18.5 39.4 82.4 30.4 41.0
MPOS
Generalist VLPs for biomedical understanding
\hdashlineBiomedCLIP 38.7 68.5 41.7 53.0 78.3 53.0 58.2 83.3 62.3 21.1 62.9 30.7 21.1 64.1 31.6 21.5 65.5 33.0 20.5 59.1 28.4 21.6 63.4 31.1 24.4 67.0 37 46.0
PubMedCLIP 28.1 57.8 31.2 42.9 72.0 43.7 49.4 77.0 52.0 16.4 47.1 20.4 17.7 50.0 22.5 21.7 54.2 25.7 17.1 49.3 21.6 24.0 56.2 27.7 24.7 61.8 32.5 38.7
Specialist VLPs for retinal understanding
\hdashlineCLIP 46.0 75.1 47.6 66.7 90.4 75.1 77.1 94.3 83.7 30.3 61.1 32.5 31.0 62.0 33.0 31.7 63.1 34.3 28.2 60.7 32.5 29.9 62.8 33.9 36.3 67.9 38.1 52.8
KeepFITV1 58.0 85.3 64.4 80.2 94.1 87.3 85.8 95.8 92.8 32.7 64.8 35.5 39.0 69.9 42.6 43.5 75.7 50.7 33.7 65.8 36.7 44.5 73.8 49.5 62.3 85.2 67.6 63.6
KeepFITV2 73.4 92.6 80.4 78.3 94.3 84.8 82.2 95.8 88.5 68.4 90.3 75.3 69.4 91.4 78.2 72.0 92.4 81.2 66.6 90.2 75.6 69.5 91.9 80.2 75.4 93.6 84.5 82.1

Refer to caption


Figure 5: Linear probing classification performance in FFA and OCT modalities across datasets (each row represents the results of a dataset).

This subsection evaluates the performance of KeepFIT V2 across zero-shot, few-shot, and linear probing settings in FFA modality. Table III, Table IV, and Fig. 5 show that KeepFIT V2 achieves substantial improvements over all the compared methods. Under the zero-shot classification setting, it achieves overall performance increases of 4.6% and 33.0% for the Angiographic and MPOS datasets, respectively. Additionally, under the few-shot setting, it clearly outperforms the previous KeepFIT V1 by 3.4% and 18.5% for the Angiographic and MPOS datasets, respectively.

The performance improvement from KeepFIT V1 to KeepFIT V2 indicates that in the FFA modality, our hybrid image-text knowledge injection module, particularly the appearance-oriented component, effectively utilizes the detailed features learned by the image tokenizer to achieve precise image retrieval between MM-Retinal V2 and public datasets. This enables the effective injection of expert knowledge from the elite MM-Retinal V2 into the pretraining process of KeepFIT V2. Furthermore, it underscores the essential role of detailed features in the understanding and analysis of FFA images.

V-F Evaluation of the OCT Modality

TABLE V: Zero-Shot Classification Performance in OCT Modality across Two datasets. (%)
Model OCTDL OCTID
ACC AUC AUPR AVG ACC AUC AUPR AVG
Generalist VLPs for biomedical understanding
\hdashlineBiomedCLIP 21.3 61.6 21.7 34.9 20.8 68.4 43.5 44.2
PubMedCLIP 14.9 51.0 16.2 27.4 11.9 48.2 21.1 27.1
Specialist VLPs for retinal understanding
\hdashlineCLIP 29.2 54.6 29.9 37.9 66.6 93.5 87.9 82.7
KeepFITV1 37.6 70.6 35.0 47.7 63.9 97.9 94.4 85.4
KeepFITV2 38.5 72.0 33.8 48.1 70.7 97.3 93.0 87.0
TABLE VI: Few-Shot Classification Performance in OCT Modality across Two Datasets. (%)
Model Clipadapter Tipadapter Tipadapter-f AVG
1 5 10 1 5 10 1 5 10
ACC AUC AUPR ACC AUC AUPR ACC AUC AUPR ACC AUC AUPR ACC AUC AUPR ACC AUC AUPR ACC AUC AUPR ACC AUC AUPR ACC AUC AUPR
OCTDL
Generalist VLPs for biomedical understanding
\hdashlineBiomedCLIP 27.4 65.8 25.1 39.2 75.0 33.4 48.6 82.6 43.0 18.5 56.4 18.8 19.1 58.3 19.5 19.7 60.7 20.5 17.5 55.9 19.1 19.5 60.6 20.3 21.7 63.7 23.2 38.3
PubMedCLIP 23.1 60.8 22.5 34.4 70.5 31.2 43.7 77.2 38.8 15.9 54.1 17.5 16.0 58.1 18.8 16.2 62.4 21.0 15.1 53.6 18.2 17.0 61.4 21.2 24.7 69.6 25.7 36.6
Specialist VLPs for retinal understanding
\hdashlineCLIP 40.4 79.8 40.4 55.4 88.1 54.1 58.4 90.9 58.9 29.6 55.0 27.5 30.3 57.3 28.9 31.2 59.5 30.5 30.0 56.1 27.6 32.3 63.4 33.4 41.9 68.7 40.8 48.5
KeepFITV1 41.3 79.7 40.5 52.3 86.8 54.1 55.7 90.5 59.6 37.8 68.8 34.3 38.4 70.5 25.1 39.2 72.1 36.3 38.5 69.3 34.6 38.7 72.6 36.2 41.1 78.2 40.1 53.0
KeepFITV2 46.2 81.3 46.6 57.3 89.4 59.8 61.8 92.3 66.8 37.8 70.0 33.5 38.5 71.7 34.2 38.8 73.5 35.0 37.6 70.3 33.3 41.6 77.0 37.6 43.2 82.6 43.0 55.6
OCTID
Generalist VLPs for biomedical understanding
\hdashlineBiomedCLIP 54.1 81.1 58.0 62.2 89.2 68.5 68.6 91.8 73.5 18.6 59.0 32.9 19.9 59.6 33.7 22.9 61.1 35.7 18.4 61.0 34.9 23.2 60.5 36.0 26.5 64.9 39.0 50.2
PubMedCLIP 31.0 69.5 41.8 58.8 86.7 64.9 64.9 89.8 71.3 21.5 53.0 24.8 26.9 61.6 29.3 31.6 71.1 37.1 19.2 55.4 26.5 33.3 68.8 36.8 43.8 79.1 51.7 50.0
Specialist VLPs for retinal understanding
\hdashlineCLIP 91.8 97.8 95.5 92.4 99.3 96.4 92.5 98.4 96.7 58.2 93.6 77.9 65.0 92.4 81.6 72.9 94.3 84.6 59.8 91.6 77.7 70.3 95.7 86.3 80.0 97.4 90.6 86.3
KeepFITV1 91.0 97.9 94.7 92.8 98.4 95.6 93.6 97.9 95.3 67.3 97.3 93.5 73.9 97.6 94.1 80.9 97.9 94.6 68.3 98.0 95.0 79.2 97.5 94.3 88.9 98.4 95.5 91.5
KeepFITV2 89.6 97.8 94.6 92.4 98.8 96.4 92.8 98.5 96.6 76.4 97.0 92.7 82.4 97.9 94.2 85.2 98.7 95.4 76.5 97.6 93.5 86.5 98.6 95.8 90.2 98.9 96.6 93.0

Finally, we examine the generalization capability and transferability of different vision-language pretraining models under zero-shot, few-shot, and linear probing settings in OCT modality. Table V, Table VI, and Fig. 5 demonstrate the comparison results. Similarly, KeepFIT V2 achieves the highest average score on OCTDL and OCTID in three scenarios. The enhancement can be attributed to the combination of high-level semantics-oriented and low-level appearance-oriented knowledge injection, which facilitates the vision-language alignment and feature understanding.

Experiments across three different modalities illustrate that the high-quality expert knowledge contained within the MM-Retinal V2 dataset significantly benefits the training of foundation models in fundus image analysis. In addition, these experiments also validate the effectiveness of the proposed KeepFIT V2, which successfully uses only a minimal amount of elite image-text data as a spark to achieve comparable performance to those vision-language pretraining models trained on large-scale private image-text pairs.

V-G Ablation Study

TABLE VII: Ablation Study in CFP Modality across Zero-Shot, Few-Shot, and Linear Probing. For each dataset, the average values of all metrics under each setting are presented. KI refers to knowledge injection (%)
Semantic KI Textual Pretraining Appearance KI Zero-Shot Few-Shot Linear Probing
REFUGE ODIR Retina AMD AVG REFUGE ODIR Retina AMD AVG REFUGE ODIR APTOS FIVES AVG

89.3 68.2 49.9 74.9 70.6 87.4 73.8 51.9 77.1 72.6 88.3 94.3 83.0 91.9 89.4

89.4 87.2 57.4 83.8 79.5 87.8 90.6 63.3 84.6 81.6 90.5 95.7 82.4 92.5 90.3

89.7 91.3 58.7 80.3 80.0 88.7 92.2 66.2 80.3 81.9 91.0 96.3 83.0 92.4 90.7
(Ours)

92.8 87.2 61.1 85.9 81.8 90.1 90.8 64.4 86.0 82.8 91.5 95.6 88.3 95.3 92.7

In this section, we validate the effectiveness of each module in KeepFIT V2. Table VII presents the average score of each downstream dataset under zero-shot, few-shot, and linear probing settings in CFP modality. The Semantic KI and Appearance KI represent semantics-oriented and appearance-oriented expert knowledge extraction, respectively, along with their associated expert knowledge refinements. These elements collectively constitute the hybrid image-text knowledge injection. Textual Pretraining refers to the preliminary textual knowledge pretraining in Section  IV-B. A primary observation from the results is that the removal of any module leads to a performance decline to varying degrees, highlighting that all the modules contribute to the performance improvement. Notably, the results of the second and the last rows demonstrate the crucial role of the hybrid image-text knowledge injection module in achieving strong knowledge transfer from MM-Retinal V2 to categorical public datasets. Although a slight decline is observed in minor cases, overall averaged performance improves across all datasets and settings.

VI Conclusion

In this work, we construct MM-Retinal V2, a high-quality image-text dataset encompassing CFP, FFA, and OCT modalities, and covering over 96 fundus diseases and abnormalities. Enabled by MM-Retinal V2 and public categorically-labeled datasets, we propose KeepFIT V2, a vision-language foundation model for retinal image analysis. KeepFIT V2 effectively incorporates expert knowledge from MM-Retinal V2 into foundation model pretraining through preliminary textual pretraining and hybrid image-text knowledge injection, which leverages a combination of high-level semantic features from contrastive learning and low-level appearance features from generative learning to enhance its performance. Moreover, KeepFIT V2 provides a novel approach to building retinal foundation model with the elite MM-Retinal V2 spark instead of relying on large-scale private image-text data, while still delivering competitive performance. Our proposed knowledge spark spreading pretraining scheme is not only effective for retinal foundation model pretraining but can also be broadly applied to other medical foundation models encountering the same challenge of limited image-text data. This highlights the versatility and generalizability of our scheme, providing a promising solution for advancing vision-language pretraining across diverse medical imaging domains.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Nos. 62476054, and 62172228).

References

  • [1] Retina dataset, https://www.kaggle.com/datasets/jr2ngb/cataractdataset/data
  • [2] Aptos 2019 blindness detection (2019), https://www.kaggle.com/c/aptos2019-blindness-detection
  • [3] Alsentzer, E., Murphy, J.R., Boag, W., Weng, W.H., Jin, D., Naumann, T., McDermott, M.: Publicly available clinical bert embeddings. arXiv preprint arXiv:1904.03323 (2019)
  • [4] Boecking, B., Usuyama, N., Bannur, S., Castro, D.C., Schwaighofer, A., Hyland, S., Wetscherek, M., Naumann, T., Nori, A., Alvarez-Valle, J., et al.: Making the most of text semantics to improve biomedical vision–language processing. In: European conference on computer vision. pp. 1–21. Springer (2022)
  • [5] Chen, D., Wu, Z., Liu, F., Yang, Z., Zheng, S., Tan, Y., Zhou, E.: Protoclip: Prototypical contrastive language image pretraining. IEEE Transactions on Neural Networks and Learning Systems (2023)
  • [6] Decencière, E., Zhang, X., Cazuguel, G., Lay, B., Cochener, B., Trone, C., Gain, P., Ordóñez-Varela, J.R., Massin, P., Erginay, A., et al.: Feedback on a publicly distributed image database: the messidor database. Image Analysis & Stereology pp. 231–234 (2014)
  • [7] Du, J., Guo, J., Zhang, W., Yang, S., Liu, H., Li, H., Wang, N.: Ret-clip: A retinal image foundation model pre-trained with clinical diagnostic reports. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 709–719. Springer (2024)
  • [8] Eslami, S., Meinel, C., De Melo, G.: Pubmedclip: How much does clip benefit visual question answering in the medical domain? In: Findings of the Association for Computational Linguistics: EACL 2023. pp. 1181–1193 (2023)
  • [9] Fu, H., Li, F., Orlando, J.I., Bogunović, H., Sun, X., Liao, J., Xu, Y., Zhang, S., Zhang, X.: Adam: Automatic detection challenge on age-related macular degeneration (2020). https://doi.org/10.21227/dt4f-rt59, https://dx.doi.org/10.21227/dt4f-rt59
  • [10] Gao, P., Geng, S., Zhang, R., Ma, T., Fang, R., Zhang, Y., Li, H., Qiao, Y.: Clip-adapter: Better vision-language models with feature adapters. International Journal of Computer Vision 132(2), 581–595 (2024)
  • [11] Gholami, P., Roy, P., Parthasarathy, M.K., Lakshminarayanan, V.: Octid: Optical coherence tomography image database. Computers & Electrical Engineering 81, 106532 (2020)
  • [12] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
  • [13] He, X., Zhou, Y., Wang, B., Cui, S., Shao, L.: Dme-net: Diabetic macular edema grading by auxiliary task learning. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 788–796. Springer (2019)
  • [14] Huang, J.H., Yang, C.H.H., Liu, F., Tian, M., Liu, Y.C., Wu, T.W., Lin, I., Wang, K., Morikawa, H., Chang, H., et al.: Deepopht: medical report generation for retinal images via deep models and visual explanation. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 2442–2452 (2021)
  • [15] Huang, Z., Zeng, Z., Huang, Y., Liu, B., Fu, D., Fu, J.: Seeing out of the box: End-to-end pre-training for vision-language representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12976–12985 (2021)
  • [16] Jin, K., Huang, X., Zhou, J., Li, Y., Yan, Y., Sun, Y., Zhang, Q., Wang, Y., Ye, J.: Fives: A fundus image dataset for artificial intelligence based vessel segmentation. Scientific data 9(1),  475 (2022)
  • [17] Kulyabin, M., Zhdanov, A., Nikiforova, A., Stepichev, A., Kuznetsova, A., Ronkin, M., Borisov, V., Bogachev, A., Korotkich, S., Constable, P.A., et al.: Octdl: Optical coherence tomography dataset for image-based deep learning methods. Scientific Data 11(1),  365 (2024)
  • [18] Lavoie, S., Kirichenko, P., Ibrahim, M., Assran, M., Wilson, A.G., Courville, A., Ballas, N.: Modeling caption diversity in contrastive vision-language pretraining. In: Forty-first International Conference on Machine Learning
  • [19] Li, H., Zhu, J., Jiang, X., Zhu, X., Li, H., Yuan, C., Wang, X., Qiao, Y., Wang, X., Wang, W., et al.: Uni-perceiver v2: A generalist model for large-scale vision and vision-language tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2691–2700 (2023)
  • [20] Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., Hoi, S.C.H.: Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems 34, 9694–9705 (2021)
  • [21] Li, M., Cai, W., Liu, R., Weng, Y., Zhao, X., Wang, C., Chen, X., Liu, Z., Pan, C., Li, M., et al.: Ffa-ir: Towards an explainable and reliable medical report generation benchmark. In: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) (2021)
  • [22] Li, Z., Song, D., Yang, Z., Wang, D., Li, F., Zhang, X., Kinahan, P.E., Qiao, Y.: Visionunite: A vision-language foundation model for ophthalmology enhanced with clinical knowledge. arXiv preprint arXiv:2408.02865 (2024)
  • [23] Lu, J., Clark, C., Zellers, R., Mottaghi, R., Kembhavi, A.: Unified-io: A unified model for vision, language, and multi-modal tasks. In: The Eleventh International Conference on Learning Representations (2022)
  • [24] Orlando, J.I., Fu, H., Breda, J.B., Van Keer, K., Bathula, D.R., Diaz-Pinto, A., Fang, R., Heng, P.A., Kim, J., Lee, J., et al.: Refuge challenge: A unified framework for evaluating automated methods for glaucoma assessment from fundus photographs. Medical image analysis 59, 101570 (2020)
  • [25] Pachade, S., Porwal, P., Thulkar, D., Kokare, M., Deshmukh, G., Sahasrabuddhe, V., Giancardo, L., Quellec, G., Mériaudeau, F.: Retinal fundus multi-disease image dataset (rfmid): a dataset for multi-disease detection research. Data 6(2),  14 (2021)
  • [26] Pelka, O., Koitka, S., Rückert, J., Nensa, F., Friedrich, C.M.: Radiology objects in context (roco): a multimodal image dataset. In: Intravascular Imaging and Computer Assisted Stenting and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis: 7th Joint International Workshop, CVII-STENT 2018 and Third International Workshop, LABELS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 16, 2018, Proceedings 3. pp. 180–189. Springer (2018)
  • [27] Peng, Y., Zhu, W., Chen, Z., Wang, M., Geng, L., Yu, K., Zhou, Y., Wang, T., Xiang, D., Chen, F., et al.: Automatic staging for retinopathy of prematurity with deep feature fusion and ordinal classification strategy. IEEE Transactions on Medical Imaging 40(7), 1750–1762 (2021)
  • [28] Porwal, P., Pachade, S., Kokare, M., Deshmukh, G., Son, J., Bae, W., Liu, L., Wang, J., Liu, X., Gao, L., et al.: Idrid: Diabetic retinopathy–segmentation and grading challenge. Medical image analysis 59, 101561 (2020)
  • [29] Qiu, J., Wu, J., Wei, H., Shi, P., Zhang, M., Sun, Y., Li, L., Liu, H., Liu, H., Hou, S., et al.: Visionfm: a multi-modal multi-task vision foundation model for generalist ophthalmic artificial intelligence. arXiv preprint arXiv:2310.04992 (2023)
  • [30] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)
  • [31] Shi, D., Zhang, W., Yang, J., Huang, S., Chen, X., Yusufu, M., Jin, K., Lin, S., Liu, S., Zhang, Q., et al.: Eyeclip: A visual-language foundation model for multi-modal ophthalmic image analysis. arXiv preprint arXiv:2409.06644 (2024)
  • [32] Shi, F., Luo, Z., Ge, Y., Yang, Y., Shan, Y., Wang, L.: Taming scalable visual tokenizer for autoregressive image generation. arXiv preprint arXiv:2412.02692 (2024)
  • [33] Silva-Rodriguez, J., Chakor, H., Kobbi, R., Dolz, J., Ayed, I.B.: A foundation language-image model of the retina (flair): Encoding expert knowledge in text supervision. arXiv preprint arXiv:2308.07898 (2023)
  • [34] University, P.: Peking university international competition on ocular disease intelligent recognition (2019), https://odir2019.grand-challenge.org/
  • [35] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017)
  • [36] Wang, H., Xing, Z., Wu, W., Yang, Y., Tang, Q., Zhang, M., Xu, Y., Zhu, L.: Non-invasive to invasive: Enhancing ffa synthesis from cfp with a benchmark dataset and a novel network. In: Proceedings of the 1st International Workshop on Multimedia Computing for Health and Medicine. pp. 7–15 (2024)
  • [37] Wang, M., Lin, T., Lin, A., Yu, K., Peng, Y., Wang, L., Chen, C., Zou, K., Liang, H., Chen, M., et al.: Common and rare fundus diseases identification using vision-language foundation model with knowledge of over 400 diseases. arXiv preprint arXiv:2406.09317 (2024)
  • [38] Wang, M., Tichelaar, J., Pasquale, L.R., Shen, L.Q., Boland, M.V., Wellik, S.R., De Moraes, C.G., Myers, J.S., Ramulu, P., Kwon, M., et al.: Characterization of central visual field loss in end-stage glaucoma by unsupervised artificial intelligence. JAMA ophthalmology 138(2), 190–198 (2020)
  • [39] Wei, H., Liu, B., Zhang, M., Shi, P., Yuan, W.: Visionclip: An med-aigc based ethical language-image foundation model for generalizable retina image analysis. arXiv preprint arXiv:2403.10823 (2024)
  • [40] Wu, R., Zhang, C., Zhang, J., Zhou, Y., Zhou, T., Fu, H.: Mm-retinal: Knowledge-enhanced foundational pretraining with fundus image-text expertise. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 722–732. Springer (2024)
  • [41] Yang, S., Du, J., Guo, J., Zhang, W., Liu, H., Li, H., Wang, N.: Vilref: A chinese vision-language retinal foundation model. arXiv e-prints pp. arXiv–2408 (2024)
  • [42] Zhang, R., Zhang, W., Fang, R., Gao, P., Li, K., Dai, J., Qiao, Y., Li, H.: Tip-adapter: Training-free adaption of clip for few-shot classification. In: European Conference on Computer Vision. pp. 493–510. Springer (2022)
  • [43] Zhang, S., Xu, Y., Usuyama, N., Xu, H., Bagga, J., Tinn, R., Preston, S., Rao, R., Wei, M., Valluri, N., et al.: Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. arXiv preprint arXiv:2303.00915 (2023)
  • [44] Zhang, W., Chotcomwongse, P., Chen, X., Chung, F.H., Song, F., Zhang, X., He, M., Shi, D., Ruamviboonsuk, P.: Angiographic report generation for the 3rd aptos’s competition: Dataset and baseline methods. medRxiv pp. 2023–11 (2023)
  • [45] Zhang, X., Xiao, Z., Wu, X., Chen, Y., Zhao, J., Hu, Y., Liu, J.: Pyramid pixel context adaption network for medical image classification with supervised contrastive learning. IEEE Transactions on Neural Networks and Learning Systems (2024)
  • [46] Zhou, Y., Chia, M.A., Wagner, S.K., Ayhan, M.S., Williamson, D.J., Struyven, R.R., Liu, T., Xu, M., Lozano, M.G., Woodward-Court, P., et al.: A foundation model for generalizable disease detection from retinal images. Nature 622(7981), 156–163 (2023)