GCS-M3VLT: Guided Context Self-Attention based Multi-modal Medical Vision Language Transformer for Retinal Image Captioning

Teja Krishna Cherukuri^{${\dagger}\ast$} Nagur Shareef Shaik^{${\dagger}\ast$} Jyostna Devi Bodapati^${\ddagger}$ Dong Hye Ye^${\dagger}$

^${\dagger}$Department of Computer Science, Georgia State University, Atlanta, GA, USA
^${\ddagger}$Vignan’s Foundation for Science, Technology & Reserach University, Guntur, Andhra Pradesh, India
^$\ast$Authors contributed equallyCorresponding Author: dongye@gsu.edu

Abstract

Retinal image analysis is crucial for diagnosing and treating eye diseases, yet generating accurate medical reports from images remains challenging due to variability in image quality and pathology, especially with limited labeled data. Previous Transformer-based models struggled to integrate visual and textual information under limited supervision. In response, we propose a novel vision-language model for retinal image captioning that combines visual and textual features through a guided context self-attention mechanism. This approach captures both intricate details and the global clinical context, even in data-scarce scenarios. Extensive experiments on the DeepEyeNet dataset demonstrate a 0.023 BLEU@4 improvement, along with significant qualitative advancements, highlighting the effectiveness of our model in generating comprehensive medical captions.

Index Terms:

Retinal Image Captioning, Vision Language Model, Guided Context Attention, Self-Attention, Transformer.

I Introduction

The rising incidence of retinal diseases like Diabetic Retinopathy (DR) and Diabetic Macular Edema (DME) poses a global health challenge, with DR alone affecting nearly a one-third of diabetic individuals and leading to vision-threatening complications in about 10% of cases [1, 2, 3]. Early detection is crucial, yet traditional diagnosis methods using Color Fundus Photography and Optical Coherence Tomography are resource-intensive, relying heavily on ophthalmologists [4]. Automating medical report generation from retinal images offers a potential solution, but remains difficult due to the complexity of retinal pathologies, image variability, and limited annotated data [5]. Current approaches often borrow from general image captioning models [6, 7], yet struggle to meet the high accuracy and interpretability demands in medical applications [8, 9].

Models like DeepOpht [5] and Deep Context-Encoding [10] made early progress in retinal image captioning but struggled due to their simplistic architectures, limiting their ability to capture complex visual-clinical interactions. The integration of attention mechanisms in models like Non-local Attention [11], Contextualized GPT [12], and Expert Transformer [13] enhanced coherence and knowledge integration but still struggled with multi-modal fusion in complex medical scenarios. More recent advancements, such as the Gated Contextual Transformer [14] and M3 Transformer [15], have improved context-based medical image captioning, particularly in multi-modal fusion [16, 17].

Parallel developments in Vision-Language Models (VLMs) have also contributed to this field. The ME Transformer [18] introduced learnable expert tokens for better modality adaptation, yet its high computational cost limits practical deployment. Similarly, VisionGPT [19] and LlaVA [20] offer robust vision-language integration but require significant computational resources. To overcome these limitations, we propose the Guided Context Self-Attention based Multi-modal Medical Vision Language Transformer (GCS-M3VLT), aimed at addressing modality integration and computational efficiency in retinal image report generation.

Refer to caption — Figure 1: Architecture of the proposed Guided Context Self-Attention-based Multi-modal Medical Vision Language Transformer (GCS-M3VLT); Vision Encoder – Learns attention-based representations from retinal images to capture visual features crucial for diagnosis; Language Encoder – Learns self-attention-based clinical-context embeddings from diagnostic keywords, enabling the model to understand the semantic context of medical terms; Vision-Language TransFusion Encoder – Integrates visual attention features and clinical-context embeddings, leveraging both visual and semantic information to provide comprehensive understanding; Language Generation Decoder – Generates coherent and meaningful medical descriptions by attending to relevant visual and semantic cues, ensuring contextually appropriate and diagnostically relevant outputs.

II Methodology

This research aims to derive clinical context from multi-modal retinal images and diagnostic keywords to generate accurate medical captions. We introduce the Guided Context Self-Attention based Multi-modal Medical Vision Language Transformer (GCS-M3VLT), which fuses visual and textual data for comprehensive medical descriptions. Figure 1 illustrates the model architecture, with technical details provided in the following subsections.

II-A Vision Encoder

The Vision Encoder converts pre-processed retinal images into visual features using a Convolutional base and a Guided Context Attention mechanism. The Convolutional base, employing EfficientNetV2B0 [21], extracts initial spatial features from images. Given a retinal scan image $X_{R}$ of dimensions $(356\times 356\times 3)$ , Convolutional base produces visual features $F_{R}$ of dimensions $(12\times 12\times 1280)$ .

II-A1 Guided Context Attention

The Guided Context Attention (GCA) block enhances retinal image analysis by integrating spatial and channel contexts. This approach is essential for capturing lesion-specific details that traditional attention mechanisms may miss [22]. We start with spatial features from the convolutional base, formulated as Context Query $(Q_{c})$ , Context Key $(K_{c})$ , and Context Value $(V_{c})$ . The GCA block refines these features through two main stages: spatial and channel context formulation.

Spatial Context

Spatial context is computed using global attention pooling:

F_{s}=\sum\limits_{j=1}^{d}\frac{e^{W_{r}f_{r,j}}}{\sum\limits_{m=1}^{d}e^{W_{r}f_{r,m}}}\qquad\forall f_{r,j}\in F_{R}

(1)

where $F_{s}$ represents spatial context features, and $W_{r}$ are the point-wise convolution parameters. This operation normalizes the spatial features, emphasizing essential global information.

Channel Context

Channel context is incorporated as follows:

F_{c}=F_{R}\oplus\sum\limits_{i=1}^{d}W_{2}\left(\text{LN}\left(\Gamma\left(\sum\limits_{j=1}^{k}W_{1}f_{s_{j}}\right)\right)\right)_{i}\forall f_{s_{j}}\in F_{s}

(2)

Here, $F_{c}$ combines spatial context with channel-wise features, where $W_{1}$ and $W_{2}$ are point-wise convolutions, $\Gamma$ denotes ReLU, and LN stands for Layer Normalization.

Attention Coefficients

The final attention-weighted features are obtained through:

F_{gca}=\sigma(W_{\psi}\cdot\Gamma(W_{qc}Q_{c}+W_{kc}K_{c}+b_{qk})+b_{\psi})\cdot V_{c}

(3)

where $F_{gca}$ are the refined visual features, with $W_{qc}$ , $W_{kc}$ , $b_{qk}$ , $W_{\psi}$ , and $b_{\psi}$ as the gating parameters. This formula combines $Q_{c}$ and $K_{c}$ to compute attention coefficients, which are applied to $V_{c}$ to focus on relevant features.

II-B Language Encoder

The language encoder transforms processed diagnostic keywords into attention embeddings. It employs the Embedding layer to convert the processed diagnostic keyword sequence into embeddings, denoted as $KE$ , encapsulating individual keyword embeddings $ke_{1},ke_{2},...,ke_{n}$ . The encoder captures the semantic meaning of input keywords, but the generated embeddings initially lack explicit relationships between keywords, making them context-free. To overcome this limitation, the embeddings undergo further processing by a Multi-Head Attention. This layer employs a scaled-dot product attention mechanism, wherein three matrices are computed: Queries $(Q=KE)$ , Keys $(K=KE)$ , and Values $(V=KE)$ . These matrices are subsequently processed to obtain context-enriched keyword embeddings $(KE_{c})$ that encapsulate meaningful relationships between the keywords .

\text{H}_{i}=\text{Self-Attention}(KE\cdot W_{q_{i}},KE\cdot W_{k_{i}},KE\cdot W_{v_{i}})

(4)

\text{Self-Attention}(Q,K,V)=\text{Softmax}\left(\frac{QK^{T}}{\sqrt{d_{k}}}\right)V

(5)

K_{att}=\text{MHA}(KE)=\text{Concat}(\text{H}_{1},\text{H}_{2},...,\text{H}_{h})W_{o}

(6)

Equations 4 to 6 show the computation of attention for each head in the MHA mechanism, with 5 detailing the Self-Attention scores and 6 defining the concatenation of attention outputs from all heads. In these operations, $KE$ represents the input embeddings; $W_{q_{i}}$ , $W_{k_{i}}$ , and $W_{v_{i}}$ are the weight matrices associated with the keyword embeddings for $Q$ , $K$ , and $V$ in the $i$ -th head, respectively; $\sqrt{d_{k}}$ is the length of $K$ ; $W_{o}$ is the output weight matrix and, $h$ denotes the number of heads. The final embeddings are derived from the fusion of the input embedding and context embedding. This fusion is achieved by element-wise addition and normalization, represented by the equation 7 where $\oplus$ denotes the element-wise addition operation.

KE_{final}=\text{LN}\left(KE\oplus\text{MHA}(KE)\right)

(7)

II-C Vision-Language TransFusion Encoder

The Vision and Language encoders generate intermediate features within their domains, and the TransFusion Encoder (TFE) attentively fuses keyword context embeddings $(KE_{final})$ with visual representations $(F_{gca})$ for comprehensive clinical understanding. This fusion enables the cross-modality interactions necessary for accurate caption generation. Within the TFE, we implement a Vision-Language Cross-Attention (VLA) mechanism using Multi-Head Attention. The VLA selectively incorporates features from both domains, ensuring that critical clinical context is preserved and emphasized. This mechanism dynamically adjusts the contributions of visual and textual features, assigning higher weights to informative elements while suppressing irrelevant or redundant information, thus enhancing the relevance of generated captions. The VLA mechanism is formalized in Equation 8:

Z=\text{MHA}(F_{gca},KE_{final},KE_{final})

(8)

F^{{}^{\prime}}=\text{LN}(W_{r}\Gamma(W_{h},\text{LN}(Q+Z))+\text{LN}(Q+Z))

(9)

Subsequent operations, including Residual Addition, Layer Normalization (LN), and a Feed Forward Neural Network (FFNN) involving ReLU $(\Gamma)$ , with parameters $W_{h}$ and $W_{r}$ , refine the attention representations, resulting in $F^{{}^{\prime}}$ .

II-D Language Generation Decoder

The final component of GCSM3VLT is the Language Generation Decoder, which translates the fused multi-modal features into clinically meaningful textual descriptions. This decoder is based on a Transformer (GPT-2) [23, 24] that incorporates both the original visual representations and the integrated Vision-Language features generated by the TransFusion Encoder. By leveraging a sequence of multi-head attention layers, the decoder constructs a contextualized representation of the input modalities, which is then used to predict the next word in the sequence until the entire caption is generated.

III Experiments

III-A Dataset & Implementation Details

To evaluate GCSM3VLT, we use the DeepEyeNet dataset [5] comprising 15,710 retinal images across various modalities: Fluorescein Angiography, Fundus Photography, Optical Coherence Tomography, and multi-modality grids. Annotated by expert ophthalmologists with 609 unique diagnostic keywords and clinical descriptions averaging 5-10 words. Covering 265 retinal diseases, from common to rare cases, the dataset is split into training (60%), validation (20%), and testing (20%) subsets, offering a robust foundation for high-precision diagnostic model development. All Images were resized to $(356\times 356)$ pixels with three channels, and text was cleaned and standardized, with keywords limited to 5-50 words, captions to 50 words, and rare words replaced by $<$ UNK $>$ . The vocabulary was capped at 5000 tokens, with 1024-dimensional embeddings. The model was trained for 100 epochs (loss: cross_entropy, optimizer: Adam, batch size: 64, learning rate: 0.0001) on an NVIDIA P100 GPU. Performance was evaluated using BLEU, CIDEr, and ROUGE metrics.

III-B Quantitative Evaluation

Table I highlights the impressive performance of our GCS-M3VLT, which surpasses state-of-the-art Vision-Language Models (VLMs) across all evaluation metrics. Our model achieves the highest BLEU scores, outshining leading models such as DeepOpht, Contextualized Keywords, and VisionGPT. It also excels in CIDEr and ROUGE, showcasing its superior ability to generate precise and contextually relevant medical captions compared to VLMs like Gated Contextual Attention Net and M3 Transformer. Additionally, the GCS-M3VLT’s lightweight design, with fewer parameters, contrasts favorably with heavier models like Expert Transformer, LlaVA-Med, and VisionGPT. Its Guided Context Self-Attention mechanism proves to be more effective than Non-local and Gated Contextual Attention, emphasizing its enhanced vision-language integration.

TABLE I: Comparative Study of Recent Best Models with Proposed Guided Context Self-Attention based Multi-modal Medical Vision Language Transformer trained on DeepEyeNet Dataset

Model	B@1	B@2	B@3	B@4	CIDEr	ROUGE
DeepOpth [5]	0.184	0.114	0.068	0.032	0.361	0.232
Deep Context Encoding [10]	0.219	0.134	0.074	0.035	0.398	0.252
Contextualized GPT [12]	0.203	0.142	0.100	0.073	0.389	0.211
Non-local Attention [11]	0.230	0.150	0.094	0.053	0.370	0.291
Gated Contextual Transformer [14]	0.297	0.230	0.214	0.142	0.462	0.391
VisionGPT [19]	0.353	0.280	0.261	0.182	0.491	0.412
Expert Transformer [13]	0.382	0.291	0.237	0.186	0.472	0.413
LlaVA-Med [25]	0.386	0.305	0.282	0.196	0.482	0.427
M3 Transformer [15]	0.394	0.312	0.291	0.208	0.537	0.429
GCS-M3VLT (Ours)	0.430	0.345	0.319	0.231	0.559	0.497

III-C Qualitative Evaluation

In Figure 4, we illustrate the input images, heatmaps, and overlays of attention maps, visualized through GradCAM [26], when the input images consist of a grid of multiple scans with the same or different modalities. For instance, we present scenarios where the input includes OCT scans, fundus images, and AF images, either individually or in combination. Our model robustly captures clinical context and lesion-specific information, even when the input comprises a grid of multiple images.

Figure 3 illustrates that our GCS-M3VLT model provides accurate and contextually rich medical captions for retinal images and keywords. For various retinal conditions, including suprachoroidal hemorrhage, subretinal neovascularization (SRNV), and Goldmann-Favre syndrome, our model’s predictions closely match actual clinical descriptions, effectively capturing key diagnostic details and clinical context. Compared to existing methods [14, 11], GCS-M3VLT consistently demonstrates superior accuracy in reflecting relevant retinal pathologies and patient details, highlighting its potential for enhancing clinical image interpretation and diagnosis.

IV Conclusion

The proposed Guided Context Self-Attention based Multi-modal Medical Vision Language Transformer (GCS-M3VLT) advances retinal image captioning by integrating guided self-attention and multimodal fusion. This approach effectively captures contextual information from both retinal images and diagnostic keywords, enhancing the accuracy and coherence of clinical descriptions. Qualitative results demonstrate the model’s ability to localize lesion-specific features and adapt dynamically to various retinal abnormalities. The attention mechanisms improve focus on relevant regions, leading to more precise and relevant captions. Future work could explore aligning multi-modal embeddings in latent space while addressing missing keywords using zero-shot learning.

References

[1] Thomas A Ciulla, Armando G Amador and Bernard Zinman “Diabetic retinopathy and diabetic macular edema: pathophysiology, screening, and novel therapies” In Diabetes care 26.9 Am Diabetes Assoc, 2003, pp. 2653–2664
[2] Nagur Shareef Shaik and Teja Krishna Cherukuri “Hinge attention network: A joint model for diabetic retinopathy severity grading” In Applied Intelligence 52.13 Springer, 2022, pp. 15105–15121
[3] Alfred Sommer et al. “Challenges of ophthalmic care in the developing world” In JAMA ophthalmology 132.5 American Medical Association, 2014, pp. 640–644
[4] Louis Pizzarello et al. “VISION 2020: The Right to Sight: a global initiative to eliminate avoidable blindness” In Archives of ophthalmology 122.4 American Medical Association, 2004, pp. 615–620
[5] Jia-Hong Huang et al. “Deepopht: medical report generation for retinal images via deep models and visual explanation” In Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2021, pp. 2442–2452
[6] Kelvin Xu et al. “Show, attend and tell: Neural image caption generation with visual attention” In International conference on machine learning, 2015, pp. 2048–2057 PMLR
[7] Quanzeng You et al. “Image captioning with semantic attention” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4651–4659
[8] Zhenyu Zhang et al. “Sam-guided enhanced fine-grained encoding with mixed semantic learning for medical image captioning” In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 1731–1735 IEEE
[9] Zhongzhen Huang, Xiaofan Zhang and Shaoting Zhang “Kiut: Knowledge-injected u-transformer for radiology report generation” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19809–19818
[10] Jia-Hong Huang, Ting-Wei Wu, Chao-Han Huck Yang and Marcel Worring “Deep context-encoding network for retinal image captioning” In 2021 IEEE International Conference on Image Processing (ICIP), 2021, pp. 3762–3766 IEEE
[11] Jia-Hong Huang et al. “Non-local attention improves description generation for retinal images” In Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2022, pp. 1606–1615
[12] Jia-Hong Huang, Ting-Wei Wu and Marcel Worring “Contextualized keyword representations for multi-modal retinal image captioning” In Proceedings of the 2021 International Conference on Multimedia Retrieval, 2021, pp. 645–652
[13] Ting-Wei Wu, Jia-Hong Huang, Joseph Lin and Marcel Worring “Expert-defined Keywords Improve Interpretability of Retinal Image Captioning” In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 1859–1868
[14] Nagur Shareef Shaik and Teja Krishna Cherukuri “Gated contextual transformer network for multi-modal retinal image clinical description generation” In Image and Vision Computing Elsevier, 2024, pp. 104946
[15] Nagur Shareef Shaik, Teja Krishna Cherukuri and Dong Hye Ye “M3T: Multi-Modal Medical Transformer To Bridge Clinical Context With Visual Insights For Retinal Image Medical Description Generation” In 2024 IEEE International Conference on Image Processing (ICIP), 2024, pp. 3037–3043 DOI: 10.1109/ICIP51287.2024.10647584
[16] Yiming Cao et al. “MMTN: multi-modal memory transformer network for image-report consistent medical report generation” In Proceedings of the AAAI Conference on Artificial Intelligence 37-1, 2023, pp. 277–285
[17] Ruiqi Wu et al. “MM-Retinal: Knowledge-Enhanced Foundational Pretraining with Fundus Image-Text Expertise” In arXiv preprint arXiv:2405.11793, 2024
[18] Zhanyu Wang, Lingqiao Liu, Lei Wang and Luping Zhou “Metransformer: Radiology report generation by transformer with multiple learnable expert tokens” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 11558–11567
[19] Chris Kelly et al. “Visiongpt: Vision-language understanding agent using generalized multimodal framework” In arXiv preprint arXiv:2403.09027, 2024
[20] Haotian Liu, Chunyuan Li, Qingyang Wu and Yong Jae Lee “Visual instruction tuning” In Advances in neural information processing systems 36, 2024
[21] Mingxing Tan and Quoc Le “Efficientnetv2: Smaller models and faster training” In International conference on machine learning, 2021, pp. 10096–10106 PMLR
[22] Teja Krishna Cherukuri, Nagur Shareef Shaik and Dong Hye Ye “Guided Context Gating: Learning To Leverage Salient Lesions in Retinal Fundus Images” In 2024 IEEE International Conference on Image Processing (ICIP), 2024, pp. 3098–3104 DOI: 10.1109/ICIP51287.2024.10647604
[23] A Vaswani “Attention is all you need” In Advances in Neural Information Processing Systems, 2017
[24] Alec Radford et al. “Language models are unsupervised multitask learners” In OpenAI blog 1.8, 2019, pp. 9
[25] Chunyuan Li et al. “Llava-med: Training a large language-and-vision assistant for biomedicine in one day” In Advances in Neural Information Processing Systems 36, 2024
[26] Ramprasaath R Selvaraju et al. “Grad-cam: Visual explanations from deep networks via gradient-based localization” In Proceedings of the IEEE international conference on computer vision, 2017, pp. 618–626