LaTeX Author Guidelines for CVPR Proceedings
Abstract
Diagnosing diseases through histopathology whole slide images (WSIs) is fundamental in modern pathology but is challenged by the gigapixel scale and complexity of WSIs. Trained histopathologists overcome this challenge by navigating the WSI, looking for relevant patches, taking notes, and compiling them to produce a final holistic diagnostic. Traditional AI approaches, such as multiple instance learning and transformer-based models, fail short of such a holistic, iterative, multi-scale diagnostic procedure, limiting their adoption in the real-world. We introduce PathFinder, a multi-modal, multi-agent framework that emulates the decision-making process of expert pathologists. PathFinder integrates four AI agents—the Triage Agent, Navigation Agent, Description Agent, and Diagnosis Agent—that collaboratively navigate WSIs, gather evidence, and provide comprehensive diagnoses with natural language explanations. The Triage Agent classifies the WSI as benign or risky; if risky, the Navigation and Description Agents iteratively focus on significant regions, generating importance maps and descriptive insights of sampled patches. Finally, the Diagnosis Agent synthesizes the findings to determine the patient’s diagnostic classification. Our Experiments show that PathFinder outperforms state-of-the-art methods in skin melanoma diagnosis by 8% while offering inherent explainability through natural language descriptions of diagnostically relevant patches. Qualitative analysis by pathologists shows that the Description Agent’s outputs are of high quality and comparable to GPT-4o. PathFinder is also the first AI-based system to surpass the average performance of pathologists in this challenging melanoma classification task by 9%, setting a new record for efficient, accurate, and interpretable AI-assisted diagnostics in pathology. Data, demo, code and models are available at https://pathfinder-dx.github.io/.
1 Introduction
Medical diagnosis of histopathology through the examination of whole slide images (WSIs) is a cornerstone of modern pathology. WSIs are high-resolution, digitally scanned histopathology cases, providing an extensive view of tissue architecture and cellular detail. Pathologists navigate these gigapixel-scale images to identify morphological features and spatial relationships critical for accurate diagnoses. They start with a low magnification to identify suspicious regions and then zoom into image patches for detailed examination [ghezloo2022analysis, liu2024semantics]. They gather evidence across patches and accumulate them together to make a final holistic diagnosis. This process is the gold standard. However, it is labor-intensive and requires significant expertise to interpret complex visual information effectively. It is becoming increasingly unsustainable due to the rising number of cancer cases globally.
The shift towards more efficient diagnostic methods in medical imaging is essential, yet must maintain accuracy. Recent advancements in deep learning report achieving expert-level performance, promising such a scalable approach [topol2019high]. However, current methods typically divide WSIs into smaller patches for independent analysis, making diagnoses without the holistic context [li2021dual, yang2024foundation, zhou2024pathm3, seyfioglu2024quilt, sun2024pathgen, ahmed2024pathalign, xu2024whole, ikezogwo2023quilt, gu2023augmenting]. Transformer-based models attempt to capture both local and global patterns but are not scalable with the high-resolution demands of WSI [shao2021transmil, wu2021scale, guo2023higt, zheng2022graph, chen2022scaling].
In contrast, we propose PathFinder, a multi-modal and multi-agent system designed to mimic the decision-making process of expert pathologists by integrating four AI agents: Triage, Navigation, Description, and Diagnosis. The system begins with the Triage Agent, classifying the WSI as benign or risky; if risky, the Navigation and Description Agents iteratively examine patches, generating natural language descriptions and refining their focus with each cycle. Finally, these detailed insights are integrated by the Diagnosis Agent to produce an accurate and holistic diagnostic classification. Figure LABEL:fig:nav-pipeline-simple demonstrates an overview of PathFinder’s pipeline.
Our experiments demonstrate that the proposed agentic system significantly outperforms prior state-of-the-art (SOTA) methods on WSI skin melanoma grading, achieving an accuracy of 74% on the M-Path Skin Biopsy dataset [elmore2017pathologists]. This marks an 8% improvement over the best baseline with accuracy of 66% and a 9% improvement over the 65% average performance of pathologists [elmore2017pathologists]. Our proposed system is also fully explainable from the patches visited to the description of the patches and the final diagnoses, which takes into consideration all the patch-wise information. To the best of our knowledge, PathFinder is the first AI-based system capable of surpassing the average performance of pathologists on this challenging melanoma classification task.

2 Related Work
Multi-modal Histopathology Models. There have been a series of studies in histopathology that leverage WSI-level and pacth-level images to train unimodal classifiers based on multiple instance modeling leveraging pretrained feature extractors [shao2021transmil, scatnet, hatnet]. More recently unimodal foundational models trained on varying self-supervised objectives have achieved significant improvements on performance downstream [xu2024whole, virchow, ikezogwo2022multi, uni]. With the introduction of large-scale multi-modal datasets in histopathology, we have seen significant advancements, with the emergence of large language models and vision-language models for histopathology. For instance, studies like Quilt-1M [ikezogwo2023quilt] and PathGen-1.6M [sun2024pathgen] curate large histopathology image-text paired dataset and train CLIP-based models to learn joint vision-language representations, significantly enhancing clinical histology downstream tasks on patch-level. On the WSI-level, PathAlign [ahmed2024pathalign] aligns diagnostic texts from pathology reports with corresponding WSIs, facilitating applications such as automatic report generation and case/patient-level visual question answering, moving towards a more clinically integrated and holistic diagnostic process. While many other studies like Quilt-LLaVA [seyfioglu2024quilt], SlideChat [chen2024slidechat], and PathChat [pathchat] train histopathology Multi-modal Large Language models (MLLM) and improve on diagnostic reasoning tasks, none of these models effectively automatically navigate the giga-pixel scale WSIs towards a diagnosis.
The role of Navigation in Histopathology Diagnosis. Computational pathology studies have tried to capture and analyze the navigation patterns of pathologists when reviewing digital slide images [roa2010experimental, mercan2018characterizing, molin2015slide, ghezloo2022analysis] specifically characterizing mouse patterns, zooming in/out, and panning the field of view (FOV) to piece out morphological clues towards a diagnosis. Often, these studies juxtapose the navigation patterns of junior and senior-level pathologists. NaviPath [gu2023augmenting], presents a human-AI collaborative navigation system designed to seamlessly integrate into pathologists’ workflows, considering the specific domain knowledge and navigation strategies required for effective examination of pathology scans.
Multi-agent Systems. The concept of multi-agent systems has gained traction in AI research, particularly for tasks requiring dynamic behavior and contextual understanding. Recent research has demonstrated the potential of large foundation models in creating interactive agent-based AI systems including interactions between robots, environments, and humans in the field of robotics [durante2024interactive, han2024llm, wu2023autogen]. These systems can perform complex tasks by leveraging the strengths of individual agents utilizing collaboration and coordination. The potential of multi-agent systems in handling real-world scenarios has been demonstrated in recent studies including but not limited to role-playing [li2303camel], reasoning [du2023improving], gaming [huang2024far] and software engineering [he2024llm]. In the medical domain some studies have explored role-playing providers (clinicians) treating patients and accumulating proficiency with increasing interactions [doctorsimul, medagentsbench]. These studies are centered around multiple providers; however, in medical image analysis, multi-agent systems can simulate the collaborative nature of different sub-tasks within the human-expert decision-making processes, including region navigation, understanding and holistic diagnosis.