This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

GRASP: GRAph-Structured Pyramidal Whole Slide Image Representation

Ali Khajegili Mirabadi1\faEnvelope, Graham Archibald1, Amirali Darbandsari1, Alberto
Contreras-Sanz1,2, Ramin Ebrahim Nakhli1, Maryam Asadi1, Allen Zhang1,
C. Blake Gilks2, Peter Black2, Gang Wang3, Hossein Farahani1, Ali Bashashati1\faEnvelope
1
School of Biomedical Engineering & Department of Pathology and Laboratory Medicine,
2Vancouver Prostate Centre, 3BC Cancer Institute,
The University of British Columbia
Abstract

Cancer subtyping is one of the most challenging tasks in digital pathology, where Multiple Instance Learning (MIL) by processing gigapixel whole slide images (WSIs) has been in the spotlight of recent research. However, MIL approaches do not take advantage of inter- and intra-magnification information contained in WSIs. In this work, we present GRASP, a novel lightweight graph-structured multi-magnification framework for processing WSIs in digital pathology. Our approach is designed to dynamically emulate the pathologist’s behavior in handling WSIs and benefits from the hierarchical structure of WSIs. GRASP, which introduces a convergence-based node aggregation mechanism replacing traditional pooling mechanisms, outperforms state-of-the-art methods by a high margin in terms of balanced accuracy, while being significantly smaller than the closest-performing state-of-the-art models in terms of the number of parameters. Our results show that GRASP is dynamic in finding and consulting with different magnifications for subtyping cancers, is reliable and stable across different hyperparameters, and can generalize when using features from different backbones. The model’s behavior has been evaluated by two expert pathologists confirming the interpretability of the model’s dynamic. We also provide a theoretical foundation, along with empirical evidence, for our work, explaining how GRASP interacts with different magnifications and nodes in the graph to make predictions. We believe that the strong characteristics yet simple structure of GRASP will encourage the development of interpretable, structure-based designs for WSI representation in digital pathology. Data and code can be found in https://github.com/AIMLab-UBC/GRASP

1 Introduction

Though deep learning has revolutionized computer vision in many fields, digital pathology tasks such as cancer classification remain a complex problem in the domain. For natural images, the task usually relates to assigning a label to an image with an approximate size of 256 × 256 pixels, with the label being clearly visible and well-represented in the image. Gigapixel tissue whole-slide images (WSIs) break this assumption in digital pathology as images exhibit enormous heterogeneity and can be as large as 150,000 × 150,000 pixels. Further, labels are provided at the slide level and may be descriptive of a small region of pixels occupying a minuscule portion of the total image, or they may be descriptive of complex interactions between the substructures within the entire composition of the WSI Ehteshami Bejnordi et al. (2017); Zhang et al. (2015); Pawlowski et al. (2019).

Multiple Instance Learning (MIL) has become the prominent approach to address the computational complexity of WSI; however, the majority of methods in the literature focus only on a single level of magnification, usually 20×20\times, due to the computational cost of including other magnifications Lu et al. (2021); Ilse et al. (2018); Schirris et al. (2022); Zheng et al. (2021); Chen et al. (2021); Shao et al. (2021); Zhou et al. (2019); Guan et al. (2022). Using this magnification, a set of patches from each WSI are extracted and used as an instance-level representation. This neither captures the biological structure of the data nor does it follow the diagnostic protocols of pathologists. That is to say, WSIs at higher magnifications reveal finer details—such as the structure of the cell nucleus and the intra/extracellular matrix—whereas lower magnifications enable the identification of larger structures like blood vessels, connective tissue, or muscle fibers. Further, these structures are inconsistent from patient to patient, slide to slide, and subtype to subtype Morovic et al. (2021). To capture this variability, pathologists generally use a variety of lenses in their inspection of a tissue sample under the microscope, switching between different magnifications as needed. They generally begin with low magnifications to identify regions of interest for making preliminary decisions before increasing magnifications to confirm or rule out diagnoses Rasoolijaberi et al. (2022).

Refer to caption
Figure 1: A chronological overview of different WSI representation methods and their performance compared to the size of the model.

To address this challenge, several multi-magnification approaches have recently been introduced for various tasks such as cancer subtyping, survival analysis, and image retrieval. However, these models often possess millions of parameters, as briefly illustrated in Figure 1, and suffer from interpretability issues due to their modular complexity Thandiackal et al. (2022); Li et al. (2021); Riasatian et al. (2021); Chen et al. (2022b); D’Alfonso et al. (2021); Hashimoto et al. (2020). Although these models have demonstrated promise across different tasks, they are not well-suited for low-resource clinical settings, where computational resources are often limited and the infrastructure may not support large-scale computational clusters. Therefore, there is a critical need to develop approaches and models specifically designed for use in smaller clinics, where the hardware may consist of small GPUs with limited memory. These lightweight models must balance accuracy with efficiency, enabling reliable deployment on standard devices while ensuring real-time performance and ease of integration within existing clinical workflows.

In this research, we aim to further the progress of deep learning in this context by introducing a pre-defined fixed structure for a lightweight model that reduces the complexity while maintaining efficacy and interoperability. Therefore, our contribution is as follows:

  1. 1.

    Introducing GRASP to capture pyramidal information contained in WSIs, as the first lightweight multi-magnification model in computational pathology.

  2. 2.

    GRASP introduces a novel convergence-based mechanism instead of traditional pooling layers to capture intra-magnification information.

  3. 3.

    We provide a solid theoretical foundation of the model’s functionality and its interpretability from both technical and pathological perspectives, as well as providing empirical evidence for the model’s efficacy concerning hyperparameters.

  4. 4.

    An extensive comparison with eleven state-of-the-art models across three different cancers, ranging from two to five histotypes, using two popular backbones demonstrates the generalizability of the proposed method.

2 Related Work

2.1 Patch-Level Encoding

With recent progress in deep learning, deep features, i.e. high-level embeddings from a deep network, have advanced past handcrafted features and are considered the most robust sources for image representation. Pre-trained networks such as DenseNet Huang et al. (2017), ResNet He et al. (2016), or Swin Liu et al. (2021) draw their features from millions of non-medical and non-histopathological images, where they cannot necessarily produce high-level embeddings for complex images, especially rare cancers Wang et al. (2023); Riasatian et al. (2021); Ciga et al. (2022); Wang et al. (2022). In this context, the use of Variational Autoencoders (VAEs) has been evaluated in Chen et al. (2022a), where the authors show that DenseNet pre-trained on ImageNet performs better for extracting semantic features from WSIs than VAEs. However, domain-specific vision encoders such as KimiaNet Riasatian et al. (2021), CTransPath Wang et al. (2022), PLIP Huang et al. (2023), UNI Chen et al. (2023), Virchow Vorontsov et al. (2023), etc. were developed and trained on large sets of histopathology images (patches) outperforming pre-trained models on ImageNet across various tasks.

2.2 Weak Supervision in Gigapixel WSIs

MIL Approaches: Several domains of deep learning have been explored in an attempt to effectively address the task of classification in digital pathology. Models such as AB-MIL Ilse et al. (2018), CLAM Lu et al. (2021), and Trans-MIL Shao et al. (2021) have utilized MIL with promising results. Such approaches have generally focused only on instance-level feature extraction and have not yet explored modeling global, long-range interactions within and across different magnifications. In Waqas et al. (2024), a detailed overview of different MIL methods has been provided.
Graph-based Approaches: To incorporate contextual information and long-range interactions, models such as PatchGCN Chen et al. (2021) and DGCN Zheng et al. (2021) have been designed with a graph structure that can capture and learn context-aware features from interactions across the WSI. These models represent WSIs as graphs where the nodes are usually embeddings and edges are defined based on clustering or neighborhood node similarity, which in turn adds new hyperparameters and increases inference time. The similarity between nodes can be measured in terms of spatial or latent space, leading to the construction of different graphs for each WSI.
Multi-Magnification Approaches: Multiple efforts have been made to incorporate multi-magnification information in the context of gigapixel histopathology subtyping tasks. Models such as HiGT Guo et al. (2023), ZoomMIL Thandiackal et al. (2022), CSMIL Deng et al. (2023), H2H^{2}-MIL Hou et al. (2022), and DSMIL Li et al. (2021) address this by aggregating contextual tissue information using features from multiple magnifications in WSIs. DSMIL concatenates embeddings from different magnifications together by duplicating lower-magnification features, making the model biased toward lower magnifications and unable to look into inter-magnification information. On the other hand, ZoomMIL aggregates information from 5x5x to 20x20x in a fixed hierarchy with no interaction in the opposite direction. Chen et. al explore this in the context of vision transformers with their Hierarchical Image Pyramid Transformer (HIPT) Chen et al. (2022b). Their architecture incorporates regions of size 256×256256\times 256 and 4096×40964096\times 4096 pixels to leverage the natural hierarchical structure of WSIs. H2H^{2}-MIL also adopts a graph-based approach, where it pools the nodes in each magnification using an Iterative Hierarchical Pooling module. Our proposed model, on the other hand, is designed to dynamically aggregate information within and across different magnifications without using traditional pooling layers in its intra-magnification interactions. It also employs a similar mechanism to zoom-in and zoom-out through its inter-magnification interactions, from lower to higher magnifications and vice versa.

3 Method

Refer to caption
Figure 2: Overview of our workflow beginning with WSIs and outputting slide-level subtype predictions. a) shows the WSI being tiled into patches of varying magnification which are then embedded and assembled into a hierarchical graph. In b), graph representations are fed into a three-layer GCN Kipf & Welling (2016) and subsequently, a two-layer MLP to predict graph-level (slide-level) subtypes. As shown in the message passing steps in b), nodes in the first GCN layer interact with their immediate neighbors; those in the second GCN layer can interact with their second neighbors; and nodes in the final GCN layer can interact with all nodes in the graph. Then, the inter-magnification convergence causes the nodes within each magnification to converge, which is an intrinsic property of the architecture. In the end, the three converged nodes are passed through an average readout module. This dynamic helps the model to look for important messages in the entire graph, and if a node contains important information, it will be broadcast to all other nodes in the graph. The output of the GCN layers is then averaged by the readout module and passed to the FC layers. (For the sake of illustration, m=4m=4 is used to show the structure of GRASP).

This section introduces the GRAph-Structured Pyramidal (GRASP) WSI Representation, a framework for subtype recognition using multi-magnification weakly-supervised learning, illustrated in Figure 2.

3.1 Problem Formulation

Contrary to Multiple Instance Learning (MIL) approaches, which use a bag of instances to represent a given WSI, GRASP benefits from a graph-based, multi-magnification structure to objectively represent connections between different instances across and within different magnifications. To build a graph and learn a graph-based function \mathcal{F} that predicts slide-level labels with no knowledge of patch labels, the following formulation is adopted.

For a given WSI, WrN×M×3W_{r}\in\mathbb{R}^{N\times M\times 3} with label 𝒴\mathcal{Y}, three sets of mm patches, {pin×n×3:i[1,,m]}\{p_{i}\in\mathbb{R}^{n\times n\times 3}:\forall i\in[1,...,m]\}, {pin×n×3:i[1,,m]}\{p^{\prime}_{i}\in\mathbb{R}^{n\times n\times 3}:\forall i\in[1,...,m]\}, and {pi′′n×n×3:i[1,,m]}\{p^{\prime\prime}_{i}\in\mathbb{R}^{n\times n\times 3}:\forall i\in[1,...,m]\} are extracted for each magnification of M1=5x\textbf{M}_{1}=5x, M2=10x\textbf{M}_{2}=10x, and M3=20x\textbf{M}_{3}=20x, respectively. It is important to note that pi′′p^{\prime\prime}_{i} is the high-resolution window located at the center of pip^{\prime}_{i}, and pip^{\prime}_{i} is the high-resolution window located at the center of pip_{i}.  These patches provide 3m3m patches in total that are then fed into an encoder ϕ\phi to encode extracted patches into a lower dimension space as follows:

ϕ:pihid×1,i[1,,m]\phi:p_{i}\longrightarrow h_{i}\in\mathbb{R}^{d\times 1},\forall i\in[1,...,m]\centering\@add@centering (1)

where hih_{i} is the feature vector corresponding to the patch pip_{i}. Correspondingly, hih^{\prime}_{i} represents pip^{\prime}_{i}, and so hi′′h^{\prime\prime}_{i} does pi′′p^{\prime\prime}_{i}. Using all the feature vectors for each WrW_{r}, graph 𝔾r\mathbb{G}_{r} is constructed using the transformation Γ\Gamma:

Γ:[{h1,,hm}{h1,,hm}{h1′′,,hm′′}]3m×d×1𝔾r=(Vr,Er)\Gamma:\left[\begin{aligned} &\{h_{1},...,h_{m}\}\\ &\{h^{\prime}_{1},...,h^{\prime}_{m}\}\\ &\{h^{\prime\prime}_{1},...,h^{\prime\prime}_{m}\}\\ \end{aligned}\right]\in\mathbb{R}^{3m\times d\times 1}\longrightarrow\mathbb{G}_{r}=(V_{r},E_{r})\centering\@add@centering (2)

Eventually, classifier 𝒞\mathcal{C} is applied on top of graph convolutional layers 𝒢\mathcal{G} to build the graph-based function \mathcal{F} to predict slide-level label 𝒴\mathcal{Y} as follows:

𝒴=(Wr)=𝒞(𝒢(Vr,Er))\mathcal{Y}=\mathcal{F}(W_{r})=\mathcal{C}(\mathcal{G}(V_{r},E_{r}))\centering\@add@centering (3)

3.2 GRASP

We start by extracting multi-magnification patches as described earlier. Then, for any ii, we use the same encoder to encode pip_{i}, pip^{\prime}_{i}, and pi′′p^{\prime\prime}_{i} into features hih_{i}, hih^{\prime}_{i}, and hi′′h^{\prime\prime}_{i} respectively. Having instances features, we use the transformation Γ\Gamma to build 𝔾r\mathbb{G}_{r} as introduced in Eq. 2.

The mechanism of connecting every two nodes in 𝔾r\mathbb{G}_{r} through Γ\Gamma is premised upon an intuition of the pyramidal nature of WSIs as well as the way in which a conventional light microscope works when one switches from one magnification to another. When using a microscope, increasing magnification preserves the size of the image yet increases resolution by showing the central window of the lower magnification. This is the exact procedure we use to extract our patches in three magnifications. Therefore, for any ii, hih_{i}, hih^{\prime}_{i}, and hi′′h^{\prime\prime}_{i} are connected to each other via undirected edges, where this connection represents inter-magnification information contained in the features. On the other hand, for any ii, all hih_{i}’s contain information in M1\textbf{M}_{1}, such that they are connected to each other, forming a fully connected graph at M1\textbf{M}_{1} magnification to represent intra-magnification information. Similarly, all hih^{\prime}_{i}s are connected to each other and also all the hi′′h^{\prime\prime}_{i}s to represent intra-magnification information contained in M2\textbf{M}_{2} and M3\textbf{M}_{3}, respectively.

Figure 2 shows a small example of such a graph for 𝔾r=(Vr,Er)|m=4\mathbb{G}_{r}=(V_{r},E_{r})|_{m=4} where blue, red, and green nodes each form a fully connected graph of size mm; the inter-magnification relationship can also be seen via the edges between blue & red nodes as well as the red & green nodes. So far, each WSI, WrW_{r}, has been represented by a fixed graph 𝔾r\mathbb{G}_{r} with 3m3m nodes and (3m+1)m2\frac{(3m+1)m}{2} edges. These graphs are thus deployed to train the GCNs and predict the label 𝒴\mathcal{Y} at the output.

3.3 Graph Convolutional Layers

Following Eq. 3, we are defining 𝒢\mathcal{G} which includes three GCN layers. The intuition behind using three layers is that as a pathologist begins to look for a tumor in a given WSI, they use an initial magnification to find the region of interest; Once found, they consult with other magnifications, which may require zooming in and out back and forth, to confirm their final decision. Therefore, as shown in Figure 2, all nodes in the graph interact with one another in a hierarchical fashion through the GCN layers. Consequently, each node gradually gathers information from all other nodes, and therefore, if there are any important messages carried by some nodes, it is guaranteed to be broadcast to all other nodes, which is the equivalent of zoom-in and zoom-out mechanism. This dynamic and hierarchical structure imposes theoretical properties on the model which is going to be discussed here. Following the graph convolutional layer introduced in Kipf & Welling (2016), the graph nodes are updated as follows:

hi(l+1)=α(b(l)+j𝒩(i)1cjihj(l)W(l)),h_{i}^{(l+1)}=\alpha(b^{(l)}+\sum_{j\in\mathcal{N}(i)}\frac{1}{c_{ji}}h_{j}^{(l)}W^{(l)}),\centering\@add@centering (4)

where b(l)b^{(l)} is bias; hi(l+1)h_{i}^{(l+1)} is the node feature’s update of the graph at (l+1)(l+1)-th step at M1\textbf{M}_{1}; 𝒩(i)\mathcal{N}(i) is the set of neighbors of node ii, cji=|𝒩(j)||𝒩(i)|c_{ji}=\sqrt{|\mathcal{N}(j)||\mathcal{N}(i)|}, where given the symmetry of the graph, all cjic_{ji}s are equal; and α(.)\alpha(.) is the activation function, which is ReLU in our implementation. hi(l+1)h_{i}^{\prime(l+1)} and hi′′(l+1)h_{i}^{\prime\prime(l+1)} expressions follow the same logic as hi(l+1)h_{i}^{(l+1)} in terms of the parameters mentioned above.  After the last graph convolutional layer, where the intra-magnification convergence happens, the graph is passed through an average readout module to pool the three-node graph mean embedding and then is fed into the two-layer classifier 𝒞\mathcal{C} to predict 𝒴\mathcal{Y}.

3.4 Intra-Magnification Convergence

Based on the idea of capturing the information within and across magnifications, we now show that node features converge in each magnification in the graph to one node Having this, GRASP essentially encodes a graph of size 3m3m nodes to only 3 nodes. We interpret this as the model learning the information contained in the magnification through interaction with other magnifications without a need for traditional pooling layers.

Theorem 1.

Supposing the graph convolutional layers have L2L_{2}-bounded weights, and the graph node features at l=0l=0 are L2L_{2}-bounded. Therefore, i,j[1,,m]\forall i,j\in[1,...,m],

limmhi(3)hj(3)2=0;limmhi(3)hj(3)2=0;andlimmhi′′(3)hj′′(3)2=0.\displaystyle\lim_{m\to\infty}||h_{i}^{(3)}-h_{j}^{(3)}||_{2}=0;\lim_{m\to\infty}||h_{i}^{\prime(3)}-h_{j}^{\prime(3)}||_{2}=0;\text{and}\lim_{m\to\infty}||h_{i}^{\prime\prime(3)}-h_{j}^{\prime\prime(3)}||_{2}=0. (5)

Proof: Please see Section 7.3 (Theoretical Analysis).

Note: It is worth mentioning that the assumptions made in Theorem 1 are minimal. In the case of L2L_{2}-bounded weights, it has been further explained in Wu et al. (2023); Cai & Wang (2020). The assumption that graph node features at l=0l=0 are bounded is also minimal as we use a frozen features extractor, which ideally should not generate nondefinitive values for a finite feature vector.
With that, we show

hi(3)hj(3)2(1m+1)3hi(0)hj(0)2W(2)2W(1)2W(0)2,\displaystyle\left\|h_{i}^{(3)}-h_{j}^{(3)}\right\|_{2}\leq(\frac{1}{m+1})^{3}\left\|h_{i}^{\prime(0)}-h_{j}^{\prime(0)}\right\|_{2}\left\|W^{(2)}\right\|_{2}\left\|W^{(1)}\right\|_{2}\left\|W^{(0)}\right\|_{2},

which essentially means that hi(3)hj(3)2\left\|h_{i}^{(3)}-h_{j}^{(3)}\right\|_{2} is upper bounded inversely with the reciprocal of (m+1)3(m+1)^{3} . Thus, as mm increases, the upper-bound gets tighter and eventually leads to limmhi(3)hj(3)2=0\lim_{m\to\infty}||h_{i}^{(3)}-h_{j}^{(3)}||_{2}=0. As a result, we can conclude the following corollary.

Corollary 1.

i[1,,m]\forall i\in[1,...,m] and mm sufficiently large, hi(3)hh_{i}^{(3)}\longrightarrow h^{*}, hi(3)hh_{i}^{\prime(3)}\longrightarrow h^{\prime*}, and hi′′(3)h′′h_{i}^{\prime\prime(3)}\longrightarrow h^{\prime\prime*}, where hh^{*}, hh^{\prime*}, and h′′h^{\prime\prime*} are functions of mm; hh^{*}, hh^{\prime*}, and h′′h^{\prime\prime*} are the convergence node for each magnification.

This is necessarily equivalent to pooling the nodes in magnification level, yet with a completely new approach than the traditional pooling layer, without further imposing computational load on the network for pooling. Taking into account the fact that hh^{*}, hh^{\prime*}, and h′′h^{\prime\prime*} are not necessarily equal, our model is fusing node features in each magnification while it consults with other magnifications, and draw the conclusion via averaging nodes across three magnifications at the end of the convolutional layers by means of the readout module. This means that the final embedding of the graph is h+h+h′′3\dfrac{h^{*}+h^{\prime*}+h^{\prime\prime*}}{3}. We believe that this process helps the model reduce variance and uncertainty in making predictions as mm grows. To support this claim, we provide empirical evidence detailed in Section 7.5.1 (Monte Carlo Test).

The structure of the graph has been designed in such a way that it does not get stuck in the bottleneck of over-smoothing, a common issue in deep GCNs Cai & Wang (2020). Our intuition is that nodes in M1\textbf{M}_{1}, M2\textbf{M}_{2}, and M3\textbf{M}_{3} are interacting via message passing, and the flow of inter-magnification information helps the model to keep its balance and continue the process of learning. Nevertheless, by increasing the number of GCN layers for the model to four or higher, the so-called over-smoothing problem will take place, which can possibly deteriorate the model’s performance. On the other hand, less than three layers of GCNs might not be able to fully capture the inter-magnification interactions. This leaves us with three layers of GCNs, which is equal to the graph’s diameter. In addition to this theoretical description, we empirically support our claim in Section 7.5.5 (Graph Depth).

4 Experiments

4.1 Data Preparation

We utilize three datasets: Esophageal Carcinoma (ESCA) from The Cancer Genome Atlas (TCGA), which includes 135 WSIs across two subtypes, and Ovarian Carcinoma and Bladder Cancer, where Ovarian Carcinoma consists of 948 WSIs with five histotypes, while Bladder Cancer contains 262 WSIs with two histotypes. These datasets were curated using HistoQC Janowczyk et al. (2019). A detailed breakdown of each dataset is available in Table 3.

4.2 Comparisons with State-of-the-Art

To compare with state-of-the-art approaches, we repeated these experiments with the same cross-validation folds and random seeds to have a fair comparison; the choice of ten random seeds is to capture statistical significance and reliability. For evaluating the models, we adopt Balanced Accuracy and F1 Score since these metrics show how reliable a model performs on imbalanced data, and more importantly on clinical applications. We compare our proposed model, GRASP, with models using different approaches to have a broad spectrum of evaluation. These models include Ab-MIL Ilse et al. (2018), Trans-MIL Shao et al. (2021), CLAM-SB Lu et al. (2021), and CLAM-MB Lu et al. (2021) from the attention/transformer-based family; ZoomMIL (2021) Thandiackal et al. (2022), H2MIL (2022) Hou et al. (2022), and HiGT (2023) Guo et al. (2023) from multi-magnification approaches since they have a hierarchical structure and are compatible with our patch extraction paradigm; and PatchGCN: latent & spatial Chen et al. (2021) and DGCN: latent & spatial Zheng et al. (2021) from graph-based learning approaches.

4.3 Subtype Prediction

Table 1 shows the comparison between our model and state-of-the-art methods based on Swin features, where GRASP outperforms all the competing methods on the Ovarian and Bladder datasets. Interestingly, Ab-MIL and CLAMs are the closest-performing methods to GRASP on these two datasets. On the ESCA dataset, however, ZoomMIL is the superior model with GRASP being the closest counterpart. Overall, GRASP is the superior model among all other models based on the average Balanced Accuracy on the three datasets.
Table 2 shows the comparison between our model and state-of-the-art methods based on KimiaNet features, where GRASP outperforms all the competing methods by a margin of 2.6%10.7%2.6\%-10.7\% Balanced Accuracy on the Ovarian dataset and 0.4%10.0%0.4\%-10.0\% on the Bladder dataset. It is worth mentioning that ZoomMIL is the closest-performing model to GRASP, although ZoomMIL has 7 times more parameters than GRASP. PatchGCN and DGCN are not performing comparably to GRASP, even though they are using spatial information that GRASP does not. This implies that a multi-magnification graph structure can potentially show more capability compared to other state-of-the-art approaches in terms of representing gigapixel WSIs. Moreover, single-magnification approaches are faster in terms of inference time than other approaches, especially CLAM-SB which has the lowest inference time. Inference times (per slide) have been calculated on the same machine for all models.

Table 1: The average performance on 3 folds and 10 random seeds based on Swin’s features. The best and second best average values are highlighted in bold and underlined, respectively.
Model Params. Inference Ovarian: Five subtypes Bladder: Two subtypes ESCA: Two subtypes Average
Balanced Acc. F1 Score Balanced Acc. F1 Score Balanced Acc. F1 Score Balanced Acc.
Trans-MIL 2.6722.672M 0.0190.019 sec 0.297±0.0110.297\pm 0.011 0.244±0.0110.244\pm 0.011 0.830±0.0370.830\pm 0.037 0.819±0.0300.819\pm 0.030 0.626±0.0210.626\pm 0.021 0.611±0.0210.611\pm 0.021 0.5840.584
Ab-MIL 0.2630.263M 0.0150.015 sec 0.643±0.022{0.643\pm 0.022} 0.647±0.0200.647\pm 0.020 0.900±0.0130.900\pm 0.013 0.884±0.0230.884\pm 0.023 0.818±0.0100.818\pm 0.010 0.812±0.0040.812\pm 0.004 0.7870.787
CLAM-SB 0.7950.795M 0.0140.014 sec 0.546±0.0620.546\pm 0.062 0.550±0.0650.550\pm 0.065 0.903±0.051¯\underline{0.903\pm 0.051} 0.902±0.044¯\underline{0.902\pm 0.044} 0.877±0.067¯\underline{0.877\pm 0.067} 0.861±0.0560.861\pm 0.056 0.7750.775
CLAM-MB 0.7960.796M 0.0150.015 sec 0.558±0.0440.558\pm 0.044 0.565±0.0420.565\pm 0.042 0.903±0.032¯\underline{0.903\pm 0.032} 0.901±0.0280.901\pm 0.028 0.848±0.0550.848\pm 0.055 0.833±0.0510.833\pm 0.051 0.7690.769
DGCN: latent 0.7900.790M 0.0980.098 sec 0.224±0.0170.224\pm 0.017 0.146±0.0170.146\pm 0.017 0.725±0.0520.725\pm 0.052 0.655±0.1080.655\pm 0.108 0.763±0.0510.763\pm 0.051 0.736±0.0590.736\pm 0.059 0.5700.570
DGCN: spatial 0.7900.790M 0.0860.086 sec 0.210±0.0110.210\pm 0.011 0.133±0.0120.133\pm 0.012 0.700±0.0440.700\pm 0.044 0.620±0.0740.620\pm 0.074 0.660±0.0490.660\pm 0.049 0.606±0.0530.606\pm 0.053 0.5230.523
PatchGCN: latent 1.3851.385M 0.0990.099 sec 0.397±0.0390.397\pm 0.039 0.362±0.0470.362\pm 0.047 0.537±0.0110.537\pm 0.011 0.351±0.0520.351\pm 0.052 0.855±0.0760.855\pm 0.076 0.847±0.0760.847\pm 0.076 0.5960.596
PatchGCN: spatial 1.3851.385M 0.1100.110 sec 0.423±0.0420.423\pm 0.042 0.390±0.0530.390\pm 0.053 0.527±0.0200.527\pm 0.020 0.336±0.0170.336\pm 0.017 0.864±0.0800.864\pm 0.080 0.859±0.0770.859\pm 0.077 0.6050.605
ZoomMIL 2.8912.891M 0.0240.024 sec 0.640±0.0180.640\pm 0.018 0.648±0.011{0.648\pm 0.011} 0.899±0.0460.899\pm 0.046 0.895±0.0370.895\pm 0.037 0.889±0.037\mathbb{0.889\pm 0.037} 0.895±0.040\mathbb{0.895\pm 0.040} 0.809¯\underline{0.809}
HiGT 6.3886.388M 0.1480.148 sec 0.251±0.0370.251\pm 0.037 0.184±0.0490.184\pm 0.049 0.755±0.0410.755\pm 0.041 0.717±0.0230.717\pm 0.023 0.760±0.0500.760\pm 0.050 0.744±0.0610.744\pm 0.061 0.5880.588
H2MIL 0.8290.829M 0.0920.092 sec 0.671±0.008\mathbb{0.671\pm 0.008} 0.667±0.024\mathbb{0.667\pm 0.024} 0.900±0.0540.900\pm 0.054 0.899±0.0440.899\pm 0.044 0.854±0.0720.854\pm 0.072 0.845±0.0840.845\pm 0.084 0.8080.808
GRASP (ours) 0.3780.378M 0.0240.024 sec 0.669±0.029¯\underline{0.669\pm 0.029} 0.654±0.041¯\underline{0.654\pm 0.041} 0.905±0.058\mathbb{0.905\pm 0.058} 0.906±0.051\mathbb{0.906\pm 0.051} 0.877±0.111¯\underline{0.877\pm 0.111} 0.872±0.112¯\underline{0.872\pm 0.112} 0.817\mathbb{0.817}
Table 2: The average performance on 3 folds and 10 random seeds based on KimiaNet’s features. The best and second best average values are highlighted in bold and underlined, respectively.
Model Params. Inference Ovarian: Five Subtypes Bladder: Two Subtypes Model’s Average
Balanced Acc. F1 Score Balanced Acc. F1 Score Balanced Acc.
Trans-MIL 2.672M2.672M 0.0190.019 sec 0.647±0.0070.647\pm 0.007 0.632±0.0050.632\pm 0.005 0.868±0.0230.868\pm 0.023 0.877±0.0130.877\pm 0.013 0.7580.758
Ab-MIL 0.263𝕄\mathbb{0.263M} 0.015¯\underline{0.015} sec 0.692±0.0160.692\pm 0.016 0.680±0.0140.680\pm 0.014 0.919±0.0180.919\pm 0.018 0.922±0.0160.922\pm 0.016 0.8060.806
CLAM-SB 0.795M0.795M 0.014\mathbb{0.014} sec 0.627±0.0150.627\pm 0.015 0.623±0.0100.623\pm 0.010 0.908±0.0260.908\pm 0.026 0.911±0.0230.911\pm 0.023 0.7680.768
CLAM-MB 0.796M0.796M 0.015¯\underline{0.015} sec 0.620±0.0350.620\pm 0.035 0.609±0.0300.609\pm 0.030 0.901±0.0390.901\pm 0.039 0.906±0.0370.906\pm 0.037 0.7610.761
DGCN: latent 0.790M0.790M 0.0980.098 sec 0.654±0.0170.654\pm 0.017 0.652±0.0240.652\pm 0.024 0.835±0.0340.835\pm 0.034 0.841±0.0350.841\pm 0.035 0.7450.745
DGCN: spatial 0.790M0.790M 0.0860.086 sec 0.654±0.0090.654\pm 0.009 0.652±0.0090.652\pm 0.009 0.867±0.0150.867\pm 0.015 0.875±0.0070.875\pm 0.007 0.7610.761
PatchGCN: latent 1.385M1.385M 0.0990.099 sec 0.683±0.0030.683\pm 0.003 0.675±0.0050.675\pm 0.005 0.911±0.0310.911\pm 0.031 0.919±0.0200.919\pm 0.020 0.7970.797
PatchGCN: spatial 1.385M1.385M 0.1100.110 sec 0.672±0.0020.672\pm 0.002 0.662±0.0050.662\pm 0.005 0.896±0.0330.896\pm 0.033 0.905±0.0210.905\pm 0.021 0.7840.784
ZoomMIL 2.891M2.891M 0.0240.024 sec 0.701±0.020¯\underline{0.701\pm 0.020} 0.690±0.021\mathbb{0.690\pm 0.021} 0.931±0.008¯\underline{0.931\pm 0.008} 0.933±0.009¯\underline{0.933\pm 0.009} 0.816¯\underline{0.816}
HiGT 6.3886.388M 0.1480.148 sec 0.337±0.0440.337\pm 0.044 0.288±0.0540.288\pm 0.054 0.847±0.0670.847\pm 0.067 0.842±0.0550.842\pm 0.055 0.5920.592
H2MIL 0.8290.829M 0.0920.092 sec 0.653±0.0180.653\pm 0.018 0.658±0.0320.658\pm 0.032 0.876±0.0540.876\pm 0.054 0.876±0.0480.876\pm 0.048 0.7640.764
GRASP (ours) 0.378M¯\underline{0.378M} 0.0240.024 sec 0.727±0.036\mathbb{0.727\pm 0.036} 0.689±0.040¯\underline{0.689\pm 0.040} 0.935±0.011\mathbb{0.935\pm 0.011} 0.937±0.014\mathbf{0.937\pm 0.014} 0.831\mathbf{0.831}

Comparing Tables 1 and 2, all the models performed better with KimiaNet embeddings than with Swin embeddings, which is mostly because KimiaNet has domain knowledge and can provide more contextual features than Swin. Furthermore, GRASP, H2MIL, ZoomMIL, Ab-MIL, and CLAMs showcase robust generalization and effective performance even when utilizing features from different backbones, especially with GRASP being the most robust model.

Although Deng et al. (2023) has used attention score distribution to show their model is reliable across different magnifications, we want to step further and adopt a similar logic as first introduced in Selvaraju et al. (2017) to define the concept of energy of gradients for graph nodes (Please see Graph-Based Visualization 7.6 in Appendix). Therefore, for the first time in the field, we show that an AI model such as GRASP can learn the concept of magnification and behave according to the subtype and slide characteristics. To this end, we formulate an experiment to obtain a sense of each magnification’s influence on the model which leads to Figure 3. The main takeaway of this experiment is that depending on the subtype, the distribution of referenced magnifications by GRASP is different. From a pathological point of view, this finding fits our knowledge of the biological properties of each subtype. As an example, we conducted a case study on the bladder dataset, where the micropapillary subtype is known to be diagnosed generally in lower magnification owing to its morphological properties and the structure of micropapillary tumors, whereas UCC needs to be examined in higher magnifications due to its cell- and texture-dependent structure.

On the Ovarian dataset, for the subtype ENOC, endometrioids can often be recognized and identified at low power as they tend to have characteristic glandular architecture occupying contiguous, large areas. Low-grade serous carcinomas (LGSC) can be very difficult at low power due to the necessity of confirming low-grade cytology at high power. For CCOC, clear cell carcinomas have characteristic low-power architectural patterns but can also require high-power examinations to exclude high-grade serous carcinoma with clear cell features, meaning that important information is distributed on all magnifications. The other subtypes, MUC and HGSC, may either show pathognomic architectural features at low power or require high-power examination on a case-by-case basis. According to Figure 3, GRASP collects the information from all three magnifications for MUC and CCOC.

Furthermore, to examine whether GRASP understands the biological meaning of the data, i.e., differentiating between tumor vs non-tumor regions, we conduct a visualization experiment to plot the pixel-level heatmap of patches in multiple magnifications as depicted in Figure 4.

4.4 Ablation Study

Here, we design five experiments to evaluate our proposed model. Firstly, a Monte Carlo test on graph size, i.e., the number of nodes to investigate the impact of the number of nodes on the model’s performance (Section Monte Carlo Test 7.5.1). Secondly, analyzing model performance on individual or pairs of magnifications to study the effectiveness of multi-magnification representation (Section Magnification Test 7.5.2). Thirdly, we investigate the effect of different graph convolution types (Section Graph Convolutions 7.5.3). In the fourth experiment, we study how different models perform when all the patches from a WSI are used (Section Patch Number 7.5.4). Lastly, we empirically show that the graph depth of d=3d=3 is the appropriate choice for our design (Section Graph Depth 7.5.5).

5 Conclusion

In this work, we developed GRASP, the first lightweight multi-magnification framework for processing gigapixel WSIs. GRASP is a fixed-structure model that can learn multi-magnification interactions in the data based on the idea of capturing both the inter- and intra-magnification information. This relies on the theoretical property of the model, where it benefits from intra-magnification convergence to pool the nodes rather than conventional pooling layers. GRASP, with its pre-defined fixed structure, has comparably fewer parameters than other state-of-the-art multi-magnification models in the field and outperforms the competing models in terms of average Balanced Accuracy over three complex cancer datasets using two different backbones. For the first time in the field, confirmed by two expert genitourinary pathologists, we showed that our model is dynamic in finding and consulting with different magnifications for subtyping two challenging cancers. We also evaluated the model’s decision-making to show that the model is learning semantics by highlighting tumorous regions in patches. Furthermore, we not only run extensive experiments to show the model’s reliability and stabilization in terms of its different hyperparameters, but we also provide the theoretical foundation of our work to shed light on the dynamics of GRASP in interacting with different nodes and magnifications in the graph. To conclude, we hope that the strong characteristics of GRASP and its straightforward structure, along with the theoretical basis, will encourage the modeling of lightweight structure-based design in the field of digital pathology for WSI representation.

6 MEANINGFULNESS STATEMENT

In digital pathology, a meaningful representation of life involves capturing the intricate, multi-scale structures of biological tissues, similar to how pathologists operate. GRASP (GRAph-Structured Pyramidal Whole Slide Image Representation) aids this process by modeling whole slide images as hierarchical graphs that integrate information across different microscopic magnification levels. This method, though lightweight, improves cancer subtyping accuracy and aligns computational analysis with human diagnostic processes, promoting deeper insights into tissue architecture and disease mechanisms.

7 Appendix

Refer to caption
Figure 3: The histogram of consultations conducted by GRASP with different magnifications. First, this shows GRASP is actively dynamic in terms of capturing information from different magnifications benefiting from its multi-magnification structure. Second, information is distributed differently over magnifications depending on the subtype and slide, and there is no optimal magnification for a subtype. For example, in the Bladder dataset, ‘(5x&10x&20x)(5x\&10x\&20x)’ shows that the model needed to consult with all three magnifications for 19.3%19.3\% and 39.4%39.4\% of slides for MicroP and UCC, respectively; ‘(5x)(5x)’ shows that the model has mostly focused on only 5x5x magnification for 43.2%43.2\% and 5.6%5.6\% of slides for MicroP and UCC, respectively. This behavior is similar to pathologists, where they can diagnose massive MicroP tumors with lower magnifications, while they need to consult with higher magnifications to confirm a minuscule mass of MicroP tumors. On the other hand, UCC is hard to diagnose at lower magnifications and requires careful examination with different magnifications due to its morphological complexity, which fits the model behavior in proclivity to highlight more than one magnification for the majority of cases.
Refer to caption
Figure 4: A case study on the Bladder dataset using KimiaNet features. a) Graph-based visualization: a random case from the subtype MicroP in the test data was selected to visualize its magnification heatmap where we show the absolute gradient in terms of each node. The 5x5x magnification contributes to 66.01%66.01\% of the whole energy model spent on this slide, meaning GRASP overall emphasizes more on 5x5x on this slide. Patch-based visualization: GRASP highlights patches of the three magnifications of a region of interest. In the second row, highlighted regions show the model has identified those areas as important while paying minimal attention to other regions. As confirmed by an expert pathologist, the model’s highlights on the three patches are tumors. The model can thus differentiate MicroP tumors from other tissue textures despite being trained for separating MicroP vs UCC. b) shows a similar case yet on the subtype UCC from a random slide in the test data. In this case, GRASP focuses on both 5x(47.45%)5x(47.45\%) and 10x(33.38%)10x(33.38\%) but is more interested in 5x5x. As confirmed by the expert pathologist, the regions highlighted (yellowish areas in the second row) by the model are tumorous neighborhoods. Therefore, GRASP can differentiate UCC tumors from other textures and healthy cells across multiple magnifications.

7.1 Datasets

Table 3: Summary of the datasets used in this study.
Dataset Source No. of WSIs Histotypes/Subtypes
Ovarian Carcinoma Private Dataset 948 High-Grade Serous Carcinoma (HGSC): 410
Clear Cell Ovarian Carcinoma (CCOC): 167
Endometrioid Carcinoma (ENOC): 237
Low-Grade Serous Carcinoma (LGSC): 69
Mucinous Carcinoma (MUC): 65
Esophageal Carcinoma (ESCA) TCGA 135 Adenocarcinoma: 86
Squamous Cell Carcinoma: 49
Bladder Cancer Private Dataset 262 Micropapillary (MicroP): 128
Conventional Urothelial Carcinomas (UCC): 134

A total of 1,133,388 patches of size 1000×10001000\times 1000 pixels for the Ovarian dataset, 602,874 patches of size 224×224224\times 224 from the ESCA dataset, and 313,191 patches of size 1000×10001000\times 1000 pixels for the Bladder dataset are extracted in multi-magnification setting (approximately 2 TB of Gigapixel WSIs). Patches being extracted such that they do not overlap at M3\textbf{M}_{3} while overlapping at M2\textbf{M}_{2} and M1\textbf{M}_{1} is inevitable. From each magnification, m400m\leq 400 patches (note that we use a large field of view, meaning this number eventually covers much of tissue regions) have been extracted per slide in both Ovarian and Bladder datasets, as it’s been shown in Chen et al. (2022a); Wang et al. (2023); Rasoolijaberi et al. (2022) that a subset of patches is enough to represent WSIs.
For patch-level feature extraction, we utilized two backbones: KimiaNet and Swin_base. Given that KimiaNet was trained on TCGA data in a supervised fashion, we intentionally refrained from extracting features from the ESCA dataset using this backbone to ensure an unbiased and leakage-free comparison. Conversely, we employed Swin, pre-trained on ImageNet, to extract features from all three datasets.
For each cancer dataset, we trained our proposed method in a 3-fold cross-validation and repeated the experiments ten times with different random seeds, where random seeds were randomly generated, to ensure a rigorous comparison. In order to prevent data leakage in our cross-validation splits, we split the slides based on their patients, since some patients have more than one slide, meaning that slides were split in a way that all slides from the same patient remain in the same set.

7.2 Training and Inference

To tackle the data imbalance problem, for all models in the study, we deployed a weighted cross entropy loss. A learning rate of 0.0010.001 and a weight decay of 0.010.01 for Adam optimizer have been adopted, and in case competing models were not converging, learning rate of 0.00010.0001 resolved the problem. Models were trained for 100, 50, and 10 epochs for the Ovarian, ESCA, and Bladder datasets, respectively. Specific to GRASP, the first two layers are of size 256 and the last layer output is of size 128. For all training and testing, the GPU hardware used was either a GeForce GTX 3090 Ti-24 GB (Nvidia), Quadro RTX 5000-16 GB (Nvidia), RTX 6000-48 GB (Nvidia) based on availability. Deep Graph Library (DGL), PyTorch, NumPy, SciPy, PyGeometric, and Scikit-Learn libraries have been used to perform the experiments.

7.3 Theoretical Analysis

Here, we prove Theorem 1 for any hi(3)h_{i}^{(3)} and hj(3)h_{j}^{(3)}, and conclude the case for hi(3)h_{i}^{\prime(3)}s and hi′′(3)h_{i}^{\prime\prime(3)}s similarly. To start with, we demonstrate Lemma 1.

Lemma 1.

For any given vectors xx and yy, and having .2\left\|.\right\|_{2} as the L2L_{2} norm, the following inequality holds,

α(x)α(y)2xy2,\centering\left\|\alpha(x)-\alpha(y)\right\|_{2}\leq\left\|x-y\right\|_{2},\@add@centering (6)

where α(.)=ReLU(.)\alpha(.)=\operatorname{ReLU}(.)

Proof.

Let’s reformulate ReLU(x)\operatorname{ReLU}(x) as x+|x|2\frac{x+\left|x\right|}{2} where operator |x||x| is the element-wise absolute value of the vector xx. Thus,

α(x)α(y)2=x+|x|2y+|y|22\displaystyle\left\|\alpha(x)-\alpha(y)\right\|_{2}=\left\|\frac{x+|x|}{2}-\frac{y+|y|}{2}\right\|_{2}
=xy2+|x||y|22\displaystyle=\left\|\frac{x-y}{2}+\frac{|x|-|y|}{2}\right\|_{2}
xy22+|x||y|22\displaystyle\leq\left\|\frac{x-y}{2}\right\|_{2}+\left\|\frac{|x|-|y|}{2}\right\|_{2} (7)

using the reverse triangle inequality, |x||y|22xy22\left\|\frac{|x|-|y|}{2}\right\|_{2}\leq\left\|\frac{x-y}{2}\right\|_{2}, which gives result to

α(x)α(y)2xy22+xy22\displaystyle\left\|\alpha(x)-\alpha(y)\right\|_{2}\leq\left\|\frac{x-y}{2}\right\|_{2}+\left\|\frac{x-y}{2}\right\|_{2}
=xy2\displaystyle=\left\|x-y\right\|_{2} (8)

Refer to caption
Figure 5: The structure of our hierarchical graph and the relationship between two given nodes hih_{i} and hkh_{k} within and across different magnifications.

Theorem 1:

Proof.

Recalling the main GCN formula for any ll in 4, 𝒩(i)\mathcal{N}(i) is the neighborhood size, and taking self-loops (see Note 1) into account, i[1,,m]\forall i\in[1,...,m], |𝒩(i)|=m+1\left|\mathcal{N}(i)\right|=m+1. With the graph being symmetric, we deduce that j[1,,m]\forall j\in[1,...,m], |𝒩(j)|=m+1\left|\mathcal{N}(j)\right|=m+1. These result in

cji=|𝒩(j)||𝒩(i)|=m+1c_{ji}=\sqrt{\left|\mathcal{N}(j)\right|\left|\mathcal{N}(i)\right|}=m+1

hence, Eq. 4 is simplified as follows,

hi(l+1)=α(b(l)+1m+1j𝒩(i)hj(l)W(l)).h_{i}^{(l+1)}=\alpha(b^{(l)}+\frac{1}{m+1}\sum_{j\in\mathcal{N}(i)}h_{j}^{(l)}W^{(l)}). (9)

Now, for any ii, we can partition the set of all nodes in 𝒩(i)\mathcal{N}(i) into two partitions {h1(l),,hm(l)}\{h_{1}^{(l)},...,h_{m}^{(l)}\} and {hi(l)}\{h_{i}^{\prime(l)}\} based on the relationship between nodes as shown in Figure 5. Therefore, j𝒩(i)hj(l)\sum_{j\in\mathcal{N}(i)}h_{j}^{(l)} can be rewritten as,

j𝒩(i)hj(l)=(j[1,,m]hj(l))+hi(l).\sum_{j\in\mathcal{N}(i)}h_{j}^{(l)}=\left(\sum_{j\in[1,...,m]}h_{j}^{(l)}\right)+h_{i}^{\prime(l)}. (10)

The first term in Eq. 10 is common among all nodes at a given magnification, so we call it (l)\mathcal{H}^{(l)} leading to Eq. 11,

j𝒩(i)hj(l)=(l)+hi(l)\sum_{j\in\mathcal{N}(i)}h_{j}^{(l)}=\mathcal{H}^{(l)}+h_{i}^{\prime(l)} (11)

as a result, we combine Eq. 9 with Eq. 11 which yields

hi(l+1)=α(b(l)+1m+1((l)+hi(l))W(l))h_{i}^{(l+1)}=\alpha(b^{(l)}+\frac{1}{m+1}\left(\mathcal{H}^{(l)}+h_{i}^{\prime(l)}\right)W^{(l)}) (12)

and similarly for any jij\neq i as well,

hj(l+1)=α(b(l)+1m+1((l)+hj(l))W(l)).h_{j}^{(l+1)}=\alpha(b^{(l)}+\frac{1}{m+1}\left(\mathcal{H}^{(l)}+h_{j}^{\prime(l)}\right)W^{(l)}). (13)

By using Lemma 1 and with combination with Eq. 12 and 13,

hi(l+1)hj(l+1)2\displaystyle\left\|h_{i}^{(l+1)}-h_{j}^{(l+1)}\right\|_{2}
1m+1hi(l)W(l)1m+1hj(l)W(l)2\displaystyle\leq\left\|\frac{1}{m+1}h_{i}^{\prime(l)}W^{(l)}-\frac{1}{m+1}h_{j}^{\prime(l)}W^{(l)}\right\|_{2}
=1m+1(hi(l)hj(l))W(l)2\displaystyle=\frac{1}{m+1}\left\|\left(h_{i}^{\prime(l)}-h_{j}^{\prime(l)}\right)W^{(l)}\right\|_{2}
1m+1hi(l)hj(l)2W(l)2.\displaystyle\leq\frac{1}{m+1}\left\|h_{i}^{\prime(l)}-h_{j}^{\prime(l)}\right\|_{2}\left\|W^{(l)}\right\|_{2}. (14)

Therefore, we reach the inequality below,

hi(l+1)hj(l+1)21m+1hi(l)hj(l)2W(l)2.\left\|h_{i}^{(l+1)}-h_{j}^{(l+1)}\right\|_{2}\leq\frac{1}{m+1}\left\|h_{i}^{\prime(l)}-h_{j}^{\prime(l)}\right\|_{2}\left\|W^{(l)}\right\|_{2}.\\ (15)

Now, by going recursively over l=0,1,2l=0,1,2, we have

hi(3)hj(3)2\displaystyle\left\|h_{i}^{(3)}-h_{j}^{(3)}\right\|_{2}\leq
(1m+1)3hi(0)hj(0)2W(2)2W(1)2W(0)2.\displaystyle(\frac{1}{m+1})^{3}\left\|h_{i}^{\prime(0)}-h_{j}^{\prime(0)}\right\|_{2}\left\|W^{(2)}\right\|_{2}\left\|W^{(1)}\right\|_{2}\left\|W^{(0)}\right\|_{2}. (16)

Since W(2)2\left\|W^{(2)}\right\|_{2}, W(1)2\left\|W^{(1)}\right\|_{2}, and W(0)2\left\|W^{(0)}\right\|_{2} are L2boundedL_{2}-bounded based on our assumption. Also, hi(0)hj(0)2\left\|h_{i}^{\prime(0)}-h_{j}^{\prime(0)}\right\|_{2} is an L2boundedL_{2}-bounded value based on our assumption (as input image data is L2boundedL_{2}-bounded, and also our encoder ϕ\phi is a bounded encoder: features are not scattered in an infinite space, rather they are encoded in a finite space). Given these, by approaching mm\rightarrow\infty (see remark 2), the right side of the Eq. 7.3 approaches 0. Therefore,

limmhi(3)hj(3)2=0\lim_{m\to\infty}\left\|h_{i}^{(3)}-h_{j}^{(3)}\right\|_{2}=0 (17)

similar to this case, it can be proved that

limmhi(3)hj(3)2=0\lim_{m\to\infty}\left\|h_{i}^{\prime(3)}-h_{j}^{\prime(3)}\right\|_{2}=0 (18)
limmhi′′(3)hj′′(3)2=0\lim_{m\to\infty}\left\|h_{i}^{\prime\prime(3)}-h_{j}^{\prime\prime(3)}\right\|_{2}=0 (19)

\square

remark 1.

To implement GCNs, self-loops are considered to represent the relationship between each node with itself, and it is also part of the technical implementation of the models. Thus, we consider this fact in our theoretical discussion.

remark 2.

Empirically, reaching sufficiently large mm can guarantee the convergence. For example, m=10m=10 can affect the upper bound in Eq. 7.3 with an order of 1103\frac{1}{10^{3}}, while m=100m=100 can affect the upper bound with an order of 1106\frac{1}{10^{6}}. In our experiments, m=400m=400 has been adopted that guarantees the convergence with an order of 164×106\frac{1}{64\times 10^{6}}. Therefore, the larger mm, the tighter together node features at the last GCN layer.

Corollary 2.

i[1,,m]\forall i\in[1,...,m] and mm sufficiently large, hi(3)hh_{i}^{(3)}\longrightarrow h^{*}, hi(3)hh_{i}^{\prime(3)}\longrightarrow h^{\prime*}, and hi′′(3)h′′h_{i}^{\prime\prime(3)}\longrightarrow h^{\prime\prime*}, where hh^{*}, hh^{\prime*}, and h′′h^{\prime\prime*} are functions of mm; hh^{*}, hh^{\prime*}, and h′′h^{\prime\prime*} are the convergence node for each magnification.

Description: given Eq. 17, every two arbitrary nodes hi(3)h_{i}^{(3)} and hj(3)h_{j}^{(3)} in one level of magnification are converging to each other. This means that all nodes are converging to the same value, which we name hh^{*}. Thus, i[1,,m]\forall i\in[1,...,m], limmhi(3)h2=0\lim_{m\to\infty}\left\|h_{i}^{(3)}-h^{*}\right\|_{2}=0 or equivalently hi(3)hh_{i}^{(3)}\longrightarrow h^{*}. Using the same logic as above, one can conclude hi(3)hh_{i}^{\prime(3)}\longrightarrow h^{\prime*} and hi′′(3)h′′h_{i}^{\prime\prime(3)}\longrightarrow h^{\prime\prime*}. Since each of hh^{*}, hh^{\prime*}, and h′′h^{\prime\prime*} are a function of mm, increasing mm would result in them being a better estimation/representation for the intra-magnification information.

7.4 Empirical Proof

In addition to the theoretical analysis in Section 7.3, we empirically demonstrate the intra-magnification convergence in Figure 6. In this experiment, we plot the mean squared error between all nodes in a magnification and the corresponding convergence node. As shown, the mean squared error at the third layer (=3\ell=3) is nearly zero, providing empirical evidence for the convergence of the nodes, i.e., the nodes being pooled without the need for a pooling layer.

Refer to caption
Figure 6: Empirical proof for intra-magnification convergence at =3\ell=3.

7.5 Ablation Study

7.5.1 Monte Carlo Test

As described in Algorithm 1, this test requires m=400m=400. Therefore, we removed four slides that did not have 400 non-overlapping patches in 20x20x from the Bladder dataset and ran this test. Because of removing these four slides from the dataset, the result of Figure 4 in the paper is not comparable with any of the tables in the paper. In this test, a subset of the nodes is randomly dropped from each magnification layer (corresponding nodes in each magnification are removed), to create independent graphs. Then, training and inference happens and the results are reported at the end.

Algorithm 1 Monte Carlo Test on Graph Size
1:m400m\leftarrow 400
2:step10step\leftarrow 10
3:Load DATASET
4:D{}D\leftarrow\{\} \triangleright Node indices to be dropped
5:for iter1\text{iter}\leftarrow 1 to 1010 do
6:  for GrG_{r} in DATASET do
7:   for count10\text{count}\leftarrow 10 to 390390 step stepstep do
8:     DRANDOM([1,,m],count)D\leftarrow\text{RANDOM}([1,\dots,m],\text{count})
9:     Qr,countGrQ_{r,\text{count}}\leftarrow G_{r}
10:     for index in DD do
11:      Qr,countDROP(Qr,count,hindex)Q_{r,\text{count}}\leftarrow\text{DROP}(Q_{r,\text{count}},h_{\text{index}})
12:      Qr,countDROP(Qr,count,hindex)Q_{r,\text{count}}\leftarrow\text{DROP}(Q_{r,\text{count}},h_{\text{index}}^{\prime})
13:      Qr,countDROP(Qr,count,hindex′′)Q_{r,\text{count}}\leftarrow\text{DROP}(Q_{r,\text{count}},h_{\text{index}}^{\prime\prime})
14:     end for
15:     STORE(Qr,count)(Q_{r,\text{count}})
16:   end for
17:  end for
18:end for
19:Use Qr,countQ_{r,\text{count}} for cross-validation and inference
20:Report: Balanced Accuracy and Standard Deviation
Refer to caption
Figure 7: Monte Carlo experiments on the graph size. As the number of nodes increases, the uncertainty decreases and the model stabilizes.
Table 4: Average Performance on 3-folds and 10 random seeds based on KimiaNet’s features.
Bladder Cancer
Model Balanced Acc. F1 Score
Graph on M1\textbf{M}_{1} 0.898±0.0520.898\pm 0.052 0.890±0.0470.890\pm 0.047
Graph on M2\textbf{M}_{2} 0.927±0.0570.927\pm 0.057 0.928±0.0510.928\pm 0.051
Graph on M3\textbf{M}_{3} 0.905±0.0350.905\pm 0.035 0.913±0.0230.913\pm 0.023
Graph on M1&M2\textbf{M}_{1}\&\textbf{M}_{2} 0.919±0.0320.919\pm 0.032 0.919±0.0300.919\pm 0.030
Graph on M1&M3\textbf{M}_{1}\&\textbf{M}_{3} 0.917±0.0310.917\pm 0.031 0.922±0.0330.922\pm 0.033
Graph on M2&M3\textbf{M}_{2}\&\textbf{M}_{3} 0.926±0.0240.926\pm 0.024 0.934±0.0220.934\pm 0.022
GRASP (ours) 0.935±0.011\mathbf{0.935}\pm\mathbf{0.011} 0.937±0.014\mathbf{0.937}\pm\mathbf{0.014}

In this experiment, we take a graph of size 3m=12003m=1200, and randomly drop a set of its nodes, along with their multi-magnification correspondences, to build a new graph with a smaller size. For example, when count=10count=10, we randomly drop 10 triplets of (hindex,hindex,hindex′′)(h_{index},h_{index}^{\prime},h_{index}^{\prime\prime}) from the graph to create a new graph QrQ_{r} of the size 11701170. Similarly, when count=200count=200, we randomly drop 200 triplets of (hindex,hindex,hindex′′)(h_{index},h_{index}^{\prime},h_{index}^{\prime\prime}) from the graph to create a new graph QrQ_{r} of the size 600600. This is an aggressive way to create statistically independent graphs of smaller sizes. To capture statistical variance, we repeat the experiment 1010 times to create 4040 graphs with different sizes and report the model performance in Figure 7. To accomplish this, the same 3-fold cross-validation sets and 1010 random seeds have been used for all repetitions. Taking 1010 times repetitions of 4040 different graph sizes into account, we performed 12,00012,000 independent training and inference experiments. Algorithm 1 demonstrates this experiment.

As can be seen in Figure 7, the performance of GRASP increases and stabilizes as the number of nodes increases. Since the standard deviation decreases as the number of nodes increases, it brings to light the concept of variance convergence, meaning that the model with m200m\geq 200 is fairly generalizable over different cross-validation folds and is statistically reliable in terms of performance. This is also in agreement with our theoretical expectation based on inter-magnification convergence that as mm grows, the model has better convergence resulting in more stability.

7.5.2 Magnification Test

To confirm that the idea of multi-magnification is valid and that multi-magnification is the cause for the model’s performance, we design 6 different experiments (repeated on 10 random seeds and 3 folds) on the Bladder dataset, with KimiaNet as the backbone, as our empirical evidence. These include evaluating the same model on only M1\textbf{M}_{1}, M2\textbf{M}_{2}, and M3\textbf{M}_{3} fully connected graphs and on pairs of M1&M2\textbf{M}_{1}\&\textbf{M}_{2}, M1&M3\textbf{M}_{1}\&\textbf{M}_{3}, and M2&M3\textbf{M}_{2}\&\textbf{M}_{3}. The results in Table 4 show that GRASP is superior to all other methods. One possible explanation is that for those single and paired graphs, three layers of GCNs most likely cause the aforementioned over-smoothing problem, which shows that GRASP can effectively capture the information contained in different magnifications and boost its performance.

7.5.3 Graph Convolutions

To study the impact of different graph convolutions on the performance of GRASP, we designed this experiment where we replaced the GCN layers in the original layers, with a newer version of graph convolutions. As seen in Figure 8, GAT and SAGEConv improve the performance of the primary GCN-based GRASP in terms of Balanced Accuracy over different batch sizes.

Refer to caption
Figure 8: Ablation on the effect of different GNN structures benchmarked on 3 folds and 10 random seeds with different batch sizes. GNNs were taken from Deep Graph Library.
Table 5: The average performance on 3 folds and 10 random seeds, based on Swin’s features and the setting where all the patches were extracted from each WSI.
Model Bladder: Two Subtypes
Balanced Acc. F1 Score AUC
ZoomMIL 0.879±0.0650.879\pm 0.065 0.872±0.0600.872\pm 0.060 0.951±0.0310.951\pm 0.031
HiGT 0.720±0.0490.720\pm 0.049 0.658±0.0420.658\pm 0.042 0.819±0.0660.819\pm 0.066
H2MIL 0.877±0.0500.877\pm 0.050 0.871±0.0350.871\pm 0.035 0.966±0.0220.966\pm 0.022
GRASP (GCN) 0.883±0.069\mathbb{0.883\pm 0.069} 0.879±0.065\mathbb{0.879\pm 0.065} 0.953±0.031\mathbb{0.953\pm 0.031}
GRASP (GAT) 0.917±0.013\mathbb{0.917\pm 0.013} 0.907±0.017\mathbb{0.907\pm 0.017} 0.978±0.007\mathbb{0.978\pm 0.007}
GRASP (SAGEConv) 0.936±0.023\mathbb{0.936\pm 0.023} 0.932±0.015\mathbb{0.932\pm 0.015} 0.988±0.008\mathbb{0.988\pm 0.008}
Refer to caption
Figure 9: Our model’s performance as the number of GCN layers increases. As can be seen, by increasing the number of layers from three to four a relatively large drop happens, showing that the model is being over-smoothed. However, the model is recovering this loss at fivefive layers or more yet with a relatively larger standard deviation compared to three layers.

7.5.4 Patch Number

To compare the models’ performance when all the patches from each slide are extracted, we designed this experiment and benchmarked the baseline against different variations of GRASP. GCN-based GRASP is superior to other multi-magnification methods such as ZoomMIL, H2MIL, and HiGT. Compared to Table 5, the increased number of patches helps GRASP to improve, yet it degrades the performance of the other three models. However, single Magnification methods are more competitive with a higher density of patches. With this in mind, we followed the previous ablation study 7.5.3 and compared GRASP’s performance with newer graph convolutions. Consequently, GRASP with SAGEConv outperforms all other models. This further emphasizes the flexibility of the graph structure that can easily employ different graph convolutions, which we belive is a unique advantage of GRASP.

7.5.5 Graph Depth

We experimented with GRASP with different numbers of GCN layers as shown in Figure 9. Firstly, three layers of GCNs show the same performance as two layers yet with lower standard deviation. Secondly, four layers of GCNs show a sudden drop in performance and increase in standard deviation, which can be attributed to over smoothing problem Cai & Wang (2020); Wu et al. (2023). In addition, with more than fivefive layers of GCNs, the network can recover the performance but with a slightly higher standard deviation and, clearly, an increased number of parameters. This shows that our original architecture of three layers is the best choice in the trade-off of average accuracy and the model’s reliability. Based on the discussion in Wu et al. (2023), we expect the same phenomenon for different graph convolution types to happen.

7.6 Graph-Based Visualization

Let’s call the output of the classifier, SS, where the logit for correctly classified slides/graphs is ScS_{c}. To visualize the importance of magnifications, we compute the magnitude of the gradient of a graph with respect to its node features at l=0l=0 (hi(0)h_{i}^{(0)}; for the sake of brevity, we drop the superscript (0)(0) form hi(0)h_{i}^{(0)} and show it as hih_{i}), which is |Schi|\left|\frac{\partial S_{c}}{\partial h_{i}}\right| for the magnification M1\textbf{M}_{1}. |Schi|\left|\frac{\partial S_{c}}{\partial h_{i}}\right| is a vector of size 1024×11024\times 1, and we have mm nodes giving result to mm such vectors. Likewise, we can define |Schi|\left|\frac{\partial S_{c}}{\partial h_{i}^{\prime}}\right| and |Schi′′|\left|\frac{\partial S_{c}}{\partial h_{i}^{\prime\prime}}\right| for M2\textbf{M}_{2} and M3\textbf{M}_{3}, respectively. Arranging these absolute gradients for each magnification in a matrix of size m×1024m\times 1024 as follows,

Heatmap𝐌1=[|Sch1||Schm|]\textit{Heatmap}_{\mathbf{M}_{1}}=\left[\begin{array}[]{c}\left|\frac{\partial S_{c}}{\partial h_{1}}\right|\\ \vdots\\ \left|\frac{\partial S_{c}}{\partial h_{m}}\right|\end{array}\right] (20)
Heatmap𝐌2=[|Sch1||Schm|]\textit{Heatmap}_{\mathbf{M}_{2}}=\left[\begin{array}[]{c}\left|\frac{\partial S_{c}}{\partial h_{1}^{\prime}}\right|\\ \vdots\\ \left|\frac{\partial S_{c}}{\partial h_{m}^{\prime}}\right|\end{array}\right] (21)
Heatmap𝐌3=[|Sch1′′||Schm′′|]\textit{Heatmap}_{\mathbf{M}_{3}}=\left[\begin{array}[]{c}\left|\frac{\partial S_{c}}{\partial h_{1}^{\prime\prime}}\right|\\ \vdots\\ \left|\frac{\partial S_{c}}{\partial h_{m}^{\prime\prime}}\right|\end{array}\right] (22)

As such, putting matrices in 20, 21, and 22 together in a matrix gives us the overall heatmap for the graph, of the size 3m×10243m\times 1024, to compare the influence of each magnification:

Heatmap=[Heatmap𝐌1Heatmap𝐌2Heatmap𝐌3]\textit{Heatmap}=\left[\begin{array}[]{c}\textit{Heatmap}_{\mathbf{M}_{1}}\\ \textit{Heatmap}_{\mathbf{M}_{2}}\\ \textit{Heatmap}_{\mathbf{M}_{3}}\end{array}\right] (23)

This is the heatmap depicted in Figure 9 as a graph-based heatmap, which shows how model focuses on different magnifications.

Having the gradient for each node in the graph, we develop the concept of energy of gradients to find out which magnification(s) play a more important role in GRASP’s final decision. To do so, we start by defining 𝐌1\mathcal{E}_{\mathbf{M}_{1}} as follows,

𝐌1=i[1,,m]Schi22\mathcal{E}_{\mathbf{M}_{1}}=\sum_{i\in[1,...,m]}\left\|\frac{\partial S_{c}}{\partial h_{i}}\right\|_{2}^{2} (24)

similarly, for 𝐌2\mathbf{M}_{2} and 𝐌3\mathbf{M}_{3},

𝐌2=i[1,,m]Schi22\mathcal{E}_{\mathbf{M}_{2}}=\sum_{i\in[1,...,m]}\left\|\frac{\partial S_{c}}{\partial h_{i}^{\prime}}\right\|_{2}^{2} (25)
𝐌3=i[1,,m]Schi′′22\mathcal{E}_{\mathbf{M}_{3}}=\sum_{i\in[1,...,m]}\left\|\frac{\partial S_{c}}{\partial h_{i}^{\prime\prime}}\right\|_{2}^{2} (26)

Having these energies, the energy contribution of each magnification is calculated based on their relative share in the whole energy spent in the graph:

𝐌1’s contribution=𝐌1𝐌1+𝐌2+𝐌3{\mathbf{M}_{1}}\textit{'s contribution}=\frac{\mathcal{E}_{\mathbf{M}_{1}}}{\mathcal{E}_{\mathbf{M}_{1}}+\mathcal{E}_{\mathbf{M}_{2}}+\mathcal{E}_{\mathbf{M}_{3}}} (27)
𝐌2’s contribution=𝐌2𝐌1+𝐌2+𝐌3{\mathbf{M}_{2}}\textit{'s contribution}=\frac{\mathcal{E}_{\mathbf{M}_{2}}}{\mathcal{E}_{\mathbf{M}_{1}}+\mathcal{E}_{\mathbf{M}_{2}}+\mathcal{E}_{\mathbf{M}_{3}}} (28)
𝐌3’s contribution=𝐌3𝐌1+𝐌2+𝐌3{\mathbf{M}_{3}}\textit{'s contribution}=\frac{\mathcal{E}_{\mathbf{M}_{3}}}{\mathcal{E}_{\mathbf{M}_{1}}+\mathcal{E}_{\mathbf{M}_{2}}+\mathcal{E}_{\mathbf{M}_{3}}} (29)

Accordingly, the importance of each magnification can be quantified for further investigations. More samples are provides in Figure 3.

References

  • Cai & Wang (2020) Chen Cai and Yusu Wang. A note on over-smoothing for graph neural networks. arXiv preprint arXiv:2006.13318, 2020.
  • Chen et al. (2022a) Chen Chen, Ming-Yuan Lu, David F. K. Williamson, Andrew D. Trister, Ravi G. Krishnan, and Faisal Mahmood. Fast and scalable search of whole-slide images via self-supervised deep learning. Nature biomedical engineering, 6(12):1420–1434, 2022a.
  • Chen et al. (2021) Richard J Chen, Ming-Yuan Lu, Muhammad Shaban, Chi Chen, Ting-Yun Chen, David F Williamson, and Faisal Mahmood. Whole slide images are 2d point clouds: context-aware survival prediction using patch-based graph convolutional networks. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part VIII, pp.  339–349. Springer, 2021.
  • Chen et al. (2022b) Richard J Chen, Cheng Chen, Yuanfang Li, Tsung-Ying Chen, Andrew D Trister, Ravi G Krishnan, and Faisal Mahmood. Scaling vision transformers to gigapixel images via hierarchical self-supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  16144–16155, 2022b.
  • Chen et al. (2023) Richard J Chen, Tong Ding, Ming Y Lu, Drew FK Williamson, Guillaume Jaume, Bowen Chen, Andrew Zhang, Daniel Shao, Andrew H Song, Muhammad Shaban, et al. A general-purpose self-supervised model for computational pathology. arXiv preprint arXiv:2308.15474, 2023.
  • Ciga et al. (2022) Ozan Ciga, Tony Xu, and Anne Louise Martel. Self supervised contrastive learning for digital histopathology. Machine Learning with Applications, 7:100198, 2022. ISSN 2666-8270. doi: https://doi.org/10.1016/j.mlwa.2021.100198. URL https://www.sciencedirect.com/science/article/pii/S2666827021000992.
  • Deng et al. (2023) Ruiqi Deng, Chun Cui, Lester W. Remedios, Shunxing Bao, Ryan M. Womick, Sylvain Chiron, Jialun Li, Joseph T. Roland, Keith S. Lau, Qing Liu, and Keith T. Wilson. Cross-scale multi-instance learning for pathological image diagnosis. arXiv preprint arXiv:2304.00216, April 2023.
  • D’Alfonso et al. (2021) Timothy M D’Alfonso, David J Ho, Matthew G Hanna, Zhi Li, Hongyan Li, Limei Tang, Lei Zhang, Ziyang Li, Ruiyang Liu, Yiming Zheng, et al. Multi-magnification-based machine learning as an ancillary tool for the pathologic assessment of shaved margins for breast carcinoma lumpectomy specimens. Mod Pathol, 34:1487–1494, 2021. doi: 10.1038/s41379-021-00807-9.
  • Ehteshami Bejnordi et al. (2017) Babak Ehteshami Bejnordi, Mitko Veta, Paul Johannes van Diest, and et al. Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. JAMA, 318(22):2199–2210, 2017. doi: 10.1001/jama.2017.14585.
  • Guan et al. (2022) Yonghang Guan, Jun Zhang, Kuan Tian, Sen Yang, Pei Dong, Jinxi Xiang, Wei Yang, Junzhou Huang, Yuyao Zhang, and Xiao Han. Node-aligned graph convolutional network for whole-slide image representation and classification. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  18791–18801, 2022. doi: 10.1109/CVPR52688.2022.01825.
  • Guo et al. (2023) Ziyu Guo, Weiqin Zhao, Shujun Wang, and Lequan Yu. Higt: Hierarchical interaction graph-transformer for whole slide image analysis. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp.  755–764. Springer, 2023.
  • Hashimoto et al. (2020) Noriaki Hashimoto, Daisuke Fukushima, Ryoichi Koga, Yusuke Takagi, Kaho Ko, Kei Kohno, Masato Nakaguro, Shigeo Nakamura, Hidekata Hontani, and Ichiro Takeuchi. Multi-scale domain-adversarial multiple-instance cnn for cancer subtype classification with unannotated histopathological images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  • Hou et al. (2022) Wenjin Hou, Lu Yu, Chao Lin, Heng Huang, Rongtao Yu, Jing Qin, and Liang Wang. H2-mil: Exploring hierarchical representation with heterogeneous multiple instance learning for whole slide image analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp.  933–941, 2022.
  • Huang et al. (2017) Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  4700–4708. IEEE, 2017.
  • Huang et al. (2023) Zhi Huang, Federico Bianchi, Mert Yuksekgonul, Thomas Montine, and James Zou. Leveraging medical twitter to build a visual–language foundation model for pathology ai. bioRxiv, pp.  2023–03, 2023.
  • Ilse et al. (2018) Maximilian Ilse, Jakub Tomczak, and Max Welling. Attention-based deep multiple instance learning. In International Conference on Machine Learning, pp.  2123–2132. PMLR, 2018. URL http://proceedings.mlr.press/v80/ilse18a.html.
  • Janowczyk et al. (2019) Andrew Janowczyk, Ren Zuo, Hannah Gilmore, Michael Feldman, and Anant Madabhushi. Histoqc: An open-source quality control tool for digital pathology slides. JCO Clinical Cancer Informatics, (3):1–7, 2019. doi: 10.1200/CCI.18.00157. URL https://doi.org/10.1200/CCI.18.00157. PMID: 30990737.
  • Kipf & Welling (2016) Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016. URL https://arxiv.org/abs/1609.02907.
  • Li et al. (2021) Bin Li, Yin Li, and Kevin W. Eliceiri. Dual-stream multiple instance learning network for whole slide image classification with self-supervised contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  14318–14328, 2021.
  • Liu et al. (2021) Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  10012–10022, 2021.
  • Lu et al. (2021) Michael Y. Lu, David F.K. Williamson, Tai-Yen Chen, and et al. Data-efficient and weakly supervised computational pathology on whole-slide images. Nature Biomedical Engineering, 5(6):555–570, 2021. doi: 10.1038/s41551-020-00682-w.
  • Morovic et al. (2021) Anamarija Morovic, Perry Damjanov, and Kyle Perry. Pathology for the Health Professions-Sixth Edition. Elsevier Health Sciences, 2021.
  • Pawlowski et al. (2019) Nick Pawlowski, Saurabh Bhooshan, Nicolas Ballas, Francesco Ciompi, Ben Glocker, and Michal Drozdzal. Needles in haystacks: On classifying tiny objects in large images. arXiv preprint arXiv:1908.06037, 2019.
  • Rasoolijaberi et al. (2022) Maral Rasoolijaberi, Morteza Babaei, Abtin Riasatian, Sobhan Hemati, Parsa Ashrafi, Ricardo Gonzalez, and Hamid R. Tizhoosh. Multi-magnification image search in digital pathology. IEEE Journal of Biomedical and Health Informatics, 26(9):4611–4622, 2022. doi: 10.1109/JBHI.2022.3181531.
  • Riasatian et al. (2021) Abtin Riasatian, Morteza Babaie, Danial Maleki, Shivam Kalra, Mojtaba Valipour, Sobhan Hemati, Manit Zaveri, Amir Safarpoor, Sobhan Shafiei, Mehdi Afshari, Maral Rasoolijaberi, Milad Sikaroudi, Mohd Adnan, Sultaan Shah, Charles Choi, Savvas Damaskinos, Clinton JV Campbell, Phedias Diamandis, Liron Pantanowitz, Hany Kashani, Ali Ghodsi, and H. R. Tizhoosh. Fine-tuning and training of densenet for histopathology image representation using tcga diagnostic slides, 2021. URL https://arxiv.org/abs/2101.07903.
  • Schirris et al. (2022) Yoni Schirris, Efstratios Gavves, Iris Nederlof, Hugo Mark Horlings, and Jonas Teuwen. Deepsmile: Contrastive self-supervised pre-training benefits msi and hrd classification directly from h&e whole-slide images in colorectal and breast cancer. Medical Image Analysis, 79:102464, 2022. ISSN 1361-8415. doi: https://doi.org/10.1016/j.media.2022.102464. URL https://www.sciencedirect.com/science/article/pii/S1361841522001116.
  • Selvaraju et al. (2017) Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • Shao et al. (2021) Zhuchen Shao, Hao Bian, Yang Chen, Yifeng Wang, Jian Zhang, Xiangyang Ji, and Yongbing Zhang. TransMIL: Transformer based correlated multiple instance learning for whole slide image classification. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=LKUfuWxajHc.
  • Thandiackal et al. (2022) Kevin Thandiackal, Boqi Chen, Pushpak Pati, Guillaume Jaume, Drew F. K. Williamson, Maria Gabrani, and Orcun Goksel. Differentiable zooming for multiple instance learning on whole-slide images. In Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner (eds.), Computer Vision – ECCV 2022, pp.  699–715, Cham, 2022. Springer Nature Switzerland. ISBN 978-3-031-19803-8.
  • Vorontsov et al. (2023) Eugene Vorontsov, Alican Bozkurt, Adam Casson, George Shaikovski, Michal Zelechowski, Siqi Liu, Philippe Mathieu, Alexander van Eck, Donghun Lee, Julian Viret, et al. Virchow: A million-slide digital pathology foundation model. arXiv preprint arXiv:2309.07778, 2023.
  • Wang et al. (2022) Xiyue Wang, Sen Yang, Jun Zhang, Minghui Wang, Jing Zhang, Wei Yang, Junzhou Huang, and Xiao Han. Transformer-based unsupervised contrastive learning for histopathological image classification. Medical image analysis, 81:102559, 2022.
  • Wang et al. (2023) Xiyue Wang, Yuexi Du, Sen Yang, Jun Zhang, Minghui Wang, Jing Zhang, Wei Yang, Junzhou Huang, and Xiao Han. Retccl: Clustering-guided contrastive learning for whole-slide image retrieval. Medical Image Analysis, 83:102645, 2023. ISSN 1361-8415. doi: https://doi.org/10.1016/j.media.2022.102645. URL https://www.sciencedirect.com/science/article/pii/S1361841522002730.
  • Waqas et al. (2024) Muhammad Waqas, Syed Umaid Ahmed, Muhammad Atif Tahir, Jia Wu, and Rizwan Qureshi. Exploring multiple instance learning (mil): A brief survey. Expert Systems with Applications, 250:123893, 2024. ISSN 0957-4174. doi: https://doi.org/10.1016/j.eswa.2024.123893. URL https://www.sciencedirect.com/science/article/pii/S0957417424007590.
  • Wu et al. (2023) Xinyi Wu, Amir Ajorlou, Zihui Wu, and Ali Jadbabaie. Demystifying oversmoothing in attention-based graph neural networks. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp.  35084–35106. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/6e4cdfdd909ea4e34bfc85a12774cba0-Paper-Conference.pdf.
  • Zhang et al. (2015) Xingyuan Zhang, Hang Su, Lin Yang, and Shuicheng Zhang. Fine-grained histopathological image analysis via robust segmentation and large-scale retrieval. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.  5361–5368, Boston, MA, USA, 2015. doi: 10.1109/CVPR.2015.7299174.
  • Zheng et al. (2021) Yu Zheng, Chen Gao, Liang Chen, Depeng Jin, and Yong Li. Dgcn: Diversified recommendation with graph convolutional networks. In Proceedings of the Web Conference 2021, WWW ’21, pp.  401–412, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383127. doi: 10.1145/3442381.3449835. URL https://doi.org/10.1145/3442381.3449835.
  • Zhou et al. (2019) Yanning Zhou, Simon Graham, Navid Alemi Koohbanani, Muhammad Shaban, Pheng-Ann Heng, and Nasir Rajpoot. Cgc-net: Cell graph convolutional network for grading of colorectal cancer histology images. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, Oct 2019.